This article presents six core strategies for resolving context failure problems in large language models (LLMs). It explains various failure types including context poisoning, confusion, and clash, and emphasizes the importance of efficient information management through RAG, tool loadout, context quarantine, pruning, summarization, and offloading techniques. Ultimately, since all information in the context influences the model's responses, the key to high-quality outputs is removing unnecessary information and retaining only what's relevant.
1. Revisiting Context Failure Types and the Importance of Information Management
This article follows up on the earlier post "How Long Context Fails" published on June 22, 2025, covering how to mitigate or entirely avoid context failures. First, let's briefly revisit the main ways long context can fail.
- Context Poisoning: Occurs when hallucinations or other errors enter the context and are repeatedly referenced.
- Context Distraction: Occurs when the context grows so long that the model over-focuses on the context rather than its trained knowledge.
- Context Confusion: Occurs when irrelevant information in the context is used by the model to produce low-quality responses.
- Context Clash: Occurs when new information or tools accumulate in the context and conflict with other information in the prompt.
All of these problems ultimately come down to information management. All information within the context influences the model's responses, which aligns with the old programming adage "Garbage in, garbage out." Fortunately, there are various methods to address these issues.
2. Introduction to Context Management Tactics
The main management tactics for resolving context failures are as follows:
- RAG (Retrieval-Augmented Generation): Selectively adding relevant information so the LLM generates better responses.
- Tool Loadout: Selecting only the relevant tool definitions to add to the context.
- Context Quarantine: Isolating context into its own dedicated thread so one or more LLMs use it independently.
- Context Pruning: Removing irrelevant or unnecessary information from the context.
- Context Summarization: Condensing accumulated context into compact summaries.
- Context Offloading: Storing information outside the LLM's context, typically through tools that store and manage data.
3. RAG (Retrieval-Augmented Generation)
RAG is a technique for selectively adding relevant information so the LLM generates better responses. Much has already been written about RAG, but it remains an extremely important technique.
Recently, Llama 4 Scout introduced an enormous context window of 10 million tokens, reigniting the debate that "RAG is dead." However, if you treat the context like a junk drawer, that junk will influence the responses. In other words, no matter how large the context window becomes, having lots of irrelevant information can actually degrade response quality.
4. Tool Loadout
Tool loadout means adding only the most relevant tool definitions for a given task to the context. The concept is borrowed from video games, where you select abilities, weapons, and equipment tailored to a specific situation--a 'loadout.'
The simplest approach is applying RAG to tool descriptions. In the "RAG MCP" paper, Tiantian Gan and Qiyao Sun proposed storing tool descriptions in a vector database and selecting the most relevant tools based on the input prompt. Testing with the DeepSeek-v3 model, they found that correct tool selection becomes critical beyond 30 tools, and the model almost certainly fails with more than 100 tools. When using RAG to select fewer than 30 tools, prompts became much shorter and tool selection accuracy improved up to 3x.
The "Less is More" paper showed that the Llama 3.1 8b model failed benchmarks when given 46 tools but succeeded when given only 19 tools, suggesting the problem is context confusion rather than context window limitations. The research team developed a method for dynamically selecting tools using an LLM-based tool recommender, which improved Llama 3.1 8b's performance by 44%. Additionally, this method provided the supplementary benefits of reducing power consumption by 18% and latency by 77%.
Most agents only need a small number of manually curated tools, but whenever the breadth of functionality or the volume of integrations needs to expand, loadout should always be considered.
5. Context Quarantine
Context quarantine isolates context into its own dedicated thread so one or more LLMs use it independently. An effective way to prevent context from becoming too long or filled with irrelevant content is to split tasks into smaller, independent tasks, each with its own context.
Anthropic's blog post on multi-agent research systems provides a good example of this strategy. They explain that "the essence of search is compression," with sub-agents operating in parallel within their own context windows, simultaneously exploring different aspects of a question, then compressing and passing the most important tokens to a senior research agent. This not only speeds up information gathering and distillation but also prevents each context from accumulating too much information or information unrelated to the specific prompt, delivering higher-quality results.
"Our internal evaluations show that multi-agent research systems excel particularly on broad queries requiring pursuit of multiple independent directions simultaneously. We found that a multi-agent system using Claude Opus 4 as the senior agent and Claude Sonnet 4 as sub-agents performed 90.2% better than single-agent Claude Opus 4 on internal research evaluations."
This approach also helps with tool loadout, as agent designers can create multiple agent archetypes, each with its own dedicated loadout and instructions for using each tool. The challenge for agent builders is finding independent tasks suitable for parallelization and separating them into distinct threads. Problems requiring context sharing among multiple agents are not well-suited to this tactic.
6. Context Pruning
Context pruning is the act of removing irrelevant or unnecessary information from the context. Agents accumulate context as they run tools and assemble documents. Sometimes it's important to evaluate accumulated content and remove unnecessary parts. This can be delegated to the main LLM, or a separate LLM-based tool can be designed to review and edit the context.
Context pruning has a relatively long history in the natural language processing (NLP) field, dating back to before ChatGPT when context length was a bigger bottleneck. One current pruning method is Provence, which stands for "efficient and robust context pruning for question answering."
Provence is fast, accurate, easy to use, and relatively small (1.75GB). It can be invoked with just a few lines of code. For example, it successfully reduced a Wikipedia entry for Alameda, California by up to 95% based on a question, retaining only the relevant portions.
These pruning functions can be used to clean up documents or entire contexts. Additionally, this pattern provides a strong case for maintaining a structured version of the context as a dictionary or other format, assembling it into a compiled string before every LLM call. This structure is useful during pruning, allowing you to preserve key instructions and objectives while pruning or summarizing document or record sections.
7. Context Summarization
Context summarization is the act of condensing accumulated context into compact summaries. This technique originally emerged as a tool for handling smaller context windows. When chat sessions approached the maximum context length, a summary would be generated and a new thread started. Chatbot users would manually request summaries from ChatGPT or Claude and paste them into new sessions.
However, as context windows grew, agent builders discovered that summarization has benefits beyond just staying within the total context limit. As context grows longer, the model relies less on its trained knowledge and becomes distracted--the context distraction phenomenon. A team running Gemini agents playing Pokemon found that this behavior was triggered beyond 100,000 tokens.
"Gemini 2.5 Pro supports over 1 million tokens of context, but using it effectively for agents presents new research challenges. In this agent setup, as context grew well beyond 100K tokens, agents showed a tendency to repeat actions from the vast history rather than synthesizing new plans. While anecdotal, this highlights an important distinction between long context for retrieval versus long context for multi-step generative reasoning."
Summarizing context is easy, but applying it perfectly to a specific agent is harder. Understanding which information to preserve and instructing the LLM-based compression step in detail is important for agent builders. It's advisable to separate this function into its own LLM-based step or app so you can collect evaluation data and directly optimize this task.
8. Context Offloading
Context offloading is the act of storing information outside the LLM's context, typically through tools that store and manage data. This method is so simple it seems too good to be true.
Anthropic provides a good explanation of the "think" tool, which is essentially a scratchpad.
"Through the 'think' tool, we give Claude the ability to include additional thinking steps (with its own designated space) as part of the process of arriving at a final answer... This is particularly useful when performing long chains of tool calls or in long multi-turn conversations with users."
If this tool had been named scratchpad, its function would have been immediately obvious. It's a space where the model records notes and progress that can be referenced later without polluting the context. Anthropic shows that combining the "think" tool with domain-specific prompts yields significant performance improvements of up to 54% on specialized agent benchmarks.
Anthropic identified three scenarios where the context offloading pattern is useful:
- Tool output analysis: When Claude needs to carefully process the output of previous tool calls and reconsider its approach before acting.
- Policy-heavy environments: When Claude must follow detailed instructions and verify compliance.
- Sequential decision-making: When each action builds on previous ones and mistakes are costly (often found in multi-step domains).
9. Conclusion
Context management is one of the hardest parts of building agents. As Karpathy put it, programming the LLM to "appropriately fill the context window," smartly deploying tools and information, and performing regular context maintenance is the core job of an agent designer.
The key insight across all the tactics mentioned above is that context is not free. Every token in the context influences the model's behavior, for better or worse. The massive context windows of modern LLMs are a powerful capability, but they cannot serve as an excuse for neglecting information management.
When building your next agent or optimizing an existing one, ask yourself: "Is everything in this context pulling its weight?" If not, you now know six ways to fix it.
