1. The Promise and Reality of Long Context Windows
As AI model context windows have expanded to one million tokens, many people grew excited by the idea that "you can now stuff everything into a single prompt and let the agent handle it all." The fantasy: load in every tool, document, and instruction at once, and the model will figure out the optimal answer on its own.
"Now that we can fit all our documents into the prompt, we don't have to worry about which documents to retrieve!"
This enthusiasm cooled interest in RAG (Retrieval-Augmented Generation) while fueling expectations around MCP (Multi-Component Prompting) and agents. In reality, however, longer context does not guarantee better results. When there is too much context, agents and applications can fail in unexpected ways. Context can become poisoned, distracting, confusing, or cause clashes. These problems are especially severe for agents, since they rely heavily on context to gather information, synthesize it, and coordinate actions.
2. Four Types of Context Failure
The most common context problems that cause agents to fail can be grouped into four categories:
-
Context Poisoning
- Incorrect information (such as hallucinations) enters the context and is repeatedly referenced
-
Context Distraction
- The context grows so long that the model fixates on it rather than on what it has learned during training
-
Context Confusion
- Irrelevant information enters the context, leading the model to produce low-quality responses
-
Context Clash
- Contradictory information or conflicting tools enter the context, producing internal inconsistencies
3. Concrete Examples and Impact of Each Failure Type
1) Context Poisoning
Context poisoning appears when, for example, Gemini 2.5 is playing a Pokémon game, generates a hallucination, and the incorrect information enters the context and is referenced repeatedly going forward.
"The most severe form of this is 'context poisoning,' where multiple parts of the context (goals, summaries, etc.) become contaminated with false information about the game state, and it can take a very long time to recover from. As a result, the model becomes fixated on achieving impossible or irrelevant goals."
Once a goal section is poisoned, the agent devises nonsensical strategies and keeps pursuing objectives it can never reach.
2) Context Distraction
Context distraction occurs when the context grows so long that the model forgets the strategies it learned during training and starts repeating past actions. For instance, Gemini 2.5 Pro supports over one million tokens of context, but in practice it began repeating past actions once the context exceeded around 100,000 tokens.
"In agentic settings, once the context grew well beyond 100k tokens, the agent tended to repeat actions from its extensive history rather than forming new plans. This shows that long context may be useful for information retrieval but has real limits for multi-step reasoning."
Smaller models hit this ceiling even sooner. Llama 3.1 405B, for example, begins to lose accuracy around 32,000 tokens.
Because of this, the real value of long context windows lies in summarization and fact retrieval. For any other use case, you need to watch out for the model's distraction ceiling.
3) Context Confusion
Context confusion occurs when too many tools or pieces of information are loaded at once, leaving the model uncertain about how to respond. For example, when every tool description is packed into the prompt with MCP (Multi-Component Prompting), the model becomes confused about which tool to use.
"The Berkeley Function-Calling Leaderboard shows that every model's performance drops once more than two tools are provided. Even when none of the available functions are relevant, every model occasionally calls an irrelevant tool."
In the GeoEngine benchmark, where 46 tools were provided, Llama 3.1 8B failed even within a 16k context window. But with only 19 tools, it succeeded.
In other words, everything placed in the context demands the model's attention, so the more irrelevant information is included, the more confused the model becomes.
4) Context Clash
Context clash occurs when contradictory information enters the context and the model becomes trapped in a contradiction. Microsoft and Salesforce research teams explored this with "sharding" prompt experiments, adding information in multiple successive turns.
"LLMs make assumptions early in a conversation and rush to a final answer too soon. Then they over-rely on that flawed answer. In simple terms, once an LLM goes down the wrong path in a conversation, it gets lost and cannot find its way back."
When information is introduced incrementally across multiple turns, an early incorrect answer remains in the context and continues to distort subsequent responses. In practice, OpenAI's o3 model saw its score plummet from 98.1 to 64.1.
4. The Limits of Long Context and What We Can Learn
The arrival of million-token context windows seemed revolutionary, but in practice it has introduced new categories of failure.
- Context poisoning causes errors to accumulate over time.
- Context distraction causes agents to become fixated on the past.
- Context confusion causes the model to invoke irrelevant tools or documents.
- Context clash causes reasoning to collapse under internal contradictions.
These failures are especially serious in agents, which gather information from multiple sources, make sequential tool calls, perform multi-step reasoning, and accumulate long conversation histories.
5. Looking Ahead: Solutions
Fortunately, there are ways to mitigate or avoid these problems. For example, dynamically loading tools and quarantining context are both viable strategies. These will be covered in detail in a follow-up article.
For more on solutions, see "How to Fix Your Context"!
Key Concept Summary
- Context window
- Agent
- Context poisoning, distraction, confusion, and clash
- RAG, MCP
- Summarization, fact retrieval
- Dynamic tool loading, context quarantine
🧠 Long context is not a silver bullet! What matters most for agent success is how well you manage information and extract only what you actually need.
