The Long Context Trap: How Agents Fail and What to Learn

1. The Promise and Reality of Long Context Windows

As AI model context windows have expanded to one million tokens, many people grew excited by the idea that "you can now stuff everything into a single prompt and let the agent handle it all." The fantasy: load in every tool, document, and instruction at once, and the model will figure out the optimal answer on its own.

"Now that we can fit all our documents into the prompt, we don't have to worry about which documents to retrieve!"

This enthusiasm cooled interest in RAG (Retrieval-Augmented Generation) while fueling expectations around MCP (Multi-Component Prompting) and agents. In reality, however, longer context does not guarantee better results. When there is too much context, agents and applications can fail in unexpected ways. Context can become poisoned, distracting, confusing, or cause clashes. These problems are especially severe for agents, since they rely heavily on context to gather information, synthesize it, and coordinate actions.

2. Four Types of Context Failure

The most common context problems that cause agents to fail can be grouped into four categories:

Context Poisoning
- Incorrect information (such as hallucinations) enters the context and is repeatedly referenced
Context Distraction
- The context grows so long that the model fixates on it rather than on what it has learned during training
Context Confusion
- Irrelevant information enters the context, leading the model to produce low-quality responses
Context Clash
- Contradictory information or conflicting tools enter the context, producing internal inconsistencies

3. Concrete Examples and Impact of Each Failure Type

1) Context Poisoning

Context poisoning appears when, for example, Gemini 2.5 is playing a Pokémon game, generates a hallucination, and the incorrect information enters the context and is referenced repeatedly going forward.

"The most severe form of this is 'context poisoning,' where multiple parts of the context (goals, summaries, etc.) become contaminated with false information about the game state, and it can take a very long time to recover from. As a result, the model becomes fixated on achieving impossible or irrelevant goals."

Once a goal section is poisoned, the agent devises nonsensical strategies and keeps pursuing objectives it can never reach.

2) Context Distraction

Context distraction occurs when the context grows so long that the model forgets the strategies it learned during training and starts repeating past actions. For instance, Gemini 2.5 Pro supports over one million tokens of context, but in practice it began repeating past actions once the context exceeded around 100,000 tokens.

"In agentic settings, once the context grew well beyond 100k tokens, the agent tended to repeat actions from its extensive history rather than forming new plans. This shows that long context may be useful for information retrieval but has real limits for multi-step reasoning."

Smaller models hit this ceiling even sooner. Llama 3.1 405B, for example, begins to lose accuracy around 32,000 tokens.

Because of this, the real value of long context windows lies in summarization and fact retrieval. For any other use case, you need to watch out for the model's distraction ceiling.

3) Context Confusion

Context confusion occurs when too many tools or pieces of information are loaded at once, leaving the model uncertain about how to respond. For example, when every tool description is packed into the prompt with MCP (Multi-Component Prompting), the model becomes confused about which tool to use.

"The Berkeley Function-Calling Leaderboard shows that every model's performance drops once more than two tools are provided. Even when none of the available functions are relevant, every model occasionally calls an irrelevant tool."

In the GeoEngine benchmark, where 46 tools were provided, Llama 3.1 8B failed even within a 16k context window. But with only 19 tools, it succeeded.

In other words, everything placed in the context demands the model's attention, so the more irrelevant information is included, the more confused the model becomes.

4) Context Clash

Context clash occurs when contradictory information enters the context and the model becomes trapped in a contradiction. Microsoft and Salesforce research teams explored this with "sharding" prompt experiments, adding information in multiple successive turns.

"LLMs make assumptions early in a conversation and rush to a final answer too soon. Then they over-rely on that flawed answer. In simple terms, once an LLM goes down the wrong path in a conversation, it gets lost and cannot find its way back."

When information is introduced incrementally across multiple turns, an early incorrect answer remains in the context and continues to distort subsequent responses. In practice, OpenAI's o3 model saw its score plummet from 98.1 to 64.1.

4. The Limits of Long Context and What We Can Learn

The arrival of million-token context windows seemed revolutionary, but in practice it has introduced new categories of failure.

Context poisoning causes errors to accumulate over time.
Context distraction causes agents to become fixated on the past.
Context confusion causes the model to invoke irrelevant tools or documents.
Context clash causes reasoning to collapse under internal contradictions.

These failures are especially serious in agents, which gather information from multiple sources, make sequential tool calls, perform multi-step reasoning, and accumulate long conversation histories.

5. Looking Ahead: Solutions

Fortunately, there are ways to mitigate or avoid these problems. For example, dynamically loading tools and quarantining context are both viable strategies. These will be covered in detail in a follow-up article.

For more on solutions, see "How to Fix Your Context"!

Key Concept Summary

Context window
Agent
Context poisoning, distraction, confusion, and clash
RAG, MCP
Summarization, fact retrieval
Dynamic tool loading, context quarantine

🧠 Long context is not a silver bullet! What matters most for agent success is how well you manage information and extract only what you actually need.

1. The Promise and Reality of Long Context Windows

"Now that we can fit all our documents into the prompt, we don't have to worry about which documents to retrieve!"

2. Four Types of Context Failure

The most common context problems that cause agents to fail can be grouped into four categories:

Context Poisoning
- Incorrect information (such as hallucinations) enters the context and is repeatedly referenced
Context Distraction
- The context grows so long that the model fixates on it rather than on what it has learned during training
Context Confusion
- Irrelevant information enters the context, leading the model to produce low-quality responses
Context Clash
- Contradictory information or conflicting tools enter the context, producing internal inconsistencies

3. Concrete Examples and Impact of Each Failure Type

1) Context Poisoning

"The most severe form of this is 'context poisoning,' where multiple parts of the context (goals, summaries, etc.) become contaminated with false information about the game state, and it can take a very long time to recover from. As a result, the model becomes fixated on achieving impossible or irrelevant goals."

Once a goal section is poisoned, the agent devises nonsensical strategies and keeps pursuing objectives it can never reach.

2) Context Distraction

"In agentic settings, once the context grew well beyond 100k tokens, the agent tended to repeat actions from its extensive history rather than forming new plans. This shows that long context may be useful for information retrieval but has real limits for multi-step reasoning."

Smaller models hit this ceiling even sooner. Llama 3.1 405B, for example, begins to lose accuracy around 32,000 tokens.

Because of this, the real value of long context windows lies in summarization and fact retrieval. For any other use case, you need to watch out for the model's distraction ceiling.

3) Context Confusion

"The Berkeley Function-Calling Leaderboard shows that every model's performance drops once more than two tools are provided. Even when none of the available functions are relevant, every model occasionally calls an irrelevant tool."

In the GeoEngine benchmark, where 46 tools were provided, Llama 3.1 8B failed even within a 16k context window. But with only 19 tools, it succeeded.

In other words, everything placed in the context demands the model's attention, so the more irrelevant information is included, the more confused the model becomes.

4) Context Clash

"LLMs make assumptions early in a conversation and rush to a final answer too soon. Then they over-rely on that flawed answer. In simple terms, once an LLM goes down the wrong path in a conversation, it gets lost and cannot find its way back."

4. The Limits of Long Context and What We Can Learn

The arrival of million-token context windows seemed revolutionary, but in practice it has introduced new categories of failure.

Context poisoning causes errors to accumulate over time.
Context distraction causes agents to become fixated on the past.
Context confusion causes the model to invoke irrelevant tools or documents.
Context clash causes reasoning to collapse under internal contradictions.

5. Looking Ahead: Solutions

For more on solutions, see "How to Fix Your Context"!

Key Concept Summary

Context window
Agent
Context poisoning, distraction, confusion, and clash
RAG, MCP
Summarization, fact retrieval
Dynamic tool loading, context quarantine

🧠 Long context is not a silver bullet! What matters most for agent success is how well you manage information and extract only what you actually need.

1. The Promise and Reality of Long Context Windows

2. Four Types of Context Failure

3. Concrete Examples and Impact of Each Failure Type

1) Context Poisoning

2) Context Distraction

3) Context Confusion

4) Context Clash

4. The Limits of Long Context and What We Can Learn

5. Looking Ahead: Solutions

Key Concept Summary

Related writing

Claude Code becomes the best product in the AI era

Block's AI Champion Strategy: Autonomizing a…

Introducing TabFM: A zero-shot foundation model for…

Reading

1. The Promise and Reality of Long Context Windows

2. Four Types of Context Failure

3. Concrete Examples and Impact of Each Failure Type

1) Context Poisoning

2) Context Distraction

3) Context Confusion

4) Context Clash

4. The Limits of Long Context and What We Can Learn

5. Looking Ahead: Solutions

Key Concept Summary

Related writing

Claude Code becomes the best product in the AI era

Block's AI Champion Strategy: Autonomizing a…

Introducing TabFM: A zero-shot foundation model for…

1. The Promise and Reality of Long Context Windows

2. Four Types of Context Failure

3. Concrete Examples and Impact of Each Failure Type

1) Context Poisoning

2) Context Distraction

3) Context Confusion

4) Context Clash

4. The Limits of Long Context and What We Can Learn

5. Looking Ahead: Solutions

Key Concept Summary

Related writing

Claude Code becomes the best product in the AI ​​era

Block's AI Champion Strategy: Autonomizing a…

Introducing TabFM: A zero-shot foundation model for…

Reading

1. The Promise and Reality of Long Context Windows

2. Four Types of Context Failure

3. Concrete Examples and Impact of Each Failure Type

1) Context Poisoning

2) Context Distraction

3) Context Confusion

4) Context Clash

4. The Limits of Long Context and What We Can Learn

5. Looking Ahead: Solutions

Key Concept Summary

Related writing

Claude Code becomes the best product in the AI ​​era

Block's AI Champion Strategy: Autonomizing a…

Introducing TabFM: A zero-shot foundation model for…

Claude Code becomes the best product in the AI era

Claude Code becomes the best product in the AI era