The long-context trap: why agents break down
Expectations vs. reality with massive context windows
With token windows stretching toward one million, many assumed we could just pack every document, tool description, and guideline into a single prompt and the agent would handle everything flawlessly.
"Now that we can dump every document into the prompt, why bother searching for the right file?"
That enthusiasm fueled excitement around RAG (Retrieval-Augmented Generation) and MCP (Multi-Component Prompting), but the truth is that more context doesn't always help. Overflowing contexts can poison, distract, confuse, or even clash inside the model, and agents—who orchestrate tools and multi-step reasoning—are particularly vulnerable to those failure modes.
Four types of context failure
-
Context poisoning: hallucinated or incorrect data leaks into the prompt and keeps getting referenced.
-
Context distraction: the context becomes so long that the model focuses on past behavior instead of planning new actions.
-
Context confusion: too much or irrelevant information causes the model to choose the wrong tool or produce low-quality answers.
-
Context clash: contradictory data in the prompt leads the model to inconsistent reasoning or paths.
Concrete examples
- Poisoning: When Gemini 2.5 hallucinated in a game scenario, the false status became embedded in the context. The agent then chased impossible goals because the corrupted state persisted across runs.
"Once a goal section is poisoned, the agent doggedly pursues nonsense strategies."
- Distraction: Gemini 2.5 Pro supports over 1M tokens, but beyond ~100k tokens it started replaying past actions rather than making new plans. Smaller models like Llama 3.1 405B degrade even earlier, around 32k tokens.
"After 100k tokens, agents stop planning and just repeat large chunks of history."
-
Confusion: When MCP-style prompts list every tool description, benchmarks show performance drops once multiple tools are present—the wrong tool gets called simply because it's in the prompt.
-
Clash: Microsoft and Salesforce studies found that once incorrect assumptions are baked into a conversation, LLMs latch onto them and struggle to recover. OpenAI's o3 accuracy dropped from 98.1 to 64.1 when conflicting information was introduced incrementally.
Lessons learned
Massive context windows introduced these failure categories. Each has consequences:
- Poisoning accumulates errors over time.
- Distraction leads agents to replay history instead of innovate.
- Confusion makes agents trigger irrelevant tools.
- Clash leaves the reasoning pipeline inconsistent.
Because agents combine sources, call tools, and reason across long conversations, they experience these failures deeply.
Moving forward
There are mitigations: load tools dynamically, quarantine different context sources, and summarize before feeding. The presenter promises deeper coverage in a follow-up post.
"For more strategies, see 'How to Fix Your Context.'"
Key concepts summary
- context windows and agents
- poisoning, distraction, confusion, clash
- RAG and MCP
- summarization and fact retrieval
- dynamic tool loading and context isolation
🧠 Long context isn't a panacea. Managing information carefully and only keeping what matters determines whether an agent succeeds.
