This study evaluates how well large language models (LLMs) preserve document content when performing delegated tasks such as document editing. Using a new benchmark called DELEGATE-52, researchers tested 19 LLMs across 52 professional domains and found that current models severely corrupt document content in long-horizon workflows. Notably, even the latest models corrupted an average of 25% of document content after just 20 interactions. Document size, interaction length, and the presence of distractor context all worsened corruption severity. These findings indicate a reliability gap that LLMs must close before they can serve as trustworthy agents in knowledge work.
1. Raising the Question of Reliability in Delegated LLM Tasks
Recent advances in large language models have enabled new interaction paradigms such as delegated tasks, where knowledge workers hand off work to an LLM and oversee it. Because users often lack the expertise or time to review every change the LLM makes, trust that the model will execute tasks accurately and without errors has become critical.
This study used simulation to broadly investigate whether LLMs are ready to perform delegated tasks without introducing errors into documents. The central goal was to understand how well LLMs preserve document integrity when carrying out delegated knowledge work.
2. Introducing the DELEGATE-52 Benchmark 🚀
The centerpiece of this research is DELEGATE-52, a new benchmark spanning 52 professional domains — including coding, crystallography, genealogy, and musical notation — with 310 work environments. Each environment contains a real document of roughly 15,000 tokens and 5–10 complex editing tasks for the LLM to perform.
Unlike prior work that focused on a single domain (e.g., code editing), DELEGATE-52 covers a far broader range of fields and aims to assess the general capabilities of LLMs.
2.1. Simulating Long-Horizon Workflows: The Round-Trip Relay 🔁
DELEGATE-52 uses a distinctive method called round-trip relay simulation to simulate long-horizon delegated interactions and evaluate LLM performance. A key advantage of this method is that evaluation requires no annotations or reference solutions.
- Reversible edit tasks: Each editing task is defined by a forward instruction and a corresponding backward instruction, making it reversible by design.
- Round-trip process: The LLM first applies the forward instruction to the original document (s) to produce a transformed document (t). It then applies the backward instruction to t to produce a reconstructed document (ŝ) that should resemble the original. An ideal model would produce s and ŝ that match exactly.
- Similarity measurement: Domain-specific similarity functions are implemented to measure reconstruction quality, returning a score between 0 and 1 that reflects semantic similarity between the original and reconstructed documents.
These round trips can be chained sequentially into a relay, simulating long-horizon workflows across many steps. For example, 20 interactions correspond to 10 round-trip edits. The primary evaluation metric is the reconstruction score after k interactions (RS@k).
Figure 2: The backtranslation round-trip primitive
2.2. Benchmark Components 🛠️
DELEGATE-52 includes the following components:
- 52 professional domains: Covering a wide range of fields across five categories — Science & Engineering, Code & Configuration, Creative & Media, Structured Records, and Everyday.
Figure 3: The 52 professional domains in the DELEGATE-52 benchmark - Work environments: Each domain includes 6 work environments, each consisting of a seed document, 5–10 editing tasks, and a distractor context.
- Seed documents: Real documents sourced online, ranging from 2,000 to 5,000 tokens.
- Editing tasks: Pairs of forward and backward instructions requiring deep transformations beyond simple expansion.
- Distractor context: A document unrelated to the task (8,000–12,000 tokens) included to simulate realistic work environments.
- Domain-specific evaluation: Because general text similarity metrics struggle to capture subtle semantic changes, custom similarity functions are implemented for each domain. For example, in the recipe domain, scores are weighted by the importance of ingredients, steps, and tips.
Figure 5: Domain-specific evaluation approach in DELEGATE-52 - Quality assurance: To ensure evaluation validity, several quality assurance stages were conducted, covering parsing robustness, evaluation sensitivity, edit testing, and distractor interference.
3. Key Experimental Results 📊
The study ran large-scale simulations using DELEGATE-52 across 19 LLMs. Each simulation consisted of 10 round trips (20 interactions total), with models receiving the work environment document as text in their context window at each interaction.
3.1. Document Corruption 📉
All models degraded in performance as interactions progressed, with an average of 50% of document content corrupted by the end of simulation. Even the latest models — Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 — showed severe results, with an average of 25% of document content corrupted after 20 interactions.
"Current LLMs are unreliable agents. They silently inject sparse but severe errors into documents, and these accumulate over long-horizon interactions."
By domain, LLMs performed better in programming domains such as Python and databases, and worse in natural-language and niche domains (e.g., earnings reports, musical notation). Among all 52 domains, Python was the only one where most models reached a "ready" state (scoring above 98% after 20 interactions).
These results make clear that a substantial gap exists in LLMs' ability to perform delegated tasks.
3.2. The Limits of Short-Term Performance ⚠️
Surprisingly, performance after just 2 interactions was not always predictive of long-term performance at 20 interactions. For example, GPT-5 and Kimi K2.5 started at nearly identical scores, but diverged significantly over time. This suggests that short-horizon simulations are insufficient for understanding LLM performance in long-horizon settings and underscores the importance of long-term evaluation.
3.3. The Effect of Agentic Tool Use 🛠️
To test the hypothesis that giving LLMs tools might reduce errors, the researchers implemented a basic agent system equipped with file read/write and code execution tools.
"The LLMs we tested do not benefit from agentic tool use when completing complex editing tasks across diverse text domains."
The results were the opposite of what was expected. All four tested models produced an average of 6% more document corruption when using tools than when operating without them. Several reasons explain this:
- Overhead: Tool use causes models to consume more input tokens, increasing cost and latency.
- Task complexity: DELEGATE-52 tasks require text comprehension and document-level reasoning, not simple short program execution. As a result, models tended to prefer manual file-writing tools over code execution.
In conclusion, current LLMs do not effectively leverage agentic tools for complex editing tasks, and DELEGATE-52 may serve as a useful benchmark for developers building such agent systems.
3.4. Effects of Document Size and Interaction Length 📏
- Document size: As document size increased, GPT-5.4's performance degraded progressively. Scaling from a 1,000-token document to a 10,000-token document resulted in a 5× increase in corruption after 20 interactions. This indicates that document size and interaction length compound multiplicatively, causing corruption to snowball.
- Interaction length: When relays were extended beyond 10 rounds (20 interactions) to 50 rounds (100 interactions), performance continued to degrade. No model showed any sign of stabilizing. Although early-stage corruption rates were higher than later-stage rates, even the strongest model, GPT-5.4, fell below 60% after 50 rounds.
3.5. Effects of Distractor Context 🚫
Simulations included distractor documents to reflect real-world work environments. When experiments were re-run without distractor documents, there was little difference early on, but the negative impact of distractors grew as interactions lengthened. By the end of simulation, the distractor-free condition showed 2–8% better performance. This demonstrates that unnecessary context (distractors) accelerates LLM performance degradation and that its severity can be underestimated by short-term evaluation.
3.6. Non-Text Document Editing Capabilities 🖼️
In addition to text documents, the study tested LLM capabilities on non-text document tasks such as image editing. Nine image-generation models were evaluated across six visual work environments.
- Severe corruption: Corruption in image manipulation was far worse than in text domains. Even the best models achieved final reconstruction scores of only 28–30%, compared to 70–80% for text models.
- Not ready: Even after just 2 interactions, no image-generation model exceeded 65% — lower than what text models achieve after 20 interactions.
This suggests that image editing models are not yet ready for delegated tasks and demonstrates that the DELEGATE-52 methodology can be extended to non-text domains.
4. Detailed Analysis 🔬
4.1. Critical Failure Analysis 💥
To understand how LLM performance degrades, cases where a score dropped by 10% or more in a single round trip were defined as critical failures and analyzed.
- Accumulated critical failures: Across all models, critical failures accounted for 80–98% of total corruption. LLMs do not degrade gradually through many small errors — rather, they tend to produce sudden, severe errors in specific rounds that substantially corrupt document content.
- Characteristics of stronger models: Better-performing models delayed critical failures to later rounds or experienced fewer of them across interactions. This indicates that stronger models are not better at avoiding small errors, but at postponing the onset of critical failures.
4.2. Deletion vs. Corruption 🗑️↔️
Document corruption was decomposed into content deletion and corruption of existing content (contamination).
- Weaker models: Lower-performing models showed a higher proportion of corruption attributable to content deletion.
- Latest models: In contrast, most corruption in the latest models (Claude 4.6 Opus, Sonnet) stemmed from contamination of existing content — modification, hallucination, and distortion. This means that state-of-the-art LLMs primarily produce errors by contaminating user documents during delegated tasks.
4.3. Effects of Document Characteristics 📄
The study analyzed which document properties influence LLM performance.
- Repetitiveness and structural density: Models degraded less on documents with high repetitiveness (d=+0.261), abundant numerical data (d=+0.159), and high structural density (d=+0.119) — for example, tabular data and chemical records.
- Natural language and lexical diversity: Conversely, models degraded more on documents with high naturalness (d=−0.260) and diverse vocabulary — for example, prose.
This suggests that LLMs perform better on formal, machine-oriented formats (Science & Engineering, Code & Configuration) and are more vulnerable to natural-language documents (Everyday, Creative & Media).
4.4. Difficulty of Semantic Operations 🧠
The semantic operations required by each editing task were categorized and their effects on LLM performance were analyzed.
- Harder tasks: Tasks requiring global restructuring of documents — such as Split and Merge, Classification, and Format Knowledge — produced significantly lower scores. These tasks carry risks of reasoning across the full document structure, with information loss or misrouting.
- Easier tasks: Local tasks — such as String Manipulation, Referencing, and Context Expansion — produced higher scores, as models can operate on individual tokens or phrases without requiring global document understanding.
- Compound task difficulty: Tasks combining multiple semantic operations simultaneously produced even lower scores, indicating that coordinating multiple operations is harder for LLMs.
5. Implications and Limitations 💡
5.1. Implications for LLM Developers 💻
DELEGATE-52 was used as an evaluation tool to understand current LLM capabilities, but it can also be repurposed for model training. In particular, the 52 domains can serve as a "mini-gym" for online reinforcement learning, training LLMs to complete work cycles without loss. LLM developers can explore reward designs that simultaneously target instruction-following and content preservation.
5.2. Implications for NLP Practitioners 🗣️
- Need for long-horizon interaction benchmarks: Since short-term performance is not a reliable predictor of long-horizon delegated task performance, more benchmarks covering long-horizon interactions beyond memory management are needed.
- Evaluation across diverse domains: Existing evaluations skewed toward math and code should give way to broader benchmarks spanning diverse professional fields and domains.
- Integrating agent and LLM benchmarks: Rather than treating agent benchmarks and LLM benchmarks as separate fields, an integrated approach is needed to understand how LLMs operate across different modes.
5.3. Implications for AI System Users 🙋♀️
Users delegating tasks to LLMs should be cautious about generalizing capability from one domain to another. Because LLMs' capabilities follow a "jagged frontier," a model may perform surprisingly well on certain tasks while making severe errors on others. Current LLMs are ready for delegated tasks in some domains, such as Python coding, but not in others that are less common. As a result, users must continue to closely monitor LLM systems while they carry out work.
5.4. Limitations of the Study 🚧
- Single-turn interactions: The simulations in this study are based on single-turn interactions, whereas real users refine their instructions over multiple turns — meaning corruption could be further amplified in real-world settings.
- Practical constraints: Simulation parameters such as document size, distractor context, and relay length may underestimate real-world scale due to cost and context window limitations. Corruption could worsen further in practice.
- Conceptual constraints: Reliance on backtranslation and domain-specific parsing restricts tasks to document editing, requires edits to be reversible, and creates an inherent advantage for structured domains in evaluation.
6. Conclusion 🎯
This study ran large-scale simulations of LLMs performing delegated tasks across 52 professional domains. The results show that current LLMs are unreliable agents: even the latest models corrupt an average of 25% of document content in long-horizon workflows. This corruption accumulates silently in the form of sparse but severe errors, and worsens with document length, interaction length, and distractor context. Agentic tool use did not mitigate the performance degradation.
These findings make clear that there is a fundamental reliability gap in LLMs' ability to perform delegated roles in knowledge work. The research team is releasing DELEGATE-52 publicly, hoping it will serve as a tool for monitoring the delegated-task readiness of AI systems and for advancing research on long-horizon human–AI interaction.
