1. Reliability Problems in Delegated LLM Work
The paper examines what happens when LLMs are asked to carry out delegated document work over longer interactions. The concern is not only whether the final answer looks plausible, but whether the underlying document remains intact.
2. Introducing the DELEGATE-52 Benchmark
DELEGATE-52 tests agents on professional document environments across many domains. It is designed to reveal reliability issues that short, isolated tasks can hide.
2.1. Simulating Long Workflows With Round-Trip Relay

The benchmark uses a round-trip relay method to simulate repeated delegation and revision. This makes it possible to observe gradual corruption over time.
2.2. Benchmark Components
The benchmark includes science and engineering, code and configuration, creative media, structured records, and everyday document tasks. Each domain has parsing and similarity functions so document quality can be scored.


3. Key Experimental Results
The results show that LLMs can perform useful delegated work while still damaging documents in ways that are hard to notice immediately.
3.1. Document Corruption
Documents may lose important content, formatting, metadata, or structure during repeated edits. The errors can accumulate even when each individual step seems acceptable.
3.2. Limits of Short-Term Performance
Good performance on short tasks does not guarantee reliability in long workflows. Evaluation needs to include persistence and cumulative effects.
3.3. Effects of Agent Tool Use
Tools can help agents inspect and modify documents more reliably, but tool use does not eliminate corruption. The workflow still needs verification.
3.4. Document Size and Interaction Length
Longer documents and longer interaction histories increase the risk of damage. More context does not automatically mean better preservation.
3.5. Effects of Distractors
Distracting or irrelevant content can make the agent more likely to miss important constraints or change the wrong part of a document.
3.6. Non-Text Document Tasks
The benchmark also shows that agents struggle with non-text document structure, where visual layout and hidden metadata matter.
4. Detailed Analysis
The analysis separates obvious failures from subtler degradation. This distinction matters because many corrupted outputs can still look fluent.
4.1. Severe Error Analysis
Severe errors include deleting critical content, breaking schemas, or altering facts that should have remained stable.
4.2. Deletion vs. Corruption
Deletion is only one failure mode. Corruption can also mean changing order, structure, labels, or relationships in ways that are harder to detect.
4.3. Effects of Document Characteristics
Documents with rigid structure, many cross-references, or dense formatting are more vulnerable to agent mistakes.
4.4. Difficulty of Semantic Tasks
Tasks that require semantic judgment are especially difficult because the agent must preserve intent while making meaningful edits.
5. Implications and Limits
The benchmark suggests that delegated document work needs more robust guardrails, better evaluation, and explicit preservation checks.
5.1. For LLM Developers
Models and tools should be optimized not only for task completion, but for preserving document integrity across long workflows.
5.2. For NLP Practitioners
Benchmarks should measure cumulative reliability, not just single-turn accuracy or surface fluency.
5.3. For AI System Users
Users should verify important documents after delegation, especially when formatting, records, or legal and financial content matter.
5.4. Research Limitations
The benchmark does not cover every real-world workflow, and results may change as models and tools improve.
6. Conclusion
Delegation is useful, but document integrity is a separate capability. Reliable agents need to preserve the artifact, not merely produce convincing text.
