1. Reliability Problems in Delegated LLM Work

The paper examines what happens when LLMs are asked to carry out delegated document work over longer interactions. The concern is not only whether the final answer looks plausible, but whether the underlying document remains intact.

2. Introducing the DELEGATE-52 Benchmark

DELEGATE-52 tests agents on professional document environments across many domains. It is designed to reveal reliability issues that short, isolated tasks can hide.

2.1. Simulating Long Workflows With Round-Trip Relay

Figure 2: The backtranslation round-trip primitive.

The benchmark uses a round-trip relay method to simulate repeated delegation and revision. This makes it possible to observe gradual corruption over time.

2.2. Benchmark Components

The benchmark includes science and engineering, code and configuration, creative media, structured records, and everyday document tasks. Each domain has parsing and similarity functions so document quality can be scored.

Figure 3: DELEGATE-52 includes work environments from 52 professional domains in five categories: Science & Engineering, Code & Configuration, Creative & Media, Structured Records, and Everyday.

Figure 5: Top: Domains in DELEGATE-52 implement a parsing function that converts text documents into a structured representation which is then used by a similarity function to score two parsed instances. Bottom: concrete example for the recipe domain.

3. Key Experimental Results

The results show that LLMs can perform useful delegated work while still damaging documents in ways that are hard to notice immediately.

3.1. Document Corruption

Documents may lose important content, formatting, metadata, or structure during repeated edits. The errors can accumulate even when each individual step seems acceptable.

3.2. Limits of Short-Term Performance

Good performance on short tasks does not guarantee reliability in long workflows. Evaluation needs to include persistence and cumulative effects.

3.3. Effects of Agent Tool Use

Tools can help agents inspect and modify documents more reliably, but tool use does not eliminate corruption. The workflow still needs verification.

3.4. Document Size and Interaction Length

Longer documents and longer interaction histories increase the risk of damage. More context does not automatically mean better preservation.

3.5. Effects of Distractors

Distracting or irrelevant content can make the agent more likely to miss important constraints or change the wrong part of a document.

3.6. Non-Text Document Tasks

The benchmark also shows that agents struggle with non-text document structure, where visual layout and hidden metadata matter.

4. Detailed Analysis

The analysis separates obvious failures from subtler degradation. This distinction matters because many corrupted outputs can still look fluent.

4.1. Severe Error Analysis

Severe errors include deleting critical content, breaking schemas, or altering facts that should have remained stable.

4.2. Deletion vs. Corruption

Deletion is only one failure mode. Corruption can also mean changing order, structure, labels, or relationships in ways that are harder to detect.

4.3. Effects of Document Characteristics

Documents with rigid structure, many cross-references, or dense formatting are more vulnerable to agent mistakes.

4.4. Difficulty of Semantic Tasks

Tasks that require semantic judgment are especially difficult because the agent must preserve intent while making meaningful edits.

5. Implications and Limits

The benchmark suggests that delegated document work needs more robust guardrails, better evaluation, and explicit preservation checks.

5.1. For LLM Developers

Models and tools should be optimized not only for task completion, but for preserving document integrity across long workflows.

5.2. For NLP Practitioners

Benchmarks should measure cumulative reliability, not just single-turn accuracy or surface fluency.

5.3. For AI System Users

Users should verify important documents after delegation, especially when formatting, records, or legal and financial content matter.

5.4. Research Limitations

The benchmark does not cover every real-world workflow, and results may change as models and tools improve.

6. Conclusion

Delegation is useful, but document integrity is a separate capability. Reliable agents need to preserve the artifact, not merely produce convincing text.

Related writing