This document explains how Anthropic designed a harness for building long-running applications with Claude. In particular, it shows how a multi-agent structure built around a generator and an evaluator helped overcome two difficult problems at once: subjective quality evaluation and autonomous coding. The core idea is that a three-stage agent architecture made up of a planner, generator, and evaluator could produce richer and more complete applications than simpler approaches. It is a useful case study in how to optimize AI-agent performance for complex development work.
1. Introduction: The Challenge of AI-Agent Development
Anthropic researcher Prithvi Rajasekaran spent the last few months focused on two connected problems: generating high-quality frontend design with Claude, and building fully working applications with minimal human intervention. Prompt engineering and harness design improved Claude's performance substantially at first, but the gains eventually hit a ceiling.
To get beyond that limit, he looked for a new engineering pattern and took inspiration from Generative Adversarial Networks (GANs). The result was a multi-agent setup with separate generator and evaluator roles. A crucial move here was turning vague subjective questions such as "Is this design good?" into concrete evaluation criteria. Once those criteria existed, the evaluator could assess output more reliably.
The same ideas were then applied to long-running autonomous coding. Earlier harness work had already revealed two important lessons: break work into manageable chunks, and use structured artifacts to pass context between sessions. From there, Anthropic developed a three-stage architecture with a planner, generator, and evaluator. That setup was able to produce rich full-stack applications across autonomous coding sessions that lasted for hours.
2. Why Naive Implementations Fail
Previous research had already shown that harness design meaningfully affects long-running agent coding. Early experiments used an initializer agent to break a product spec into tasks, then let a coding agent implement features one by one while artifacts carried context between sessions. Even so, several problems remained.
When tasks became more complex, the agent tended to drift off course over time. Anthropic observed two broad failure modes.
First, long-running tasks caused models to lose consistency as the context window filled up. Some models even showed what the team called context anxiety: as they neared the context limit, they tried to wrap up prematurely. A context reset addressed this by wiping the context window entirely, launching a fresh agent, and handing over structured state and next steps.
This differs from compaction, where earlier conversation is summarized so the same agent can continue with a shortened history. Compaction preserves continuity, but it does not give the model a clean restart, so context anxiety can persist. A reset gives a cleaner starting point, though it also requires richer handoff artifacts so the next agent can continue cleanly. In early tests, Claude Sonnet 4.5 showed enough context anxiety that compaction alone was not sufficient.
Second, the team confronted the problem of self-evaluation. When an agent is asked to evaluate its own output, it tends to praise work confidently even when a human observer would consider it mediocre. This is especially severe in subjective tasks like design, where there is no simple binary test. Even in tasks with objective outcomes, agents sometimes make inaccurate judgments that get in the way of execution.
Separating the agent that does the work from the agent that evaluates the work turned out to be a powerful solution. The split does not instantly remove generosity bias, but it is much more realistic to tune an independent evaluator to be more critical than to force a generator to judge its own work rigorously. Once external feedback exists, the generator can iterate concretely on it.
3. Frontend Design: Making Subjective Quality Evaluatable
The self-evaluation issue first became especially visible in frontend design. Left on its own, Claude tended to produce safe, predictable, technically functional layouts that still felt visually ordinary.
Anthropic's frontend harness was built around two insights. First, aesthetics cannot be reduced perfectly to a score, but improvement is possible if you encode design principles and preferences as evaluation criteria. "Is this beautiful?" is hard to answer consistently; "Does this follow our principles for good design?" gives the evaluator something more concrete. Second, separating generation from evaluation creates a feedback loop that pushes the generator toward stronger output.
The team used four criteria:
- Design quality: Does the result feel like a coherent whole rather than a pile of parts?
- Originality: Does it reflect intentional choices, or does it look like a template or a generic AI pattern?
- Craft: Does it show good typography, spacing, color harmony, and contrast?
- Functionality: Can users understand the interface and complete tasks without guessing?
Rajasekaran deliberately emphasized design quality and originality over craft and functionality, because Claude already performed reasonably well on the latter two. The new weighting penalized common "AI slop" patterns and pushed the model to take more aesthetic risks.
The evaluator was further calibrated with few-shot examples and detailed score analysis so that its judgments matched Anthropic's preferences more closely. The loop itself was built on the Claude Agent SDK. A generator first created HTML/CSS/JS from a user prompt. The evaluator was then given Playwright MCP so it could interact with the live page, inspect the result, and write detailed critiques. That feedback was passed back to the generator for the next round.
Each run typically went through 5 to 15 iterations, with every round steering the generator toward more distinctive output. Since the evaluator actively explored the page instead of scoring static screenshots, the full process could take up to four hours.
The results improved over successive iterations, though not always in a neat linear way. Some runs showed steady progress. Others jumped sharply between visual directions. Interestingly, the phrasing of the criteria itself shaped output. For example, language like "the best design feels museum-grade" nudged the generator toward a particular kind of visual convergence. In other words, the evaluation prompt was not just measuring the work. It was also shaping it.
4. Extending the Pattern to Full-Stack Coding
Anthropic then took the same GAN-inspired pattern into full-stack development. The generator-evaluator loop maps naturally onto software work, where code review and QA serve a similar structural purpose to design critique.
4.1. Architecture
The older long-running harness had already used an initializer agent, feature-by-feature coding agents, and context resets between sessions. Context reset was especially important because Sonnet 4.5 showed the "context anxiety" mentioned earlier. Later, Opus 4.5 reduced that issue enough that Anthropic could remove explicit context resets and rely on a single continuous session plus the Agent SDK's automatic compaction.
On top of that base, Anthropic built a three-stage system:
- Planner: Instead of requiring users to provide detailed specs up front, the planner expanded a short 1-4 sentence prompt into a full product spec. It was instructed to stay ambitious about scope while focusing more on product context and high-level technical design than overly specific implementation details. It was also asked to look for opportunities to integrate AI capabilities into the product.
- Generator: The generator worked in sprint mode, selecting one feature from the spec at a time and implementing it using a React, Vite, FastAPI, and SQLite (later PostgreSQL) stack. At the end of each sprint, it performed a self-check before handing off to QA.
- Evaluator: Using Playwright MCP, the evaluator interacted with the live application like a user would, checking UI functionality, API endpoints, and database behavior. It graded each sprint against a rubric covering product depth, functionality, visual design, and code quality. If any category missed its threshold, the sprint failed and the generator received detailed feedback.
Before each sprint, the generator and evaluator negotiated a sprint contract. This defined what "done" would mean before coding started. Because the product spec remained intentionally high-level, the contract served as the bridge between abstract user stories and concrete, testable implementation. Communication happened through files: one agent wrote, the other read and responded in the same or a new file.
4.2. Results
Anthropic first tested the harness using Claude Opus 4.5, comparing the full harness with a single-agent run. The prompt was simple: build a 2D retro game maker with a level editor, sprite editor, entity behaviors, and a playable test mode.
The difference in output quality was obvious. The single-agent run looked plausible at first glance, but quickly broke down under actual use. Layout wasted space. The workflow was rigid. Critical features, like moving entities in the playable mode, did not work. The code revealed disconnects between the entity definitions and the game runtime.
The harness run, by contrast, expanded the same one-line prompt into a much larger spec spanning ten sprints and sixteen features. In addition to the core editor and play mode, it included sprite animation, behavior templates, sound effects, music, AI-assisted sprite generation, AI-assisted level design, and shareable export links. The planner even used Anthropic's frontend-design skill to produce a coherent visual design language.
The resulting application felt much more polished immediately. The canvas used the full viewport, the panels were better sized, and the visual identity felt consistent. There were still workflow gaps, but the difference in capability became even clearer during use. The sprite editor was richer. The built-in Claude integration accelerated content generation. Most importantly, the play mode actually worked: the entities moved, and the game was playable.
The evaluator was central to that outcome. It tested the live app, traced sprint contracts back to specific expected behaviors, and filed bugs that the generator could act on directly. That said, getting the evaluator to that level took work. Early QA agents were weak: they noticed issues but approved them anyway, or stayed too shallow to catch edge cases. Rajasekaran had to read the evaluator logs, compare its judgment with his own, and keep refining the prompt until the evaluator graded more rigorously.
Even then, the harness exposed the limits of current QA models. Some layout issues, awkward interactions, and deeply nested bugs remained. But compared with the single-agent baseline, where core features did not work at all, the improvement was undeniable.
4.3. Simplifying the Harness
The first harness worked, but it was bulky, slow, and expensive. The next question was whether the same performance could be achieved with less scaffolding.
Anthropic first tried a more aggressive simplification and failed to reproduce the same quality. So the team switched to a more systematic approach, removing one component at a time and observing what actually mattered.
When Opus 4.6 arrived, simplification became even more appealing. The model itself had improved at planning, staying on task, debugging, reviewing, and retrieving long-context information. Those were precisely the weaknesses the harness had originally been built to offset.
One major change was removing the sprint structure entirely. Planner and evaluator still added value, but the generator no longer needed features to be decomposed into tiny pieces in the same way. Evaluation also moved to the end of the build instead of happening after each sprint. At this stage, whether an evaluator was worth paying for depended on whether the task still sat outside the model's unaided capability frontier.
The simplified harness was then tested on a browser-based DAW built with the Web Audio API. Even in the simplified form, the run still took almost four hours and cost over $120 in tokens. But it produced a genuinely functional music-creation tool with arrangement view, mixer, transport controls, and even AI-assisted composition flows. QA still found gaps, such as missing clip resize, split support, and real recording implementation, but the evaluator once again caught meaningful end-stage issues the generator had glossed over.
5. What This Suggests Going Forward
As models improve, some scaffolding will become unnecessary. Certain problems may simply disappear as raw model capability increases. At the same time, the space for interesting harness designs does not shrink. It shifts.
The key lessons here are durable:
- Keep experimenting directly with models on real problems.
- Read traces instead of assuming you know why something failed.
- Break difficult work apart and assign specialized roles where it helps.
- Revisit your harness whenever a new model arrives, because what mattered for one generation may no longer matter for the next.
Rajasekaran's conclusion is not that harness design becomes less relevant as models get better. It becomes a moving target. The interesting work for AI engineers is to keep discovering the next effective combination.
Closing
This piece is a compelling look at how Anthropic approached difficult application-building problems through harness design. From subjective design critique to fully working full-stack systems, the planner-generator-evaluator pattern shows one path toward extracting more value from AI agents than naive prompting alone can deliver.
As models keep advancing, the boundaries of harness design will keep moving too. That makes this not just a case study about current systems, but a useful framework for thinking about the next generation of AI engineering as well.
