This document explains in detail the challenges Anthropic's research team encountered while developing long-running applications with Claude, and the harness design they devised to overcome them. 🧠 In particular, it shows how a multi-agent structure built around generator and evaluator agents addressed two difficult problems simultaneously: subjective quality evaluation and autonomous coding. Going beyond the limits of earlier approaches, a three-stage agent architecture comprising a planner, a generator, and an evaluator made it possible to build richer, more fully featured applications. This piece offers insights for optimizing AI agent performance and efficiently tackling complex development tasks. ✨
1. Introduction: Challenges in AI Agent Development and a New Approach
Anthropic researcher Preethi Rajasekharan spent several months absorbed in two interconnected problems: generating high-quality frontend designs with Claude and building complete applications without human intervention. Early on, prompt engineering and harness design dramatically improved Claude's performance, but she soon hit a wall. 🚧
To break through that wall, Rajasekharan explored a new AI engineering approach. Drawing inspiration from Generative Adversarial Networks (GANs), she designed a multi-agent structure composed of generator and evaluator agents. A key challenge was converting subjective judgments—like "is this design good?"—into concrete, evaluable criteria that would allow the evaluator to assess outputs reliably.
These techniques were then applied to long-running autonomous coding. Two lessons from earlier harness work came into play: decomposing tasks into manageable chunks, and using structured artifacts to carry context across sessions. The result was a three-stage agent architecture comprising a planner, a generator, and an evaluator, capable of producing rich, full-stack applications through hours-long autonomous coding sessions. 🚀
2. Why Naive Implementations Fail
Previous research showed that harness design significantly affects the efficiency of long-running agent coding. Early experiments used an initializer agent to decompose a product specification into a task list, a coding agent to implement each feature, and artifacts to pass context between sessions. But several problems remained. 😥
When tackling more complex tasks, agents tended to drift off course over time. Analyzing this problem revealed two common failure modes.
First, models tend to lose coherence over long tasks as the context window fills up. Some models exhibit "context anxiety"—a tendency to wrap up work prematurely as they approach the context limit. A context reset—clearing the context window entirely and launching a fresh agent with a structured handoff conveying the prior agent's state and next steps—addresses both of these issues.
This differs from compaction, which summarizes earlier parts of the conversation so the same agent can continue with an abbreviated history. Compaction preserves continuity but does not give the agent a clean slate, so context anxiety can persist. A reset provides a clean starting point but requires enough state in the handoff artifact for the next agent to pick up seamlessly. In early tests, Claude Sonnet 4.5 exhibited strong context anxiety, so compaction alone was not enough to enable solid long-task performance. Context reset therefore became an essential part of the harness design. This approach solves the core problem but adds orchestration complexity, token overhead, and latency per harness run. 😩
Second, a previously unaddressed problem: self-evaluation. When an agent is asked to evaluate its own work, it tends to praise the output with confidence even when a human observer would clearly judge the quality as mediocre. This is especially pronounced for subjective tasks like design, where there are no binary checks like verifiable software tests. Whether a layout feels polished or plain is a matter of judgment, and agents show a positive bias when evaluating their own output.
Even on tasks with verifiable outcomes, agents sometimes make inaccurate judgments that interfere with task execution. Separating the agent that does the work from the agent that evaluates it proved to be a powerful remedy. While this separation does not instantly eliminate the leniency, tuning an independent evaluator to be critical is far more practical than trying to make the generator critique its own work. And once external feedback exists, the generator can iterate concretely on it. 👍
3. Frontend Design: Making Subjective Quality Evaluable
Experimentation began in frontend design, where the self-evaluation problem was most visible. Without any intervention, Claude tended toward safe, predictable layouts—technically functional but visually unremarkable. 🎨
The harness built for frontend design incorporated two insights. First, while aesthetics cannot be fully reduced to scores and personal taste always varies, evaluation criteria that codify design principles and preferences enable improvement. "Is this design beautiful?" is hard to answer consistently, but "Does this follow our principles of good design?" gives Claude concrete criteria to evaluate against. Second, separating frontend generation from frontend evaluation creates a feedback loop that drives the generator toward stronger outputs.
With these ideas in mind, both the generator and the evaluator agents were given the following four evaluation criteria as part of their prompts.
- Design quality: Does the design feel like a coherent whole rather than a collection of parts? Colors, typography, layout, imagery, and other details should combine to create a distinct mood and identity.
- Originality: Is there evidence of custom decisions, or is it a template layout, a library default, or an AI-generated pattern? A human designer should be able to recognize intentional creative choices. Telltale signs of AI generation—unmodified stock components, purple gradients on white cards—fail here.
- Craft: Technical execution: typographic hierarchy, spacing consistency, color harmony, contrast ratios. This is a competence check rather than a creativity check. Most reasonable implementations pass by default; failing here signals a fundamental problem.
- Functionality: Usability independent of aesthetics. Can the user understand what the interface does, find key actions, and complete tasks without guessing?
Design quality and originality were weighted more heavily than craft and functionality. Claude scored well on craft and functionality by default—the required technical competencies are naturally baked into the model. On design and originality, however, Claude frequently produced results that were mediocre at best. These criteria explicitly penalized common "AI slop" patterns, and heavier weighting on design and originality nudged the model to take more aesthetic risks. 📈
The evaluator was calibrated using few-shot examples with detailed score breakdowns, ensuring its judgments aligned with stated preferences and reducing scoring variance across iterations.
The loop was built on the Claude Agent SDK to streamline orchestration. The generator agent first produced an HTML/CSS/JS frontend based on a user prompt. The evaluator was given Playwright MCP so it could interact directly with the live page, assess each criterion, and write a detailed critique. In practice, the evaluator navigated the page on its own, took screenshots, studied the implementation carefully, and then wrote its assessment. That feedback was passed back to the generator as input for the next iteration.
Five to fifteen iterations were run per generation, with each iteration responding to the evaluator's critique and pushing the generator toward more distinctive directions. Because the evaluator actively navigated the page rather than assessing static screenshots, each cycle took real time, and full runs lasted up to four hours. The generator was also instructed to make a strategic decision after each evaluation: refine the current direction when scores were good, or pivot to a completely different aesthetic when the approach wasn't working. 🔄
Across runs, the evaluator's assessments improved with each iteration, though room for improvement remained. Some generations improved incrementally; others showed sharp aesthetic shifts between iterations.
The wording of the evaluation criteria shaped the generator in ways that were not fully anticipated. Phrases like "the best designs are museum-quality" nudged designs toward particular visual convergences, suggesting that prompt language associated with the criteria directly shapes the character of the output.
Scores generally rose across iterations, but the pattern was not always cleanly linear. Later implementations tended to be better overall, but intermediate iterations frequently outperformed the final one. Implementation complexity also tended to increase round by round as the generator reached for more ambitious solutions in response to the evaluator's feedback. Even on the first iteration, outputs were noticeably better than with no prompting at all—suggesting that the criteria and associated language steered the model away from common defaults even before evaluator feedback drove additional improvement. ✨
4. Scaling to Full-Stack Coding: The GAN-Inspired Pattern
Building on these findings, the GAN-inspired pattern was applied to full-stack development. The generator-evaluator loop maps naturally onto the software development lifecycle: code review and QA play the same structural role as the design evaluator. 🧑💻
4.1. Architecture Design
The earlier long-running harness addressed multi-session coding consistency through an initializer agent, a coding agent working one feature at a time, and context resets between sessions. Context resets were a critical advance, because the harness used Sonnet 4.5, which exhibited the "context anxiety" tendency described above. Building a harness that worked well with context resets was essential to keeping the model on task. Opus 4.5 had largely eliminated that behavior on its own, making it possible to remove context resets entirely from this harness. Agents ran as one continuous session across the entire build, with automatic compaction in the Claude Agent SDK managing context growth along the way.
A three-stage agent system was built on top of the original harness's foundation, with each agent designed to address specific gaps observed in earlier runs. The system comprised the following agent personas.
- Planner: The earlier long-running harness required users to supply a detailed specification upfront. To automate this step, a planner agent was created that takes a simple one-to-four sentence prompt and expands it into a complete product specification. The planner was instructed to be ambitious about scope and to focus on product context and high-level technical design rather than detailed technical implementation. This emphasis reflected concern that if the planner tried to specify granular technical details upfront and got them wrong, those errors would propagate into downstream implementation. It was judged wiser to place constraints on what the agents should produce and let them find their own path as they work. The planner was also asked to look for opportunities to incorporate AI capabilities into the product specification.
- Generator: The "one feature at a time" approach used in the earlier harness worked well for scope management. A similar model was applied here, instructing the generator to work in sprints and select one feature from the specification at a time. Each sprint implemented the app in a React, Vite, FastAPI, SQLite (later PostgreSQL) stack, and the generator was instructed to self-assess its work at the end of each sprint before handing off to QA. Git was also used for version control.
- Evaluator: Applications produced by the earlier harness looked impressive but still had real bugs when actually used. To catch these bugs, the evaluator used Playwright MCP to click through the running application as a user would, testing UI functionality, API endpoints, and database state. It then scored each sprint against a set of criteria modeled on the frontend design experiments and adapted to cover product depth, functionality, visual design, and code quality. Each criterion had a strict threshold, and if any fell below it, the sprint failed and the generator received detailed feedback about what went wrong.
Before each sprint, the generator and evaluator negotiated a sprint contract—reaching agreement on what "done" would look like for that task before any code was written. This existed because the product specification was intentionally high-level, and a step was needed to bridge the gap between user stories and testable implementation. The generator proposed what it would build and how success would be verified; the evaluator reviewed that proposal to confirm the generator was building the right thing. The two agents iterated until they reached agreement. 🤝
Communication was handled through files. One agent would write a file; the other would read it and respond within that file or write a new file for the first agent to read in turn. The generator then carried out the build according to the agreed contract before handing off to QA. This kept the work aligned to the specification without locking in implementation details too early.
4.2. Harness Execution and Results
The first version of this harness used Claude Opus 4.5 for both the full harness and a single-agent system run on the same user prompts, for comparison purposes, as Opus 4.5 was the best available coding model at the time.
The following prompt was written to generate a retro video game maker:
Create a 2D retro game maker with features including a level editor, sprite editor, entity behaviors, and a playable test mode.
The table below shows the harness type, run time, and total cost.
| Harness type | Run time | Cost |
|---|---|---|
| Solo run | 20 min | $9 |
| Full harness | 6 hours | $200 |
The harness cost more than twenty times as much, but the difference in output quality was immediately apparent.
The expectation was an interface for composing levels and their components (sprites, entities, tile layouts) and then pressing "play" to actually play the level. Opening the solo run's output, the initial application appeared to meet that expectation.
But clicking through it, problems started to appear. The layout wasted space; fixed-height panels left most of the viewport empty. The workflow was rigid. Filling a level required creating sprites and entities first, but nothing in the UI guided the user through that order. More critically, the actual game was broken. Entities appeared on screen but nothing responded to input. Examining the code revealed a broken connection between entity definitions and the game runtime, with no indication of where the problem lay. 😖
After evaluating the solo run, attention shifted to the harness run. Starting from the same one-sentence prompt, the planning stage expanded it into a specification covering 16 features across 10 sprints—considerably more than the solo run had attempted. Beyond a core editor and play mode, the specification included a sprite animation system, behavior templates, sound effects and music, an AI-assisted sprite generator and level designer, and a game export feature with shareable links. The planner was given access to the frontend design skill, which it read and used to produce a visual design language for the app as part of the specification. In each sprint, the generator and evaluator negotiated a contract defining the specific implementation details and testable behaviors that would confirm completion.
The app was immediately more polished and fluid than the solo run. The canvas used the full viewport, panels were sensibly sized, and the interface had a consistent visual identity following the specification's design direction. A degree of awkwardness from the solo run remained—the workflow didn't make it obvious that sprites and entities needed to be created before populating a level, and that had to be discovered by trial and error. This read as a gap in the base model's product intuition, not something the harness was designed to address—though it suggested room to further improve output quality through targeted iteration within the harness.
Working through the editor, the advantage of the new run over the solo run became clearer. The sprite editor was richer and more fully featured, offering a cleaner tool palette, a better color picker, and more usable zoom controls. 🤩
Because the planner was asked to incorporate AI capabilities into the specification, the app also came with a built-in Claude integration for generating various parts of the game through prompts. This substantially accelerated the workflow.
The biggest difference appeared in play mode. It was actually possible to move entities and play the game. Physics had some rough edges (a character jumping onto a platform would overlap it in a way that felt intuitively wrong), but the core functionality worked—something that was not possible in the solo run. Exploring it further revealed some limitations in how the AI constructed game levels: there were large walls that couldn't be jumped over, leading to getting stuck. This suggested that there were commonsense improvements and edge cases the harness could address to further refine the app.
Reading the logs made it clear that the evaluator had kept the implementation aligned to the specification. Each sprint, it followed the sprint contract's test criteria and ran the live application through Playwright, filing bugs for anything that deviated from expected behavior. Contracts were granular—sprint 3 alone had 27 criteria covering the level editor—and the evaluator's findings were specific enough to act on without further investigation. The table below shows examples of issues identified by the evaluator.
| Contract criterion | Evaluator finding |
|---|---|
| Rectangle fill tool should fill a rectangular area with the selected tile via click-drag | Fail — The tool places tiles only at the drag start/end points rather than filling the area. fillRectangle exists but is not properly triggered on mouseUp. |
| User should be able to select and delete placed entity spawn points | Fail — The delete key handler at LevelEditor.tsx:892 requires both selection and selectedEntityId to be set, but clicking an entity sets only selectedEntityId. The condition should be selection || (selectedEntityId && activeLayer === 'entity'). |
| User should be able to reorder animation frames via the API | Fail — The PUT /frames/reorder route is defined after the /{frame_id} route. FastAPI matches 'reorder' as a frame_id integer and returns 422: "unable to parse string as integer." |
Getting the evaluator to operate at this level took work. Early on, Claude was a poor QA agent. In initial runs it would identify a legitimate problem, then dismiss it and approve the work anyway. It also tended to test superficially rather than exploring edge cases, so subtle bugs were easy to miss. 🕵️♀️ The tuning loop involved reading the evaluator's logs, identifying examples where its judgment diverged from what was expected, and updating the QA prompt to address those issues. Several development loops were required before the evaluator rated things in a reasonable way. Even so, harness outputs revealed the limits of the model's QA capabilities: small layout issues, some non-intuitive interactions, and bugs in more deeply nested features that the evaluator didn't test thoroughly. There was clearly more validation headroom to capture with additional tuning. But compared to the solo run—where core application functionality didn't work—the improvement was clear.
4.3. Iterative Harness Improvement
The first results from the harness were encouraging, but it was bulky, slow, and expensive. The logical next step was to find ways to simplify the harness without sacrificing performance. This was partly a common-sense exercise and partly a function of a more general principle: every component of the harness embeds an assumption about something the model can't do on its own. Those assumptions can be wrong, and they can quickly become obsolete as models improve—making them worth stress-testing.
A first attempt at simplification dramatically cut back the harness and tried some creative new ideas, but couldn't reproduce the original performance. It also became harder to tell which parts of the harness design were actually doing important work, and in what way. This experience led to a more systematic approach: removing one component at a time and examining the impact on final results. 🧪
During this iteration cycle, Opus 4.6 was also released, providing additional motivation to reduce harness complexity. There was a reasonable expectation that 4.6 would require less scaffolding than 4.5.
[Opus 4.6] plans more carefully, sustains agentic work for longer, operates more reliably across larger codebases, and has better code review and debugging skills for catching its own mistakes. Its long-context retrieval capability has also improved substantially.
All of these capabilities were exactly what the harness had been built to compensate for.
4.4. Removing the Sprint Structure
The first thing removed was the sprint structure entirely. It had helped by decomposing tasks into chunks that kept the model working coherently. Given Opus 4.6's improvements, there was good reason to think the model could handle this kind of decomposition on its own.
The planner and evaluator were kept because they continued to add clear value. Without the planner, the generator set scope too narrowly—given a raw prompt, it would start building without pre-specifying the work and end up with a less feature-rich application than the planner would have produced.
With the sprint structure removed, the evaluator was changed to evaluate once at the end of a run rather than after each sprint. Because the model had become substantially more capable, how much the evaluator mattered in any given run depended on how reliably the model could handle the work on its own. On 4.5, that boundary was close—the builds sat right at the limit of what the generator could do well solo, and the evaluator caught meaningful problems throughout. On 4.6, the model's raw capability had increased, moving that boundary outward. Tasks that previously needed an evaluator's check now fell within what the generator could handle on its own, and for work within that range the evaluator became unnecessary overhead. But for the parts of a build sitting at the edge of the generator's capability, the evaluator still delivered real value.
The practical implication is that the evaluator is not a fixed yes-or-no decision. It's worth the cost when the task exceeds what the current model can reliably handle on its own.
Alongside structural simplification, prompting was added to improve how AI features were built into each app—specifically, generating an appropriate agent capable of driving the app's functionality through tools. This required real iteration work, since the relevant knowledge was recent enough that it was barely represented in Claude's training data. With enough tuning, the generator could build agents correctly. 🛠️
4.5. Updated Harness Results
To test the updated harness, the following prompt was used to generate a DAW (Digital Audio Workstation)—a music production program for composing, recording, and mixing songs.
Build a fully featured DAW in the browser using the Web Audio API.
This run was still long and costly, taking about four hours and costing $124 in tokens.
Most of the time was spent by the builder, which ran coherently for over two hours without the sprint decomposition that had been necessary with Opus 4.5.
| Agent and stage | Run time | Cost |
|---|---|---|
| Planner | 4.7 min | $0.46 |
| Build (Round 1) | 2 hr 7 min | $71.08 |
| QA (Round 1) | 8.8 min | $3.24 |
| Build (Round 2) | 1 hr 2 min | $36.89 |
| QA (Round 2) | 6.8 min | $3.09 |
| Build (Round 3) | 10.9 min | $5.88 |
| QA (Round 3) | 9.6 min | $4.06 |
| Total V2 Harness | 3 hr 50 min | $124.70 |
As with the previous harness, the planner expanded a one-line prompt into a complete specification. The logs showed the generator model planning the app and agent design carefully, wiring up the agents, and testing before handing off to QA.
Even so, the QA agent still found real gaps. Its first feedback read:
This is a strong app with great design polish, solid AI agents, and a good backend. The primary failure point is feature completeness. The app is impressive and the AI integration works well, but several core DAW features are display-only without interaction depth. Clips can't be dragged/moved in the timeline, instrument UI panels (synth knobs, drum pads) are absent, and there are no visual effect editors (EQ curve, compressor meter). These aren't edge cases—they're the core interactions that make a DAW usable, and they're explicitly required by the specification. 😢
The second round of feedback found multiple functional gaps as well:
Remaining gaps:
- Audio recording is still stub-only (button toggles but no mic capture).
- Clip resizing via edge-drag and clip splitting are not implemented.
- Effect visualization is numeric sliders, not graphical (no EQ curve).
The generator still tended to miss details or stub out features when working on its own, and QA continued to add value by catching last-mile issues the generator needed to fix.
Based on the prompt, the expectation was a program for creating melodies, harmonies, and drum patterns, arranging them into a song, and getting help from an integrated agent along the way. 🎵 The video linked in the original article shows the result.
The app falls well short of a professional music production program, and the agent's compositional abilities clearly had significant room to grow. Moreover, since Claude cannot actually hear sound, the QA feedback loop was less effective for anything related to musical taste.
But the final app had all the core elements of a functional music production program—a working arrangement view, a mixer, and transport controls running in the browser. Beyond that, a short musical piece could be created through prompting alone. The agent set the tempo and key, laid down a melody, created a drum track, adjusted mixer levels, and added reverb. The core elements for composition were present, and the agent could autonomously drive a simple production from start to finish using tools. Not perfect yet, but getting there! 🤩
5. Next Steps: The Continuing Evolution of Harness Design
As models continue to improve, we can expect them to sustain longer and more complex tasks. In some cases, the importance of the scaffold surrounding the model will diminish over time, and developers may find certain problems resolving themselves as they wait for the next model. On the other hand, as models get better, the space for developing harnesses that achieve complex tasks beyond what the model can do by default only grows larger. 🚀
With that in mind, several lessons from this research will continue to matter. Experimenting with models, reading traces of real problems, and tuning performance to achieve desired outcomes are always good practices. For more complex tasks, there may be room for improvement by decomposing the work and applying specialized agents to each aspect of the problem. And when new models are released, it's generally worth revisiting the harness—removing parts that no longer play an important role in performance, and adding new parts to achieve capabilities that weren't previously possible.
This research reinforced the conviction that the space of interesting harness combinations does not shrink as models improve. Instead, it shifts—and for AI engineers, the interesting work is continuing to find the next new combination. 💡
Closing Thoughts
This article offers an engaging look at how the challenges of building complex applications with AI agents—particularly Claude—can be addressed through the creative approach of harness design. From evaluating subjective design quality to building full-stack applications that actually work, the multi-agent system linking generator, evaluator, and planner shows a path to drawing out the full potential of AI agents. As models continue to advance, the domain of harness design will expand further, and the creative efforts of AI engineers will only shine brighter—an exciting prospect indeed! ✨
