This article covers Cursor's internal research into how thousands of AI agents can collaborate to autonomously develop complex software projects. It details the evolution of their agent orchestration (harness) system and the key lessons learned along the way, describing the various experiments and final system design they pursued in building a self-driving codebase.
1. Background and Early Attempts
Cursor began internal research to explore the feasibility of long-horizon autonomous coding projects. 🧑💻 The goal was to build a system (harness) that could coordinate thousands of agents and observe their behavior. Last month (May 2026), the system ran continuously for a week, demonstrating enough stability to generate the majority of commits needed for a research project (a web browser). It wasn't a browser for external use, and the code had some bugs, but the fact that thousands of agents collaborated without human intervention to produce a working artifact was a truly significant milestone! 🎉
This research grew out of a personal side project. A web browser was sufficiently complex to expose the limits of frontier models, and it contained diverse subsystems that needed to work together. Initially, Opus 4.5 was asked to plan a browser engine that could render web pages without JavaScript support. But the model quickly got lost — claiming success before finishing properly or getting stuck on complex implementation details. It still showed signs of deep knowledge and intelligence in writing small code snippets.
The core problem was that the task of building a browser was too overwhelming and needed to be decomposed into subtasks. Next, the agent was prompted to plan a dependency graph of major tasks it could execute in parallel. Agents were created manually and nudged to keep going when they stalled. This increased throughput, but results didn't improve much, because agents couldn't communicate with each other or provide feedback about the project as a whole. The system needed to be more dynamic.
Meanwhile, GPT-5.1 and GPT-5.2 began showing better results with improved instruction-following. This seemed well-suited to long-running agents, and the harness was updated to use OpenAI models based on these experiments. At this point the harness could produce a simple version of a JavaScript-free web browser, but it became clear that a single agent alone would be too slow to build a complete browser engine.
This was the start of the next phase of research. 🤔 The question was: "If we invest 10x more compute, can we get 10x more meaningful throughput?"
2. Moving from Single Agent to Multi-Agent
A new repository was started with a simple Rust-based harness. Rather than dealing with the complexity of distributed systems, the harness ran on a single large Linux VM with ample resources. A simple terminal interface over SSH was used to control the harness.
More time was invested in building proper observability into the system. All agent messages, system actions, and command outputs were logged with timestamps so sessions could be analyzed and replayed. This was useful for manual review, and also helpful for piping back through Cursor to sift through large volumes of data and quickly identify patterns.
2.1. The Failure of Self-Coordination
The first multi-agent idea was the simplest. 👇
- Agents with equal roles would use a shared state file to check what other agents were doing, decide what to work on, and update the file.

The least prescriptive approach was taken, letting agents figure out how to coordinate themselves — but this failed quickly. 😱
The coordination file caused more problems than it solved. Agents held locks for too long, forgot to release them, or tried to acquire or release locks at illegal times — broadly failing to understand the importance of locking the coordination file. Locking was easy to get wrong and hard to use correctly, and additional prompting didn't help.
Locking also caused too much contention. Twenty agents spent most of their time waiting for locks, slowing effective throughput to 1–3 agents. Agents were given tools to explicitly wait for other agents' work, but rarely used them. A lock-free optimistic concurrency control approach was also tried, which reduced overhead but didn't eliminate the confusion.
The lack of structure among agents meant no single agent took on large, complex tasks. They tended to avoid contention and conflict, preferring small, safe changes rather than taking responsibility for the project as a whole.
3. Adding Structure and Roles
Next, roles were separated to give agents ownership and accountability. 👥

- Planner: Plans the exact approach and deliverables upfront, in accordance with the user's instructions.
- Executor: The sole lead agent responsible for ensuring the planner's plan is fully achieved. The executor can create tasks for Workers, providing linear scalability and throughput.
- Judge: Runs independently after the executor finishes to determine whether the work is complete and whether another iteration should be run.
This structure resolved many coordination problems. By having a single role that owned and supervised execution, Workers could focus solely on their own tasks while the overall system still produced deliverables.
3.1. Observation and Improvement
Making this design work required close observation of the system. 🔎
When a major problem emerged, it recurred repeatedly across many agents and tool calls. For example, it was discovered that many agents were running git restore simultaneously, creating excessive contention. Cursor was used to analyze logs and compare them against prompts to understand why behavior didn't match expectations.
Ultimately, it became clear that this system was bottlenecked by the slowest worker. It was too rigid. Planning everything upfront also made it hard for the system to dynamically readjust when new problems were discovered. Some agents headed in unproductive directions and couldn't self-correct until the next iteration loop.
4. The Continuous Executor
In the next version, the independent planner was removed. 👏
The executor could now plan for itself how to achieve goals, in addition to creating tasks. As the sole agent, it no longer needed to record plans anywhere, be bound to a static unchanging plan, or inflexibly wait for all workers.
4.1. Ensuring Freshness
Freshness mechanisms were introduced to prevent agents from drifting over long periods. 🌿
scratchpad.mdshould be rewritten frequently rather than appended to.- Individual agents should automatically summarize themselves when they approach context limits.
- Self-reflection and alignment reminders were added to system prompts.
- Agents were encouraged to revise and challenge their assumptions at any time.
The system became highly dynamic and flexible. It could actively explore code, reconsider decisions, manage workers, interleave tasks, and continuously incorporate the latest information. Because agents turned out to be quite capable of completing instructions, the Judge was removed to keep the system simple.

4.2. Pathological Behaviors
Despite these improvements, the continuous executor began exhibiting pathological behaviors. 😵💫 It would randomly go idle, halt agent execution, perform work itself, refuse to plan and create more than a few narrow tasks, fail to properly merge workers' changes, and prematurely claim completion.
It turned out that too many roles and goals were being given to the executor simultaneously: planning, exploring, researching, creating tasks, checking on workers, reviewing code, making edits, merging outputs, and deciding whether the loop was done. In hindsight, it's no wonder it was overwhelmed.
5. Final System Design
The final design integrates everything learned. 🧠
- Root Planner: Owns the full scope of the user's instructions. Its job is to understand the current state and deliver specific, concrete tasks that move toward the goal. It does no coding itself and has no knowledge of who executes its tasks.
- Subplanners: Created when the planner determines its scope can be subdivided. They fully own the narrower portion delegated to them, with the same kind of total ownership but scoped to that area. This is recursive.
- Workers: Bear full responsibility for picking up a task and seeing it through to completion. They have no awareness of the larger system and do not communicate with other planners or workers. They work on their own copy of the repository, and when done, write a single handoff that the system submits to the planner that requested the task.
Interestingly, this design resembles how some software teams operate today. 🤝

Subplanners rapidly fan out workers to increase throughput while ensuring the entire system remains fully owned and accountable to its agents. This also helped with large-scale projects and tasks where a single planner could become overwhelmed and fall into tunnel vision.
Handoffs contain not just what was completed, but also important notes, concerns, deviations, discoveries, thoughts, and feedback. Planners receive these as follow-up messages. This keeps the system continuously moving. Even after a planner is "done," it continues receiving updates, can pull the latest repository, and carries on planning and making follow-up decisions.
Every agent has this mechanism, making the system incredibly dynamic and self-converging. Information flows upward to owners with progressively broader views — without the overhead of global synchronization or inter-agent communication.
5.1. Removing the Integrator
An Integrator was originally added for centralized, globally-aware quality control, and to eliminate contention caused by too many workers simultaneously attempting to push, rebase, resolve conflicts, and merge.
However, the Integrator became an obvious bottleneck — hundreds of workers with all their work funneling through a single gate (a kind of "bureaucracy"). Prompt changes were tried, but ultimately the Integrator was deemed unnecessary and removed to simplify the system.
6. Throughput and Trade-offs
Over one week, the system logged approximately 1,000 commits per hour through roughly 10 million tool calls. After the system started, no human intervention was needed. 🚀 Achieving this throughput required deliberate trade-offs.
6.1. Commit Correctness
When 100% correctness was required before every commit, severe serialization occurred and effective throughput degraded. Even a single small error — an API change or typo — would halt the entire system. Workers would start ranging outside their scope to fix unrelated things, and many agents would end up stepping on each other trying to fix the same problem.
This behavior was neither helpful nor necessary. By allowing some slack, agents can trust that other problems will soon be fixed by peer agents. This holds true because the system has effective ownership and delegation over the entire codebase. Errors occur but are fixed quickly. The error rate stays small and constant — not perfectly clean, but stable and manageable, never exploding or spiraling out of control.
This suggests that an ideally efficient system tolerates some error rate, but the final "green" branch will need agents to periodically take snapshots and quickly fix things before release.
6.2. Synchronization Overhead
Sometimes multiple agents touch the same file or refactor the same code. Rather than trying to fully eliminate these situations or over-engineer around them, some degree of "turbulence" is accepted, and the system is allowed to naturally converge and stabilize in a short time.
This consumes extra tokens and causes local contention, but keeps the overall system simpler. It's easier to align and not overwhelm the models, easier to manage and observe, produces less friction, and yields better overall productivity. It also avoids overly complex approaches.
7. Infrastructure Learnings
Each multi-agent run executed on its own large machine with sufficient system resources, avoiding premature complexity in distributed systems. This was a good approach — most runs peaked at hundreds of agents, which generally saturated these machines without overloading them. This architecture made it easy to observe system metrics and to share and copy state as needed.
After capping agent RAM usage, disk became the bottleneck. In monolithic projects especially, hundreds of agents compiling simultaneously generated gigabytes of reads/writes per second against build artifacts. This had a significant impact on overall harness throughput — an interesting lesson. Project structure, architecture decisions, and developer experience can affect token and commit throughput, because waiting for codebase operations (e.g., compilation) dominates time rather than actual thinking and coding.
Common development environments also had constraints and inefficiencies. Things that don't matter or are insignificant in a single-user workspace can become glaring when hundreds of agents perform the same operations on one machine. One simple remedy is to give each agent its own machine. But there are also interesting low-risk opportunities for significant efficiency gains by rethinking and redesigning these basic tools and primitives.
For example, many tools like Git and Cargo rely primarily on shared locks as simple concurrency control mechanisms. Could well-established mechanisms from concurrent systems like databases be brought over and made to work well in multi-agent systems? All agents have their own copy of the repository, but most files and artifacts are identical. Could adding simple copy-on-write and deduplication features found in more sophisticated production storage systems bring similar easy wins to otherwise "single-user" systems without building separate infrastructure?
8. Specifying Intent to Agents
The instructions given to this multi-agent system were critically important. 📢
Initially, instructions weren't the primary focus — building a stable and effective harness was. But the importance of instructions quickly became clear. The system was essentially interacting with general-purpose coding agents, just at a much larger scale in terms of time and compute. This amplifies everything, including suboptimal or unclear instructions.
It's worth investing more time upfront in initial instructions. Ultimately, agents are still agents. They're trained to follow instructions strictly, to stay on that path, and won't change or ignore instructions even if they're poor.
Because success was sought on the research project, initial instructions were revised as both the project and harness evolved. Learning how to build a browser while also learning how to operate a new multi-agent system meant that poorly specified or underspecified requirements were visible in output quality. This wasn't a failure of the harness itself — the harness simply followed instructions precisely.
A few examples from the browser project:
- Early on, instructions focused on spec implementation and bug fixing. Instructions like "implement the spec" were too vague, causing agents to dive deep into obscure, rarely-used features rather than intelligently prioritizing.
- It was assumed that performance expectations within a user-friendly scope would be implicit. But explicit instructions and enforced timeouts were needed to force agents to balance performance against other goals.
- In complex parts of the system, agents could write code that caused memory leaks or deadlocks. Humans would notice this, but it wasn't always obvious to agents. Explicit process-based resource management tools were needed to make the system recover gracefully and behave more defensively.
The first version of the simple JavaScript-free browser converged on an architecture that was unsuitable for evolving into a full browser. This was a failure of the initial specification.
Similarly, even though agents were told from the start this was a browser project, they pulled in some dependencies they could have implemented themselves or used as temporary scaffolding while proper implementations were in progress. This was an oversight in the instructions. Subsequent runs corrected this by explicitly specifying the dependency philosophy and which libraries should not be used.
Those subsequent runs also undertook a major reorganization from a monolith into many self-contained crates. The repository was in a severely broken state, but the multi-agent system converged to working code within days. This demonstrated the system's powerful ability to operate collaboratively and intelligently even from a completely broken state — without degrading further or grinding to a halt. Those runs also spent far less time waiting for compilation and ran at several times higher throughput than before.
Architecture and instructions matter. Agents have tremendous engineering skill, but they will follow instructions to the end, for better or worse. Finding the balance between overly narrow metrics and unstructured freedom was tricky, as was knowing what was obvious versus what needed to be stated explicitly. All of this points to the importance of eliciting, specifying, and communicating intent — and at this scale, that importance only grows. Steerability and Observability will remain fascinating research areas to continue exploring.
9. Optimizing Prompts
Prompt writing was an important part of the evolution process. 📝
It was found to be better to not instruct models on things they already know how to do, and instead only instruct them on things they don't know (e.g., multi-agent collaboration) or things specific to the relevant domain (e.g., how to run tests, deployment pipelines). Treat models like smart new hires who know engineering but don't know the specific codebase and processes.
Constraints work better than instructions. "No TODOs, no partial implementations" works better than "don't forget to finish the implementation." Models generally do good things by default. Constraints define their boundaries.
For higher-level or deeper work, avoid a checkbox mindset. Provide detailed instructions about intent, but keep in mind that instructing specific tasks causes models to focus on achieving those tasks rather than the broader scope — and implicitly deprioritizes anything not listed. Generally it's better to let models exercise judgment and autonomy.
When discussing amounts of scope, it was found useful to provide specific numbers and ranges. Instructions like "generate many tasks" tended to produce small quantities — a conservative default, technically still following instructions while playing it safe. "Generate 20–100 tasks" communicates a broader sense of intent and implies ambition, and very different, much wider behaviors were observed.
10. System Design Learnings
The research established several principles. 💡
- The system must be anti-fragile. As the number of concurrently running agents increases, so does the probability of failure. The system must be able to absorb individual agent failures and allow other agents to recover or try alternative approaches.
- Be empirical, not assumption-based. Rather than approaching the problem with preconceived notions of how it should work based on human organizations or existing system designs, data and observations were used to make adjustments.
- Design throughput explicitly. This means trading off other aspects of coding — for example, tolerating a small but stable error rate requiring a final cleanup pass, rather than demanding 100% perfectly working code that would dramatically slow the system.
When designed well, these systems tend to be elegantly simple — but it wasn't clear which simple approaches would work until various approaches had been explored. The current system design runs with minimal overhead and provides useful linear scaling of token throughput. No major additional iterations on the harness were needed.
Conclusion
In this research, taste, judgment, and direction came from humans, but AI served as a significant force multiplier in rapidly iterating and exploring that research. 💪
This resembles a "virtuous" AI loop where AI develops AI. The better the models, agents, and harness become, the more this feeds back into itself, accelerating ever faster. We shape the tools that shape us.
There is a poetic similarity between this research and how some software teams operate today. The fact that these models were not explicitly trained to work this way suggests this is emergent behavior — and perhaps the right way to structure software projects.
Research into very long-running agents will continue, and these findings will illuminate the future of the product. ✨
