Cursor: The Journey Toward Self-Driving Codebases preview image

This article covers Cursor's research into how thousands of AI agents can collaborate to autonomously develop complex software projects. It details the evolution of the agent orchestration (harness) system and key lessons learned along the way, discussing various experiments and the final system design for building self-driving codebases.


1. Research Background and Initial Attempts

Cursor began internal research to explore the possibility of long-term autonomous coding projects. The goal was to build a system (harness) that orchestrates thousands of agents and observe their behavior. Last month (May 2026), the system demonstrated stability by running continuously for a week, generating most commits needed for a research project (a web browser). While not intended for external use and containing some bugs, the fact that thousands of agents collaborated to produce working results without human intervention was a truly important milestone.

The research started as a personal side project. A web browser was sufficiently complex to expose frontier model limitations and contained diverse subsystems requiring collaboration. Initially, Opus 4.5 was asked to plan building a browser engine that renders web pages without JavaScript. But the model quickly lost its way, claiming success on incomplete implementations or getting stuck on complex details. Yet there were signs of deep knowledge and intelligence in writing small code fragments well.

The core issue was that building a browser was too overwhelming as a single task and needed to be decomposed into subtasks. Next, agents were directed to plan dependency graphs of major tasks that could be performed in parallel. Agents were manually created, and when stuck, nudged to continue. This increased throughput but didn't significantly improve results — agents couldn't communicate with each other or provide feedback on the overall project. The system needed to be more dynamic.


2. From Single Agent to Multi-Agent

The team started a new repository with a simple Rust-based harness, running on a single large Linux VM rather than dealing with distributed system complexity. They invested more time in proper observability — logging all agent messages, system actions, and command outputs with timestamps.

2.1. Self-Coordination Failure

The first multi-agent idea was simplest: equal-role agents using a shared coordination file. But this failed quickly. Agents held locks too long, forgot to release them, or attempted locking in invalid situations. Locking also created too much contention — 20 agents spent most time waiting for locks, reducing throughput to 1-3 agents. Agents weren't structurally incentivized to take on large, complex tasks, preferring small, safe changes to avoid contention.


3. Adding Structure and Roles

Next, roles were separated to give agents ownership and accountability:

  • Planner: Plans the exact approach and deliverables according to user instructions.
  • Executor: The sole lead agent ensuring the planner's plan is fully achieved, able to create tasks for Workers.
  • Judge: Runs independently after the executor finishes to determine whether work is complete.

This structure solved many coordination problems. However, the system was bottlenecked by the slowest worker, and pre-planning everything made dynamic readjustment difficult when new issues emerged.


4. Continuous Executor

The next version removed the independent planner. The executor could now plan on its own in addition to creating tasks. Freshness mechanisms were introduced — scratchpads rewritten frequently, self-reflection prompts, and alignment reminders. The system became very dynamic and flexible. The judge was removed since agents proved quite competent at completing instructions.

However, the continuous executor began exhibiting pathological behaviors: randomly going to sleep, stopping agent runs, performing work itself, refusing to create tasks, and prematurely claiming completion. The executor had been given too many roles and goals simultaneously — planning, exploration, research, task creation, worker checking, code review, editing, merging, and completion judgment. In retrospect, being overwhelmed was inevitable.


5. Final System Design

The final design integrates everything learned:

  1. Root Planner: Owns the full scope of user instructions. Understands current state and delegates specific tasks. Does not write code directly.
  2. Subplanners: Created when a planner judges its scope can be subdivided. Fully owns the delegated portion. This is recursive.
  3. Workers: Solely responsible for picking up and completing tasks. Know nothing about the larger system. Work on their own repository copy and produce a single handoff when complete.

Handoffs include not just what was completed, but important notes, concerns, deviations, discoveries, thoughts, and feedback. Planners receive these as follow-up messages. Information flows upward to owners with increasingly global perspective, without global synchronization overhead.


6. Throughput and Tradeoffs

The system recorded approximately 1,000 commits per hour through roughly 10 million tool calls over a week, requiring zero human intervention after startup.

Key tradeoffs: requiring 100% correctness before every commit caused severe serialization and throughput loss. Allowing a small, steady error rate — trusting that errors would be quickly fixed by peer agents — proved far more productive. Similarly, some synchronization overhead from multiple agents touching the same files was accepted rather than over-engineering prevention.


7. Key Learnings

  • Systems must be anti-fragile — tolerating individual agent failures while others recover.
  • Experience-based, not assumption-based — using data and observation rather than assumptions from human organizations.
  • Design explicitly for throughput — accepting small error rates for dramatic speed gains.
  • Instructions matter enormously at scale — unclear or suboptimal instructions amplify problems.
  • Constraints beat instructions — "No TODOs, no partial implementations" works better than "Remember to complete implementations."
  • Specific numbers and ranges are more effective than vague directives.

Conclusion

In this research, taste, judgment, and direction came from humans, but AI served as a significant force-multiplier for rapidly iterating and exploring. This resembles a 'virtuous' AI loop where better models, agents, and harnesses feed back into themselves, accelerating ever faster. We shape our tools and then our tools shape us.

The final system design bears poetic similarity to how some software teams operate today. Since these models weren't explicitly trained this way, this suggests emergent behavior and perhaps the right way to structure software projects. Cursor will continue researching very long-running agents, with findings illuminating the future of their product.

Related writing

Related writing