This post shares what OpenAI's team learned over five months of developing and shipping a software product with zero lines of manually written code. The product has both internal users and external alpha testers, and every line of code was written by Codex agents. The experience cut development time by a factor of ten and transformed the engineer's role from writing code to designing environments where agents can work reliably, clarifying intent, and building feedback loops.
1. A Journey That Started from an Empty Git Repository 🚀
In late August 2025, our team started the project from an empty Git repository. The first commit was an initial skeleton — repository structure, CI configuration, formatting rules, package manager setup, and application framework — generated by a Codex CLI powered by GPT-5. Even the AGENTS.md file instructing agents on how to work in the repository was written by Codex itself! Remarkable, isn't it? From the very beginning there was no human-written code; everything was shaped by agents.
Five months later, the repository holds roughly one million lines of code spanning application logic, infrastructure, tooling, documentation, and internal developer utilities. Over that period, just three engineers drove Codex and opened and merged approximately 1,500 pull requests — an average of 3.5 PRs per engineer per day — and throughput actually increased as the team grew to seven. What matters is not just the numbers: hundreds of internal users were using the product daily, proving its real-world utility.
Throughout development, humans made no direct code contributions. This became the team's core philosophy.
"No manually written code."
2. Redefining the Engineer's Role 💡
Not writing code by hand meant a new kind of engineering work focused on systems, scaffolding, and leverage.
Early progress was slower than expected — not because Codex lacked capability, but because the environment was too ambiguous. Agents lacked the tools, abstractions, and internal structure needed to make progress toward high-level goals. So the primary job of our engineering team became creating the environment that would let agents do useful work.
In practice, this meant working depth-first: breaking large goals into smaller components (design, code, review, tests, etc.), directing agents to build those components, and then using them to tackle more complex tasks. When something failed, the answer was almost never "try harder." Since making Codex do the work was the only path forward, human engineers always asked:
"What capability is missing? And how do we make that capability understandable and executable by an agent?"
Humans interact with the system almost entirely through prompts. An engineer describes the task, runs the agent, and the agent opens a pull request. To close the PR, we instruct Codex to review its own changes locally, request additional specific agent reviews both locally and in the cloud, respond to feedback provided by humans or agents, and loop until all agent reviewers are satisfied. (This is effectively the Ralph Wiggum loop.) Codex directly uses standard development tools — gh, local scripts, repository-native tooling — to gather context, with no need for humans to copy and paste into the CLI.
Humans can still review pull requests, but it isn't required. Over time we moved nearly all review work to be handled agent-to-agent. Genuinely efficient.
3. Making the Application Readable 📖
As code throughput increased, human QA capacity became the bottleneck. Because human time and attention were the fixed constraint, we focused on extending agent capability by making the application's UI, logs, and metrics directly readable by Codex.
For example, we made the app bootable per Git worktree so Codex can start and run one instance per change. We also connected the Chrome DevTools Protocol to the agent runtime and built the tooling needed for DOM snapshots, screenshots, and navigation actions. This lets Codex reproduce bugs, validate fixes, and reason directly about UI behavior. 🤯
We applied the same approach to observability tooling. Logs, metrics, and traces are exposed to Codex through a local observability stack that spins up temporarily per worktree. Codex works on a fully isolated version of the app, and when that work is done the logs and metrics are cleaned up alongside it. Agents can query logs with LogQL and metrics with PromQL. Given this context, prompts like "ensure service startup completes within 800 ms" or "make sure no span across these four core user journeys exceeds two seconds" become actionable.
We regularly observe a single Codex run working on one task for six or more hours — including while humans sleep.
4. Making Repository Knowledge a System of Record 🗺️
One of the biggest challenges in putting agents to work on large, complex tasks is context management. The simplest lesson we learned early was:
"Give Codex a map, not a 1,000-page manual."
- Context is a scarce resource. Too many instruction files overwhelm the task, the code, and the relevant documentation, causing agents to miss key constraints or optimize for the wrong ones.
- Too many instructions become no instructions. If everything is "important," nothing is. Agents end up matching patterns locally rather than navigating deliberately.
- Instructions decay immediately. A giant manual becomes a graveyard of stale rules. Agents cannot know what is still valid, humans stop maintaining it, and the file quietly becomes an attractive nuisance.
- Verification is hard. A single monolithic blob does not lend itself to mechanical inspection (coverage, freshness, ownership, cross-linkage), so drift is inevitable.
So instead of treating AGENTS.md like an encyclopedia, we treat it like a table of contents. 📚
The repository's knowledge base is structured under a docs/ directory and maintained as a system of record. A short AGENTS.md (around 100 lines) is injected into context and serves primarily as a map, pointing to deeper sources of truth elsewhere.
Repository knowledge layout:
docs/
├── architecture/
│ ├── index.md
│ └── components.md
├── beliefs/
│ └── core.md
├── design/
│ ├── feature_x.md
│ └── feature_y.md
├── plans/
│ ├── active/
│ ├── completed/
│ └── debt/
└── quality/
└── domains.md
Design documents are categorized and indexed, including a core set of beliefs that define validation status and agent-first operating principles. An architecture document provides the top-level map of domains and package layers. Quality documents grade each product domain and architectural layer and track gaps over time.
Plans are treated as first-class artifacts. Small changes use ad-hoc lightweight plans; complex tasks are recorded in execution plans whose progress and decision logs are checked into the repository. Active plans, completed plans, and known technical debt are all version-controlled and co-located so agents can operate without depending on external context.
This enables progressive disclosure: agents start from a small, stable entry point, learn where to look next, and are never overwhelmed from the start.
We enforce this mechanically. Dedicated linters and CI jobs verify that the knowledge base is up-to-date, interlinked, and correctly structured. Periodic "doc-gardening" agents scan for stale or obsolete documentation that no longer reflects actual code behavior and open fix pull requests.
5. Agent Readability Is the Goal 🎯
As the codebase evolved, the framework for Codex's design decisions had to evolve with it.
Because the repository was produced entirely by agents, it is optimized above all for Codex's readability. Just as teams improving code navigability for new engineer hires, our human engineers' goal was to make it possible for agents to reason about the entire business domain directly from the repository itself.
From an agent's perspective, anything that cannot be accessed in context during a run effectively does not exist. Knowledge in Google Docs, chat threads, or people's heads is inaccessible to the system. Only version-controlled artifacts local to the repository — code, Markdown, schemas, executable plans — are visible to agents.
We learned that we need to push more and more context into the repository over time. That Slack discussion where the team aligned on an architectural pattern? If it is not discoverable by an agent, it is just as unreadable as it would be to a new hire three months later.
Giving Codex more context means organizing and exposing information so it can be reasoned over, rather than overwhelming agents with ad-hoc instructions. Just as you would introduce a new team member to product principles, engineering norms, and team culture (including emoji preferences!), giving agents this information produces better-aligned outputs. 😊
This framing clarified many tradeoffs. We favored dependencies and abstractions that could be fully internalized and reasoned about within the repository. Technologies often described as "boring" tend to be easier for agents to model, thanks to their composability, API stability, and representation in training data. In some cases it was cheaper for an agent to reimplement part of a feature than to navigate opaque higher-order behavior from a third-party library. For example, rather than pulling in a generic p-limit-style package, we implemented our own map-with-concurrency helper. It integrates tightly with OpenTelemetry instrumentation, has 100% test coverage, and behaves exactly as the runtime expects.
Bringing more of the system into a form that agents can directly inspect, validate, and modify provides greater leverage not just for Codex but for other agents working on the codebase — such as Aardvark.
6. Enforcing Architecture and Style ✨
Documentation alone is not enough to keep a fully agent-generated codebase consistent. By enforcing invariants rather than micromanaging implementations, we let agents ship quickly without undermining the foundation. For example, we require Codex to parse data shapes at boundaries, but we do not prescribe how that should happen. (The model seems to prefer Zod, but we never specified a particular library.)
Agents are most effective in environments with strict boundaries and predictable structure. So we built the application around a strict architectural model. Each business domain is divided into a fixed set of layers with rigorously validated dependency direction and a limited set of allowable boundaries. These constraints are mechanically enforced through custom linters (generated by Codex, of course) and structural tests.
The diagram below shows the rules. Within each business domain (e.g., app settings), code may only depend "forward" through a fixed set of layers (types → config → repository → service → runtime → UI). Cross-cutting concerns (auth, connectors, telemetry, feature flags) enter through a single explicit interface: Providers. Everything else is disallowed and mechanically enforced.
Domain layer structure: within each domain, code may only depend "forward" through a fixed set of layers; cross-cutting concerns enter through Providers.
This kind of architecture is normally deferred until you have hundreds of engineers. With coding agents, it becomes an early prerequisite. Constraints are what make it possible to move fast without corruption or architectural drift.
In practice we enforce these rules through custom linters, structural tests, and a small number of "taste invariants" — for example, structured logging, naming conventions for schemas and types, file size limits, and platform-specific stability requirements are all statically enforced with custom lint rules. Because the linting is custom, we write error messages designed to inject fix guidance into agent context.
In a human-driven workflow, rules like these can feel overly fussy or restrictive. With agents, they become multipliers: once encoded, they apply everywhere at once.
At the same time, we are explicit about where constraints matter and where they do not — similar to running a large-scale platform engineering organization: enforce boundaries centrally, allow autonomy locally. We care deeply about boundaries, correctness, and reproducibility, while granting the team (or the agents) considerable freedom in how they express solutions within those boundaries.
The resulting code does not always match human stylistic preferences, and that is fine. If the output is correct, maintainable, and readable to future agent runs, that is enough.
Human style is fed back into the system continuously. Review comments, refactoring pull requests, and bugs that reached users are recorded as documentation updates or encoded directly into tooling. When documentation falls short, we promote the rule to code.
7. Throughput Changes Merge Philosophy 💨
As Codex's throughput increased, many traditional engineering norms became counterproductive.
The repository operates with minimal blocking merge gates. Pull requests are short-lived. Test failures are often resolved in a follow-up run rather than blocking progress indefinitely. In a system where agent throughput far exceeds human attention, the cost of fixing something is low and the cost of waiting is high.
This would be irresponsible in a low-throughput environment. Here, it is often the right tradeoff.
8. What "Agent-Generated" Actually Means 🤖
When we say the codebase was generated by Codex agents, we mean everything in the codebase.
Agents generate:
- Product code and tests
- CI configuration and release tooling
- Internal developer tools
- Documentation and design records
- Evaluation harnesses
- Review comments and responses
- Scripts that manage the repository itself
- Production dashboard definition files
Humans remain in the loop, but at a different level of abstraction than before. We prioritize work, translate user feedback into acceptance criteria, and validate outcomes. When an agent struggles, we treat it as a signal: we identify what is missing — tools, guardrails, documentation — and always have Codex itself write the fix, feeding it back into the repository.
Agents use standard development tools directly: fetching review feedback, responding inline, pushing updates, and often squashing and merging their own pull requests.
9. Increasing Levels of Autonomy 🚀
As more of the development loop was encoded directly into the system — tests, validation, review, feedback handling, and recovery — the repository recently crossed a meaningful threshold where Codex can drive new features end-to-end.
Given a single prompt, an agent can now:
- Validate the current state of the codebase.
- Reproduce a reported bug.
- Record a video showing the failure.
- Implement a fix.
- Boot the application to validate the fix.
- Record a second video showing the resolution.
- Open a pull request.
- Respond to agent and human feedback.
- Detect and fix build failures.
- Escalate to humans only when judgment is required.
- Merge the change.
This behavior depends heavily on the specific structure and tooling of this repository and should not be assumed to generalize without similar investment. At least not yet!
10. Entropy and Garbage Collection 🧹
Full agent autonomy also introduces new problems. Codex replicates patterns already present in the repository — even irregular or suboptimal ones. Over time this inevitably leads to drift.
Initially humans addressed this manually. Our team spent every Friday — 20% of the entire week! — cleaning up "AI slop." As expected, this did not scale. 😩
Instead, we began encoding what we call "golden principles" directly into the repository and building a periodic cleanup process. These principles are opinionated, mechanical rules that keep the codebase readable and consistent for future agent runs. For example: (1) prefer shared utility packages over hand-rolled helpers to centralize invariants; and (2) do not inspect data "YOLO-style" — instead validate at boundaries or rely on typed SDKs so agents do not accidentally build on assumed shapes. On a regular cadence we run a set of background Codex tasks that scan for deviations, update quality grades, and open targeted refactoring pull requests. Most can be reviewed and auto-merged in under a minute.
This works like garbage collection. Technical debt is like a high-interest loan: it is almost always better to pay it down steadily in small amounts than to let it accumulate and resolve it painfully all at once. Human style is captured once and then continuously enforced across every line of code. It also lets us catch and address bad patterns daily rather than letting them spread through the codebase for days or weeks.
11. What We Are Still Learning 🧐
This approach has worked well so far, through internal launch and adoption within OpenAI. Building a real product for real users has helped ground our investments in reality and guide us toward long-term maintainability.
What we do not yet know is how architectural coherence will evolve over years in a fully agent-generated system. We are still learning where human judgment adds the most leverage and how to encode that judgment so it compounds. We also do not know how this system will evolve as models become increasingly capable over time.
What has become clear is that building software still requires discipline — but that discipline now manifests more in scaffolding than in code. The tools, abstractions, and feedback loops that keep a codebase consistent are becoming increasingly important.
"Our hardest challenges are now concentrated in designing the environments, feedback loops, and control systems that help agents achieve our goals — namely, building and maintaining complex, reliable software at scale."
As agents like Codex take on a larger share of the software lifecycle, these questions will only grow more important. We hope sharing our early lessons helps you reason about where to invest your own efforts so that you can just build, too. 🛠️
Conclusion 🌟
OpenAI's "harness engineering" experiment demonstrates a transformative shift in the software development paradigm. With Codex agents handling every step of development and no manual coding involved, the engineer's role has shifted from writing code to designing agent environments and building feedback loops — producing a dramatic increase in development velocity. By prioritizing "agent readability," systematizing the repository's knowledge base, and enforcing architectural invariants, agents have acquired substantial autonomy: they can handle the full cycle from bug reproduction to feature implementation to PR merge. This experience suggests that the future of software development will evolve toward saving human time and maximizing efficiency, heralding a new era in which human creative judgment and agent execution power work in synergy.
