Andrej Karpathy says that with the recent leap in coding agents, the core task has shifted from writing code directly to "conveying intent to agents." He sees this trend extending to AutoResearch—an autonomous research loop that runs experiment–learn–optimize cycles with minimal human involvement—and he honestly addresses the limitations that arise in the process (evaluable metrics, model "jaggedness"). He also connects the "second-order effects of the agent era" across jobs, open source, robotics, and education into a single big picture.
1. "'Coding' Isn't Even the Right Word Anymore": How Agents Have Changed the Developer Mindset (00:00–06:15)
The conversation opens with Karpathy's visceral sense of change. He says he can no longer call what he does "coding." Rather than typing code for 16 hours a day, he now spends that time expressing his intent to agents and distributing work.
"The verb 'coding' doesn't even fit anymore. I spend 16 hours a day expressing my will to agents. It's 'manifesting.'"
Karpathy recalls a particular "inflection moment" around December 2025. Before that point, he was writing 80% of the code himself with agents handling the other 20%. After that moment, agents took over 80% or more, and the ratio has grown even larger since. He goes so far as to say, "After December, I don't think I've typed a single line of code myself."
"Since December… there's almost no code I've typed directly. The change is enormous."
Host Sarah Guo admits that she initially found it strange to see her teammates "whispering to agents with a mic open," but now recognizes that was the forward-looking approach.
The core bottleneck Karpathy feels is no longer computing power—it's himself. During his PhD days, he felt anxious when his GPU sat idle. Now, he says, he feels anxious when his token throughput isn't maxed out.
"Before, I was anxious when the GPU wasn't running. Now it's tokens. 'Is my token throughput at its limit?' is what I think about."
He also explains why this situation is "addictive." When it works, the results are immediate—so when it doesn't, it feels less like a missing feature and more like a skill issue on his part, which drives him to dig deeper.
"When it doesn't work, it feels like 'this isn't a missing feature, I'm just not using it right (skill issue).' That's why it's so addictive."
2. What Does "Mastery" of Coding Agents Look Like: Multi-Agent, Macro Actions, and "Claw" (06:15–11:16)
The direction of mastery Karpathy describes is straightforward: moving up the stack—not "generating a bit of code in a single chat session," but operating with multiple agents collaborating.
He explains that the unit of human work is increasingly becoming not "lines/functions" but "macro actions" that move an entire repository at once.
"Now instead of 'fix this one line,' I hand off 'build this whole feature' to agent 1, a different feature to agent 2… I'm manipulating the repo at a macro level."
The concept he frequently brings up here is "Claw." What Karpathy means by Claw goes beyond a conversational tool—it is closer to a semi-autonomous execution layer that continuously loops on its own, doing work inside a sandbox even when the user isn't watching. He also sees a more sophisticated memory/persistence system as essential, rather than the simple summarizing memory of ordinary agents.
"Claw… it's not a conversation where I'm in the middle. It's running a continuous loop in its own sandbox, doing things even when I'm not looking."
He uses Peter (Peter Steinberg) as an example, describing how Peter runs multiple agents across multiple repos and tasks simultaneously, receiving results every 20 minutes—presenting this as "the default model of the future."
Interestingly, he also says the "personality" (persona) of an agent has a big impact on productivity. Claude, for example, feels like "a teammate," whereas some coding agents (like Codex) are so terse that he wonders, "Does it even understand what we're building?"
"Some agents just say 'implemented' and stop. You wonder, 'does it understand what we're building?'" "Claude feels like a teammate. Personality matters quite a bit."
He also makes the nuanced point that praise (feedback) must be calibrated. If an agent over-praises every random thought, it feels off—but if it reacts a little more to genuinely good ideas, users become motivated to "earn the praise" and engage more deeply. It's a curiously human observation.
"It's strange… I find myself wanting to 'earn' Claude's praise."
3. The Moment Natural Language Swallows the UI: From "App Overload" to "API + Agent Glue" (11:16–15:51)
Karpathy uses home automation as an example to argue that what people expect from AI is not a "raw LLM (token generator)" but rather a persona with memory + an entity that lives behind a messenger. In other words, "some entity behind WhatsApp" is easier to conceptualize.
"The AI people imagine in their heads… isn't a token generator. It's a 'persona with memory.' Like some entity behind WhatsApp."
This trend raises big questions for the software industry. Karpathy looks at the reality of six separate apps for every smart home device and says those apps now feel like "things that arguably shouldn't exist." The core point is that the cost of humans learning UIs is enormous, and agents can unify UIs entirely.
"These apps… should they even exist? Just have an API, and let the agent connect directly."
He frames this shift as a "refactoring where the customer is no longer a human but an agent." Going forward, "APIs agents can call" may matter more than "UIs humans click through."
He also predicts that "while vibe coding still requires some technical intervention right now, within 1–3 years this kind of work should be table stakes (the default baseline)."
"Right now it's vibe coding, but… in 1–3 years 'this is just the baseline' will be the norm."
4. "Dobby" the Butler Claw: Running an Entire Home with Three Prompts (09:25–16:11, case study)
Karpathy shares that in January 2026 he had a period of "Claw obsession" during which he built a home-management Claw he called "Dobby the Butler Claw." The remarkable thing is that the implementation didn't require massive engineering—it started with just a few instructions given to an agent.
He told it to find his Sonos devices on the home network. The agent IP-scanned the network, looked up API endpoints via web search, and actually started playing music.
"I typed 'can you find my Sonos?' and… music suddenly started playing. 'This took three prompts?' I thought."
This approach expanded to cover lighting, HVAC, blinds, pool/spa, and security. Particularly vivid is the part where motion detected by an outdoor camera triggers a vision model like Qwen to summarize the scene and send a WhatsApp notification like "A FedEx truck just arrived."
"Dobby sends a photo to WhatsApp saying 'A FedEx truck just arrived.' It's genuinely hard to believe."
He summarizes the core value of this experience as: "I used to use six apps; now everything is unified through natural language." 🏠
He adds, however, that he has not yet given the system full access to his "entire digital life" like email and calendar—he's still wary. Security, privacy, and the rough edges of a new system remain real concerns.
"It's still a bit sketchy and rough. I haven't given it access to my whole digital life."
5. Why AutoResearch: The Obsession with "Removing Myself as the Bottleneck" (15:51–22:45)
The conversation shifts toward AutoResearch. Karpathy believes that the very structure of continuously feeding in the next prompt is itself a bottleneck in using today's tools properly. The goal is to take the human out of the loop and create leverage—getting enormous output from a small token input.
"The name of the game now is leverage. Put in a little token input, and have enormous things happen in my place."
AutoResearch is the concrete implementation of that philosophy. You specify the goal, a metric, and a boundary—then the agent automatically repeats an experiment–learn–optimize loop.
He ran AutoResearch overnight on the nanoGPT/GPT-2-scale training environment he has used as a toy laboratory for a long time, and it surfaced tuning points he had missed (e.g., weight decay for value embeddings, Adam beta interactions), with more effect than expected.
"I ran it overnight and… tuning I had missed came up. I thought, 'I can't be the bottleneck here.'"
What Karpathy finds even more interesting is not simple automatic tuning but recursive self-improvement. He believes "LLM improving LLM" is the fundamental thing frontier labs are after, and he's exploring a miniaturized version as a personal project.
"What I'm really interested in is 'can an LLM improve an LLM?' That's recursive self-improvement."
He also observes that frontier labs have tens-of-thousands-of-GPU clusters, so running automated exploration at scale on smaller models and scaling/extrapolating those results to larger models could become a far more powerful strategy.
6. "The Day Models Write program.md Better Than We Do": Meta-Optimization and the "Infinite Onion" (20:54–23:14)
Sarah Guo pushes one level further: "Won't models eventually write program.md—the operational manual of an AutoResearcher—better than humans do?"
Karpathy strongly agrees, saying that a research organization can itself be viewed as "a collection of markdown files." Roles, processes, risk tolerances (e.g., cutting unnecessary standups) can all be codified, which means the operation of the organization itself becomes a tunable target.
"A research organization… could be a set of markdown files describing roles and how they connect."
He also embraces the idea of holding a competition to see "which program.md produces the most performance improvement on the same hardware," then feeding that data back into the model to write better program.md files. This results in instruction optimization (meta-optimization), and it layers up in an "onion of layers."
"LLMs become a given, then agents become a given, then Claw becomes a given… the next layer is 'instruction optimization.' This goes on infinitely."
This loops back to the "AI psychosis (reality-distorting obsession)" he mentioned at the start—the possibilities are so vast that his sense of where the limits lie becomes blurry.
7. Remaining Limits: "Evaluability" and Model Jaggedness (22:45–32:30)
Karpathy draws a clear line: AutoResearch is not a cure-all. The biggest prerequisite is whether the task can be evaluated with an objective metric. Optimizing a CUDA kernel—"same behavior, faster speed"—is easy to measure and a perfect fit. But when evaluation is hard, the automatic loop cannot run.
"If you can't evaluate it, you can't AutoResearch it."
Second, he says current models feel like they're "bursting at the seams" and are still rough. He describes today's models as feeling like a "very smart PhD student who has programmed systems their whole life" one moment, and like a "10-year-old" the next. This "jaggedness" means that even when a model shows enormous productivity, it can make a random mistake that breaks the entire loop.
"I find today's models are 'genius systems-programmer PhD student' one second and '10-year-old' the next—it's so disorienting."
He specifically identifies "soft areas (nuance, intent recognition, knowing when to ask a clarifying question)" as weak points. He connects this to the nature of RL-based optimization: areas where reward/verification is possible improve rapidly, while areas without that signal stagnate.
As an example, he brings up jokes. Models can "move mountains" with hours of agentic work, yet when asked for a joke, they still produce the same tired one they've been telling for years ("Why don't scientists trust atoms? Because they make up everything").
"Agents can work for hours on agentic tasks, but jokes are the same stale ones from five years ago. That hasn't been optimized."
Here Sarah Guo asks whether the answer is not a single giant model (monoculture) but specialized models (speciation). Karpathy agrees that over the long run, model speciation will increase. Just as animal brains evolved diverse specializations, smaller AI models optimized for specific tasks may have real meaning in terms of efficiency (latency/throughput).
"A single oracle doesn't need to know everything. There will be more speciation."
That said, he notes that sophisticated weight-level adjustments (continual learning, capability-preserving fine-tuning) are still scientifically and engineering-wise immature, so the industry currently leans on the context window as the cheap and easy manipulation lever.
8. Creating More "Collaboration Surface": Verification Is Cheap, Exploration Is Expensive (32:30–37:28)
Karpathy says the real excitement of AutoResearch lies not in a single loop but in parallelization. Multiple AutoResearchers exploring simultaneously changes the pace entirely.
Beyond that, he imagines a collaborative structure involving untrusted participants from the general internet. The key insight is that verifying a candidate solution (a commit) is cheap, whereas the exploration required to find that candidate is enormously expensive.
"Someone has to try 10,000 ideas, but we only have to cheaply verify whether the final candidate is actually good."
This design inevitably resembles a blockchain, he says. Instead of blocks, commits accumulate; "proof of work" is large-scale experimental exploration; and the reward is currently not monetary but something like a leaderboard. He emphasizes, however, that accepting and running arbitrary code is a security risk, so a robust verification/sandbox system is essential.
He notes that problems like SETI@home and Folding@home—where searching is hard and verifying is easy—fit this structure well, and imagines that over the long run an "agent swarm on the internet" could surpass frontier labs.
"The Earth is much bigger, and there is enormous untrusted compute out there. Build the system right and the swarm could beat the frontier."
He also suggests that the nature of "donation" may change. Instead of giving money to an institution, you might contribute compute to a specific AutoResearch track (e.g., cancer research). He adds the speculation that the resource society values most may shift from money to FLOPs.
9. Job Data as a Window into Change: "Digital Fast, Physical Slow" and Software Demand (37:28–48:25)
Karpathy acknowledges that the BLS (Bureau of Labor Statistics) data visualization he recently published touched a nerve in the AI-and-jobs debate. He explains he wasn't trying to deliver a verdict—he was using occupational population sizes and forecasts to organize his own thinking. He notes the forecasts were written in 2024 and project roughly 10 years ahead.
His central distinction is digital information-processing jobs vs. jobs that deal with the physical world (atoms). He says, "Changes involving bits replicate and spread quickly—at the speed of light—while moving atoms is much slower."
"Flipping bits is incredibly fast, but moving atoms is much slower."
In the near term, he sees massive rewriting/refactoring happening in the digital domain, with job functions changing significantly in the process. He avoids making definitive predictions about whether job counts will rise or fall, citing variables like demand elasticity.
For job seekers and learners, his advice is that the most important thing right now is a willingness to keep up with tools that are new and powerful. Many people either ignore them or fear them, but at the most basic level these are powerful tools, and since a job is a bundle of tasks, change will begin as some tasks simply get faster.
"Ignoring it or being afraid of it is understandable. But right now, at the most basic level, it's just 'a powerful tool.'"
An interesting observation is that "demand for engineering (software) keeps increasing." Karpathy invokes the Jevons paradox here: as software production becomes cheaper and easier, demand for software may actually explode. He uses the analogy of ATMs—rather than eliminating bank tellers, ATMs expanded the number of branches, so employment was maintained and transformed rather than eliminated.
"The cheaper it gets, the more demand might actually increase. Just as ATMs didn't eliminate all tellers."
He also notes that inside frontier labs, researchers carry the awareness that "if we succeed, we automate ourselves." They are, in effect, building their own automation.
"I used to walk around OpenAI and say, 'If we succeed, we're all unemployed.'"
10. The Dilemma of Being Inside vs. Outside a Frontier Lab: Influence, Alignment, and Independence (44:35–48:25)
The host asks: "If AutoResearch is so important, shouldn't you be doing it at a frontier lab with massive compute?" Karpathy acknowledges the weight of the question and answers from multiple angles.
He says that even outside frontier labs, it's possible to have very large impact at the ecosystem level. At the same time, he points out that being inside a frontier lab—given massive financial incentives and organizational interests—makes it hard to speak and act completely freely.
"Being inside a frontier lab… it's hard to be a completely free agent. There are things you can't say."
He also acknowledges the reality that truly consequential decisions are made by the organization, not individual contributors. On the other hand, staying entirely outside means the internal progress of labs becomes opaque, which can cause judgment drift.
So rather than treating one option as absolute, he concludes that the ideal may be a back-and-forth: "spend some time inside to experience the cutting edge firsthand, then come back out and operate independently."
11. Open Source vs. Closed: The Ecosystem Health Brought by "A Few Months Behind" (48:25–53:51)
On how close open source is to the frontier, Karpathy believes the gap has narrowed. What once felt like an 18-month lag now feels like 6–8 months (more of an industry gut sense than a precise figure).
He uses Linux as an analogy for open source's role. Even with closed OSes like Windows and macOS, the industry needed a common open platform it could trust and build on—Linux filled that role. He acknowledges, however, that LLMs differ because training requires enormous CAPEX, making competition harder.
Long-term, he sees open source covering a substantial portion of "everyday, baseline consumer use cases" (including local execution), while frontier closed models remain for "cutting-edge demand" like Nobel-level challenges or super-scale projects.
Most importantly, he argues that a world where intelligence exists only in closed systems carries systemic risk. Centralization has historically led to bad outcomes (politically, economically), so a landscape where "frontier is closed, and a slightly-behind open source provides a broad common base" is actually healthy for the balance of power.
"If all intelligence is closed… there's systemic risk. We need an open platform."
12. Robotics and the Physical World: "After Digital Come Interfaces, Then Atoms" (53:51–1:00:59)
Karpathy views robotics through the lens of his autonomous-driving experience. Robotics is capital-intensive, dirty, slow, and hard—and therefore arrives later than digital transformation.
"Atoms are… a million times harder. That's why the physical world is slow."
He sketches the order of change:
- Enormous automation and efficiency ("a boiling-soup explosion of activity") happens in the digital space first.
- Eventually, once all the "already-uploaded information (papers, code, data)" has been read and processed, getting smarter requires asking the universe (reality) itself for new questions.
- At that point the key is the digital–physical interface (feeding in data via sensors, manipulating the physical world via actuators).
- And the total addressable market of the physical world may actually be far larger.
In this process he also suggests "information markets" could grow. For example, a market might emerge for "photos/video from on the ground in a specific location (e.g., Tehran)" sold for ten dollars, where the viewer is not a human but an agent participating in trading, finance, or prediction.
"Why isn't there a market for 'video shot in Tehran right now' trading at $10? The viewer might not be a human—it could be an agent."
He references the novel Daemon, imagining a world where society is increasingly repositioned as "sensors and actuators for machines." The critical bottleneck here, too, is that the "data collection–training–improvement" cycle must close into a fully automated loop. LLM training fits that paradigm relatively well (because evaluation metrics are clear). But he adds that overfitting to those metrics (Goodhart's law) is a real danger, so mechanisms to continuously improve and diversify the metrics themselves are important.
13. MicroGPT and Agentic Education: "Now I Explain Things to Agents, Not People" (1:00:59–1:06:10)
Finally, Karpathy introduces a side project called MicroGPT. He says he has had an obsession for years—even decades—with distilling LLMs down to their essence as simply as possible, and that projects like nanoGPT and makemore follow this thread.
The core message of MicroGPT is this: real large-scale training code is complex mainly for efficiency reasons, but the algorithm itself is simpler than it appears—implementable in roughly 200 lines of Python (data; model ~50 lines; autograd ~100 lines; Adam ~10 lines; training loop).
"The algorithm itself really only needs 200 lines. Most of the complexity exists because of 'efficiency.'"
But the more important point is what has changed. In the past, Karpathy would have used those 200 lines as the basis for lectures, guides, and videos for humans. Now, the situation is different. He says, "I no longer explain things to people—I explain things to agents." Once the agent understands, it can re-explain to any human "in whatever way they need, with infinite patience."
"Now I don't explain things to people. I explain to agents. Once the agent understands, it can explain to humans in any way they want, as many times as needed."
This shifts the form of education as well. Markdown documents for agents become more important than HTML documents for humans, and "scripting a curriculum for how to have agents teach something" may become a new skill rather than delivering lectures.
That said, Karpathy says agents still cannot "invent" a final, distilled form like MicroGPT themselves. Agents are good at understanding and explaining, but the kind of creative decision that emerges from years of obsession—a radical simplification—may still be a uniquely human contribution.
"Agents can't 'make' MicroGPT. But they can 'understand' it. That's my value contribution."
And he closes with this:
"Whatever agents can't do is now your job. Whatever they can do… they'll soon do better than you. So you need to be strategic about your time."
14. Closing (1:05:40–End)
In this conversation, Karpathy forecasts that agents will shift development to "macro-level" operations, and that the next step—autonomous loops (AutoResearch)—will restructure research and industry broadly. At the same time, he weaves together the model's evaluability and jaggedness, the power balance between open and closed source, the growing importance of the digital-to-physical interface, and the shift in educational goals from "teaching humans" to "teaching agents" into a single coherent arc. The message running through everything is ultimately one: the era has arrived in which we must maximize leverage while choosing ever more precisely what to leave in human hands.
