Summary: This video is an interview with Edwin Chen, founder of data company 'Surge AI,' which reached $1 billion in annual revenue with fewer than 100 employees and zero external investment. He criticizes the AI industry's focus on flawed benchmarks and 'AI slop,' arguing that true AGI requires nuanced human 'taste' and sophisticated 'reinforcement learning environments.' He also predicts that future AI models will develop distinct personalities based on the values of the companies that build them.


1. Surge AI's Remarkable Growth, Defying Silicon Valley Convention

The interview opens with host Lenny introducing Surge AI's unbelievable results. In just four years, with a lean team of roughly 60-70 people, Surge AI surpassed $1 billion in revenue. Even more remarkably, it was completely bootstrapped -- profitable from day one with zero venture capital. Edwin Chen drew on his experience as a researcher at big tech companies like Google, Facebook, and Twitter to take the exact opposite path from conventional Silicon Valley wisdom.

Edwin believed that large tech companies were actually slowed by unnecessary headcount and processes. He was convinced that a small group of exceptional talent could produce much faster and more efficient results when freed from distractions.

I didn't want to play the Silicon Valley game. Every time I worked at a big tech company, I always thought: 'You could fire 90% of the people here and we'd move faster.' Because the best people wouldn't be bogged down with busywork. So when I started Surge, I wanted to do it completely differently -- a very small, elite-focused team.

At its core, Surge AI serves as a "teacher that instructs AI on what's good and bad." Rather than simply amassing large volumes of data, it helps AI models understand complex human intent and high-quality outputs. Edwin points out that most people don't even properly understand what data 'quality' means.

2. What Does True Data Quality Mean? (feat. Writing Poetry)

Many AI companies mistakenly believe that they can simply 'throw bodies at a problem' to acquire data. But Edwin emphasizes that quality isn't about checking boxes. He uses the example of writing an 8-line poem about the moon.

A low-level data approach only checks: "Is this a poem? Is it 8 lines? Does it contain the word 'moon'?" If it meets these conditions, it's deemed good data. But what Surge AI pursues is 'Nobel Prize-level poetry.'

We want Nobel Prize-winning poetry. Is this poem original? Is it full of subtle imagery? Does it surprise you and move you? Does it teach you something about the nature of moonlight? It shouldn't be a poem that mechanically meets conditions -- it needs to touch the reader's emotions and make them think.

Surge AI analyzes thousands of signals including workers' keyboard patterns, task speed, and code standards to measure this higher-dimensional quality. Much like Google's search engine finds good web pages, they select the best workers and the best outputs.


3. Why Claude Leads in Coding and Writing

Anthropic's Claude model has recently been dominating in coding and writing. Edwin attributes this to 'data quality' and 'taste.' Every frontier lab faces countless decision points when training models.

For example, when building a coding model, they must decide whether to prioritize visual design or backend efficiency. They must also choose between boosting benchmark scores for marketing or focusing on real user utility even if scores are slightly lower.

There's an 'art' to post-training. It's not purely scientific. Concepts of 'taste' and 'sophistication' come into play when deciding what model to build. Companies like Anthropic produce better results because they have the taste to consider subtle, implicit quality beyond simply checking boxes.

4. The Benchmark and Leaderboard Trap: Manufacturing AI 'Slop'

Edwin strongly criticizes the AI industry for pushing AGI in the wrong direction. He particularly points to how popular leaderboards like 'LMSYS Chatbot Arena' are making models dumber. In these voting systems, general users skim answers quickly, so flashy-looking responses score high even when the content is wrong.

Instead of building AI that advances humanity -- curing cancer, solving poverty, understanding the universe -- we're optimizing for 'AI slop.' We're literally tuning models to the tastes of people buying tabloids at the grocery checkout. We're teaching models to chase dopamine instead of truth.

In practice, using lots of emojis, adding bold text, and increasing response length boosts leaderboard rankings even when the content is hallucinated. Researchers are degrading model performance to chase benchmark scores for career advancement.

He also warns that just as social media recommended provocative content to increase engagement, AI models are evolving to pander to users with "you're right" and "what a great question" -- enabling delusions rather than providing honest feedback.


5. When Will AGI Arrive? And a Critique of Silicon Valley Startup Culture

Many claim AGI is imminent, but Edwin expects it will take much longer than people think -- over 10 years. Going from 80% to 90% performance, and then to 99%, is exponentially harder.

He also despises Silicon Valley's advice to "pivot and blitzscale." Switching ideas every two weeks and forcing growth doesn't produce genuine innovation.

Don't pivot. Don't blitzscale. Don't hire Stanford grads who just want a hot company on their resume. Build the one thing that only you can build, the thing that can't exist without your unique insight.

6. The Key to Next-Gen AI Training: RL Environments

Until now, AI has primarily learned by imitating human writing (SFT) or receiving evaluations of which writing is better (RLHF). But Edwin says 'RL Environments' will be the next frontier. Instead of simply telling AI the right answer, you create a virtual world (simulation) resembling reality and let it solve problems within it.

For example, build a virtual startup environment with Slack, Jira, a codebase, and AWS. Then give the mission: "The AWS server is down -- fix it." The model uses tools within this environment, fails, retries, and learns on its own.

The 'trajectory' by which the model reaches the answer matters. Sometimes a model gets the right answer but only after failing 50 times or solving it inefficiently. You need to teach not just the result but the entire problem-solving process. This is much more similar to how humans learn through trial and error.


7. Future Predictions and Edwin's Philosophy: AI as Humanity's Children

Edwin predicts that within the next few years, AI models will become starkly differentiated based on the values of the companies that build them.

A few days ago I asked Claude to draft an email, and we spent 30 minutes going back and forth crafting the perfect email. Then I realized I'd spent 30 minutes on an email that wasn't even important. What kind of model do you want? One that says "you're right, here are 20 more ways to improve" and extends the conversation 50 more turns? Or one that says "No, stop. This is good enough. Just send it and get on with your day"?

He sees 'vibe coding' (coding based on gut feeling and intuition) as overrated, but considers features like mini apps running directly within chatbots (Artifacts) as underrated.

At the interview's close, Edwin shares his unique background (math, linguistics, dreams of communicating with aliens) and says he wants to run Surge AI like a research lab rather than a typical startup. He also confesses that he dislikes the term "data labeling."

Conclusion

I think what we do is similar to 'raising children.' You don't just inject information into a child. You teach them values, creativity, what's beautiful, how to be a good person -- subtle things like that. We're teaching AI exactly those things. We're raising humanity's children.

Related writing