How We Accidentally Solved Robotics by Watching 1 Million Hours of YouTube

Imagine this. 💭 You've poured $640 billion into building the largest language model in human history. You even gave it a dramatic name: "Behemoth." This model can pester you on WhatsApp, solve calculus problems, and argue like a philosophy PhD.

But,

"Hey, grab me a coffee cup from the kitchen." What happens? Absolutely nothing.

No matter how massive the language model, robots still don't understand the world. That's because text alone from the internet can't teach you how objects actually move in three-dimensional space. No amount of "think step by step" or chain-of-thought prompting will help this chatty AI figure out where the kitchen trash can is.

And yet,

"What if the answer was right in front of us all along? The secret ingredient wasn't more tokens — it was more video!"

The "Why Didn't We Think of This Earlier?" Moment

While we were busy having AI agents book flights for us, everyone had forgotten one crucial fact: Robots don't need language — they need physics.

Enter V-JEPA 2. This model was born from the idea:

"What if we fed a neural network 1 million hours of YouTube and had it predict what happens next?" Not the next word — the next moment of reality.

The result?

"A robot dropped into a brand-new lab can pick up objects it has never seen before."

The Beauty Inside: Predict in Representation Space, Not Pixels

Older AI was obsessed with generating pretty pictures, but V-JEPA 2 said:

"Forget about noise!" and chose to predict in latent space instead.

Because,

"Predicting every pixel is like wanting to know if a soccer ball goes into the net, but predicting every blade of grass on the pitch first."

The magic happens in three steps:

Encoder: A ViT-g with 1 billion parameters watches a video and says,

"I understand the essence of this physical situation."
Predictor: A smaller neural network masks out part of the video and asks,

"What belongs in this blank?" — like a sophisticated video Mad Libs game.
3D-RoPE: Going beyond 2D positional embeddings,

"Let's properly encode positional information in three dimensions!"

Masking Strategy

Instead of showing the model the full video, V-JEPA 2 randomly masks portions of the footage (called tubelets). The model must answer:

"What happened in this hidden part?"

Scaling Up Data and Models

Before: 2 million videos (quaint)
After: 22 million videos + 1 million images (now we're talking)

They scraped something-something v2, Kinetics, HowTo100M, and countless YouTube videos.

Model Size: Bigger Is Better (Sometimes)

They scaled from 300M to 1 billion parameters. The ViT-g encoder is the apex of vision transformers.

Progressive Resolution Training: The "Boiling Frog" Strategy

Training on high-resolution video from the start is computationally prohibitive, so:

"Start small, scale up gradually!" — curriculum learning in action.

16 frames at 256² → 64 frames at 384²

V-JEPA 2-AC: The Real Highlight

A world model that understands physics is impressive, but what robots need is actionable physics. Meaning:

"If I move my arm like this, what happens to the world?"

So they:

Froze the pretrained V-JEPA 2 (fixed parameters),
Attached a 300M-parameter transformer,
And had it predict how the world changes when actions are taken.

The astonishing part?

"The training data was just 62 hours of robot video." Raw footage of a Franka robotic arm doing various things — successes and failures alike, no curation.

"Experiments on data curation and success/failure ratios are a fascinating direction for future work."

The Magic of Energy Minimization

When actually controlling the robot, V-JEPA 2-AC plays a kind of "warmer, colder" game:

Observe the current state
Observe the goal state
Imagine several possible action sequences
Choose the action that gets closest to the goal
Execute the first action
Repeat until goal is reached or it fails

"Model predictive control (MPC) on top of the world model is one of the coolest parts of this paper."

Zero-Shot Generalization (= The Real Headline)

They trained this model on one dataset and deployed it on a Franka arm in a completely different lab.

Different lighting, different objects, different environment.

Success rates:

Reach: 100%

"When you understand physics, moving to a point in space is trivially easy."
Grasp cup: 65%

"Cups are harder than you'd think."
Pick and place: 65–80% (depending on object difficulty)

Existing approaches mostly failed at everything except reach, while this model performed dramatically better.

It's Fast Too!

V-JEPA 2-AC: 16 seconds per action
Diffusion models: 4 minutes per action

Summary for Roboticists and LLM Hackers

For Roboticists:

Zero-shot generalization: Works immediately on objects it's never seen
Data efficiency: 62 hours of video is enough (previously thousands of hours were needed)
Actually deployable: Planning takes only seconds

For LLM Hackers:

Here's the twist. V-JEPA 2 was combined with an 8B language model and achieved SOTA on video question answering:

PerceptionTest: 84.0%
TempCompass: 76.9%

"A video encoder pretrained without any language supervision beat models trained on image-text pairs. The longstanding assumption that language supervision is essential for world understanding just got demolished."

Limitations (= Not All Sunshine and Rainbows)

Camera Position Sensitivity

The model is extremely sensitive to camera placement.

"Shift the camera 10 degrees and left becomes right, up becomes down." In practice, camera position has to be adjusted by hand. "Very scientific. Very engineering."

Difficulty with Long-Horizon Planning

Ask the model to plan more than a few steps ahead and it starts

"hallucinating. That's a bit rough."

The Language Goal Problem

For now,

"You have to show the robot a photo of what you want it to do." If you want it to clean the kitchen, you need a photo of a clean kitchen.

"The goal is to eventually take language commands like 'make me a sandwich.' I'm actively working on this — reach out if you're interested!"

Imagining the Future

Looking ahead:

"A day may come when world models understand the physical world as well as text models understand language." In other words: "A physics-understanding robot could be as smart as a language-understanding ChatGPT!"

TL;DR (by Claude)

Property	V-JEPA 2	Diffusion	BC-Policy
Understanding	✨	🤷	🤷
Planning speed	🚀	🐌	🐌
Zero-shot magic	✅	❌	❌
Data efficiency	📈	📉	😐
Can make coffee	Maybe	Unlikely	Sort of

Closing

"There's also a great video visualizing VJEPA with PCA — check it out here!"

If you want to go deeper:

Paper
Code

Or,

"Watch your Roomba slam into the same chair leg for the 47th time, and reflect on how far we've come." 🤖

Key Terms:

V-JEPA 2
1 million hours of YouTube
Latent space prediction
Zero-shot generalization
Data efficiency
Robot action prediction
Energy minimization
Video-based world model
SOTA without language supervision
Camera position sensitivity
Long-horizon planning limits
The future of robot AI

But,

"Hey, grab me a coffee cup from the kitchen." What happens? Absolutely nothing.

And yet,

"What if the answer was right in front of us all along? The secret ingredient wasn't more tokens — it was more video!"

The "Why Didn't We Think of This Earlier?" Moment

While we were busy having AI agents book flights for us, everyone had forgotten one crucial fact: Robots don't need language — they need physics.

Enter V-JEPA 2. This model was born from the idea:

"What if we fed a neural network 1 million hours of YouTube and had it predict what happens next?" Not the next word — the next moment of reality.

The result?

"A robot dropped into a brand-new lab can pick up objects it has never seen before."

The Beauty Inside: Predict in Representation Space, Not Pixels

Older AI was obsessed with generating pretty pictures, but V-JEPA 2 said:

"Forget about noise!" and chose to predict in latent space instead.

Because,

"Predicting every pixel is like wanting to know if a soccer ball goes into the net, but predicting every blade of grass on the pitch first."

The magic happens in three steps:

Encoder: A ViT-g with 1 billion parameters watches a video and says,

"I understand the essence of this physical situation."
Predictor: A smaller neural network masks out part of the video and asks,

"What belongs in this blank?" — like a sophisticated video Mad Libs game.
3D-RoPE: Going beyond 2D positional embeddings,

"Let's properly encode positional information in three dimensions!"

Masking Strategy

Instead of showing the model the full video, V-JEPA 2 randomly masks portions of the footage (called tubelets). The model must answer:

"What happened in this hidden part?"

Scaling Up Data and Models

Before: 2 million videos (quaint)
After: 22 million videos + 1 million images (now we're talking)

They scraped something-something v2, Kinetics, HowTo100M, and countless YouTube videos.

Model Size: Bigger Is Better (Sometimes)

They scaled from 300M to 1 billion parameters. The ViT-g encoder is the apex of vision transformers.

Progressive Resolution Training: The "Boiling Frog" Strategy

Training on high-resolution video from the start is computationally prohibitive, so:

"Start small, scale up gradually!" — curriculum learning in action.

16 frames at 256² → 64 frames at 384²

V-JEPA 2-AC: The Real Highlight

A world model that understands physics is impressive, but what robots need is actionable physics. Meaning:

"If I move my arm like this, what happens to the world?"

So they:

Froze the pretrained V-JEPA 2 (fixed parameters),
Attached a 300M-parameter transformer,
And had it predict how the world changes when actions are taken.

The astonishing part?

"The training data was just 62 hours of robot video." Raw footage of a Franka robotic arm doing various things — successes and failures alike, no curation.

"Experiments on data curation and success/failure ratios are a fascinating direction for future work."

The Magic of Energy Minimization

When actually controlling the robot, V-JEPA 2-AC plays a kind of "warmer, colder" game:

Observe the current state
Observe the goal state
Imagine several possible action sequences
Choose the action that gets closest to the goal
Execute the first action
Repeat until goal is reached or it fails

"Model predictive control (MPC) on top of the world model is one of the coolest parts of this paper."

Zero-Shot Generalization (= The Real Headline)

They trained this model on one dataset and deployed it on a Franka arm in a completely different lab.

Different lighting, different objects, different environment.

Success rates:

Reach: 100%

"When you understand physics, moving to a point in space is trivially easy."
Grasp cup: 65%

"Cups are harder than you'd think."
Pick and place: 65–80% (depending on object difficulty)

Existing approaches mostly failed at everything except reach, while this model performed dramatically better.

It's Fast Too!

V-JEPA 2-AC: 16 seconds per action
Diffusion models: 4 minutes per action

Summary for Roboticists and LLM Hackers

For Roboticists:

Zero-shot generalization: Works immediately on objects it's never seen
Data efficiency: 62 hours of video is enough (previously thousands of hours were needed)
Actually deployable: Planning takes only seconds

For LLM Hackers:

Here's the twist. V-JEPA 2 was combined with an 8B language model and achieved SOTA on video question answering:

PerceptionTest: 84.0%
TempCompass: 76.9%

"A video encoder pretrained without any language supervision beat models trained on image-text pairs. The longstanding assumption that language supervision is essential for world understanding just got demolished."

Limitations (= Not All Sunshine and Rainbows)

Camera Position Sensitivity

The model is extremely sensitive to camera placement.

"Shift the camera 10 degrees and left becomes right, up becomes down." In practice, camera position has to be adjusted by hand. "Very scientific. Very engineering."

Difficulty with Long-Horizon Planning

Ask the model to plan more than a few steps ahead and it starts

"hallucinating. That's a bit rough."

The Language Goal Problem

For now,

"You have to show the robot a photo of what you want it to do." If you want it to clean the kitchen, you need a photo of a clean kitchen.

"The goal is to eventually take language commands like 'make me a sandwich.' I'm actively working on this — reach out if you're interested!"

Imagining the Future

Looking ahead:

"A day may come when world models understand the physical world as well as text models understand language." In other words: "A physics-understanding robot could be as smart as a language-understanding ChatGPT!"

TL;DR (by Claude)

Property	V-JEPA 2	Diffusion	BC-Policy
Understanding	✨	🤷	🤷
Planning speed	🚀	🐌	🐌
Zero-shot magic	✅	❌	❌
Data efficiency	📈	📉	😐
Can make coffee	Maybe	Unlikely	Sort of

Closing

"There's also a great video visualizing VJEPA with PCA — check it out here!"

If you want to go deeper:

Paper
Code

Or,

"Watch your Roomba slam into the same chair leg for the 47th time, and reflect on how far we've come." 🤖

Key Terms:

V-JEPA 2
1 million hours of YouTube
Latent space prediction
Zero-shot generalization
Data efficiency
Robot action prediction
Energy minimization
Video-based world model
SOTA without language supervision
Camera position sensitivity
Long-horizon planning limits
The future of robot AI

The "Why Didn't We Think of This Earlier?" Moment

The Beauty Inside: Predict in Representation Space, Not Pixels

Masking Strategy

Scaling Up Data and Models

Model Size: Bigger Is Better (Sometimes)

Progressive Resolution Training: The "Boiling Frog" Strategy

V-JEPA 2-AC: The Real Highlight

The Magic of Energy Minimization

Zero-Shot Generalization (= The Real Headline)

It's Fast Too!

Summary for Roboticists and LLM Hackers

For Roboticists:

For LLM Hackers:

Limitations (= Not All Sunshine and Rainbows)

Camera Position Sensitivity

Difficulty with Long-Horizon Planning

The Language Goal Problem

Imagining the Future

TL;DR (by Claude)

Closing

Related writing

Midjourney Full-Body Ultrasound: From Images to Outcomes

MouseMapper Reveals Whole-Body Change

SensorLM: Giving Wearable Data Language

Reading

The "Why Didn't We Think of This Earlier?" Moment

The Beauty Inside: Predict in Representation Space, Not Pixels

Masking Strategy

Scaling Up Data and Models

Model Size: Bigger Is Better (Sometimes)

Progressive Resolution Training: The "Boiling Frog" Strategy

V-JEPA 2-AC: The Real Highlight

The Magic of Energy Minimization

Zero-Shot Generalization (= The Real Headline)

It's Fast Too!

Summary for Roboticists and LLM Hackers

For Roboticists:

For LLM Hackers:

Limitations (= Not All Sunshine and Rainbows)

Camera Position Sensitivity

Difficulty with Long-Horizon Planning

The Language Goal Problem

Imagining the Future

TL;DR (by Claude)

Closing

Related writing

Midjourney Full-Body Ultrasound: From Images to Outcomes

MouseMapper Reveals Whole-Body Change

SensorLM: Giving Wearable Data Language