This piece emphasizes that the hidden driving force behind AI progress is "data quality," arguing that sophisticated data curation matters far more than model architecture or compute scaling. Drawing on Ari Morcos's personal experience, it introduces the depth and practical impact of data research, as well as the paradigm shift in future AI development. The core message lies in the "Bitter Lesson": "Good models are built on good data."


1. Datology: How Data Curation Will Change the Future of AI

Datology's mission is to train models faster, better, and smaller through data curation. Ari Morcos put it this way:

"A model is made of the data it eats. Show it good data and you get a great model; show it bad data and you get a poor one."

He points out how often in machine learning data simply sits piled up in storage, neglected. Datology's goal is to automate the entire decision-making process around data--filtering, ordering (curriculum), synthetic data generation--so that anyone can train models with high-quality data.

The reason curation must be automated is simple: when dealing with massive datasets--trillions of tokens, billions of images--it is simply impossible for human hands alone to manage.


2. Ari Morcos's Journey: From Neuroscience to Realizing the Importance of Data

Ari's research career began in neuroscience. He taught rats to count, analyzed data from experiments observing how thousands of neurons in the brain activated, and naturally transitioned into machine learning. He explains that initially he focused on designing deep learning's "inductive biases"--figuring out how to make architectures smarter.

But in 2020, he encountered the same conclusion across multiple papers. What truly mattered was not bias design but the data itself.

"I spent six years focused on inductive biases, but I came to the bitter realization that data ultimately determines everything. The Bitter Lesson--it really stung."

He confesses that improving data quality holds an overwhelmingly larger share of impact than model architecture or labor-intensive theoretical approaches.


3. Why Has Data Been Historically Undervalued?

According to Ari, data research in AI has the lowest impact-to-investment ratio of any field.

"Data is the most underinvested research area relative to its impact. The gap is astonishing."

There are several reasons:

  • The research community has long undervalued data processing as "grunt work" or "plumbing."
  • Much research focused exclusively on improving model structures and training methods within fixed datasets (e.g., ImageNet).
  • In the early days, data was scarce and manually labeled by humans, so there was little reason to question data quality.

But in 2019, the emergence of self-supervised learning changed the game entirely.

"Now models can learn to find answers on their own without labels, transitioning us from an era of scarce data to an era of overflowing data."

While the sheer volume of data has grown enormously, this has also made problems of duplication, low quality, and information purity more severe, increasing the importance of high-quality data selection and structuring.


4. Why Automated Data Curation Is the Answer

Human evaluation and manual selection in data curation increasingly shows its limits. For example, results from the DCLM (Data Comp LM) project revealed a striking fact:

"Thirty NLP experts who spent two years looking at nothing but data tried to predict the results of automatic filters, but their accuracy was no better than chance."

This demonstrates that the value of data depends on the relationship between each data point and the overall dataset, which humans cannot discern at a glance. Humans can assess individual data quality, but judging the "right amount of duplication"--like 10,000 duplicate Hamlet summaries--is beyond human capability.

Ari explains this with the elephant and puppy analogy:

  • Elephants are simple as a species, so just a few data points are enough to grasp the concept.
  • Dogs have many variables--breed, size, color, texture--requiring far more examples.

Because each concept requires different "quantities" and "levels of duplication," it is simply impractical to manage without an automated system.


5. Synthetic Data and the Technical Evolution of Curation

The data curation Datology emphasizes goes beyond simple "filtering." It encompasses redistribution of data (upsampling/downsampling), data ordering (curriculum), batching methods, and synthetic data (e.g., rephrasing).

There are broadly two types of synthetic data:

  • Net-New Generation: The model creates entirely new data for training. However, it is often difficult to create a "student better than the teacher," with risks of information distortion and model collapse.
  • Rephrasing: Transforming and refining existing data in various styles. Since the original information is preserved, this approach can actually produce models better than the original.

He emphasizes:

"Are textbooks alone enough? No. What truly matters is diversity. No matter how good high-quality token repetition is, diverse data is what ultimately determines model performance."

Curriculum approaches are also gaining renewed attention through recent research:

"The issue is no longer data scarcity but underfitting--data overflows but models can't consume it all. Sophisticated data ordering (curriculum) tailored to this situation is a game changer that can improve training efficiency by 10x or even 100x."


6. Real Changes Data Curation Has Produced: Smaller and Cheaper Models

Among the actual changes Datology has achieved, the most remarkable is achieving equivalent performance with just 10% of the data, or surpassing existing performance with models less than half the size.

"Train less and performance stays the same--or even improves. It is possible to build models that are smaller, faster, and yet superior!"

In practice, through collaboration with the RC Foundation, they started with 23 trillion tokens but used only 6.6 trillion tokens to achieve comparable or even better results than competing models.

This is possible because:

  • As data quality rises a level, it breaks through machine learning's diminishing returns.
  • Data acts as a "multiplier" worth tens of times more than compute investment efficiency.

7. Data and Model Compression: Pruning and Data-Driven Solutions

Many have pinned their hopes on model parameter pruning (removing unnecessary weights), but Ari explains:

"Pruning is too dependent on the dataset and data distribution to serve as a one-directional solution. Making models smaller through data is the far more fundamental approach."

However, he notes that data-driven model compression can be complementarily combined with pruning, quantization, and other methods, and predicts that the future will be dominated by small, specialized models in the range of several billion parameters ("a few B").


8. Datology's Vision and the Future of the AI Industry

Datology's ultimate goal is "accurately assessing data value according to purpose." He says this is akin to the NP-complete problem of true AI innovation.

"Most organizations spend millions of dollars preparing model training and only think about dataset design two weeks later. Yet data is the most important thing!"

Ari Morcos emphasizes that the field of data curation is only just beginning. He says:

"We haven't even done 10% of what we could. The remaining potential is 100x. We are truly in the earliest stages."

He also mentions wanting to actively hire researchers who understand data curation well--the type who "stare at data intensely."


9. Meta, Superintelligence, and the Future Value of Data

Watching Meta invest massive resources in data, Ari summarizes:

"Companies like Meta focusing on data is a signal that the entire industry is awakening to the importance of data."

He also forecasts that the AI paradigm itself will gradually shift from "general-purpose large models" to "purpose-specific small models with high-quality data."

"Ultimately, what companies want are small, efficient models optimized for their own tasks. Companies that own their data and models will be the winners."


Conclusion

The fundamental progress of AI comes not from more parameters or hardware, but from better data. Ari Morcos and Datology emphasize that data quality is a core challenge for both AI research and industry, and that only sophisticated approaches--automated curation, curriculum, synthetic data--will determine future competitiveness.

"The real innovation in AI starts with data. Good models are built on good data."

The center of gravity in AI is now shifting from "model-centric" to "data-centric."

Related writing