In this talk, Netflix's Yesu Feng presents a detailed look at Netflix's journey to adopt a single "Foundation Model" to revolutionize its recommendation system. This model represents a bold attempt to solve diverse recommendation scenarios with one unified model. The presentation covers the complexity of Netflix's recommendation system, the technical approaches to solving it, real-world applications, and future directions.
The Complexity of Netflix's Recommendation System
Netflix's recommendation system serves extremely diverse needs. Yesu Feng uses Netflix's home screen as an example, explaining that recommendation diversity spans at least three dimensions:
- Row-level diversity
- Various rows exist for genres (comedy, action, etc.), new/trending content, Netflix exclusives, and more.
- Item (content) diversity
- Beyond traditional movies and TV shows, content types are constantly expanding to include games, live streaming, and more.
- Page-level diversity
- Home, search, kids home, mobile feed -- each page has different recommendation approaches.
"Every page, every row, every item needs different recommendations, so naturally countless specialized models emerged."
This diversity led to separate models being independently developed for each recommendation scenario. This process created redundant feature development and repetitive feature engineering, making management and scaling increasingly difficult.
"Countless derived features were built from the same user behavior data, each used slightly differently by each model, making maintenance extremely challenging."
Problem Awareness and a New Approach
Netflix determined that this situation had hit scalability limits. With content and business needs constantly growing, building new models every time was inefficient.
"If you keep building new models like this, innovation slows down and reuse becomes difficult."
So Netflix asked: "Can we learn user representations in one unified place?" The answer was to adopt a Transformer-based "Foundation Model."
- Core hypotheses
- Large-scale semi-supervised learning with Transformer architecture would significantly improve personalization quality.
- Integrating this foundation model across all recommendation systems would dramatically increase innovation velocity and efficiency.
Data and Training: Lessons from LLMs
In building the foundation model, Netflix drew heavy inspiration from large language model (LLM) development experience.
- Data preprocessing and tokenization
- Like LLMs, tokenization approach significantly impacts model quality.
- Unlike language tokens, Netflix's tokens are "user behavior events," with each event containing multiple attributes.
- Balancing tokenization granularity and context window size is crucial.
"Decisions made in tokenization affect every layer of the model and ultimately show up in quality."
- Model architecture
- Designed with a layered structure: event representation -> embedding -> Transformer -> objective function.
- Event representations include diverse information about when, where, and what happened.
- ID embeddings alone can't handle new content (cold start problem), so semantic information is also learned.
"ID embeddings alone can't process titles never seen during training, so semantic information is essential."
- Objective functions (losses)
- Can have much richer multi-objective targets than LLMs.
- For example, they simultaneously learn to predict next content to watch, behavior type, metadata, watch time, device, and more.
- Multi-task learning, multi-head, hierarchical prediction, and other approaches can be utilized.
Scaling Up and Key Learnings
Netflix scaled the model from tens of millions of profiles up to 1 billion parameters, confirming that scaling genuinely continued to improve recommendation quality.
"Performance kept improving as we scaled up, and there's still room to expand further."
- Key techniques borrowed from LLMs
- Multi-token prediction: Training the model to predict not just short-term behavior but long-term satisfaction.
"Introducing multi-token prediction significantly improved long-term user satisfaction and behavior prediction."
- Multi-layer representation: Leveraging outputs from multiple layers to create more stable and rich user representations.
- Long context window handling: Applying various strategies to efficiently learn increasingly longer sequences.
- Multi-token prediction: Training the model to predict not just short-term behavior but long-term satisfaction.
Real-World Application and Architectural Changes
Before the foundation model, separate data, features, and models existed independently for each recommendation scenario. Now, data and representations (especially user/content embeddings) are unified, and each application model only needs to add a thin layer on top of the foundation model.
- Foundation model usage patterns
- Subgraph integration: Replacing parts of existing models with the foundation model
- Embedding extraction and sharing: Storing user/content embeddings in a central repository for various teams to use directly
- Fine-tuning/distillation: Additional training or lightweight adaptation for specific applications
"Now new applications can fine-tune the foundation model directly to quickly deliver a first experience."
- Results
- Clear A/B test wins and infrastructure consolidation benefits across diverse applications.
- Innovation velocity, scalability, and reusability improved significantly.
Future Directions and Q&A
Yesu Feng also introduced future development directions:
- Universal representation for heterogeneous content
- Generative retrieval: Recommending collections reflecting diverse business rules and diversity, not just individual titles
- Prompt tuning: Using soft tokens like LLMs for rapid adaptation
"Through prompt tuning, we can change the model's behavior at inference time just by swapping soft tokens."
Q&A covered the following topics:
- Applications beyond recommendations: Netflix is expanding to comprehensively understand diverse entities and user preferences beyond just recommendations.
- Graph models and reinforcement learning: Knowledge-graph-based embeddings and reinforcement learning with sparse rewards are actively researched and used.
"Graph models cover the entire content ecosystem, and reinforcement learning uses sparse user behavior rewards for applications like collection recommendations."
- Embedding usage and speed: Embeddings are directly used by various downstream models, with speed also being an important consideration.
- Content embedding granularity: Frame-level video embeddings aren't used yet, but there are plans to move in that direction.
Conclusion
Yesu Feng summarized the talk:
"Adopting the foundation model has significantly increased the scalability and innovation velocity of Netflix's recommendation system. We'll continue advancing for even more diverse content and user experiences."
Key Concepts:
Foundation ModelTransformerPersonalizationScale-upEmbeddingMultitask LearningPrompt TuningGraph ModelReinforcement LearningInnovation Velocity
Netflix can now recommend content tailored to us more intelligently, faster, and with greater diversity -- all powered by a single powerful model.
