This article provides a friendly summary of large-scale experimental results showing how applying reinforcement learning (RL) to Facebook's ad text generation AI significantly improved real ad performance. The paper demonstrates that an RL-based model called AdLlama improved click-through rate (CTR) by 6.7% over the existing supervised learning model, and also significantly increased advertiser satisfaction and engagement, backed by real experiments and data. The key takeaway is that RL-based post-training using actual ad performance data (CTR) as a reward signal is extremely powerful and can be broadly applied to other business domains.
1. Background: The Intersection of Generative AI and the Advertising Industry
Generative AI and large language models (LLMs) are driving innovation across a wide range of industries, with active research in content creation, education, healthcare, and decision-making, among others. While much research highlights the positive economic impact AI can have, translating this into actual real-world results requires 'post-training' the model to fit the specific use case.
Initial LLMs learn language patterns and common knowledge through pre-training on massive text data. However, custom post-training such as supervised fine-tuning (SFT) or reinforcement learning from human feedback (RLHF) is essential to align them for real-world use.
"To realize a model's true potential, the process of precisely 'aligning' it through post-training tailored to real-world applications after pre-training is critically important."
The online advertising market is projected to reach $513 billion globally by 2025. This paper is the first to demonstrate at scale how applying RL-based post-training to Facebook's ad copy generation AI can improve actual advertiser ROI.
2. Overview of Meta's Text Generation Product
Meta (Facebook)'s Text Generation feature is a service where an advertiser inputs one piece of ad copy, and an LLM rewrites it into various versions for recommendation. Advertisers can select the copy they prefer or request more variations.

Key points:
- Advertisers are not required to use the AI-generated copy.
- They can edit AI suggestions or mix them with human-written versions.
- AI-generated copy is only served after the advertiser makes a final selection before ad deployment.
Prior Models: Imitation LLM v1/v2
The initial version, Imitation LLM v1, was based on the open-source Llama 2 Chat 7B model, trained using only supervised fine-tuning (SFT) to mimic 'good ad copy style.'
- v1 used training data composed entirely of diverse ad copy generated by a large AI.
- v2 added high-quality examples rewritten by real humans to improve quality.
Research Objective
The goal of this paper is:
"To apply a new training approach that quantitatively improves actual advertiser performance (CTR) compared to the existing SFT-based LLM."
3. How Was Reinforcement Learning Applied? (AdLlama and the RLPF Method)
What is RL + Performance Feedback (RLPF)?
Copy quality is evaluated not by human raters but by actual performance data (CTR). That is, instead of artificial feedback where humans judge "this sentence is better," the actions of thousands to tens of thousands of Facebook users -- "clicking" or "ignoring" each ad -- serve as the reward signal.
"Since ad text performance is clearly measured by CTR, RL using actual performance data as rewards becomes a very powerful post-training method."
The actual implementation process:
- Ad performance records are analyzed to create ad pairs where only the copy differs under identical image/conditions.
- The pair with higher CTR is designated as the 'preferred' pair, building pairwise training data.
- This data is used to create a 'reward model (RM)' that predicts which ad copy will achieve a higher CTR.

- This reward model serves as the 'environment' when the AI generates copy, and RL (using the PPO algorithm) further trains the model to produce more high-reward (= high-CTR) copy.
- A length penalty is also applied to prevent the LLM from generating overly long copy.

The resulting new model is called AdLlama, and this is the key difference from the existing Imitation LLM.
4. Large-Scale Experiment: A/B Test Design & Results
Experimental Design
- February to April 2024, over 10 weeks
- 34,849 advertisers in the United States
- Randomly divided into two groups:
- 'Control:' received the existing Imitation LLM v2
- 'Test:' received the RL-based AdLlama

Performance was measured using a variety of real-world metrics including each advertiser's overall CTR (click-through rate), total clicks, number of ads, and number of ad copy variations.
5. Key Results
5.1 Direct CTR Performance Comparison
- The AdLlama group saw a 6.7% increase in CTR (p=0.0296)
- Compared to Imitation LLM v2, average CTR rose from 3.1% to 3.3%
- In an environment where many ad impressions are already optimized, this level of improvement represents a very significant ROI gain
"Advertisers using AdLlama had 6.7% higher click-through rates, representing a meaningful improvement in Facebook ad ROI."
- Further analysis confirmed that the increase in clicks was due to genuinely better ads, not simply more impressions
"AdLlama did not affect advertisers' total impressions, but CTR improved as the number of clicks increased."
5.2 Ad Copy Utilization & Increased Advertiser Satisfaction
- The number of ad copy variations created by advertisers also increased by 18.5%
- Specifically: the average number of variations created per advertiser during the experiment period rose from 16.8 (baseline) to 19.9 (AdLlama), a statistically significant increase (p<0.01)
- This means:
- Advertisers used AI-recommended copy more frequently and more actively
- In other words, the AI's suggestions were more satisfying and trustworthy
"AdLlama's suggestions led to more and more diverse copy usage, reflecting higher advertiser satisfaction."
6. Discussion: Implications and Limitations of the RL + Performance Feedback Approach
This study demonstrated that fine-tuning AI based on 'actual business performance metrics' can yield significantly better real-world results than training with human evaluators alone.
"Actively incorporating actual performance data (aggregate metrics like CTR) into post-training increases AI's real business contribution."
Limitations are honestly noted:
- This experiment used only offline data (historical records) for a single round of RL
- In the future, evolving to online RL that incorporates the latest performance data from AI-generated copy would improve adaptability to real-time trends and new copy experimentation
- (e.g., more creative copy, campaign-specific characteristics, multi-objective optimization including user selection probability)
Scalability
- The same approach can be applied to various domains where 'performance can be measured numerically' -- email, e-commerce, customer service, etc.
"If data is available -- such as student scores on education platforms or participation rates in public interest campaigns -- the RLPF approach can be easily adopted."
7. Conclusion
In Closing
AdLlama demonstrated through large-scale experimental data that an LLM fine-tuned with reinforcement learning using actual performance data can significantly improve ad performance (CTR) compared to supervised learning-based models, while also increasing advertiser satisfaction and engagement.
In particular, this approach of 'quantitatively' using real user responses and performance metrics has major implications not just for advertising but for all businesses with sufficient data. As this evolves further with online RL, multi-objective optimization, and real-time adaptation, the positive impact of AI on real-world business will only grow.
"This study is the largest-scale demonstration that RL post-training based on real metrics can close the gap between AI and real-world outcomes."
