Using Reinforcement Learning to Improve Facebook Ad Copy

This document covers a study on how Meta used reinforcement learning (RL) to improve Facebook's ad copy generation capabilities. In particular, through a new post-training method called RLPF (Reinforcement Learning with Performance Feedback) that uses ad performance data as reward signals, they trained the LLM (Large Language Model) "AdLlama." In a large-scale A/B test conducted over 10 weeks in early 2024, AdLlama improved click-through rate (CTR) by 6.7% compared to the existing model and also increased advertiser satisfaction. This is an important result that quantitatively demonstrates the real business impact of generative AI.

1. Introduction: The Importance of Generative AI and Post-training 💡

Generative AI has been recognized for its innovative potential across various industries including content creation, education, healthcare, and decision-making, and is expected to have a significant economic impact. In particular, large language models (LLMs) undergo a pre-training stage where they learn general language patterns from vast text data. However, applying them to real-world environments requires post-training — fine-tuning and aligning the model for specific tasks.

While there has been extensive research on the impact of LLMs, the specific economic impact of the post-training stage has been relatively underexplored. This paper analyzes the real-world impact of RL-based post-training through the online advertising industry. As of 2025, global online advertising spending reaches $513 billion, representing 63% of total global advertising revenue — highlighting just how important this market is.

Meta's Text Generation product is a feature that uses LLMs to generate various ad copy variations based on the advertiser's original copy. Through this feature, advertisers can leverage Meta's ad delivery system to select the best-performing ad copy. The initial Text Generation product used an LLM fine-tuned via supervised fine-tuning (SFT) to mimic curated ads.

The goal of this research is to improve the Text Generation LLM to write more compelling ad copy and thereby measurably improve advertiser performance. To achieve this, they introduced a new approach called RLPF (Reinforcement Learning with Performance Feedback) that directly uses ad performance (click-through rate) as a reward signal. RLPF treats user behavior (click or no-click) for each ad impression as a small form of feedback provided by thousands of humans, extending the concept of traditional RLHF (Reinforcement Learning with Human Feedback).

The authors conducted a large-scale online A/B test comparing the RLPF-trained model with the existing SFT-based imitation model. This test, conducted over 10 weeks in early 2024 with approximately 35,000 advertisers and 640,000 ad variations, showed a remarkable 6.7% increase in CTR for advertisers using the RLPF model compared to the existing imitation model. Additionally, the number of ad variations generated by advertisers increased by 18.5%, suggesting that advertisers were more satisfied with AdLlama's output.

These results have several important implications. First, they highlight the effectiveness of reinforcement learning for LLM post-training in business use cases. Second, by quantitatively demonstrating the benefits of RL-based LLM post-training in online advertising, they make an important contribution to understanding the broader impact of generative AI. This study is reported to be the largest real-world application study of generative AI to date.

Overview of our contributions. Figure 1: Overview of research contributions. The left panel shows RLPF, and the right panel shows the results of the large-scale A/B test.

2. Meta's Text Generation Product 📝

Meta's Text Generation product is a generative AI feature that helps advertisers experiment with different versions of ad copy. The feature works by taking the advertiser's original ad copy as input and having the LLM suggest new variations. For example, it can highlight key selling points or add creative messaging.

2.1 User Interface 🖥️

The Text Generation product's user interface appears during the ad creation process. Advertisers first enter their original ad copy. The LLM then generates and displays multiple text variations, and advertisers can select the ones they want to use or edit them directly. They can even add their own custom variations through an "Add text option" button.

For example, if an advertiser enters the original copy "Spend this weekend with a new book! Visit the bookstore today," the LLM might suggest variations like:

"Enrich your weekend with a new book! Visit the bookstore now."

"No more boring weekends! Discover the special books waiting for you at the bookstore."

"Dreaming of the perfect weekend with a book? Come to the bookstore!"

Advertisers can select up to 5 text variations, including the original copy and AI-generated variations, to deliver to users. They can also continue generating additional text variations through a "Generate more" button.

An important point here is that advertisers don't have to select the AI-generated copy. Advertisers can completely ignore the AI's suggestions and use the original text, use the AI's suggestions as reference to write their own multiple copies, edit the AI-generated copy, or use the AI-generated copy as inspiration for new copy. Therefore, the LLM plays a subtle but important role in shaping ad copy regardless of what the advertiser ultimately selects.

Figure 2: The user interface of the Meta Text Generation product. Based on the advertiser's input copy, AI suggests various variations, and advertisers can select or edit the ones they want.

2.2 Existing Imitation Model 🧠

The initial version of the Text Generation LLM, Imitation LLM v1, was launched in November 2023. This LLM is based on the 7B (7 billion) parameter version of Meta's open-source language model, Llama 2 Chat. The model was post-trained via supervised fine-tuning (SFT) to mimic the style of a pre-curated set of ads.

An improved version, Imitation LLM v2, was subsequently released, characterized by the use of higher-quality data. While v1's training dataset was entirely based on synthetically generated data from a large LLM, v2 data included human-written (i.e., contractor-written) examples. These training examples were curated by having LLMs or humans rewrite existing ads according to specific instructions such as "rephrase and shorten," "make it clearer," "make it actionable," "evoke empathy," "turn it into a question," and "focus on selling points."

The work presented in this paper was conducted after the initial launch of the Text Generation product. The goal was to improve the existing imitation-based Text Generation LLM to quantitatively improve advertiser performance in terms of CTR. To achieve this, they leveraged a new idea of applying reinforcement learning to aggregated performance feedback signals.

3. Methodology: RLPF and Experimental Design 🧪

This section describes the methodology for training the new Text Generation LLM, AdLlama, including data preparation, reward model design, and reinforcement learning. It also covers the A/B test design used to quantify the performance improvement of the new model over the existing imitation model.

3.1 Reinforcement Learning with Performance Feedback (RLPF) 🎯

While pre-trained LLMs have acquired vast knowledge, they are generally insufficient for widespread use. An important step of aligning the model to specific tasks is needed before deployment to users. The main approach to LLM alignment is collecting preference data from human labelers. They compare two responses and indicate which is better, and the model is fine-tuned based on this preference data to generate responses more similar to those preferred by humans. This process is known as RLHF (Reinforcement Learning with Human Feedback). Because the "quality" of many LLM tasks such as open-ended conversations or creative writing is subjective and hard to quantify, training based on human preferences is the closest approximation to a well-defined optimization objective.

The authors' key insight is that the task of writing ad copy can be clearly linked to a measurable quantitative objective — the ad's click-through rate (CTR). This setting can be applied not only to online advertising but to any domain with concrete performance metrics, such as e-commerce, AI customer support agents, and educational technology. The authors propose the following general approach, which can be seen as a metric-driven extension of RLHF:

Train a Performance Reward Model: Train a Performance Reward Model that scores text using aggregated performance metrics. Higher scores are assigned to better-performing text.
RL Fine-tuning: Use the trained reward model as an interactive environment to perform RL fine-tuning. The goal is to fine-tune the LLM to increase the likelihood of generating high-reward text.

The method for training a CTR-based performance reward model is as follows. Even before the Text Generation product was launched, advertisers had a practice of testing multiple (human-written) text variations for a single ad using Meta's "multi-text optimization" tool. Thanks to this practice, historical ad data could be observed where all other elements of the ad (image, headline, targeting criteria, etc.) were held constant except for the text. This is called multitext data.

From multitext data, preference pairs can be constructed by marking text with higher CTR as "more preferred" and text with lower CTR as "less preferred." This is a pairwise dataset that supports the standard Bradley-Terry preference-based reward model training approach. They also considered a simpler reward modeling approach using a pointwise dataset where each row is simply ad text and its resulting CTR, but pointwise reward models were found to be inferior at identifying the ordering (or ranking) among similar ad texts — which is ultimately more important than pure CTR prediction. The final RM training dataset contained approximately 7 million preference pairs.

Based on the trained reward model, Proximal Policy Optimization (PPO) was used to align the LLM with high-performing ad text. A length penalty was added to counteract the model's tendency to generate excessively long ad text.

The authors improved Imitation LLM v2 using RLPF techniques and named this model AdLlama. AdLlama is based on the 7B Llama 2 Chat model and differs from existing imitation models in its training method (RLPF vs. SFT) and training data (historical ad performance vs. curated examples).

AdLlama versus Imitation LLM v2. Figure 4: Comparison of AdLlama and Imitation LLM v2. AdLlama was additionally trained with RLPF and historical ad performance data, while Imitation LLM v2 was trained only with SFT on curated examples.

3.2 Experimental Design 🔬

The authors conducted a large-scale A/B test (randomized controlled trial) comparing the AdLlama model with Imitation LLM v2 to evaluate the impact of RLPF training on advertiser performance. This A/B test was conducted over 10 weeks from February 16 to April 25, 2024, targeting N=34,849 advertisers in the United States. Advertisers were randomly assigned at the advertiser level to use either (1) Imitation LLM v2 ("control group") or (2) RLPF-trained AdLlama LLM ("treatment group").

Figure 5: A/B test timeline. After the launch of Imitation LLM v1 on November 23, 2023, the A/B test was conducted over 10 weeks from February 16 to April 25, 2024.

The primary focus was on advertiser-level performance to understand ways to improve advertisers' return on investment. In particular, the following metrics were analyzed:

Total engagement (clicks)
Total impressions (views)
Total ads generated
Total ad variations generated

All metrics were aggregated at the advertiser level on the Facebook mobile feed during the 10-week experimental period. Various advertiser covariates were also recorded, including pre-experiment lifetime CTR, new advertiser status, and ad creation behavior since the initial Text Generation feature launch (November 2023-February 2024). These covariates included advertiser industry, expertise level, budget level, business account status, and account age. Analysis results showed no statistically significant imbalance between the two groups, confirming that sample characteristics were balanced.

4. Key Results: AdLlama's Overwhelming Performance 🏆

4.1 Advertiser Performance Analysis 📊

The authors used a log-binomial regression model to evaluate the impact of AdLlama and Imitation LLM v2 on advertiser-level CTR. This model is suitable for CTR modeling and uses a log link function to derive relative risks that can be directly interpreted as CTR ratios.

The main regression analysis results are as follows:

AdLlama achieved a statistically significant 6.7% increase in advertiser-level CTR compared to Imitation LLM v2 (p=0.0296, standard error 0.0299).
This corresponds to an absolute increase in advertiser-level CTR from approximately 3.1% to 3.3%.
While the absolute increase may seem small, the 6.7% relative increase represents a significant improvement in advertisers' return on investment on Facebook. Even small CTR increases are extremely difficult to achieve on mature advertising platforms like Facebook, which are already well-optimized.

Several robustness checks were performed to validate the robustness of these results:

Model-free analysis confirmed no imbalance between the two groups.
Various alternative CTR regression models (quasi-binomial, logistic, Poisson) were tested, all showing qualitatively similar effects.
Separate linear regressions for clicks and impressions confirmed that AdLlama increased total clicks per advertiser while having no effect on total impressions. This is consistent with the CTR increase finding.

4.2 Impact on Ad Variation Generation ✍️

The authors were also interested in the impact of AdLlama and Imitation LLM v2 on advertisers' use of the Text Generation product. Two outcomes were considered: the number of ad variations and the number of ads generated during the experimental period.

Linear regression analysis results using the same covariates are as follows:

Using the AdLlama LLM significantly increased the number of ad variations generated by advertisers (p<0.01).
Specifically, the number of ad variations increased from approximately 16.8 with Imitation LLM v2 to 19.9 with AdLlama — an 18.5% increase.
In contrast, the total number of ads generated remained statistically the same.

This suggests that advertisers were more willing to use the Text Generation product's suggestions from AdLlama compared to Imitation LLM v2.

5. Discussion: RLPF's Potential and Future Challenges 🚀

This study demonstrates that Reinforcement Learning with Performance Feedback (RLPF) can train LLMs to generate effective ad copy for both advertisers and users. Large-scale A/B test results for Meta's Text Generation product showed that the RLPF-based model significantly increased advertiser-level CTR and also increased the number of ad text variations that advertisers were willing to use. These results support the concept of anchoring the fine-tuning process to real-world aggregated performance metrics, rather than relying solely on human evaluator preference feedback or rule-based rewards.

5.1 Limitations 🚧

The authors acknowledge several limitations of the study:

Offline RL: The model was trained using offline historical performance data. This constitutes a single offline reinforcement learning without real-time interaction with the environment. To further refine the model, performance outcomes of LLM-generated ads would need to be incorporated into an iterative process. This would be closer to online RL, where the model continuously interacts with the environment and adapts based on real-time feedback. Such a system could also adapt to new trends and discover them through exploration.
Single-objective optimization: The current model primarily focuses on ad performance, but other important factors should also be considered. For example, there may be trade-offs between generating high-performing ads and demonstrating high creativity. Additionally, the model's ability to adhere to specific advertiser guidelines (e.g., maintaining a specific tone) is an important consideration. Addressing these aspects requires multi-objective optimization approaches that effectively balance diverse goals.
Human element not reflected: The current model does not account for the human element of the Text Generation product. Before ad text variations are delivered to users, advertisers must explicitly select them. Future iterations of RLPF reward model training could consider weighting CTR by the likelihood of a text being selected by the advertiser.
Platform-level factors: Beyond individual ad performance, platform-level factors such as ad inventory diversity are also important for a positive user experience. Future research should explore strategies that consider these other factors while simultaneously optimizing performance.

5.2 Broader Implications 🌐

These research results contribute to understanding the impact of LLMs. By quantifying the benefits of RL-based post-training in online advertising, they provide concrete data highlighting the potential for these models to impact real businesses by leveraging relevant performance metrics. The ability to generate more compelling ad content not only improves existing advertisers' ROI but can also lower the entry barrier for new and unskilled advertisers (e.g., small businesses) by reducing the need for extensive marketing expertise and resources.

This methodology is not limited to online advertising. The principles of RLPF can be applied to other domains where aggregated performance metrics are available. By using performance data as a feedback mechanism, organizations can fine-tune LLMs to optimize desired outcomes. For example, the core methodology can be easily extended to closely related settings such as personalized email campaigns or e-commerce product descriptions. RLPF can also be extended to settings with multiple rounds of conversational feedback, such as AI customer support agents, using metrics like resolution rate, satisfaction scores, or user response time.

RLPF can also be applied in less obvious settings. For example, on online learning platforms, student performance data (test scores and engagement metrics) can guide adaptive learning content generation, and in specific public awareness campaigns (e.g., vaccination, energy consumption), performance data can help LLMs rewrite communication materials to better resonate with the intended audience.

This research is merely a first step in demonstrating the potential of RL augmented with aggregated performance feedback. The authors believe this is a promising and generalizable approach that can bridge the gap between high-performance language models and real-world outcomes.

Conclusion: RLPF, a New Success Formula for the AI Era 🌟

Meta's study clearly demonstrated that RLPF (Reinforcement Learning with Performance Feedback) goes beyond a theoretical concept and can revolutionarily improve the performance of LLMs in real business environments. The 6.7% CTR increase and 18.5% higher ad variation adoption rate achieved by introducing AdLlama for ad copy generation prove how powerful reinforcement learning can be when directly linked to business objectives.

This research suggests that generative AI can play a pivotal role in creating measurable business value beyond simple content generation. Particularly in fields like online advertising where clear performance metrics exist, RLPF presents a powerful alternative that can surpass the limitations of traditional human-feedback-based models.

Of course, challenges remain to be addressed, including the limitations of offline learning and the need for multi-objective optimization. However, this study suggests that RLPF can be a new success formula for maximizing the potential of LLMs across various industries and ultimately increasing companies' return on investment. Future developments through online application of RLPF and multi-objective optimization are anticipated.

1. Introduction: The Importance of Generative AI and Post-training 💡

Overview of our contributions. Figure 1: Overview of research contributions. The left panel shows RLPF, and the right panel shows the results of the large-scale A/B test.