Build Hour: A Complete Guide to Reinforcement Fine-Tuning (RFT)

This video is an OpenAI live workshop covering Reinforcement Fine-Tuning (RFT) — its concepts, practical usage, demos, real-world application cases, and Q&A. Experts directly share RFT's theoretical advantages and actual workflow, focusing on how to efficiently enhance models with minimal data. It also includes a real customer case study (Accordance's tax optimization use case) along with practical best practices.

1. Build Hour and RFT Introduction

This workshop is part of OpenAI's Build Hour program, aimed at providing startups and AI practitioners with practical tools, tips, and best practices for maximizing their use of the OpenAI API and models. It is hosted by Christine, with Prashant and Theo, who have hands-on RFT experience.

The Build Hour objective is stated clearly:

"The goal of Build Hour is to provide the best know-how, tools, and AI expertise needed to scale your company using the OpenAI API and models."

Today's topic is Reinforcement Fine-Tuning (RFT), with the main flow outlined in advance:

"First an introduction to RFT, optimization and specific benefits, actual task setup, a live demo (with a code repo to share), and then the latest customer case studies and Q&A."

2. Fine-Tuning Options and RFT's Strengths

Separating 'Knowledge' from 'Reasoning' in Fine-Tuning

There are two dimensions to improving LLM performance: "what the model knows" and "how it thinks and reasons":

Knowledge augmentation: Prompt engineering, RAG, etc.
Reasoning improvement: Fine-tuning (Supervised, Preference, Reinforcement)

If information is simply lacking, prompt/RAG enhancement should come first. Fine-tuning is needed when knowledge is sufficient but the model still produces wrong answers or inefficient reasoning patterns.

"Fine-tuning is an investment. Use every other method first, and only then reach for fine-tuning as a last resort."

Three Fine-Tuning Approaches

Supervised Fine-Tuning (SFT): Learning clear patterns from question-answer pairs
Preference Fine-Tuning: Learning style preferences through good/bad answer examples (chatbots, marketing, etc.)
Reinforcement Fine-Tuning (RFT): Instead of labeled answers, an 'auto-grader' scores model outputs

RFT's distinguishing features are explained:

"RFT no longer requires labeling data. With just a few examples, you can significantly improve performance according to complex reasoning or policy criteria."

Where RFT Excels

Tasks requiring complex rules such as policy compliance, legal reasoning, and medical coding
Trainable and improvable with very little data (tens to hundreds of examples)
Eliminates the burden and cost of building large labeled datasets

"Teams are replacing complex policy pipelines with single reasoning agents, or using RFT for compliance checks based on actual policy logic."

3. How RFT Actually Works: Structure and Sample Efficiency

RFT's principle of achieving excellent efficiency with few samples is explained visually:

"In RFT, a single sample is sampled multiple times to explore different reasoning paths. Each path (answer) is graded, and the model ultimately learns better reasoning paths on its own."

That is, for the same input, various outputs/reasoning paths are compared and scored, and the model learns on its own "what constitutes good reasoning."

"Other fine-tuning methods provide one piece of information per sample, but RFT can extract numerous reasoning signals from a single sample, making it highly efficient!"

4. Practical Example: Legal Classification Task Preparation and Evaluation

Dataset and Metric Definition

The live demo task is multi-label classification of EU official legal categories (Eurovok level 1).

"Given a legal text, the task is to predict which of 21 top-level legal topic categories apply."

Specific samples, dataset construction methods, and ground truth are clearly demonstrated with examples.

Evaluator (Grader) Design

Precision: Of the predicted answers, how many are actually correct?
Recall: How many of the true answers were found?
F1 Score: The harmonic mean of both metrics

"The RFT system requires a single grade score per training sample, so a single score like F1 carries important meaning."

Additionally, the procedure for applying data sampling strategies to balance sample imbalance (frequency differences between categories) is shown:

"Training on imbalanced datasets causes the model to predict only frequent categories, artificially inflating scores — this is 'reward hacking.'"

5. Code-Based Demo: Evaluation and RFT Training

Building the Evaluation Environment

Prompt design: Providing 21 class names, specifying context and output format, designing human-readable instructions
Grader code: Implemented in Python with exception handling (for model output format mismatches)
Model: Using O4 mini with low reasoning effort setting for fast, inexpensive execution

"The same prompt, grader, and output format used for training must be reused for evaluation — this ensures valid and reliable performance comparisons."

Practical Performance Analysis

Comparison of various hyperparameters/checkpoints, visualizing how long the model reasons and how output variance changes at each
Tracking major metrics like F1 score on both training data and validation data (real-use distribution)

"As training progresses, the model should show an upward curve on the training set. On the validation set, it shouldn't overfit too early. Ideally, both curves rise together!"

"When precision goes up excessively and recall drops, the model is cherry-picking correct predictions but missing many others. These details are designed to be verified in real time."

Model Improvement Results

Direct comparison of GPT-4.1, baseline O4 mini, and the fine-tuned model
The fine-tuned model showed improvements in both precision and recall, with the best F1 score performance

"In the actual validation set, zero-score (complete misses) disappear and the gap between average and maximum scores narrows. This means the model has been upgraded to more consistent, high-quality reasoning!"

6. Real-World Case Study: RFT Applied to Accordance's Tax Automation

In this section, David, CEO of startup Accordance, shares real-world experience applying RFT.

Adopting RFT for Tax Optimization

RFT applied to complex tax regulations, optimization, and compliance work
Emphasis that RFT develops the model's reasoning framework for "how to think and approach problems" rather than simply injecting knowledge

"RFT is particularly useful for problems where experts can 'objectively agree on what's correct.'"

Data and Grader Design Strategy

"Data quality is far more important than quantity — excellent performance is possible with even small amounts (100–300 samples)"
Grader design concepts: The importance of discrete, continuous, and stratified evaluation methods

"If you only evaluate with right/wrong (0/1), the model just guesses to get it right, and even when correct, you can't tell if it reasoned properly. You need a graduated, continuous evaluation function so that rewards go to correct reasoning paths."

"For problems like tax strategy optimization, continuous scoring like 'how close to the optimal tax amount' can be designed"

Real-World Results and Strategy

"On the TaxBench industry evaluation set, we achieved over 40% performance improvement. And before starting RFT, it's important to also experiment with prompt optimization and RAG."

7. Q&A: Practical Concerns and Best Practices

Key Q&As

Q: What tasks suit RFT?
- Tasks where clear correct answers/preferences exist and only "reasoning-type" models yield effective results. Best when the evaluation function can be formalized and made continuous.
  
  "RFT truly shines when you have a continuous (graduated) evaluation function!"
Q: Data quality/noise issues?
- RFT accumulates reasoning paths from each individual sample, so even one or two low-quality samples can be devastating. Always use thoroughly curated data, even in small quantities.
Q: Balancing cost, performance, and speed?
- Small models (O4 mini) can achieve large-model performance, making RFT a cost-effective investment. However, consider ROI based on training and experiment frequency and production transaction volume.
Q: Chatbots and other inconsistent prompts/unstructured responses?
- Well-curated tasks and evaluation environments (graders, output formats, data curation) are the foundation. In noisy, unstructured environments, first define and structure the task.

Additional RFT Workflow Summary:

Obsess over quality even with few data samples
Design sophisticated evaluators (graders)
Align prompts, data, and evaluators with the "real-use environment"
Carefully monitor and learn from charts and results across experiments
Find optimal settings, seeds, and hyperparameters through iterative experimentation

8. Conclusion

This workshop was a lecture that puts everything from RFT concepts to code to real-world applications into your hands all at once. In summary, RFT efficiently achieves performance improvements with small datasets in fields requiring complex reasoning and policy evaluation, with meticulous evaluators and data quality being the keys to success.

"Noise-free small data, structured evaluators, and prompts matching the real environment — these three elements are the secret to RFT success!"

Additionally, the fact that RFT can build powerful domain-specific AI without large-scale manual data across various fields (tax, medical, policy enforcement, etc.) suggests very high future applicability.

Try experiencing RFT yourself through official documentation, code repos, and Build Hour live participation!

1. Build Hour and RFT Introduction

The Build Hour objective is stated clearly:

"The goal of Build Hour is to provide the best know-how, tools, and AI expertise needed to scale your company using the OpenAI API and models."

Today's topic is Reinforcement Fine-Tuning (RFT), with the main flow outlined in advance:

"First an introduction to RFT, optimization and specific benefits, actual task setup, a live demo (with a code repo to share), and then the latest customer case studies and Q&A."

2. Fine-Tuning Options and RFT's Strengths

Separating 'Knowledge' from 'Reasoning' in Fine-Tuning

There are two dimensions to improving LLM performance: "what the model knows" and "how it thinks and reasons":

Knowledge augmentation: Prompt engineering, RAG, etc.
Reasoning improvement: Fine-tuning (Supervised, Preference, Reinforcement)

"Fine-tuning is an investment. Use every other method first, and only then reach for fine-tuning as a last resort."

Three Fine-Tuning Approaches

Supervised Fine-Tuning (SFT): Learning clear patterns from question-answer pairs
Preference Fine-Tuning: Learning style preferences through good/bad answer examples (chatbots, marketing, etc.)
Reinforcement Fine-Tuning (RFT): Instead of labeled answers, an 'auto-grader' scores model outputs

RFT's distinguishing features are explained:

"RFT no longer requires labeling data. With just a few examples, you can significantly improve performance according to complex reasoning or policy criteria."

Where RFT Excels

Tasks requiring complex rules such as policy compliance, legal reasoning, and medical coding
Trainable and improvable with very little data (tens to hundreds of examples)
Eliminates the burden and cost of building large labeled datasets

"Teams are replacing complex policy pipelines with single reasoning agents, or using RFT for compliance checks based on actual policy logic."

3. How RFT Actually Works: Structure and Sample Efficiency

RFT's principle of achieving excellent efficiency with few samples is explained visually:

"In RFT, a single sample is sampled multiple times to explore different reasoning paths. Each path (answer) is graded, and the model ultimately learns better reasoning paths on its own."

That is, for the same input, various outputs/reasoning paths are compared and scored, and the model learns on its own "what constitutes good reasoning."

"Other fine-tuning methods provide one piece of information per sample, but RFT can extract numerous reasoning signals from a single sample, making it highly efficient!"

4. Practical Example: Legal Classification Task Preparation and Evaluation

Dataset and Metric Definition

The live demo task is multi-label classification of EU official legal categories (Eurovok level 1).

"Given a legal text, the task is to predict which of 21 top-level legal topic categories apply."

Specific samples, dataset construction methods, and ground truth are clearly demonstrated with examples.

Evaluator (Grader) Design

Precision: Of the predicted answers, how many are actually correct?
Recall: How many of the true answers were found?
F1 Score: The harmonic mean of both metrics

"The RFT system requires a single grade score per training sample, so a single score like F1 carries important meaning."

Additionally, the procedure for applying data sampling strategies to balance sample imbalance (frequency differences between categories) is shown:

"Training on imbalanced datasets causes the model to predict only frequent categories, artificially inflating scores — this is 'reward hacking.'"

5. Code-Based Demo: Evaluation and RFT Training

Building the Evaluation Environment

Prompt design: Providing 21 class names, specifying context and output format, designing human-readable instructions
Grader code: Implemented in Python with exception handling (for model output format mismatches)
Model: Using O4 mini with low reasoning effort setting for fast, inexpensive execution

"The same prompt, grader, and output format used for training must be reused for evaluation — this ensures valid and reliable performance comparisons."

Practical Performance Analysis

Comparison of various hyperparameters/checkpoints, visualizing how long the model reasons and how output variance changes at each
Tracking major metrics like F1 score on both training data and validation data (real-use distribution)

"As training progresses, the model should show an upward curve on the training set. On the validation set, it shouldn't overfit too early. Ideally, both curves rise together!"

"When precision goes up excessively and recall drops, the model is cherry-picking correct predictions but missing many others. These details are designed to be verified in real time."

Model Improvement Results

Direct comparison of GPT-4.1, baseline O4 mini, and the fine-tuned model
The fine-tuned model showed improvements in both precision and recall, with the best F1 score performance

"In the actual validation set, zero-score (complete misses) disappear and the gap between average and maximum scores narrows. This means the model has been upgraded to more consistent, high-quality reasoning!"

6. Real-World Case Study: RFT Applied to Accordance's Tax Automation

In this section, David, CEO of startup Accordance, shares real-world experience applying RFT.

Adopting RFT for Tax Optimization

RFT applied to complex tax regulations, optimization, and compliance work
Emphasis that RFT develops the model's reasoning framework for "how to think and approach problems" rather than simply injecting knowledge

"RFT is particularly useful for problems where experts can 'objectively agree on what's correct.'"

Data and Grader Design Strategy

"Data quality is far more important than quantity — excellent performance is possible with even small amounts (100–300 samples)"
Grader design concepts: The importance of discrete, continuous, and stratified evaluation methods

"If you only evaluate with right/wrong (0/1), the model just guesses to get it right, and even when correct, you can't tell if it reasoned properly. You need a graduated, continuous evaluation function so that rewards go to correct reasoning paths."

"For problems like tax strategy optimization, continuous scoring like 'how close to the optimal tax amount' can be designed"

Real-World Results and Strategy

"On the TaxBench industry evaluation set, we achieved over 40% performance improvement. And before starting RFT, it's important to also experiment with prompt optimization and RAG."

7. Q&A: Practical Concerns and Best Practices

Key Q&As

Q: What tasks suit RFT?
- Tasks where clear correct answers/preferences exist and only "reasoning-type" models yield effective results. Best when the evaluation function can be formalized and made continuous.
  
  "RFT truly shines when you have a continuous (graduated) evaluation function!"
Q: Data quality/noise issues?
- RFT accumulates reasoning paths from each individual sample, so even one or two low-quality samples can be devastating. Always use thoroughly curated data, even in small quantities.
Q: Balancing cost, performance, and speed?
- Small models (O4 mini) can achieve large-model performance, making RFT a cost-effective investment. However, consider ROI based on training and experiment frequency and production transaction volume.
Q: Chatbots and other inconsistent prompts/unstructured responses?
- Well-curated tasks and evaluation environments (graders, output formats, data curation) are the foundation. In noisy, unstructured environments, first define and structure the task.

Additional RFT Workflow Summary:

Obsess over quality even with few data samples
Design sophisticated evaluators (graders)
Align prompts, data, and evaluators with the "real-use environment"
Carefully monitor and learn from charts and results across experiments
Find optimal settings, seeds, and hyperparameters through iterative experimentation

8. Conclusion

"Noise-free small data, structured evaluators, and prompts matching the real environment — these three elements are the secret to RFT success!"

Try experiencing RFT yourself through official documentation, code repos, and Build Hour live participation!

1. Build Hour and RFT Introduction

2. Fine-Tuning Options and RFT's Strengths

Separating 'Knowledge' from 'Reasoning' in Fine-Tuning

Three Fine-Tuning Approaches

Where RFT Excels

3. How RFT Actually Works: Structure and Sample Efficiency

4. Practical Example: Legal Classification Task Preparation and Evaluation

Dataset and Metric Definition

Evaluator (Grader) Design

5. Code-Based Demo: Evaluation and RFT Training

Building the Evaluation Environment

Practical Performance Analysis

Model Improvement Results

6. Real-World Case Study: RFT Applied to Accordance's Tax Automation

Adopting RFT for Tax Optimization

Data and Grader Design Strategy

Real-World Results and Strategy

7. Q&A: Practical Concerns and Best Practices

Key Q&As

Additional RFT Workflow Summary:

8. Conclusion

Related writing

Inside YC's AI Playbook

AI Arbitrage in the Age of Bots

AX Roadmap That Leads to Results: From Individual Efficiency to Org Productivity

Reading

1. Build Hour and RFT Introduction

2. Fine-Tuning Options and RFT's Strengths

Separating 'Knowledge' from 'Reasoning' in Fine-Tuning

Three Fine-Tuning Approaches

Where RFT Excels

3. How RFT Actually Works: Structure and Sample Efficiency

4. Practical Example: Legal Classification Task Preparation and Evaluation

Dataset and Metric Definition

Evaluator (Grader) Design

5. Code-Based Demo: Evaluation and RFT Training

Building the Evaluation Environment

Practical Performance Analysis

Model Improvement Results

6. Real-World Case Study: RFT Applied to Accordance's Tax Automation

Adopting RFT for Tax Optimization

Data and Grader Design Strategy

Real-World Results and Strategy

7. Q&A: Practical Concerns and Best Practices

Key Q&As

Additional RFT Workflow Summary:

8. Conclusion

Related writing

Inside YC's AI Playbook

AI Arbitrage in the Age of Bots

AX Roadmap That Leads to Results: From Individual Efficiency to Org Productivity