Reinforcement Learning and the Future of Agents

Reinforcement Learning and the Future of Agents — A Summary of Will Brown's Talk

1. Introduction and Topic

Will Brown is a machine learning researcher at Morgan Stanley. In this talk, he explores what reinforcement learning (RL) means for agents. He studied multi-agent reinforcement learning theory at Columbia University and currently works on a variety of large language model (LLM) projects. Rather than focusing on immediate productization, this talk is about anticipating and preparing for the future — specifically, what role reinforcement learning can play in the agent engineering loop.

"This talk is not about proven science or best practices we can apply tomorrow. It's about where we're headed, and what reinforcement learning means for agents."

2. The Current State of LLMs and Agents

Where LLMs stand today: Most LLMs function as chatbots. Using OpenAI's "5-level framework" as a reference, we've made significant progress at the chatbot (level 1) and reasoner (level 2) stages. However, the transition to agents (level 3) — systems that take actions and handle complex tasks — is still in its early stages.
Current limitations of agents:
- Most agents operate for only short durations (under 10 minutes).
- They rely on user feedback loops and lack high levels of autonomy.
- While examples like "Devin Operator" and "OpenAI Deep Research" exist, they are still far from fully autonomous agents.

"We might think that waiting for better models will solve everything. But it's also worth revisiting the classical definition of reinforcement learning. An agent is a system that interacts with its environment and progressively improves toward a goal."

3. Core Concepts of Reinforcement Learning

Defining reinforcement learning: RL refers to a system that learns progressively by interacting with an environment in pursuit of a goal. It's not merely learning from data — it improves performance through iterative feedback loops.
Explore and Exploit:
- Exploration: Trying new approaches to discover which strategies are effective.
- Exploitation: Using effective strategies more frequently while reducing reliance on ineffective ones.

"The essence of reinforcement learning is identifying 'good strategies' and using them to solve problems."

4. Recent Examples and Possibilities in Reinforcement Learning

DeepSeek's R1 Model:
- The R1 model learned a long chain of thought through reinforcement learning.
- This was not the result of humans manually providing data — it was learned by the model itself.
- The R1 model received feedback by checking whether it answered questions correctly, and through that process it progressively improved.

"The long chain of thought in the R1 model was not manually programmed — it emerged as a byproduct of strategies the model learned on its own through reinforcement learning."

The Revival of Open Source:
- Following the R1 project, the open-source community has been working to replicate it or distill it into smaller models.
- This demonstrates that reinforcement learning offers the potential to learn new capabilities even without large-scale data.

5. Connecting Reinforcement Learning and Agents

Integrating RL with agents:
- Reinforcement learning is drawing attention as a core technology for enabling agents to achieve higher autonomy.
- For example, agents can use RL to learn capabilities such as calling multiple tools or browsing the internet to handle complex tasks.
Current limitations:
- While RL is useful for learning new capabilities, it is not a silver bullet that solves every problem.
- For instance, it still struggles with repetitive computational tasks and out-of-distribution problems.

"Reinforcement learning is an important key to new capabilities and greater autonomy, but we have not yet reached the stage of AGI capable of solving every problem."

6. Environment and Reward Design for Reinforcement Learning

Rubric Engineering:
- A critical element of RL is reward design. Rather than simply asking "did the model get the right answer?", we must establish diverse criteria to guide model learning.
- Examples:
  - Did it follow an XML structure?
  - Did it produce an integer-typed answer?
  - Did it solve the problem through a longer chain of thought?

"Rubric engineering is a creative process of designing rewards so that the model can learn on its own."

The Risk of Reward Hacking:
- Because models try to maximize rewards, they may learn behaviors unrelated to the actual goal. Rewards must therefore be designed to faithfully reflect the true objective.

7. The Future of Reinforcement Learning and AI Engineering

How AI engineering is changing:
- As RL becomes a core technology in agent development, environment design and reward design will play an increasingly important role.
- Existing skills — prompt engineering, evaluation systems, monitoring tools — will remain important.
Future possibilities:
- Reinforcement learning opens up the possibility of developing truly autonomous agents.
- However, this still requires a great deal of research and experimentation, and there are many challenges to address: cost, generalization, and reward design, among others.

"We need to imagine and prepare for a world where reinforcement learning has become a core technology in agent development."

8. Conclusion

Will Brown emphasizes that reinforcement learning is a critical technology that will open up the future of agent development, enabling systems with higher autonomy and the ability to handle complex tasks. However, we are still in the early stages, and much experimentation and creative thinking will be required.

"Reinforcement learning opens new possibilities for agent development, but realizing that potential requires us to think more deeply about how we design environments and rewards."

This talk offers valuable insight into the present and future of reinforcement learning and agents, providing a strong foundation for preparing for the coming era of AI.