ICML Tutorial on Model-Based Reinforcement Learning Summary: 2. Model-Based Control

Brief Summary: This article explains various control methods in model-based reinforcement learning in chronological order, covering environment simulation, end-to-end learning, the differences between background and decision-time planning, and representative algorithms (e.g., Dyna-Q, MCTS, trajectory optimization) with practical examples. It also covers continuous vs. discrete control, their limitations, and representative solutions (shooting method, collocation, CEM, etc.) from a practical perspective. This provides a clear understanding of the core flow between theory and actual implementation in model-based control RL.

1. Representative Algorithms and Approaches in Model-Based RL

In model-based reinforcement learning, the structure of the environment is learned, and actions are improved based on the predicted environment model. This tutorial covers the following diverse algorithms and categories.

Algorithm Categories

There are three representative ways to utilize models.

Three Model Utilization Methods

2. Environment Simulation and Experience Data Augmentation

Models can mix model-generated experience with data from the actual environment to assist reinforcement learning. This artificially increases dataset size for more efficient learning.

Simulation and Mixed Experience

Dyna-Q: A representative example of this approach that uses both real transitions and virtual transitions generated from the model.

"Dyna performs Q-learning updates on real transitions and generates virtual transitions using the model for the same Q-learning updates. Ultimately, there is no other way than to increase experience."

For policy learning as well, multiple steps can be generated virtually beyond a single step to enable more policy updates.

Multi-step Updates

3. End-to-End Learning Support and Policy Gradients Through Models

There are also methods that fully incorporate the model itself into the RL learning loop. The key methodology here is end-to-end training.

"End-to-end supervised learning has been very successful in computer vision, NLP, etc. — can the same approach work for RL?"

For example, in policy gradient methods, the policy parameters are adjusted to maximize the sum of future rewards. Traditional sampling-based methods (REINFORCE, etc.) suffer from high variance.

Policy Gradient and Gradient Ascent

With an accurate and smooth (differentiable) model, deterministic policy gradients can be obtained without bias.

"The policy gradient we obtain is actually a deterministic value, and there is no variance in it."

Advantages:

Policy gradient values are "exact" with no variance.
Useful for solving long-term credit assignment problems.

Disadvantages:

Can get stuck in local minima
Gradient vanishing, explosion, and other conditioning problems

Real vs. Model-Generated Trajectories

Real trajectories are safer but less sample-efficient, while model-generated trajectories allow diverse policy changes but are more vulnerable to model errors.

4. Background Planning vs. Decision-Time Planning

There are two representative distinctions in model-based RL: Background Planning vs. Decision-Time Planning

Two Planning Approaches

Background Planning: Focuses on "acquiring" habit/policy/value network parameters across all situations for fast decision-making (habit formation)
Decision-Time Planning: An improvisational, slower (computation-heavy) thinking approach that finds "immediate action sequences" from the current state

"Decision-time planning allows us to act without having learned anything at all. As long as you have a model, it's just an optimization problem."

Necessity and Differences Between the Two Approaches

Reflects the most recent state, enables rapid action (decision-time planning)
Adaptability when learned policy/value networks are uncertain
Relieves the burden of state representation design (e.g., pixels, angles)
Observation space independence

On the other hand:

Decision-time planning has limitations in partial observability and computational efficiency
Background planning has advantages in predictability, consistency, and unified discrete/continuous action handling

Pros and Cons Comparison

The two approaches can also be combined or mixed.

5. Practical Implementation of Discrete and Continuous Planning

There are differences in implementation approaches and issues between discrete and continuous actions.

In background planning, since policy distributions are handled probabilistically, they can be processed without major differences.

Utilizing reparameterization techniques like Gumbel-Softmax (hard/soft action sampling)

Decision-time planning applies specialized algorithms:

Discrete: MCTS (Monte Carlo Tree Search)
Continuous: Trajectory optimization

MCTS (Monte Carlo Tree Search, used in AlphaGo/AlphaZero, etc.)

Initialization: Initialize Q-values and visit counts for all state/action pairs
Expansion: Expand nodes from the current state, selecting actions according to search policy
Evaluation: Estimate value through Monte-Carlo rollout at the reached node
Backup: Propagate Q-values to parent nodes
Repeat: Repeat steps 2-4

"MCTS tracks Q-values — long-term rewards — for all states and actions being considered."

Trajectory Optimization (Continuous Planning)

Finding the optimal trajectory by following a single action sequence. Key steps:

Initialization: Assume an action sequence based on an initial guess
Expansion: Execute the assumed action sequence and observe state changes
Evaluation: Calculate rewards
Backpropagation: Compute gradients step-by-step using reward and transition model derivatives
Update: Update the action sequence along the gradient, repeat

Continuous vs. Discrete Summary

The continuous approach is also called the shooting method.

6. Limitations and Mitigation Strategies for Continuous Planning

Continuous planning methods have various practical limitations, and mitigation strategies have been developed for each.

1) Sensitivity and Poor Conditioning

The shooting method has the chronic problem that very small changes in initial actions can have enormous effects on the entire trajectory. This is similar to "gradient explosion/vanishing" in RNNs, and in RL, the transition function is fixed by the environment, limiting passive optimization.

"Each state implicitly depends on all previous actions."

Collocation approach: Introduces variables for the states themselves, optimizing not just actions but states independently. This method has better conditioning, simplifying optimization, and is widely used in some robotics applications.

Collocation Method Illustration

2) Local Optimum Problem

Simple gradient descent can easily get stuck in local optima. To address this, sampling-based methods (CEM, etc.) are introduced.

CEM (Cross-Entropy Method): Sample multiple trajectories, update the mean and variance of the top trajectories, avoid local optima

"It looks very simple, but this method works surprisingly well and guarantees good performance."

An additional advantage is that the search space is much smaller than the policy parameter space.

3) Slow Convergence

Simple gradient descent can take too long for the planning process (requiring tens of millions of iterations). To solve this, higher-order optimization techniques like Newton's method and iLQR are applied to accelerate convergence.

Treating the problem as LQR (Linear-Quadratic Regulator) with linear approximation and quadratic reward for computational simplification
iLQR (Iterative LQR): Repeatedly approximating with LQR around the current solution and updating

7. References and Related Links

Tutorial original text: https://kargarisaac.github.io/blog/reinforcement learning/machine learning/deep learning/2020/10/26/mbrl.html

Conclusion

The various control strategies and algorithms in model-based reinforcement learning actually improve data efficiency and generalization capability, enabling customized control for specific situations. Understanding the strengths and weaknesses of each method and the practical issues in their use is the key point for properly applying model-based control. Strategic selection matched to the situation and problem characteristics is the real power of model-based RL!

1. Representative Algorithms and Approaches in Model-Based RL

Algorithm Categories

There are three representative ways to utilize models.

Three Model Utilization Methods

2. Environment Simulation and Experience Data Augmentation

Models can mix model-generated experience with data from the actual environment to assist reinforcement learning. This artificially increases dataset size for more efficient learning.

Simulation and Mixed Experience

Dyna-Q: A representative example of this approach that uses both real transitions and virtual transitions generated from the model.

"Dyna performs Q-learning updates on real transitions and generates virtual transitions using the model for the same Q-learning updates. Ultimately, there is no other way than to increase experience."

For policy learning as well, multiple steps can be generated virtually beyond a single step to enable more policy updates.

Multi-step Updates

3. End-to-End Learning Support and Policy Gradients Through Models

There are also methods that fully incorporate the model itself into the RL learning loop. The key methodology here is end-to-end training.

"End-to-end supervised learning has been very successful in computer vision, NLP, etc. — can the same approach work for RL?"

Policy Gradient and Gradient Ascent

With an accurate and smooth (differentiable) model, deterministic policy gradients can be obtained without bias.

"The policy gradient we obtain is actually a deterministic value, and there is no variance in it."

Advantages:

Policy gradient values are "exact" with no variance.
Useful for solving long-term credit assignment problems.

Disadvantages:

Can get stuck in local minima
Gradient vanishing, explosion, and other conditioning problems

Real vs. Model-Generated Trajectories

Real trajectories are safer but less sample-efficient, while model-generated trajectories allow diverse policy changes but are more vulnerable to model errors.

4. Background Planning vs. Decision-Time Planning

There are two representative distinctions in model-based RL: Background Planning vs. Decision-Time Planning

Two Planning Approaches

Background Planning: Focuses on "acquiring" habit/policy/value network parameters across all situations for fast decision-making (habit formation)
Decision-Time Planning: An improvisational, slower (computation-heavy) thinking approach that finds "immediate action sequences" from the current state

"Decision-time planning allows us to act without having learned anything at all. As long as you have a model, it's just an optimization problem."

Necessity and Differences Between the Two Approaches

Reflects the most recent state, enables rapid action (decision-time planning)
Adaptability when learned policy/value networks are uncertain
Relieves the burden of state representation design (e.g., pixels, angles)
Observation space independence

On the other hand:

Decision-time planning has limitations in partial observability and computational efficiency
Background planning has advantages in predictability, consistency, and unified discrete/continuous action handling

Pros and Cons Comparison

The two approaches can also be combined or mixed.

5. Practical Implementation of Discrete and Continuous Planning

There are differences in implementation approaches and issues between discrete and continuous actions.

In background planning, since policy distributions are handled probabilistically, they can be processed without major differences.

Utilizing reparameterization techniques like Gumbel-Softmax (hard/soft action sampling)

Decision-time planning applies specialized algorithms:

Discrete: MCTS (Monte Carlo Tree Search)
Continuous: Trajectory optimization

MCTS (Monte Carlo Tree Search, used in AlphaGo/AlphaZero, etc.)

Initialization: Initialize Q-values and visit counts for all state/action pairs
Expansion: Expand nodes from the current state, selecting actions according to search policy
Evaluation: Estimate value through Monte-Carlo rollout at the reached node
Backup: Propagate Q-values to parent nodes
Repeat: Repeat steps 2-4

"MCTS tracks Q-values — long-term rewards — for all states and actions being considered."

Trajectory Optimization (Continuous Planning)

Finding the optimal trajectory by following a single action sequence. Key steps:

Initialization: Assume an action sequence based on an initial guess
Expansion: Execute the assumed action sequence and observe state changes
Evaluation: Calculate rewards
Backpropagation: Compute gradients step-by-step using reward and transition model derivatives
Update: Update the action sequence along the gradient, repeat

Continuous vs. Discrete Summary

The continuous approach is also called the shooting method.

6. Limitations and Mitigation Strategies for Continuous Planning

Continuous planning methods have various practical limitations, and mitigation strategies have been developed for each.

1) Sensitivity and Poor Conditioning

"Each state implicitly depends on all previous actions."

Collocation Method Illustration

2) Local Optimum Problem

Simple gradient descent can easily get stuck in local optima. To address this, sampling-based methods (CEM, etc.) are introduced.

CEM (Cross-Entropy Method): Sample multiple trajectories, update the mean and variance of the top trajectories, avoid local optima

"It looks very simple, but this method works surprisingly well and guarantees good performance."

An additional advantage is that the search space is much smaller than the policy parameter space.

3) Slow Convergence

Treating the problem as LQR (Linear-Quadratic Regulator) with linear approximation and quadratic reward for computational simplification
iLQR (Iterative LQR): Repeatedly approximating with LQR around the current solution and updating

7. References and Related Links

Tutorial original text: https://kargarisaac.github.io/blog/reinforcement learning/machine learning/deep learning/2020/10/26/mbrl.html

1. Representative Algorithms and Approaches in Model-Based RL

2. Environment Simulation and Experience Data Augmentation

3. End-to-End Learning Support and Policy Gradients Through Models

4. Background Planning vs. Decision-Time Planning

Necessity and Differences Between the Two Approaches

5. Practical Implementation of Discrete and Continuous Planning

MCTS (Monte Carlo Tree Search, used in AlphaGo/AlphaZero, etc.)

Trajectory Optimization (Continuous Planning)

6. Limitations and Mitigation Strategies for Continuous Planning

1) Sensitivity and Poor Conditioning

2) Local Optimum Problem

3) Slow Convergence

7. References and Related Links

Conclusion

Related writing

Understanding Society Through Simulation: Simile's Joon Sung Park

Vibe Coding University Member Debuts as AX Consultant

Midjourney Full-Body Ultrasound: From Images to Outcomes

Reading

1. Representative Algorithms and Approaches in Model-Based RL

2. Environment Simulation and Experience Data Augmentation

3. End-to-End Learning Support and Policy Gradients Through Models

4. Background Planning vs. Decision-Time Planning

Necessity and Differences Between the Two Approaches

5. Practical Implementation of Discrete and Continuous Planning

MCTS (Monte Carlo Tree Search, used in AlphaGo/AlphaZero, etc.)

Trajectory Optimization (Continuous Planning)

6. Limitations and Mitigation Strategies for Continuous Planning

1) Sensitivity and Poor Conditioning

2) Local Optimum Problem

3) Slow Convergence

7. References and Related Links

Conclusion

Related writing

Understanding Society Through Simulation: Simile's Joon Sung Park

Vibe Coding University Member Debuts as AX Consultant

Midjourney Full-Body Ultrasound: From Images to Outcomes