ICML Tutorial on Model-Based Reinforcement Learning Summary: 2. Model-Based Control preview image

Brief Summary: This article explains various control methods in model-based reinforcement learning in chronological order, covering environment simulation, end-to-end learning, the differences between background and decision-time planning, and representative algorithms (e.g., Dyna-Q, MCTS, trajectory optimization) with practical examples. It also covers continuous vs. discrete control, their limitations, and representative solutions (shooting method, collocation, CEM, etc.) from a practical perspective. This provides a clear understanding of the core flow between theory and actual implementation in model-based control RL.


1. Representative Algorithms and Approaches in Model-Based RL

In model-based reinforcement learning, the structure of the environment is learned, and actions are improved based on the predicted environment model. This tutorial covers the following diverse algorithms and categories.

Algorithm Categories

There are three representative ways to utilize models.

Three Model Utilization Methods


2. Environment Simulation and Experience Data Augmentation

Models can mix model-generated experience with data from the actual environment to assist reinforcement learning. This artificially increases dataset size for more efficient learning.

Simulation and Mixed Experience

  • Dyna-Q: A representative example of this approach that uses both real transitions and virtual transitions generated from the model.

"Dyna performs Q-learning updates on real transitions and generates virtual transitions using the model for the same Q-learning updates. Ultimately, there is no other way than to increase experience."

For policy learning as well, multiple steps can be generated virtually beyond a single step to enable more policy updates.

Multi-step Updates


3. End-to-End Learning Support and Policy Gradients Through Models

There are also methods that fully incorporate the model itself into the RL learning loop. The key methodology here is end-to-end training.

"End-to-end supervised learning has been very successful in computer vision, NLP, etc. — can the same approach work for RL?"

For example, in policy gradient methods, the policy parameters are adjusted to maximize the sum of future rewards. Traditional sampling-based methods (REINFORCE, etc.) suffer from high variance.

Policy Gradient and Gradient Ascent

With an accurate and smooth (differentiable) model, deterministic policy gradients can be obtained without bias.

"The policy gradient we obtain is actually a deterministic value, and there is no variance in it."

Advantages:

  • Policy gradient values are "exact" with no variance.
  • Useful for solving long-term credit assignment problems.

Disadvantages:

  • Can get stuck in local minima
  • Gradient vanishing, explosion, and other conditioning problems

Real vs. Model-Generated Trajectories

Real trajectories are safer but less sample-efficient, while model-generated trajectories allow diverse policy changes but are more vulnerable to model errors.


4. Background Planning vs. Decision-Time Planning

There are two representative distinctions in model-based RL: Background Planning vs. Decision-Time Planning

Two Planning Approaches

  • Background Planning: Focuses on "acquiring" habit/policy/value network parameters across all situations for fast decision-making (habit formation)

  • Decision-Time Planning: An improvisational, slower (computation-heavy) thinking approach that finds "immediate action sequences" from the current state

"Decision-time planning allows us to act without having learned anything at all. As long as you have a model, it's just an optimization problem."

Necessity and Differences Between the Two Approaches

  • Reflects the most recent state, enables rapid action (decision-time planning)
  • Adaptability when learned policy/value networks are uncertain
  • Relieves the burden of state representation design (e.g., pixels, angles)
  • Observation space independence

On the other hand:

  • Decision-time planning has limitations in partial observability and computational efficiency
  • Background planning has advantages in predictability, consistency, and unified discrete/continuous action handling

Pros and Cons Comparison

  • The two approaches can also be combined or mixed.

5. Practical Implementation of Discrete and Continuous Planning

There are differences in implementation approaches and issues between discrete and continuous actions.

In background planning, since policy distributions are handled probabilistically, they can be processed without major differences.

  • Utilizing reparameterization techniques like Gumbel-Softmax (hard/soft action sampling)

Decision-time planning applies specialized algorithms:

  • Discrete: MCTS (Monte Carlo Tree Search)
  • Continuous: Trajectory optimization

MCTS (Monte Carlo Tree Search, used in AlphaGo/AlphaZero, etc.)

  1. Initialization: Initialize Q-values and visit counts for all state/action pairs
  2. Expansion: Expand nodes from the current state, selecting actions according to search policy
  3. Evaluation: Estimate value through Monte-Carlo rollout at the reached node
  4. Backup: Propagate Q-values to parent nodes
  5. Repeat: Repeat steps 2-4

"MCTS tracks Q-values — long-term rewards — for all states and actions being considered."

Trajectory Optimization (Continuous Planning)

Finding the optimal trajectory by following a single action sequence. Key steps:

  1. Initialization: Assume an action sequence based on an initial guess
  2. Expansion: Execute the assumed action sequence and observe state changes
  3. Evaluation: Calculate rewards
  4. Backpropagation: Compute gradients step-by-step using reward and transition model derivatives
  5. Update: Update the action sequence along the gradient, repeat

Continuous vs. Discrete Summary

The continuous approach is also called the shooting method.


6. Limitations and Mitigation Strategies for Continuous Planning

Continuous planning methods have various practical limitations, and mitigation strategies have been developed for each.

1) Sensitivity and Poor Conditioning

The shooting method has the chronic problem that very small changes in initial actions can have enormous effects on the entire trajectory. This is similar to "gradient explosion/vanishing" in RNNs, and in RL, the transition function is fixed by the environment, limiting passive optimization.

"Each state implicitly depends on all previous actions."

Collocation approach: Introduces variables for the states themselves, optimizing not just actions but states independently. This method has better conditioning, simplifying optimization, and is widely used in some robotics applications.

Collocation Method Illustration


2) Local Optimum Problem

Simple gradient descent can easily get stuck in local optima. To address this, sampling-based methods (CEM, etc.) are introduced.

  • CEM (Cross-Entropy Method): Sample multiple trajectories, update the mean and variance of the top trajectories, avoid local optima

"It looks very simple, but this method works surprisingly well and guarantees good performance."

An additional advantage is that the search space is much smaller than the policy parameter space.


3) Slow Convergence

Simple gradient descent can take too long for the planning process (requiring tens of millions of iterations). To solve this, higher-order optimization techniques like Newton's method and iLQR are applied to accelerate convergence.

  • Treating the problem as LQR (Linear-Quadratic Regulator) with linear approximation and quadratic reward for computational simplification
  • iLQR (Iterative LQR): Repeatedly approximating with LQR around the current solution and updating

7. References and Related Links


Conclusion

The various control strategies and algorithms in model-based reinforcement learning actually improve data efficiency and generalization capability, enabling customized control for specific situations. Understanding the strengths and weaknesses of each method and the practical issues in their use is the key point for properly applying model-based control. Strategic selection matched to the situation and problem characteristics is the real power of model-based RL!

Related writing

Related writing