
Brief Summary: This article explains various control methods in model-based reinforcement learning in chronological order, covering environment simulation, end-to-end learning, the differences between background and decision-time planning, and representative algorithms (e.g., Dyna-Q, MCTS, trajectory optimization) with practical examples. It also covers continuous vs. discrete control, their limitations, and representative solutions (shooting method, collocation, CEM, etc.) from a practical perspective. This provides a clear understanding of the core flow between theory and actual implementation in model-based control RL.
1. Representative Algorithms and Approaches in Model-Based RL
In model-based reinforcement learning, the structure of the environment is learned, and actions are improved based on the predicted environment model. This tutorial covers the following diverse algorithms and categories.

There are three representative ways to utilize models.

2. Environment Simulation and Experience Data Augmentation
Models can mix model-generated experience with data from the actual environment to assist reinforcement learning. This artificially increases dataset size for more efficient learning.

- Dyna-Q: A representative example of this approach that uses both real transitions and virtual transitions generated from the model.
"Dyna performs Q-learning updates on real transitions and generates virtual transitions using the model for the same Q-learning updates. Ultimately, there is no other way than to increase experience."
For policy learning as well, multiple steps can be generated virtually beyond a single step to enable more policy updates.

3. End-to-End Learning Support and Policy Gradients Through Models
There are also methods that fully incorporate the model itself into the RL learning loop. The key methodology here is end-to-end training.
"End-to-end supervised learning has been very successful in computer vision, NLP, etc. — can the same approach work for RL?"
For example, in policy gradient methods, the policy parameters are adjusted to maximize the sum of future rewards. Traditional sampling-based methods (REINFORCE, etc.) suffer from high variance.

With an accurate and smooth (differentiable) model, deterministic policy gradients can be obtained without bias.
"The policy gradient we obtain is actually a deterministic value, and there is no variance in it."
Advantages:
- Policy gradient values are "exact" with no variance.
- Useful for solving long-term credit assignment problems.
Disadvantages:
- Can get stuck in local minima
- Gradient vanishing, explosion, and other conditioning problems

Real trajectories are safer but less sample-efficient, while model-generated trajectories allow diverse policy changes but are more vulnerable to model errors.
4. Background Planning vs. Decision-Time Planning
There are two representative distinctions in model-based RL: Background Planning vs. Decision-Time Planning

-
Background Planning: Focuses on "acquiring" habit/policy/value network parameters across all situations for fast decision-making (habit formation)
-
Decision-Time Planning: An improvisational, slower (computation-heavy) thinking approach that finds "immediate action sequences" from the current state
"Decision-time planning allows us to act without having learned anything at all. As long as you have a model, it's just an optimization problem."
Necessity and Differences Between the Two Approaches
- Reflects the most recent state, enables rapid action (decision-time planning)
- Adaptability when learned policy/value networks are uncertain
- Relieves the burden of state representation design (e.g., pixels, angles)
- Observation space independence
On the other hand:
- Decision-time planning has limitations in partial observability and computational efficiency
- Background planning has advantages in predictability, consistency, and unified discrete/continuous action handling

- The two approaches can also be combined or mixed.
5. Practical Implementation of Discrete and Continuous Planning
There are differences in implementation approaches and issues between discrete and continuous actions.
In background planning, since policy distributions are handled probabilistically, they can be processed without major differences.
- Utilizing reparameterization techniques like Gumbel-Softmax (hard/soft action sampling)
Decision-time planning applies specialized algorithms:
- Discrete: MCTS (Monte Carlo Tree Search)
- Continuous: Trajectory optimization
MCTS (Monte Carlo Tree Search, used in AlphaGo/AlphaZero, etc.)
- Initialization: Initialize Q-values and visit counts for all state/action pairs
- Expansion: Expand nodes from the current state, selecting actions according to search policy
- Evaluation: Estimate value through Monte-Carlo rollout at the reached node
- Backup: Propagate Q-values to parent nodes
- Repeat: Repeat steps 2-4
"MCTS tracks Q-values — long-term rewards — for all states and actions being considered."
Trajectory Optimization (Continuous Planning)
Finding the optimal trajectory by following a single action sequence. Key steps:
- Initialization: Assume an action sequence based on an initial guess
- Expansion: Execute the assumed action sequence and observe state changes
- Evaluation: Calculate rewards
- Backpropagation: Compute gradients step-by-step using reward and transition model derivatives
- Update: Update the action sequence along the gradient, repeat

The continuous approach is also called the shooting method.
6. Limitations and Mitigation Strategies for Continuous Planning
Continuous planning methods have various practical limitations, and mitigation strategies have been developed for each.
1) Sensitivity and Poor Conditioning
The shooting method has the chronic problem that very small changes in initial actions can have enormous effects on the entire trajectory. This is similar to "gradient explosion/vanishing" in RNNs, and in RL, the transition function is fixed by the environment, limiting passive optimization.
"Each state implicitly depends on all previous actions."
Collocation approach: Introduces variables for the states themselves, optimizing not just actions but states independently. This method has better conditioning, simplifying optimization, and is widely used in some robotics applications.

2) Local Optimum Problem
Simple gradient descent can easily get stuck in local optima. To address this, sampling-based methods (CEM, etc.) are introduced.
- CEM (Cross-Entropy Method): Sample multiple trajectories, update the mean and variance of the top trajectories, avoid local optima
"It looks very simple, but this method works surprisingly well and guarantees good performance."
An additional advantage is that the search space is much smaller than the policy parameter space.
3) Slow Convergence
Simple gradient descent can take too long for the planning process (requiring tens of millions of iterations). To solve this, higher-order optimization techniques like Newton's method and iLQR are applied to accelerate convergence.
- Treating the problem as LQR (Linear-Quadratic Regulator) with linear approximation and quadratic reward for computational simplification
- iLQR (Iterative LQR): Repeatedly approximating with LQR around the current solution and updating
7. References and Related Links
- Tutorial original text: https://kargarisaac.github.io/blog/reinforcement%20learning/machine%20learning/deep%20learning/2020/10/26/mbrl.html
Conclusion
The various control strategies and algorithms in model-based reinforcement learning actually improve data efficiency and generalization capability, enabling customized control for specific situations. Understanding the strengths and weaknesses of each method and the practical issues in their use is the key point for properly applying model-based control. Strategic selection matched to the situation and problem characteristics is the real power of model-based RL!