ModelBased Policy Optimization (MBPO) Agent
Modelbased policy optimization (MBPO) is a modelbased, offpolicy reinforcement learning algorithm for environments with a discrete or continuous action space. An MBPO agent contains an internal model of the environment, which it uses to generate additional experiences without interacting with the environment. Specifically, during training, the MBPO agent generates real experiences by interacting with the environment. These experiences are used to train the internal environment model, which is used to generate additional experiences. The training algorithm then uses both the real and generated experiences to update the agent policy. For more information on the different types of reinforcement learning agents, see Reinforcement Learning Agents.
In Reinforcement Learning Toolbox™, an MBPO agent is implemented by an rlMBPOAgent
object.
The following figure shows the components and behavior of an MBPO agent. The agent samples real experience data through environmental interaction and trains a model of the environment using this experience. Then, the agent updates the policy learnable parameters of its base agent using the real experience data and experience generated from the environment model.
Note
MBPO agents do not support recurrent networks.
MBPO agents can be trained in environments with the following observation and action spaces.
Observation Space  Action Space 

Continuous  Discrete or continuous 
You can use the following offpolicy agents as the base agent in an MBPO agent.
Action Space  Base OffPolicy Agent 

Discrete 

Continuous 

Note
Soft actorcritic agents with an hybrid action space cannot be used to build model based agents.
MBPO agents use an environment model that you define using an rlNeuralNetworkEnvironment
object, which contains the following components. In
general, these components use a deep neural network to learn the environment behavior during
training.
One or more transition functions that predict the next observation based on the current observation and action. You can define deterministic transition functions using
rlContinuousDeterministicTransitionFunction
objects or stochastic transition functions usingrlContinuousGaussianTransitionFunction
objects.A reward function that predicts the reward from the environment based on a combination of the current observation, current action, and next observation. You can define a deterministic reward function using an
rlContinuousDeterministicRewardFunction
object or a stochastic reward function using anrlContinuousGaussianRewardFunction
object. You can also define a known reward function using a custom function.An isdone function that predicts the termination signal based on a combination of the current observation, current action, and next observation. You can also define a known termination signal using a custom function.
During training, an MBPO agent:
Updates the environment model at the beginning of each episode by training the transition functions, reward function, and isdone function
Generates samples using the trained environment model and stores the samples in a circular experience buffer
Stores real samples from the interaction between the agent and the environment using a separate circular experience buffer within the base agent
Updates the actor and critic of the base agent using a minibatch of experiences randomly sampled from both the generated experience buffer and the real experience buffer
Training Algorithm
MBPO agents use the following training algorithm, in which they periodically update the
environment model and the base offpolicy agent. To configure the training algorithm,
specify options using an rlMBPOAgentOptions
object.
Initialize the actor and critics of the base agent.
Initialize the transition functions, reward function, and isdone function in the environment model.
At the beginning of each training episode:
For each modeltraining epoch, perform the following steps. To specify the number of epochs, use the
NumEpochForTrainingModel
option.Train the transition functions. If the corresponding
LearnRate
optimizer option is0
, skip this step.Use a halfmean loss for an
rlContinuousDeterministicTransitionFunction
object and a maximum likelihood loss for anrlContinuousStochasticTransitionFunction
object.To make each observation channel equally important, first compute the loss for each observation channel. Then, divide each loss by the number of elements in its corresponding observation specification.
$$Loss={\displaystyle \sum _{i=1}^{{N}_{o}}\frac{1}{{M}_{oi}}}Los{s}_{oi}$$
For example, if the observation specification for the environment is defined by
[rlNumericSpec([10.1]) rlNumericSpec([4,1])]
, then N_{o} is 2, M_{o}_{1} is 10, and M_{o}_{2} is 4.Train the reward function. If the corresponding
LearnRate
optimizer option is0
or a groundtruth custom reward function is defined, skip this step.Use a halfmean loss for an
rlContinuousDeterministicRewardFunction
object and a maximum likelihood loss for anrlContinuousStochasticRewardFunction
object.
Train the isdone function. If the corresponding
LearnRate
optimizer option is0
or a groundtruth custom isdone function is defined, skip this step.Use a weighted crossentropy loss function. In general, the terminal conditions (
isdone = 1
) occur much less frequently than nonterminal conditions (isdone = 0
). To deal with the heavily imbalanced data, use the following weights and loss function.$$\begin{array}{l}{w}_{0}=\frac{1}{{\displaystyle {\sum}_{i=1}^{M}\left(1{T}_{i}\right)}},\text{\hspace{1em}}{w}_{1}=\frac{1}{{\displaystyle {\sum}_{i=1}^{M}{T}_{i}}}\\ Loss=\frac{1}{M}{\displaystyle \sum _{i=1}^{M}\left({w}_{0}{T}_{i}\mathrm{ln}{Y}_{i}+{w}_{1}\left(1{T}_{i}\right)\mathrm{ln}\left(1{Y}_{i}\right)\right)}\end{array}$$
Here, M is the minibatch size, T_{i} is a target, and Y_{i} is the output from the reward network for the ith sample in the batch. T_{i} = 1 when
isdone
is 1 and T_{i} = 0 whenisdone
is 0.
Generate samples using the trained environment model. The following figure shows an example of two rollout trajectories with a horizon of two.
Increase the horizon based on the horizon update settings defined in the
ModelRolloutOptions
object.Randomly sample a batch of N_{R} observations from the real experience buffer. To specify N_{R}, use the
ModelRolloutOptions.NumRollout
option.For each horizon step:
Randomly divide the observations into N_{M} groups, where N_{M} is the number of transition models, and assign each group to a transition model.
For each observation o_{i}, generate an action a_{i} using the exploration policy defined by the
ModelRolloutOptions.NoiseOptions
object. IfModelRolloutOptions.NoiseOptions
is empty, use the exploration policy of the base agent.For each observationaction pair, predict the next observation o'_{2} using the corresponding transition model.
Using the environment model reward function, predict the reward value r_{i} based on the observation, action, and next observation.
Using the environment model isdone function, predict the termination signal done_{i} based on the observation, action, and next observation.
Add the experience (o_{i},a_{i},r_{i},o'_{i},done_{i}) to the generated experience buffer.
For the next horizon step, substitute each observation with the predicted next observation.
For each step in each training episode:
Sample a minibatch of M total experiences from the real experience buffer and the generated experience buffer. To specify M, use the
MiniBatchSize
option.Sample N_{real} = ⌈M·R⌉ samples from the real experience buffer. To specify R, use the
RealRatio
option.N_{model} = M – N_{real} samples from the generated experience buffer.
Train the base agent using the sampled minibatch of data by following the update rule of the base agent. For more information, see the corresponding SAC, TD3, DDPG, or DQN training algorithm.
Tips
MBPO agents can be more sampleefficient than modelfree agents because the model can generate large sets of diverse experiences. However, MBPO agents require much more computational time than modelfree agents, because they must train the environment model and generate samples in addition to training the base agent.
To overcome modeling uncertainty, best practice is to use multiple environment transition models.
If they are available, it is best to use known groundtruth reward and isdone functions.
It is better to generate a large number of trajectories (thousands or tens of thousands). Doing so generates many samples, which reduces the likelihood of selecting the same sample multiple times in a training episode.
Since modeling errors can accumulate, it is better to use a shorter horizon when generating samples. A shorter horizon is usually enough to generate diverse experiences.
In general, an agent created using
rlMBPOAgent
is not suitable for environments with image observations.When using a SAC base agent, taking more gradient steps (defined by the
NumGradientStepsPerUpdate
SAC agent option) makes the MBPO agent more sampleefficient. However, doing so increases the computational time.The MBPO implementation in
rlMBPOAgent
is based on the algorithm in the original MBPO paper [1] but with the differences shown in the following table.Original Paper rlMBPOAgent
Generates samples at each environment step Generates samples at the beginning of each training episode Trains actor and critic using only generated samples Trains actor and critic using both real data and generated data Uses stochastic environment models Uses either stochastic or deterministic environment models Uses SAC agents Can use SAC, DQN, DDPG, and TD3 agents
References
[1] Janner, Michael, Justin Fu, Marvin Zhang, and Sergey Levine. “When to Trust Your Model: ModelBased Policy Optimization.” In Proceedings of the 33rd International Conference on Neural Information Processing Systems, 12519–30. 1122. Red Hook, NY, USA: Curran Associates Inc., 2019.