Soft Actor-Critic (SAC) Agent
The soft actor-critic (SAC) algorithm is an off-policy actor-critic method for environments with discrete, continuous, and hybrid action-spaces. The SAC algorithm attempts to learn the stochastic policy that maximizes a combination of the policy value and its entropy. The policy entropy is a measure of policy uncertainty given the state. A higher entropy value promotes more exploration. Maximizing both the expected discounted cumulative long-term reward and the entropy balances exploration and exploitation of the environment. A soft actor-critic agent uses two critics to estimate the value of the optimal policy, while also featuring target critics and an experience buffer. SAC agents support offline training (training from saved data, without an environment). For more information on the different types of reinforcement learning agents, see Reinforcement Learning Agents.
In Reinforcement Learning Toolbox™, a soft actor-critic agent is implemented by an rlSACAgent
object. This
implementation uses two Q-value function critics, which prevents overestimation of the value
function. Other implementations of the soft actor-critic algorithm use an additional value
function critic.
Soft actor-critic agents can be trained in environments with the following observation and action spaces.
Observation Space | Action Space |
---|---|
Discrete or continuous | Discrete, continuous or hybrid |
Note
Soft actor-critic agents with an hybrid action space do not support training with an evolutionary strategy. Also, they cannot be used to build model based agents. Finally, while you can train offline (from existing data) any SAC agent, only SAC agents with continuous action space support batch data regularizer options.
Soft actor-critic agents use the following actor and critic. In the most general case, for hybrid action spaces, the action A has a discrete part A^{d} and a continuous part A^{c}.
Critics | Actor |
---|---|
Q-value function critics
Q(S,A), which you create
using | Stochastic policy actor
π(A|S), which you create
using |
During training, a soft actor-critic agent:
Updates the actor and critic learnable parameters at regular intervals during learning.
Estimates the probability distribution of the action and randomly selects an action based on the distribution.
Updates an entropy weight term to reduce the difference between entropy and target entropy.
Stores past experience using a circular experience buffer. The agent updates the actor and critic using a mini-batch of experiences randomly sampled from the buffer.
If the UseExplorationPolicy
option of the agent is set to
false
, the action with maximum likelihood is always used in sim
and generatePolicyFunction
. As a result, the simulated agent and generated policy
behave deterministically.
If the UseExplorationPolicy
is set to true
, the
agent selects its actions by sampling its probability distribution. As a result, the policy is
stochastic and the agent explores its observation space.
This option affects only simulation and deployment; it does not affect training.
Actor and Critic Function Approximators
To estimate the policy and value function, a soft actor-critic agent maintains the following function approximators.
Stochastic actor π(A|S;θ).
For continuous-only action spaces, the actor outputs a vector containing the mean and standard deviation of the Gaussian distribution for the continuous part of the action. Note the SAC algorithm bounds the continuous action selected from the actor.
For discrete-only action spaces, the actor outputs a vector containing the probabilities of each possible discrete action.
For hybrid action spaces, the actor outputs both these vectors.
Both distributions are parameterized in θ and conditional to the observation being S.
One or two Q-value (or vector Q-value) critics Q_{k}(S,A^{c};ϕ_{k}) — The critics, each with parameters ϕ_{k}, take observation S and the continuous part of the action A^{c} (if present) as inputs and return the corresponding value function (for continuous action spaces), or the value of each possible discrete action A^{d} (for discrete or hybrid action spaces). The value function is calculated including the entropy of the policy as well as its expected discounted cumulative long-term reward.
One or two target critics Q_{tk}(S,A^{c};ϕ_{tk}) — To improve the stability of the optimization, the agent periodically sets the target critic parameters ϕ_{tk} to the latest corresponding critic parameter values. The number of target critics matches the number of critics.
When you use two critics, Q_{1}(S,A^{c};ϕ_{1}) and Q_{2}(S,A^{c};ϕ_{2}), each critic can have different structures. When the critics have the same structure, they must have different initial parameter values.
Each critic Q_{k}(S,A^{c};ϕ_{k}) and corresponding target critic Q_{tk}(S,A^{c};ϕ_{tk}) must have the same structure and parameterization.
For more information on creating actors and critics for function approximation, see Create Policies and Value Functions.
During training, the agent tunes the parameter values in θ. After training, the parameters remain at their tuned value and the trained actor function approximator is stored in π(A|S).
Continuous Action Generation
In a continuous action space soft actor-critic agent, the neural network in the actor takes the current observation and generates two outputs, one for the mean and the other for the standard deviation. To select an action, the actor randomly selects an unbounded action from this Gaussian distribution. If the soft actor-critic agent needs to generate bounded actions, the actor applies tanh and scaling operations to the action sampled from the Gaussian distribution.
During training, the agent uses the unbounded Gaussian distribution to calculate the entropy of the policy for the given observation.
Discrete Action Generation
In a discrete action space soft actor-critic agent, the actor takes the current observation and generates a categorical distribution, in which each possible action is associated with a probability. Since each action that belongs to the finite set is already assumed feasible, no bounding is needed.
During training, the agent uses the categorical distribution to calculate the entropy of the policy for the given observation.
Hybrid Action Generation
In a hybrid action space soft actor-critic agent, the actor takes the current observation and generates both a categorical and a Gaussian distribution, which are both used to calculate the entropy of the policy during training.
A discrete action is then sampled from the categorical distribution, and a continuous action is sampled from the Gaussian distribution. If needed, the continuous action is then also automatically bounded as for continuous action generation.
The discrete and continuous actions are then returned to the environment using two different action channels.
Agent Creation
You can create and train soft actor-critic agents at the MATLAB^{®} command line or using the Reinforcement Learning Designer app. For more information on creating agents using Reinforcement Learning Designer, see Create Agents Using Reinforcement Learning Designer.
At the command line, you can create a soft actor-critic agent with default actor and critic based on the observation and action specifications from the environment. To do so, perform the following steps.
Create observation specifications for your environment. If you already have an environment object, you can obtain these specifications using
getObservationInfo
.Create action specifications for your environment. If you already have an environment object, you can obtain these specifications using
getActionInfo
.If needed, specify the number of neurons in each learnable layer of the default network or whether to use a recurrent default network. To do so, create an agent initialization option object using
rlAgentInitializationOptions
.If needed, specify agent options using an
rlSACAgentOptions
object (alternatively, you can skip this step and then modify the agent options later using dot notation).Create the agent using an
rlSACAgent
object.
Alternatively, you can create your own actor and critic objects and use them to create your agent. In this case, ensure that the input and output dimensions of the actor and critic match the corresponding action and observation specifications of the environment. To create an agent using your custom actor and critic objects, perform the following steps.
Create a stochastic actor using an
rlContinuousGaussianActor
object (for continuous action spaces), anrlDiscreteCategoricalActor
(for discrete action spaces), or anrlHybridStochasticActor
(for hybrid action spaces). For soft actor-critic agents with continuous or hybrid action spaces, the actor network must not contain atanhLayer
andscalingLayer
as last two layers in the output path for the mean values, since the scaling already occurs automatically. However, to ensure that the standard deviation values are not negative, the actor network must contain areluLayer
as the last layer in the output path for the standard deviation values.Create one or two critics using
rlQValueFunction
objects (for continuous action spaces) or usingrlVectorQValueFunction
objects (for hybrid or discrete action spaces). For hybrid action spaces, the critics must take as inputs both the observation and the continuous action. If the critics have the same structure, they must have different initial parameter values.Specify agent options using an
rlSACAgentOptions
object (alternatively, you can skip this step and then modify the agent options later using dot notation).Create the agent using an
rlSACAgent
object.
For more information on creating actors and critics for function approximation, see Create Policies and Value Functions.
Training Algorithm
The soft actor-critic agent uses the following training algorithm, in which it
periodically updates the actor and critic models and entropy weight. To configure the
training algorithm, specify options using an rlSACAgentOptions
object.
Here, K = 2 is the number of critics and k is the critic
index.
Initialize each critic Q_{k}(S,A;ϕ_{k}) with random parameter values ϕ_{k}, and initialize each target critic with the same random parameter values, $${\varphi}_{tk}={\varphi}_{k}$$.
Initialize the actor π(A|S;θ) with random parameter values θ.
Perform a warm start by taking a sequence of actions following the initial random policy in π(A|S). For each action, store the experience (S,A,R,S') in the experience buffer. To specify the size of the experience buffer, use the
ExperienceBufferLength
option in the agentrlSACAgentOptions
object. To specify the number of warm up actions, use theNumWarmStartSteps
option.For each training time step:
For the current observation S, select the action A (with its continuous part bounded) using the policy in π(A|S;θ).
Execute action A. Observe the reward R and next observation S'.
Store the experience (S,A,R,S') in the experience buffer.
Every D_{C} time steps (to specify D_{C} use the
LearningFrequency
option), perform the following operations. For each epoch (to specify the number of epochs, use theNumEpoch
option), perform the following two operations:Create at most B different mini-batches. To specify B, use the
MaxMiniBatchPerEpoch
option. Each mini-batch contains M different (typically nonconsecutive) experiences (S_{i},A_{i},R_{i},S'_{i}) that are randomly sampled from the experience buffer (each experience can only be part of one mini-batch). To specify M, use theMiniBatchSize
option.If the agent contains recurrent neural networks, each mini-batch contains M different sequences. Each sequence contains K consecutive experiences (starting from a randomly sampled experience). To specify K, use the
SequenceLenght
option.For each mini-batch, perform the learning operations described in Mini-Batch Learning Operations.
Mini-Batch Learning Operations
Operations performed for each mini-batch.
Update the parameters of each critic by minimizing the loss L_{k} across all sampled experiences.
$${L}_{k}=\frac{1}{2M}{\displaystyle \sum _{i=1}^{M}{\left({y}_{i}-{Q}_{k}\left({S}_{i},{A}_{i};{\varphi}_{k}\right)\right)}^{2}}$$
To specify the optimizer options used to minimize L_{k}, use the options contained in the
CriticOptimizerOptions
option (which in turn contains anrlOptimizerOptions
object).If the agent contains recurrent neural networks, each element of the sum over the batch elements is itself a sum over the time (sequence) dimension.
If S'_{i} is a terminal state, the value function target y_{i} is set equal to the experience reward R_{i}. Otherwise, the value function target is the sum of R_{i}, the minimum discounted future reward from the critics, and the weighted entropy. The following formulas show the value function target in discrete, continuous, and hybrid action spaces, respectively.
$$\begin{array}{l}{y}_{i}^{d}={R}_{i}+\gamma \underset{k}{\mathrm{min}}\left({\displaystyle \sum _{j=1}^{{N}^{d}}{\pi}^{d}\left({A}_{j}^{d}\text{'}|{S}_{i}\text{'};\theta \right){Q}_{tk}\left({S}_{i}\text{'},{A}_{j}^{d}\text{'};{\varphi}_{tk}\right)}\right)-{\alpha}^{d}{\displaystyle \sum _{j=1}^{{N}^{d}}{\pi}^{d}\left({A}_{j}^{d}\text{'}|{S}_{i}\text{'};\theta \right)\mathrm{ln}{\pi}^{d}\left({A}_{j}^{d}\text{'}|{S}_{i}\text{'};\theta \right)}\\ {y}_{i}^{c}={R}_{i}+\gamma \underset{k}{\mathrm{min}}\left({Q}_{tk}\left({S}_{i}\text{'},{A}_{i}^{c}\text{'};{\varphi}_{tk}\right)\right)-{\alpha}^{c}\mathrm{ln}{\pi}^{c}\left({A}_{i}^{c}\text{'}|{S}_{i}\text{'};\theta \right)\\ {y}_{i}^{h}={R}_{i}+\gamma \underset{k}{\mathrm{min}}\left({\displaystyle \sum _{j=1}^{{N}^{d}}{\pi}^{d}\left({A}_{j}^{d}\text{'}|{S}_{i}\text{'};\theta \right){Q}_{tk}\left({S}_{i}\text{'},{A}_{j}^{d}\text{'},{A}_{i}^{c}\text{'};{\varphi}_{tk}\right)}\right)-{\alpha}^{d}{\displaystyle \sum _{j=1}^{{N}^{d}}{\pi}^{d}\left({A}_{j}^{d}\text{'}|{S}_{i}\text{'};\theta \right)\mathrm{ln}{\pi}^{d}\left({A}_{j}^{d}\text{'}|{S}_{i}\text{'};\theta \right)}-{\alpha}^{c}\mathrm{ln}{\pi}^{c}\left({A}_{i}^{c}\text{'}|{S}_{i}\text{'};\theta \right)\end{array}$$
Here:
The superscripts d, c, and h indicate the quantity in the discrete, continuous, and hybrid cases, respectively. N^{d} is the number of possible discrete actions, and A^{d}_{j} indicates the jth action belonging to the discrete action set.
γ is the discount factor, which you specify in the
DiscountFactor
option.The last two terms of the target equation for the hybrid case, (or the last term in the other cases) represent the weighted policy entropy for the output of the actor when in state S. α^{d} and α^{c} are the entropy loss weights for the discrete and continuous action spaces, which you specify by setting the
EntropyWeight
option of the respectiveEntropyWeightOptions
property. To specify the other optimizer options used to tune one of the entropy term, use the other properties of theEntropyWeightOptions
agent option.
If you specify a value of
NumStepsToLookAhead
equal to N, then the N-step return (which adds the rewards of the following N steps and the discounted estimated value of the state that caused the N-th reward) is used to calculate the target y_{i}.At every critic update, update the target critics depending on the target update method. For more information, see Target Update Methods.
Every D_{A} critic updates (to set D_{A}, use both the
LearningFrequency
and thePolicyUpdateFrequency
options), perform the following two operations:Update the parameters of the actor by minimizing the following objective function across all sampled experiences. The following formulas show the objective function in discrete, continuous, and hybrid action spaces, respectively.
$$\begin{array}{l}{J}_{\pi}^{d}=\frac{1}{M}{\displaystyle \sum _{i=1}^{M}\left(-\underset{k}{\mathrm{min}}\left({\displaystyle \sum _{j=1}^{{N}^{d}}{\pi}^{d}\left({A}_{j}^{d}|{S}_{i};\theta \right){Q}_{tk}\left({S}_{i},{A}_{j}^{d};{\varphi}_{tk}\right)}\right)+{\alpha}^{d}{\displaystyle \sum _{j=1}^{{N}^{d}}{\pi}^{d}\left({A}_{j}^{d}|{S}_{i};\theta \right)\mathrm{ln}{\pi}^{d}\left({A}_{j}^{d}|{S}_{i};\theta \right)}\right)}\\ {J}_{\pi}^{c}=\frac{1}{M}{\displaystyle \sum _{i=1}^{M}\left(-\underset{k}{\mathrm{min}}\left({Q}_{k}\left({S}_{i},{A}_{i}^{c};{\varphi}_{k}\right)\right)+{\alpha}^{c}\mathrm{ln}{\pi}^{c}\left({A}_{i}^{c}|{S}_{i};\theta \right)\right)}\\ {J}_{\pi}^{h}=\frac{1}{M}{\displaystyle \sum _{i=1}^{M}\left(-\underset{k}{\mathrm{min}}\left({\displaystyle \sum _{j=1}^{{N}^{d}}{\pi}^{d}\left({A}_{j}^{d}|{S}_{i};\theta \right){Q}_{tk}\left({S}_{i},{A}_{j}^{d},{A}_{i}^{c};{\varphi}_{tk}\right)}\right)+{\alpha}^{d}{\displaystyle \sum _{j=1}^{{N}^{d}}{\pi}^{d}\left({A}_{j}^{d}|{S}_{i};\theta \right)\mathrm{ln}{\pi}^{d}\left({A}_{j}^{d}|{S}_{i};\theta \right)}+{\alpha}^{c}\mathrm{ln}{\pi}^{c}\left({A}_{i}^{c}|{S}_{i};\theta \right)\right)}\end{array}$$
To specify the optimizer options used to minimize J_{π}, use the options contained in the
ActorOptimizerOptions
option (which in turn contains anrlOptimizerOptions
object).If the agent contains recurrent neural networks, each element of the sum over the mini-batch elements is itself a sum over the time (sequence) dimension.
Update the entropy weights by minimizing the following loss functions. When the action space is discrete or continuous, only the respective entropy weight is minimized. When the action space is hybrid, both weights are updated by minimizing both functions.
$$\begin{array}{l}{L}_{\alpha}^{d}=\frac{1}{M}{\displaystyle \sum _{i=1}^{M}\left(-{\alpha}^{d}{\displaystyle \sum _{j=1}^{{N}^{d}}{\pi}^{d}\left({A}_{j}^{d}|{S}_{i};\theta \right)\mathrm{ln}{\pi}^{d}\left({A}_{j}^{d}|{S}_{i};\theta \right)}-{\alpha}^{d}{\mathscr{H}}^{d}\right)}\\ {L}_{\alpha}^{c}=\frac{1}{M}{\displaystyle \sum _{i=1}^{M}\left(-{\alpha}^{c}\mathrm{ln}{\pi}^{c}\left({A}_{i}^{c}|{S}_{i};\theta \right)-{\alpha}^{c}{\mathscr{H}}^{c}\right)}\end{array}$$
Here, ℋ^{d} and ℋ^{c} are the target entropies for the discrete and continuous cases, which you specify in the
TargetEntropy
property of the correspondingEntropyWeightOptions.TargetEntropy
option.
Target Update Methods
Soft actor-critic agents update their target critic parameters using one of the following target update methods.
Smoothing — Update the target critic parameters using smoothing factor τ. To specify the smoothing factor, use the
TargetSmoothFactor
option.$${\varphi}_{tk}=\tau {\varphi}_{k}+\left(1-\tau \right){\varphi}_{tk}$$
Periodic — Update the target critic parameters periodically without smoothing (
TargetSmoothFactor = 1
). To specify the update period, use theTargetUpdateFrequency
parameter.$${\varphi}_{tk}={\varphi}_{k}$$
Periodic smoothing — Update the target parameters periodically with smoothing.
To configure the target update method, set the
TargetUpdateFrequency
and TargetSmoothFactor
parameters as shown in the following table.
Update Method | TargetUpdateFrequency | TargetSmoothFactor |
---|---|---|
Smoothing (default) | 1 | Less than 1 |
Periodic | Greater than 1 | 1 |
Periodic smoothing | Greater than 1 | Less than 1 |
References
[1] Haarnoja, Tuomas, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, et al. "Soft Actor-Critic Algorithms and Application." Preprint, submitted January 29, 2019. https://arxiv.org/abs/1812.05905.
[2] Christodoulou, Petros. "Soft Actor-Critic for Discrete Action Settings." arXiv preprint arXiv:1910.07207 (2019). https://arxiv.org/abs/1910.07207.
[3] Delalleau, Olivier, Maxim Peter, Eloi Alonso, and Adrien Logut. "Discrete and Continuous Action Representation for Practical RL in Video Games." arXiv preprint arXiv:1912.11077 (2019). https://arxiv.org/abs/1912.11077
See Also
Functions
Objects
rlSACAgent
|rlSACAgentOptions
|rlQValueFunction
|rlVectorQValueFunction
|rlContinuousGaussianActor
|rlHybridStochasticActor
|rlDDPGAgent
|rlTD3Agent
|rlACAgent
|rlPPOAgent