Main Content

A reinforcement learning policy is a mapping that selects the action that the agent takes based on observations from the environment. During training, the agent tunes the parameters of its policy representation to maximize the expected cumulative long-term reward.

Reinforcement learning agents estimate policies and value functions using function approximators called actor and critic representations respectively. The actor represents the policy that selects the best action to take, based on the current observation. The critic represents the value function that estimates the expected cumulative long-term reward for the current policy.

Before creating an agent, you must create the required actor and critic representations using deep neural networks, linear basis functions, or lookup tables. The type of function approximators you use depends on your application.

For more information on agents, see Reinforcement Learning Agents.

The Reinforcement Learning Toolbox™ software supports the following types of representations:

*V*(*S*|*θ*) — Critics that estimate the expected cumulative long-term reward based on a given observation_{V}*S*. You can create these critics using`rlValueRepresentation`

.*Q*(*S*,*A*|*θ*) — Critics that estimate the expected cumulative long-term reward for a given discrete action_{Q}*A*and a given observation*S*. You can create these critics using`rlQValueRepresentation`

.*Q*(_{i}*S*,*A*|_{i}*θ*) — Multi-output critics that estimate the expected cumulative long-term reward for all possible discrete actions_{Q}*A*given observation_{i}*S*. You can create these critics using`rlQValueRepresentation`

.*μ*(*S*|*θ*) — Actors that select an action based on a given observation_{μ}*S*. You can create these actors using either`rlDeterministicActorRepresentation`

or`rlStochasticActorRepresentation`

.

Each representation uses a function approximator with a corresponding set of parameters
(*θ _{V}*,

For systems with a limited number of discrete observations and discrete actions, you can store value functions in a lookup table. For systems that have many discrete observations and actions and for observation and action spaces that are continuous, storing the observations and actions is impractical. For such systems, you can represent your actors and critics using deep neural networks or custom (linear in the parameters) basis functions.

The following table summarizes the way in which you can use the four representation objects available with the Reinforcement Learning Toolbox software, depending on the action and observation spaces of your environment, and on the approximator and agent that you want to use.

**Representations vs. Approximators and Agents**

Representation | Supported Approximators | Observation Space | Action Space | Supported Agents |
---|---|---|---|---|

Value function critic
| Table | Discrete | Not applicable | PG, AC, PPO |

Deep neural network or custom basis function | Discrete or continuous | Not applicable | PG, AC, PPO | |

Q-value function critic,
| Table | Discrete | Discrete | Q, DQN, SARSA |

Deep neural network or custom basis function | Discrete or continuous | Discrete | Q, DQN, SARSA | |

Deep neural network or custom basis function | Discrete or continuous | Continuous | DDPG, TD3 | |

Multi-output Q-value function critic | Deep neural network or custom basis function | Discrete or continuous | Discrete | Q, DQN, SARSA |

Deterministic policy actor | Deep neural network or custom basis function | Discrete or continuous | Continuous | DDPG, TD3 |

Stochastic policy actor
| Deep neural network or custom basis function | Discrete or continuous | Discrete | PG, AC, PPO |

Deep neural network | Discrete or continuous | Continuous | PG, AC, PPO, SAC |

For more information on agents, see Reinforcement Learning Agents.

Representations based on lookup tables are appropriate for environments with a limited
number of *discrete* observations and actions. You can create two types
of lookup table representations:

Value tables, which store rewards for corresponding observations

Q-tables, which store rewards for corresponding observation-action pairs

To create a table representation, first create a value table or Q-table using the
`rlTable`

function.
Then, create a representation for the table using either an `rlValueRepresentation`

or `rlQValueRepresentation`

object. To configure the learning rate and optimization
used by the representation, use an `rlRepresentationOptions`

object.

You can create actor and critic function approximators using deep neural networks. Doing so uses Deep Learning Toolbox™ software features.

The dimensions of your actor and critic networks must match the corresponding action
and observation specifications from the training environment object. To obtain the action
and observation dimensions for environment `env`

, use the
`getActionInfo`

and `getObservationInfo`

functions, respectively. Then access the `Dimensions`

property of the
specification objects.

actInfo = getActionInfo(env); actDimensions = actInfo.Dimensions; obsInfo = getObservationInfo(env); obsDimensions = obsInfo.Dimensions;

Networks for value function critics (such as the ones used in AC, PG, or PPO agents)
must take only observations as inputs and must have a single scalar output. For these
networks, the dimensions of the input layers must match the dimensions of the environment
observation specifications. For more information, see `rlValueRepresentation`

.

Networks for single-output Q-value function critics (such as the ones used in Q, DQN,
SARSA, DDPG, TD3, and SAC agents) must take both observations and actions as inputs, and
must have a single scalar output. For these networks, the dimensions of the input layers
must match the dimensions of the environment specifications for both observations and
actions. For more information, see `rlQValueRepresentation`

.

Networks for multi-output Q-value function critics (such as those used in Q, DQN, and
SARSA agents) take only observations as inputs and must have a single output layer with
output size equal to the number of discrete actions. For these networks the dimensions of
the input layers must match the dimensions of the environment observations.
specifications. For more information, see `rlQValueRepresentation`

.

For actor networks, the dimensions of the input layers must match the dimensions of the environment observation specifications.

Networks used in actors with a discrete action space (such as the ones in PG, AC, and PPO agents) must have a single output layer with an output size equal to the number of possible discrete actions.

Networks used in deterministic actors with a continuous action space (such as the ones in DDPG and TD3 agents) must have a single output layer with an output size matching the dimension of the action space defined in the environment action specification.

Networks used in stochastic actors with a continuous action space (such as the ones in PG, AC, PPO, and SAC agents) must have a single output layer with output size having twice the dimension of the action space defined in the environment action specification. These networks must have two separate paths, the first producing the mean values (which must be scaled to the output range) and the second producing the standard deviations (which must be non-negative).

For more information, see `rlDeterministicActorRepresentation`

and `rlStochasticActorRepresentation`

.

Deep neural networks consist of a series of interconnected layers. The following table lists some common deep learning layers used in reinforcement learning applications. For a full list of available layers, see List of Deep Learning Layers.

Layer | Description |
---|---|

`featureInputLayer` | Inputs feature data and applies normalization |

`imageInputLayer` | Inputs vectors and 2-D images and applies normalization. |

`sigmoidLayer` | Applies a sigmoid function to the input such that the output is bounded in the interval (0,1). |

`tanhLayer` | Applies a hyperbolic tangent activation layer to the input. |

`reluLayer` | Sets any input values that are less than zero to zero. |

`fullyConnectedLayer` | Multiplies the input vector by a weight matrix, and add a bias vector. |

`convolution2dLayer` | Applies sliding convolutional filters to the input. |

`additionLayer` | Adds the outputs of multiple layers together. |

`concatenationLayer` | Concatenates inputs along a specified dimension. |

`sequenceInputLayer` | Provides inputs sequence data to a network. |

`lstmLayer` | Applies a Long Short-Term Memory layer to the input. Supported for DQN and PPO agents. |

The `bilstmLayer`

and
`batchNormalizationLayer`

layers are not supported for reinforcement
learning.

The Reinforcement Learning Toolbox software provides the following layers, which contain no tunable parameters (that is, parameters that change during training).

Layer | Description |
---|---|

`scalingLayer` | Applies a linear scale and bias to an input array. This layer is useful for
scaling and shifting the outputs of nonlinear layers, such as `tanhLayer` and `sigmoidLayer` . |

`quadraticLayer` | Creates a vector of quadratic monomials constructed from the elements of the input array. This layer is useful when you need an output that is some quadratic function of its inputs, such as for an LQR controller. |

`softplusLayer` | Implements the softplus activation Y = log(1 +
e^{X}), which ensures that the output is always positive. This function
is a smoothed version of the rectified linear unit (ReLU). |

You can also create your own custom layers. For more information, see Define Custom Deep Learning Layers.

For reinforcement learning applications, you construct your deep neural network by
connecting a series of layers for each input path (observations or actions) and for each
output path (estimated rewards or actions). You then connect these paths together using
the `connectLayers`

function.

You can also create your deep neural network using the **Deep Network
Designer** app. For an example, see Create Agent Using Deep Network Designer and Train Using Image Observations.

When you create a deep neural network, you must specify names for the first layer of each input path and the final layer of the output path.

The following code creates and connects the following input and output paths:

An observation input path,

`observationPath`

, with the first layer named`'observation'`

.An action input path,

`actionPath`

, with the first layer named`'action'`

.An estimated value function output path,

`commonPath`

, which takes the outputs of`observationPath`

and`actionPath`

as inputs. The final layer of this path is named`'output'`

.

observationPath = [ imageInputLayer([4 1 1],'Normalization','none','Name','observation') fullyConnectedLayer(24,'Name','CriticObsFC1') reluLayer('Name','CriticRelu1') fullyConnectedLayer(24,'Name','CriticObsFC2')]; actionPath = [ imageInputLayer([1 1 1],'Normalization','none','Name','action') fullyConnectedLayer(24,'Name','CriticActFC1')]; commonPath = [ additionLayer(2,'Name','add') reluLayer('Name','CriticCommonRelu') fullyConnectedLayer(1,'Name','output')]; criticNetwork = layerGraph(observationPath); criticNetwork = addLayers(criticNetwork,actionPath); criticNetwork = addLayers(criticNetwork,commonPath); criticNetwork = connectLayers(criticNetwork,'CriticObsFC2','add/in1'); criticNetwork = connectLayers(criticNetwork,'CriticActFC1','add/in2');

For all observation and action input paths, you must specify an
`imageInputLayer`

as the first layer in the path.

You can view the structure of your deep neural network using the
`plot`

function.

plot(criticNetwork)

For PG and AC agents, the final output layers of your deep neural network actor
representation are a `fullyConnectedLayer`

and a
`softmaxLayer`

. When you specify the layers for your network, you
must specify the `fullyConnectedLayer`

and you can optionally specify
the `softmaxLayer`

. If you omit the `softmaxLayer`

,
the software automatically adds one for you.

Determining the number, type, and size of layers for your deep neural network representation can be difficult and is application dependent. However, the most critical component in deciding the characteristics of the function approximator is whether it is able to approximate the optimal policy or discounted value function for your application, that is, whether it has layers that can correctly learn the features of your observation, action, and reward signals.

Consider the following tips when constructing your network.

For continuous action spaces, bound actions with a

`tanhLayer`

followed by a`ScalingLayer`

, if necessary.Deep dense networks with

`reluLayer`

layers can be fairly good at approximating many different functions. Therefore, they are often a good first choice.Start with the smallest possible network that you think can approximate the optimal policy or value function.

When you approximate strong nonlinearities or systems with algebraic constraints, adding more layers is often better than increasing the number of outputs per layer. In general, the ability of the approximator to represent more complex functions grows only polynomially in the size of the layers, but grows exponentially with the number of layers. In other words, more layers allow approximating more complex and nonlinear compositional functions, although this generally requires more data and longer training times. Networks with fewer layers can require exponentially more units to successfully approximate the same class of functions, and might fail to learn and generalize correctly.

For on-policy agents (the ones that learn only from experience collected while following the current policy), such as AC and PG agents, parallel training works better if your networks are large (for example, a network with two hidden layers with 32 nodes each, which has a few hundred parameters). On-policy parallel updates assume each worker updates a different part of the network, such as when they explore different areas of the observation space. If the network is small, the worker updates can correlate with each other and make training unstable.

To create a critic representation for your deep neural network, use an `rlValueRepresentation`

or `rlQValueRepresentation`

object. To create an actor representation for your
deep neural network, use an `rlDeterministicActorRepresentation`

or `rlStochasticActorRepresentation`

object. To configure the learning rate and
optimization used by the representation, use an `rlRepresentationOptions`

object.

For example, create a Q-value representation object for the critic network
`criticNetwork`

, specifying a learning rate of
`0.0001`

. When you create the representation, pass the environment
action and observation specifications to the `rlQValueRepresentation`

object, and specify the names of the network layers to which the observations and actions
are connected (in this case `'observation'`

and
`'action'`

).

opt = rlRepresentationOptions('LearnRate',0.0001); critic = rlQValueRepresentation(criticNetwork,obsInfo,actInfo,... 'Observation',{'observation'},'Action',{'action'},opt);

When you create your deep neural network and configure your representation object, consider using the following approach as a starting point.

Start with the smallest possible network and a high learning rate (

`0.01`

). Train this initial network to see if the agent converges quickly to a poor policy or acts in a random manner. If either of these issues occur, rescale the network by adding more layers or more outputs on each layer. Your goal is to find a network structure that is just big enough, does not learn too fast, and shows signs of learning (an improving trajectory of the reward graph) after an initial training period.Once you settle on a good network architecture, a low initial learning rate can allow you to see if the agent is on the right track, and help you check that your network architecture is satisfactory for the problem. A low learning rate makes tuning parameters is much easier, especially for difficult problems.

Also, consider the following tips when configuring your deep neural network representation.

Be patient with DDPG and DQN agents, since they might not learn anything for some time during the early episodes, and they typically show a dip in cumulative reward early in the training process. Eventually, they can show signs of learning after the first few thousand episodes.

For DDPG and DQN agents, promoting exploration of the agent is critical.

For agents with both actor and critic networks, set the initial learning rates of both representations to the same value. For some problems, setting the critic learning rate to a higher value than that of the actor can improve learning results.

When creating representations for use with any agent except Q and SARSA, you can use
recurrent neural networks (RNN). These networks are deep neural networks with a `sequenceInputLayer`

input layer and at least one layer that has hidden state
information, such as an `lstmLayer`

. They
can be especially useful when the environment has states that cannot be included in the
observation vector.

For agents that have both actor and critic, you must either use an RNN for both of them, or not use an RNN for any of them. You cannot use an RNN only for the critic or only for the actor.

When using PG agents, the learning trajectory length for the RNN is the whole episode.
For an AC agent, the `NumStepsToLookAhead`

property of is options
object is treated as the training trajectory length. For a PPO agent, the trajectory
length is the `MiniBatchSize`

property of its options object.

For DQN, DDPG, SAC and TD3 agents, you must specify the length of the trajectory
training as an integer greater than one in the `SequenceLength`

property of their options object.

Note that code generation is not supported for continuous actions PG, AC, PPO, and SAC agents using a recurrent neural network (RNN), or for any agent having multiple input paths and containing an RNN in any of the paths.

For more information and examples on policies and value functions, see `rlValueRepresentation`

, `rlQValueRepresentation`

, `rlDeterministicActorRepresentation`

, and `rlStochasticActorRepresentation`

.

Custom (linear in the parameters) basis function approximators have the form ```
f
= W'B
```

, where `W`

is a weight array and `B`

is
the column vector output of a custom basis function that you must create. The learnable
parameters of a linear basis function representation are the elements of
`W`

.

For value function critic representations, (such as the ones used in AC, PG or PPO
agents), `f`

is a scalar value, so `W`

must be a column
vector with the same length as `B`

, and `B`

must be a
function of the observation. For more information, see `rlValueRepresentation`

.

For single-output Q-value function critic representations, (such as the ones used in Q,
DQN, SARSA, DDPG, TD3, and SAC agents), `f`

is a scalar value, so
`W`

must be a column vector with the same length as
`B`

, and `B`

must be a function of both the
observation and action. For more information, see `rlQValueRepresentation`

.

For multi-output Q-value function critic representations with discrete action spaces,
(such as those used in Q, DQN, and SARSA agents), `f`

is a vector with as
many elements as the number of possible actions. Therefore `W`

must be a
matrix with as many columns as the number of possible actions and as many rows as the length
of `B`

. `B`

must be only a function of the
observation. For more information, see `rlQValueRepresentation`

.

For actors with a discrete action space (such as the ones in PG, AC, and PPO agents),

`f`

must be column vector with length equal to the number of possible discrete actions.For deterministic actors with a continuous action space (such as the ones in DDPG, and TD3 agents), the dimensions of

`f`

must match the dimensions of the agent action specification, which is either a scalar or a column vector.Stochastic actors with continuous action spaces cannot rely on custom basis functions (they can only use neural network approximators, due to the need to enforce positivity for the standard deviations).

For any actor representation, `W`

must have as many columns as the
number of elements in `f`

, and as many rows as the number of elements in
`B`

. `B`

must be only a function of the observation.
For more information, see `rlDeterministicActorRepresentation`

, and `rlStochasticActorRepresentation`

.

For an example that trains a custom agent that uses a linear basis function representation, see Train Custom LQR Agent.

Once you create your actor and critic representations, you can create a reinforcement learning agent that uses these representations. For example, create a PG agent using a given actor and critic network.

```
agentOpts = rlPGAgentOptions('UseBaseline',true);
agent = rlPGAgent(actor,baseline,agentOpts);
```

For more information on the different types of reinforcement learning agents, see Reinforcement Learning Agents.

You can obtain the actor and critic representations from an existing agent using
`getActor`

and
`getCritic`

,
respectively.

You can also set the actor and critic of an existing agent using `setActor`

and
`setCritic`

,
respectively. When you specify a representation for an existing agent using these functions,
the input and output layers of the specified representation must match the observation and
action specifications of the original agent.