Main Content

rlDiscreteCategoricalActor

Stochastic categorical actor with a discrete action space for reinforcement learning agents

Description

This object implements a function approximator to be used as a stochastic actor within a reinforcement learning agent with a discrete action space. A discrete categorical actor takes an environment state as input and returns as output a random action sampled from a categorical (also known as Multinoulli) probability distribution of the expected cumulative long term reward, thereby implementing a stochastic policy. After you create an rlDiscreteCategoricalActor object, use it to create a suitable agent, such as an rlACAgent or rlPGAgent agent. For more information on creating representations, see Create Policies and Value Functions.

Creation

Description

example

actor = rlDiscreteCategoricalActor(net,observationInfo,actionInfo) creates a stochastic actor with a discrete action space, using the deep neural network net as function approximator. For this actor, actionInfo must specify a discrete action space. The network input layers are automatically associated with the environment observation channels according to the dimension specifications in observationInfo. The network must have a single output layer with as many elements as the number of possible discrete actions, as specified in actionInfo. This function sets the ObservationInfo and ActionInfo properties of actor to the inputs observationInfo and actionInfo, respectively.

Note

actor does not enforce constraints set by the action specification, therefore, when using this actor, you must enforce action space constraints within the environment.

example

actor = rlDiscreteCategoricalActor(net,observationInfo,actionInfo,ObservationInputNames=netObsNames) specifies the names of the network input layers to be associated with the environment observation channels. The function assigns, in sequential order, each environment observation channel specified in observationInfo to the layer specified by the corresponding name in the string array netObsNames. Therefore, the network input layers, ordered as the names in netObsNames, must have the same data type and dimensions as the observation specifications, as ordered in observationInfo.

example

actor = rlDiscreteCategoricalActor({basisFcn,W0},observationInfo,actionInfo) creates a discrete space stochastic actor using a custom basis function as underlying approximator. The first input argument is a two-element cell array whose first element is the handle basisFcn to a custom basis function and whose second element is the initial weight matrix W0. This function sets the ObservationInfo and ActionInfo properties of actor to the inputs observationInfo and actionInfo, respectively.

actor = rlDiscreteCategoricalActor(___,UseDevice=useDevice) specifies the device used to perform computational operations on the actor object, and sets the UseDevice property of actor to the useDevice input argument. You can use this syntax with any of the previous input-argument combinations.

Input Arguments

expand all

Deep neural network used as the underlying approximator within the actor. The network must have the environment observation channels as inputs and a single output layer with as many elements as the number of possible discrete actions. Since the output of the network must represent the probability of executing each possible action, the software automatically adds a softmaxLayer as a final output layer if you do not specify it explicitly. When computing the action, the actor then randomly samples the distribution to return an action.

You can specify the network as one of the following:

Note

Among the different network representation options, dlnetwork is preferred, since it has built-in validation checks and supports automatic differentiation. If you pass another network object as an input argument, it is internally converted to a dlnetwork object. However, best practice is to convert other representations to dlnetwork explicitly before using it to create a critic or an actor for a reinforcement learning agent. You can do so using dlnet=dlnetwork(net), where net is any neural network object from the Deep Learning Toolbox™. The resulting dlnet is the dlnetwork object that you use for your critic or actor. This practice allows a greater level of insight and control for cases in which the conversion is not straightforward and might require additional specifications.

rlDiscreteCategoricalActor objects support recurrent deep neural networks. For an example, see Create Discrete Categorical Actor from Deep Recurrent Neural Network.

The learnable parameters of the actor are the weights of the deep neural network. For a list of deep neural network layers, see List of Deep Learning Layers. For more information on creating deep neural networks for reinforcement learning, see Create Policies and Value Functions.

Network input layers names corresponding to the environment observation channels, specified as a string array or a cell array of character vectors. When you use the pair value arguments 'ObservationInputNames' with netObsNames, the function assigns, in sequential order, each environment observation channel specified in observationInfo to each network input layer specified by the corresponding name in the string array netObsNames. Therefore, the network input layers, ordered as the names in netObsNames, must have the same data type and dimensions as the observation specifications, as ordered in observationInfo.

Note

Of the information specified in observationInfo, the function uses only the data type and dimension of each channel, but not its (optional) name or description.

Example: {"NetInput1_airspeed","NetInput2_altitude"}

Custom basis function, specified as a function handle to a user-defined MATLAB function. The user defined function can either be an anonymous function or a function on the MATLAB path. The number of the action to be taken based on the current observation, which is the output of the actor, is randomly sampled from a categorical distribution with probabilities p = softmax(W'*B), where W is a weight matrix containing the learnable parameters and B is the column vector returned by the custom basis function. Each element of p represents the probability of executing the corresponding action from the observed state.

Your basis function must have the following signature.

B = myBasisFunction(obs1,obs2,...,obsN)

Here, obs1 to obsN are inputs in the same order and with the same data type and dimensions as the environment observation channels defined in observationInfo.

Example: @(obs1,obs2,obs3) [obs3(2)*obs1(1)^2; abs(obs2(5)+obs3(1))]

Initial value of the basis function weights W, specified as a matrix having as many rows as the length of the vector returned by the basis function and as many columns as the dimension of the action space.

Properties

expand all

Observation specifications, specified as an rlFiniteSetSpec or rlNumericSpec object or an array of such objects. These objects define properties such as the dimensions, data types, and names of the observation signals.

rlDiscreteCategoricalActor sets the ObservationInfo property of actor to the input observationInfo.

You can extract ObservationInfo from an existing environment or agent using getObservationInfo. You can also construct the specifications manually.

Action specifications, specified as an rlFiniteSetSpec object. This object defines the properties of the environment action channel, such as its dimensions, data type, and name. Note that the function does not use the name of the action channel specified in actionInfo.

Note

Only one action channel is allowed.

rlDiscreteCategoricalActor sets the ActionInfo property of critic to the input actionInfo.

You can extract ActionInfo from an existing environment or agent using getActionInfo. You can also construct the specifications manually.

Computation device used to perform operations such as gradient computation, parameter update and prediction during training and simulation, specified as either "cpu" or "gpu".

The "gpu" option requires both Parallel Computing Toolbox™ software and a CUDA® enabled NVIDIA® GPU. For more information on supported GPUs see GPU Support by Release (Parallel Computing Toolbox).

You can use gpuDevice (Parallel Computing Toolbox) to query or select a local GPU device to be used with MATLAB®.

Note

Training or simulating an agent on a GPU involves device-specific numerical round-off errors. These errors can produce different results compared to performing the same operations a CPU.

To speed up training by using parallel processing over multiple cores, you do not need to use this argument. Instead, when training your agent, use an rlTrainingOptions object in which the UseParallel option is set to true. For more information about training using multicore processors and GPUs for training, see Train Agents Using Parallel Computing and GPUs.

Example: 'UseDevice',"gpu"

Object Functions

rlACAgentActor-critic reinforcement learning agent
rlPGAgentPolicy gradient reinforcement learning agent
rlPPOAgentProximal policy optimization reinforcement learning agent
getActionObtain action from agent or actor given environment observations

Examples

collapse all

Create an observation specification object (or alternatively use getObservationInfo to extract the specification object from an environment). For this example, define the observation space as a continuous four-dimensional space, so that a single observation is a column vector containing four doubles.

obsInfo = rlNumericSpec([4 1]);

Create an action specification object (or alternatively use getActionInfo to extract the specification object from an environment). For this example, define the action space as consisting of three values, -10, 0, and 10.

actInfo = rlFiniteSetSpec([-10 0 10]);

Create a deep neural network approximator for the actor. The input of the network must accept a four-element vector (the observation vector just defined by obsInfo), and its output must be a three-element vector. Each element of the output vector must be between 0 and 1 since it represents the probability of executing each of the three possible actions (as defined by actInfo). Using softmax as the output layer enforces this requirement (the software automatically adds a softmaxLayer as a final output layer if you do not specify it explicitly). When computing the action, the actor then randomly samples the distribution to return an action.

net = [  featureInputLayer(4,'Normalization','none')
         fullyConnectedLayer(3) ];

Create the actor with rlDiscreteCategoricalActor, using the network, the observations and action specification objects. When the network has multiple input layers, they are automatically associated with the environment observation channels according to the dimension specifications in obsInfo.

actor = rlDiscreteCategoricalActor(net,obsInfo,actInfo);

To validate your actor, use getAction to return a random action from the observation vector [1 1 1 1], using the current network weights.

act = getAction(actor,{[1 1 1 1]}); 
act
act = 0

You can now use the actor to create a suitable agent, such as an rlACAgent, or rlPGAgent agent.

Create an observation specification object (or alternatively use getObservationInfo to extract the specification object from an environment). For this example, define the observation space as a continuous four-dimensional space, so that a single observation is a column vector containing four doubles.

obsInfo = rlNumericSpec([4 1]);

Create an action specification object (or alternatively use getActionInfo to extract the specification object from an environment). For this example, define the action space as consisting of three values, -10, 0, and 10.

actInfo = rlFiniteSetSpec([-10 0 10]);

Create a deep neural network approximator for the actor. The input of the network (here called state) must accept a four-element vector (the observation vector just defined by obsInfo), and its output (here called actionProb) must be a three-element vector. Each element of the output vector must be between 0 and 1 since it represents the probability of executing each of the three possible actions (as defined by actInfo). Using softmax as the output layer enforces this requirement (however, the software automatically adds a softmaxLayer as a final output layer if you do not specify it explicitly). When computing the action, the actor then randomly samples the distribution to return an action.

net = [  featureInputLayer(4,'Normalization','none', ...
             'Name','state')
         fullyConnectedLayer(3,'Name','fc')
         softmaxLayer('Name','actionProb')  ];

Create the actor with rlDiscreteCategoricalActor, using the network, the observations and action specification objects, as well as the name of the network input layer.

actor = rlDiscreteCategoricalActor(net,obsInfo,actInfo,...
            'Observation','state');

To validate your actor, use getAction to return a random action from the observation vector [1 1 1 1], using the current network weights.

act = getAction(actor,{[1 1 1 1]}); 
act
act = 10

You can now use the actor to create a suitable agent, such as an rlACAgent, or rlPGAgent agent.

Create an observation specification object (or alternatively use getObservationInfo to extract the specification object from an environment). For this example, define the observation space as consisting of two channels, the first being a two-dimensional vector in a continuous space, the second being a two dimensional vector that can only assume three values -[1 1], [0 0], and [1 1]. Therefore a single observation consists of two two-dimensional vectors, one continuous, the other discrete.

obsInfo = [rlNumericSpec([2 1]) rlFiniteSetSpec({-[1 1],[0 0],[1 1]})];

Create a discrete action space specification object (or alternatively use getActionInfo to extract the specification object from an environment with a discrete action space). For this example, define the action space as a finite set consisting of 3 possible values (named 7, 5, and 3 in this case).

actInfo = rlFiniteSetSpec([7 5 3]);

Create a custom basis function. Each element is a function of the observation defined by obsInfo.

myBasisFcn = @(obsC,obsD) [obsC(1)^2-obsD(2)^2; 
                           obsC(2)^2-obsD(1)^2;  
                           exp(obsC(2))+abs(obsD(1)); 
                           exp(obsC(1))+abs(obsD(2))];

The output of the actor is the action, among the ones defined in actInfo, corresponding to the element of softmax(W'*myBasisFcn(obsC,obsD)) which has the highest value. W is a weight matrix, containing the learnable parameters, which must have as many rows as the length of the basis function output, and as many columns as the number of possible actions.

Define an initial parameter matrix.

W0 = rand(4,3);

Create the actor. The first argument is a two-element cell containing both the handle to the custom function and the initial parameter matrix. The second and third arguments are, respectively, the observation and action specification objects.

actor = rlDiscreteCategoricalActor({myBasisFcn,W0},obsInfo,actInfo);

To check your actor use the getAction function to return one of the three possible actions, depending on a given random observation and on the current parameter matrix.

getAction(actor,{rand(2,1),[1 1]})
ans = 3

Note that the discrete set constrain is not enforced.

getAction(actor,{rand(2,1),[0.5 -0.7]})
ans = 3

You can now use the actor (along with an critic) to create a suitable discrete action space agent (such as an rlACAgent, rlPGAgent, or rlPPOAgent agent).

This example shows you how to create a stochastic actor with a discrete action space using a recurrent neural network. You can also use a recurrent neural network for a continuous stochastic actor.

For this example, use the same environment used in Train PG Agent to Balance Cart-Pole System. Load the environment and obtain the observation and action specifications.

env = rlPredefinedEnv('CartPole-Discrete');
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

Create a recurrent deep neural network for the actor. To create a recurrent neural network, use a sequenceInputLayer as the input layer (with size equal to the number of dimensions of the observation channel) and include at least one lstmLayer.

actorNetwork = [
    sequenceInputLayer(prod(obsInfo.Dimension), ...
         'Normalization','none','Name','state')
    fullyConnectedLayer(8,'Name','fc')
    reluLayer('Name','relu')
    lstmLayer(8,'OutputMode','sequence','Name','lstm')
    fullyConnectedLayer(numel(actInfo.Elements)) ];

Create a stochastic actor representation for the network.

actor = rlDiscreteCategoricalActor(actorNetwork, ...
    obsInfo,actInfo,...
    'Observation','state');

To check your actor use getAction to return one of the two possible actions, depending on a given random observation and on the current network weights.

getAction(actor,{rand(obsInfo.Dimension)})
ans = 1x1 cell array
    {[-10]}

Use evaluate to return the probability of each of the two possible actions. Note that the type of the returned numbers is single, not double.

prob = evaluate(actor,{rand(obsInfo.Dimension)});
prob{1}
ans = 2x1 single column vector

    0.4704
    0.5296

You can use getState and setState to extract and set the current state of the recurrent neural network in the actor.

getState(actor)
ans=2×1 cell array
    {8x1 single}
    {8x1 single}

actor = setState(actor, ...
    {-0.01*single(rand(8,1)), ...
      0.01*single(rand(8,1))});

To evaluate the actor using sequential observations, use the sequence length (time) dimension. For example, obtain actions for 5 independent sequences each one consisting of 9 sequential observations.

[action,state] = getAction(actor, ...
    {rand([obsInfo.Dimension 5 9])});

Display the action corresponding to the seventh element of the observation sequence in the fourth sequence.

action = action{1};
action(1,1,4,7)
ans = 10

Display the updated state of the recurrent neural network.

state
state=2×1 cell array
    {8x5 single}
    {8x5 single}

For more information on input and output format for recurrent neural networks, see the Algorithms section of lstmLayer.

You can now use the actor (along with an critic) to create a suitable discrete action space agent (such as an rlACAgent, rlPGAgent, or rlPPOAgent agent).

Version History

Introduced in R2022a