rlDQNAgentOptions
Options for DQN agent
Description
Use an rlDQNAgentOptions object to specify options when creating
      a deep Q-network (DQN) agent. To create a DQN agent, use rlDQNAgent.
For more information, see Deep Q-Network (DQN) Agent.
For more information on the different types of reinforcement learning agents, see Reinforcement Learning Agents.
Creation
Description
opt = rlDQNAgentOptions
opt = rlDQNAgentOptions(Name=Value)opt and sets its properties using one
          or more name-value arguments. For example,
            rlDQNAgentOptions(DiscountFactor=0.95) creates an options object with a
          discount factor of 0.95. You can specify multiple name-value
          arguments.
Properties
Sample time of the agent, specified as a positive scalar or as -1.
Within a MATLAB® environment, the agent is executed every time the environment advances,
            so, SampleTime does not affect the timing of the agent execution.
            If SampleTime is set to -1, in MATLAB environments, the time interval between consecutive elements in the
            returned output experience is considered equal to 1.
Within a Simulink® environment, the RL Agent block
            that uses the agent object executes every SampleTime seconds of
            simulation time. If SampleTime is set to -1 the
            block inherits the sample time from its input signals. Set
                SampleTime to -1 when the block is a child
            of an event-driven subsystem.
Set SampleTime to a positive scalar when the block is not a child
            of an event-driven subsystem. Doing so ensures that the block executes at appropriate
            intervals when input signal sample times change due to model variations. If
                SampleTime is a positive scalar, this value is also the time
            interval between consecutive elements in the output experience returned by sim or
                train,
            regardless of the type of environment.
If SampleTime is set to -1, in Simulink environments, the time interval between consecutive elements in the
            returned output experience reflects the timing of the events that trigger the RL Agent block
            execution.
This property is shared between the agent and the agent options object within the agent. If you change this property in the agent options object, it also changes in the agent, and vice versa.
Example: SampleTime=-1
Discount factor applied to future rewards during training, specified as a nonnegative scalar less than or equal to 1.
Example: DiscountFactor=0.9
Options for epsilon-greedy exploration, specified as an
                EpsilonGreedyExploration object with these properties.
| Property | Description | Default Value | 
|---|---|---|
| Epsilon | Initial value of the probability threshold to either randomly select an action or select the
                            action that maximizes the state-action value function. A larger Epsilonvalue means that the agent randomly
                            explores the action space at a higher rate. | 1 | 
| EpsilonMin | Minimum value of Epsilon | 0.01 | 
| EpsilonDecay | Decay rate | 0.0050 | 
At each interaction with the environment (that is, at each training step), if
                Epsilon is greater than EpsilonMin, then
            it is updated using this formula.
Epsilon = Epsilon*(1-EpsilonDecay)
Epsilon is conserved between the end of an episode and the start
            of the next one. So, Epsilon decreases uniformly over multiple
            episodes until it reaches EpsilonMin.
If your agent converges on a local optimum too quickly, you can promote agent exploration by
            increasing the value of  Epsilon.
To specify exploration options, use dot notation after creating the rlDQNAgentOptions object opt. For example, set the
            initial epsilon value to 0.9.
opt.EpsilonGreedyExploration.Epsilon = 0.9;
Note
The Epsilon property of an
                    EpsilonGreedyExploration object represents the
                    initial value of Epsilon at the
                beginning of the first episode.
Experience buffer size, specified as a positive integer. During training, the agent computes updates using a mini-batch of experiences randomly sampled from the buffer.
Example: ExperienceBufferLength=1e6
Size of random experience mini-batch, specified as a positive integer. During each training episode, the agent randomly samples experiences from the experience buffer when computing gradients for updating the critic properties. Large mini-batches reduce the variance when computing gradients but increase the computational effort.
When using a recurrent neural network for the critic,
              MiniBatchSize is the number of experience trajectories in a
            batch, where each trajectory has length equal to
            SequenceLength.
Example: MiniBatchSize=128
Maximum batch-training trajectory length when using a recurrent neural network, specified as a positive integer. This value must be greater than 1 when using a recurrent neural network and 1 otherwise.
Example: SequenceLength=4
Critic optimizer options, specified as an rlOptimizerOptions object. It allows you to specify training parameters of
            the critic approximator such as learning rate, gradient threshold, as well as the
            optimizer algorithm and its parameters. For more information, see rlOptimizerOptions and rlOptimizer.
Example: CriticOptimizerOptions =
            rlOptimizerOptions(LearnRate=5e-3)
Number of future rewards used to estimate the value of the policy, specified as a positive
                                    integer. Specifically,
                                                ifNumStepsToLookAhead is equal
                                    to N, the target value of the policy at a
                                    given step is calculated adding the rewards for the following
                                                N steps and the discounted
                                    estimated value of the state that caused the
                                                N-th reward. This target is also
                                    called N-step return.
Note
When using a recurrent neural network for the critic,
                                                  NumStepsToLookAhead must be
                                                  1.
For more information, see [1], Chapter 7.
Example: NumStepsToLookAhead=3
Minimum number of samples to generate before learning starts. Use this option to
            ensure that learning takes place over a more diverse data set at the beginning of
            training. The default, and minimum, value is the value of
                MiniBatchSize. After the software collects a minimum of
                NumWarmStartSteps samples, learning occurs at the intervals
            specified by the LearningFrequency property.
Example: NumWarmStartSteps=20
Number of times an agent learns over the data set stored in the experience buffer, specified as a positive integer. For off-policy agents that support this property (DQN, DDPG, TD3 and SAC), this value defines the number of passes over the data in the replay buffer at each learning iteration.
Example: NumEpoch=2
Maximum number of mini-batches used for learning during a single epoch, specified as a positive integer.
For off-policy agents that support this property (DQN, DDPG, TD3, and SAC), the actual
            number of mini-batches used for learning depends on the length of the replay buffer, and
                MaxMiniBatchPerEpoch specifies the upper bound. This value also
            specifies the maximum number of gradient steps per learning iteration because the
            maximum number of gradient steps is equal to the
                MaxMiniBatchPerEpoch value multiplied by the
                NumEpoch value.
For off-policy agents that support this property, a high
                MaxMiniBatchPerEpoch value means that more time is spent on
            learning than collecting new data. Therefore, you can use this parameter to control the
            sample efficiency of the learning process.
Example: MaxMiniBatchPerEpoch=200
Minimum number of environment interactions between learning iterations, specified as a
            positive integer or -1. This value defines how many new data samples
            need to be generated before learning. For DQN, DDPG, TD3, and SAC agents, the default
            value of -1 means that learning occurs after each episode is
            finished. Note that for these agents learning can start only after the software collects
            a minimum of NumWarmStartSteps samples. It then occurs at the
            intervals specified by the LearningFrequency property.
Example: LearningFrequency=4
Option to use double DQN for value function target updates, specified as a logical value. For more information, see Deep Q-Network (DQN) Agent.
Example: UseDoubleDQN=false
Smoothing factor for target critic updates, specified as a positive scalar less than or equal to 1. For more information, see Target Update Methods.
Example: TargetSmoothFactor=1e-2
Number of steps between target critic updates, specified as a positive integer. For more information, see Target Update Methods.
Example: TargetUpdateFrequency=5
Batch data regularizer options, specified as an
                rlBehaviorCloningRegularizerOptions object. These options are
            typically used to train the agent offline, from existing data. If you leave this option
            empty, no regularizer is used.
For more information, see rlBehaviorCloningRegularizerOptions.
Example: BatchDataRegularizerOptions =
                rlBehaviorCloningRegularizerOptions(BehaviorCloningRegularizerWeight=10)
Option for clearing the experience buffer before training, specified as a logical value.
Example: ResetExperienceBufferBeforeTraining=true
Options to save additional agent data, specified as a structure containing the following fields.
- Optimizer
- PolicyState
- Target
- ExperienceBuffer
You can save an agent object using one of these methods:
- Use the - savecommand.
- Specify - saveAgentCriteriaand- saveAgentValuein an- rlTrainingOptionsobject.
- Specify an appropriate logging function within a - FileLoggerobject.
When you save an agent using any method, the fields in the
                                InfoToSave structure determine whether the
                        corresponding data saves with the agent. For example, if you set the
                                PolicyState field to true,
                        then the policy state saves along with the agent.
You can modify the InfoToSave property only after you
                        create the agent options object.
Example: options.InfoToSave.Optimizer=true
Option to save the actor and critic optimizers,
                                                specified as a logical value. If you set the
                                                  Optimizer field to
                                                  false, then the actor and
                                                critic optimizers (which are hidden properties of
                                                the agent and can contain internal states) are not
                                                saved along with the agent, therefore saving disk
                                                space and memory. However, when the optimizers
                                                contains internal states, the state of the saved
                                                agent is not identical to the state of the original
                                                agent.
Example: true
Option to save the state of the explorative policy,
                                                specified as a logical value. If you set the
                                                  PolicyState field to
                                                  false, then the state of the
                                                explorative policy (which is a hidden agent
                                                property) is not saved along with the agent. In this
                                                case, the state of the saved agent is not identical
                                                to the state of the original agent.
Example: true
Option to save the actor and critic targets, specified
                                                as a logical value. If you set the
                                                  Target field to
                                                  false, then the actor and
                                                critic targets (which are hidden agent properties)
                                                are not saved along with the agent. In this case,
                                                when the targets contain internal states, the state
                                                of the saved agent is not identical to the state of
                                                the original agent.
Example: true
Option to save the experience buffer, specified as a
                                                logical value. If you set the
                                                  PolicyState field to
                                                  false, then the content of the
                                                experience buffer (which is accessible as an agent
                                                property using dot notation) is not saved along with
                                                the agent. In this case, the state of the saved
                                                agent is not identical to the state of the original
                                                agent.
Example: true
Object Functions
| rlDQNAgent | Deep Q-network (DQN) reinforcement learning agent | 
Examples
Create an rlDQNAgentOptions object that specifies the agent mini-batch size.
opt = rlDQNAgentOptions(MiniBatchSize=48)
opt = 
  rlDQNAgentOptions with properties:
                             SampleTime: 1
                         DiscountFactor: 0.9900
               EpsilonGreedyExploration: [1×1 rl.option.EpsilonGreedyExploration]
                 ExperienceBufferLength: 10000
                          MiniBatchSize: 48
                         SequenceLength: 1
                 CriticOptimizerOptions: [1×1 rl.option.rlOptimizerOptions]
                    NumStepsToLookAhead: 1
                      NumWarmStartSteps: 48
                               NumEpoch: 1
                   MaxMiniBatchPerEpoch: 100
                      LearningFrequency: -1
                           UseDoubleDQN: 1
                     TargetSmoothFactor: 1.0000e-03
                  TargetUpdateFrequency: 1
            BatchDataRegularizerOptions: []
    ResetExperienceBufferBeforeTraining: 0
                             InfoToSave: [1×1 struct]
You can modify options using dot notation. For example, set the agent sample time to 0.5.
opt.SampleTime = 0.5;
References
[1] Sutton, Richard S., and Andrew G. Barto. Reinforcement Learning: An Introduction. Second edition. Adaptive Computation and Machine Learning. Cambridge, Mass: The MIT Press, 2018.
Version History
Introduced in R2019aThe default value of the ResetExperienceBufferBeforeTraining has
        changed from true to false.
When creating a new DQN agent, if you want to clear the experience buffer before
        training, you must specify ResetExperienceBufferBeforeTraining as
          true. For example, before training, set the property using dot
        notation.
agent.AgentOptions.ResetExperienceBufferBeforeTraining = true;
Alternatively, you can set the property to true in an
          rlDQNAgentOptions object and use this object to create the DQN
        agent.
Target update method settings for DQN agents have changed. The following changes require updates to your code:
- The - TargetUpdateMethodoption has been removed. Now, DQN agents determine the target update method based on the- TargetUpdateFrequencyand- TargetSmoothFactoroption values.
- The default value of - TargetUpdateFrequencyhas changed from- 4to- 1.
To use one of the following target update methods, set the
          TargetUpdateFrequency and TargetSmoothFactor
        properties as indicated.
| Update Method | TargetUpdateFrequency | TargetSmoothFactor | 
|---|---|---|
| Smoothing | 1 | Less than 1 | 
| Periodic | Greater than 1 | 1 | 
| Periodic smoothing (new method in R2020a) | Greater than 1 | Less than 1 | 
The default target update configuration, which is a smoothing update with a
          TargetSmoothFactor value of 0.001, remains the
        same.
This table shows some typical uses of rlDQNAgentOptions
          and how to update your code to use the new option configuration.
| Not Recommended | Recommended | 
|---|---|
| opt =
                  rlDQNAgentOptions('TargetUpdateMethod',"smoothing"); | opt = rlDQNAgentOptions; | 
| opt =
                  rlDQNAgentOptions('TargetUpdateMethod',"periodic"); | opt = rlDQNAgentOptions; opt.TargetUpdateFrequency = 4;
                    opt.TargetSmoothFactor = 1; | 
| opt = rlDQNAgentOptions; opt.TargetUpdateMethod = "periodic";
                    opt.TargetUpdateFrequency = 5; | opt = rlDQNAgentOptions; opt.TargetUpdateFrequency = 5;
                    opt.TargetSmoothFactor = 1; | 
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)