The deep Q-network (DQN) algorithm is a model-free, online, off-policy reinforcement learning method. A DQN agent is a value-based reinforcement learning agent that trains a critic to estimate the return or future rewards. DQN is a variant of Q-learning. For more information on Q-learning, see Q-Learning Agents.

For more information on the different types of reinforcement learning agents, see Reinforcement Learning Agents.

DQN agents can be trained in environments with the following observation and action spaces.

Observation Space | Action Space |
---|---|

Continuous or discrete | Discrete |

During training, the agent:

Updates the critic properties at each time step during learning.

Explores the action space using epsilon-greedy exploration. During each control interval the agent selects a random action with probability

*ϵ*, otherwise it selects an action greedily with respect to the value function with probability 1-*ϵ*. This greedy action is the action for which the value function is greatest.Stores past experience using a circular experience buffer. The agent updates the critic based on a mini-batch of experiences randomly sampled from the buffer.

To estimate the value function, a DQN agent maintains two function approximators:

Critic

*Q*(*S*,*A*) — The critic takes observation*S*and action*A*as inputs and outputs the corresponding expectation of the long-term reward.Target critic

*Q'*(*S*,*A*) — To improve the stability of the optimization, the agent periodically updates the target critic based on the latest critic parameter values.

Both *Q*(*S*,*A*) and
*Q'*(*S*,*A*) have the same structure
and parameterization.

For more information on creating critics for value function approximation, see Create Policy and Value Function Representations.

When training is complete, the trained value function approximator is stored in critic
*Q*(*S*,*A*).

To create a DQN agent:

Create a critic representation object.

Specify agent options using the

`rlDQNAgentOptions`

function.Create the agent using the

`rlDQNAgent`

function.

For more information, see `rlDQNAgent`

and
`rlDQNAgentOptions`

.

DQN agents use the following training algorithm, in which they update their critic model
at each time step. To configure the training algorithm, specify options using
`rlDQNAgentOptions`

.

Initialize the critic

*Q*(*s*,*a*) with random parameter values*θ*, and initialize the target critic with the same values: $${\theta}_{Q\text{'}}={\theta}_{Q}$$._{Q}For each training time step:

For the current observation

*S*, select a random action*A*with probability*ϵ*. Otherwise, select the action for which the critic value function is greatest.$$A=\underset{A}{\mathrm{arg}\mathrm{max}}Q\left(S,A|{\theta}_{Q}\right)$$

To specify

*ϵ*and its decay rate, use the`EpsilonGreedyExploration`

option.Execute action

*A*. Observe the reward*R*and next observation*S'*.Store the experience (

*S*,*A*,*R*,*S'*) in the experience buffer.Sample a random mini-batch of

*M*experiences (*S*,_{i}*A*,_{i}*R*,_{i}*S'*) from the experience buffer. To specify_{i}*M*, use the`MiniBatchSize`

option.If

*S'*is a terminal state, set the value function target_{i}*y*to_{i}*R*. Otherwise set it to:_{i}$$\begin{array}{ll}\begin{array}{l}{A}_{\mathrm{max}}=\underset{A\text{'}}{\mathrm{arg}\mathrm{max}}Q\left({S}_{i}\text{'},A\text{'}|{\theta}_{Q}\right)\\ {y}_{i}={R}_{i}+\gamma Q\text{'}\left({S}_{i}\text{'},{A}_{\mathrm{max}}|{\theta}_{Q\text{'}}\right)\end{array}\hfill & \left(\text{double}\text{\hspace{0.17em}}\text{DQN}\right)\hfill \\ \hfill & \hfill \\ {y}_{i}={R}_{i}+\gamma \underset{A\text{'}}{\mathrm{max}}Q\text{'}\left({S}_{i}\text{'},A\text{'}|{\theta}_{Q\text{'}}\right)\hfill & \left(\text{DQN}\right)\hfill \end{array}\text{\hspace{0.17em}}$$

To set the discount factor

*γ*, use the`DiscountFactor`

option. To use double DQN, set the`UseDoubleDQN`

option to`true`

.Update the critic parameters by one-step minimization of the loss

*L*across all sampled experiences.$$L=\frac{1}{M}{\displaystyle \sum _{i=1}^{M}{\left({y}_{i}-Q\left({S}_{i},{A}_{i}|{\theta}_{Q}\right)\right)}^{2}}$$

Update the target critic depending on the target update method (smoothing or periodic). To select the update method, use the

`TargetUpdateMethod`

option.$$\begin{array}{ll}{\theta}_{Q\text{'}}=\tau {\theta}_{Q}+\left(1-\tau \right){\theta}_{Q\text{'}}\hfill & \left(\text{smoothing}\right)\hfill \\ {\theta}_{Q\text{'}}={\theta}_{Q}\hfill & \left(\text{periodic}\right)\hfill \end{array}$$

By default the agent uses target smoothing and updates the target critic at every time step using smoothing factor

*τ*. To specify the smoothing factor, use the`TargetSmoothFactor`

option. Alternatively, you can update the target critic periodically. To specify the number of episodes between target critic updates, use the`TargetUpdateFrequency`

option.Update the probability threshold

*ϵ*for selecting a random action based on the decay rate specified in the`EpsilonGreedyExploration`

option.

[1] V. Mnih, K. Kavukcuoglu, D.
Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing Atari With Deep
Reinforcement Learning,” *NIPS Deep Learning Workshop*,
2013.

`rlDQNAgent`

| `rlDQNAgentOptions`

| `rlRepresentation`