Main Content

Q-Learning Agents

The Q-learning algorithm is a model-free, online, off-policy reinforcement learning method. A Q-learning agent is a value-based reinforcement learning agent that trains a critic to estimate the return or future rewards.

For more information on the different types of reinforcement learning agents, see Reinforcement Learning Agents.

Q-learning agents can be trained in environments with the following observation and action spaces.

Observation SpaceAction Space
Continuous or discreteDiscrete

Q agents use the following critic representation.


Q-value function critic Q(S,A), which you create using rlQValueRepresentation

Q agents do not use an actor.

During training, the agent explores the action space using epsilon-greedy exploration. During each control interval the agent selects a random action with probability ϵ, otherwise it selects an action greedily with respect to the value function with probability 1-ϵ. This greedy action is the action for which the value function is greatest.

Critic Function

To estimate the value function, a Q-learning agent maintains a critic Q(S,A), which is a table or function approximator. The critic takes observation S and action A as inputs and returns the corresponding expectation of the long-term reward.

For more information on creating critics for value function approximation, see Create Policy and Value Function Representations.

When training is complete, the trained value function approximator is stored in critic Q(S,A).

Agent Creation

To create a Q-learning agent:

  1. Create a critic using an rlQValueRepresentation object.

  2. Specify agent options using an rlQAgentOptions object.

  3. Create the agent using an rlQAgent object.

Training Algorithm

Q-learning agents use the following training algorithm. To configure the training algorithm, specify options using an rlQAgentOptions object.

  • Initialize the critic Q(S,A) with random values.

  • For each training episode:

    1. Set the initial observation S.

    2. Repeat the following for each step of the episode until S is a terminal state.

      1. For the current observation S, select a random action A with probability ϵ. Otherwise, select the action for which the critic value function is greatest.


        To specify ϵ and its decay rate, use the EpsilonGreedyExploration option.

      2. Execute action A. Observe the reward R and next observation S'.

      3. If S' is a terminal state, set the value function target y to R. Otherwise, set it to


        To set the discount factor γ, use the DiscountFactor option.

      4. Compute the critic parameter update.


      5. Update the critic using the learning rate α.


        Specify the learning rate when you create the critic representation by setting the LearnRate option in the rlRepresentationOptions object.

      6. Set the observation S to S'.

See Also


Related Topics