The SARSA algorithm is a model-free, online, on-policy reinforcement learning method. A SARSA agent is a value-based reinforcement learning agent which trains a critic to estimate the return or future rewards.

For more information on the different types of reinforcement learning agents, see Reinforcement Learning Agents.

SARSA agents can be trained in environments with the following observation and action spaces.

Observation Space | Action Space |
---|---|

Continuous or discrete | Discrete |

During training, the agent explores the action space using epsilon-greedy exploration.
During each control interval the agent selects a random action with probability
*ϵ*, otherwise it selects an action greedily with respect to the value
function with probability 1-*ϵ*. This greedy action is the action for which
the value function is greatest.

To estimate the value function, a SARSA agent maintains a critic
*Q*(*S*,*A*), which is a table or
function approximator. The critic takes observation *S* and action
*A* as inputs and outputs the corresponding expectation of the long-term
reward.

For more information on creating critics for value function approximation, see Create Policy and Value Function Representations.

When training is complete, the trained value function approximator is stored in critic
*Q*(*S*,*A*).

To create a SARSA agent first create a critic representation object. Then, using this
representation, create the agent using the `rlSARSAAgent`

function.

SARSA agents use the following training algorithm. To configure the training algorithm,
specify options using `rlSARSAAgentOptions`

.

Initialize the critic

*Q*(*S*,*A*) with random values.For each training episode:

Set the initial observation

*S*.For the current observation

*S*, select a random action*A*with probability*ϵ*. Otherwise, select the action for which the critic value function is greatest.$$A=\underset{A}{\mathrm{max}}Q\left(S,A\right)$$

To specify

*ϵ*and its decay rate, use the`EpsilonGreedyExploration`

option.Repeat the following for each step of the episode until

*S*is a terminal state:Execute action

*A*. Observe the reward*R*and next observation*S'*.Select an action

*A'*by following the policy from state`S'`

.$$A\text{'}=\underset{A\text{'}}{\mathrm{max}}Q\left(S\text{'},A\text{'}\right)$$

If

*S'*is a terminal state, set the value function target*y*to*R*. Otherwise set it to:$$y=R+\gamma Q\left(S\text{'},A\text{'}\right)$$

To set the discount factor

*γ*, use the`DiscountFactor`

option.Compute the critic parameter update.

$$\Delta Q=y-Q\left(S,A\right)$$

Update the critic using the learning rate

*α*.$$Q\left(S,A\right)=Q\left(S,A\right)+\alpha \ast \Delta Q$$

Specify the learning rate when you create the critic representation by setting the

`LearnRate`

option in the`rlRepresentationOptions`

object.Set the observation

*S*to*S'*.Set the action

*A*to*A'*.

`rlRepresentation`

| `rlSARSAAgentOptions`