Main Content

rlLSPIAgent

Least square policy iteration reinforcement learning agent

Since R2025a

    Description

    The least square policy iteration (LSPI) algorithm is an off-policy reinforcement learning method for environments with a discrete action space. Similarly to a Q-learning agent, an LSPI agent trains a Q-value function critic to estimate the value of the optimal policy, while following an epsilon-greedy policy based on the value estimated by the critic. The approximation model used by the critic must be a linear-in-the-parameters custom basis function.

    For more information on LSPI agents, see LSPI Agent.

    For more information on the different types of reinforcement learning agents, see Reinforcement Learning Agents.

    Creation

    Description

    agent = rlLSPIAgent(critic) creates an LSPI agent with the specified custom-value-function-based critic. The AgentOptions property of agent is initialized using default values.

    example

    agent = rlLSPIAgent(critic,agentOptions) also sets the AgentOptions property of agent using the agentOptions argument.

    Input Arguments

    expand all

    Critic, specified as an rlQValueFunction object that uses a custom basis function as an approximation model. For more information on creating critics, see Create Policies and Value Functions.

    Properties

    expand all

    Agent options, specified as an rlLSPIAgentOptions object.

    Option to use an exploration policy when selecting actions during simulation or after deployment, specified as a logical value.

    • true — Specify this value to use the base agent exploration policy when you use the agent with the sim and generatePolicyFunction functions. Specifically, in this case, the agent uses the rlEpsilonGreedyPolicy object. The action selection has a random component, so the agent explores its action and observation spaces.

    • false — Specify this value to force the agent to use the base agent greedy policy (the action with maximum likelihood) when you use the agent with the sim and generatePolicyFunction functions. Specifically, in this case, the agent uses the rlMaxQPolicy policy. The action selection is greedy, so the policy behaves deterministically and the agent does not explore its action and observation spaces.

    Note

    This option affects only simulation and deployment and does not affect training. When you train an agent using the train function, the agent always uses its exploration policy independently of the value of this property.

    Observation specifications, specified as an rlFiniteSetSpec or rlNumericSpec object or an array containing a mix of such objects. Each element in the array defines the properties of an environment observation channel, such as its dimensions, data type, and name.

    If you create the agent by specifying an actor or critic, the value of ObservationInfo matches the value specified in the actor and critic objects. If you create a default agent, the agent constructor function sets the ObservationInfo property to the input argument observationInfo.

    You can extract observationInfo from an existing environment, function approximator, or agent using getObservationInfo. You can also construct the specifications manually using rlFiniteSetSpec or rlNumericSpec.

    Example: [rlNumericSpec([2 1]) rlFiniteSetSpec([3,5,7])]

    Action specifications, specified as an rlFiniteSetSpec object. This object defines the properties of the environment action channel, such as its dimensions, data type, and name.

    Note

    For this agent, only one action channel is allowed.

    If you create the agent by specifying a critic object, the value of ActionInfo matches the value specified in critic. If you create a default agent, the agent constructor function sets the ActionInfo property to the input argument ActionInfo.

    You can extract actionInfo from an existing environment, function approximator, or agent using getActionInfo. You can also construct the specification manually using rlFiniteSetSpec.

    Example: rlFiniteSetSpec([3,-5,7])]

    Sample time of the agent, specified as a positive scalar or as -1.

    Within a MATLAB® environment, the agent is executed every time the environment advances, so, SampleTime does not affect the timing of the agent execution. If SampleTime is set to -1, in MATLAB environments, the time interval between consecutive elements in the returned output experience is considered equal to 1.

    Within a Simulink® environment, the RL Agent block that uses the agent object executes every SampleTime seconds of simulation time. If SampleTime is set to -1 the block inherits the sample time from its input signals. Set SampleTime to -1 when the block is a child of an event-driven subsystem.

    Set SampleTime to a positive scalar when the block is not a child of an event-driven subsystem. Doing so ensures that the block executes at appropriate intervals when input signal sample times change due to model variations. If SampleTime is a positive scalar, this value is also the time interval between consecutive elements in the output experience returned by sim or train, regardless of the type of environment.

    If SampleTime is set to -1, in Simulink environments, the time interval between consecutive elements in the returned output experience reflects the timing of the events that trigger the RL Agent block execution.

    This property is shared between the agent and the agent options object within the agent. If you change this property in the agent options object, it also changes in the agent, and vice versa.

    Example: SampleTime=-1

    Object Functions

    trainTrain reinforcement learning agents within a specified environment
    simSimulate trained reinforcement learning agents within specified environment
    getActionObtain action from agent, actor, or policy object given environment observations
    getCriticExtract critic from reinforcement learning agent
    setCriticSet critic of reinforcement learning agent
    generatePolicyFunctionGenerate MATLAB function that evaluates policy of an agent or policy object

    Examples

    collapse all

    Create an environment object. For this example, use the same environment as in the example Train PG Agent with Custom Networks to Control Discrete Double Integrator.

    env = rlPredefinedEnv("DoubleIntegrator-Discrete");

    Get observation and action specifications.

    obsInfo = getObservationInfo(env)
    obsInfo = 
      rlNumericSpec with properties:
    
         LowerLimit: -Inf
         UpperLimit: Inf
               Name: "states"
        Description: "x, dx"
          Dimension: [2 1]
           DataType: "double"
    
    
    actInfo = getActionInfo(env)
    actInfo = 
      rlFiniteSetSpec with properties:
    
           Elements: [-2 0 2]
               Name: "force"
        Description: [0×0 string]
          Dimension: [1 1]
           DataType: "double"
    
    

    LSPI agents use a parameterized Q-value function based on a linear-in-the-parameters custom basis function to estimate the value of the policy. A Q-value function takes the current observation and an action as inputs and returns a single scalar as output (the estimated discounted cumulative long-term reward for taking the action from the state corresponding to the current observation, and following the policy thereafter).

    The custom basis function must have two inputs. The first input receives the content of the observation channel, as specified by obsInfo. The second input receives the content of the action channel, as specified by actInfo.

    Create a custom function that returns a vector of nine elements (the feature vector), given an observation and an action as inputs. Here, the third dimension is the batch dimension. For each element of the batch dimension, the output of the basis function is a vector with nine elements.

    myBasisFcn = @(obs,act) [
        obs(1,1,:);
        obs(2,1,:);
        act(1,1,:);
        obs(1,1,:).*act(1,1,:);
        obs(2,1,:).*act(1,1,:);
        obs(1,1,:).*obs(2,1,:);
        obs(1,1,:).^2;
        obs(2,1,:).^2;
        act(1,1,:).^2;
        ];

    The output of the critic is the scalar W'*myBasisFcn(myobs,myact), which represents the estimated value of the observation-action pair under the given policy. Here, W is a weight column vector that must have the same size as the custom function output. The elements of W are the learnable parameters.

    Define an initial parameter vector.

    W0 = ones(9,1);

    Create the critic. The first argument is a two-element cell containing both the handle to the custom function and the initial weight vector. The second and third arguments are the observation and action specification objects, respectively.

    critic = rlQValueFunction({myBasisFcn,W0},obsInfo,actInfo);

    Create a LSPI agent using the approximator object.

    agent = rlLSPIAgent(critic)
    agent = 
      rlLSPIAgent with properties:
    
                AgentOptions: [1×1 rl.option.rlLSPIAgentOptions]
        UseExplorationPolicy: 0
             ObservationInfo: [1×1 rl.util.rlNumericSpec]
                  ActionInfo: [1×1 rl.util.rlFiniteSetSpec]
                  SampleTime: 1
    
    

    Specify an Epsilon value of 0.2.

    agent.AgentOptions.EpsilonGreedyExploration.Epsilon = 0.2;

    To check your agent, use the getAction function to return the action from a random observation.

    act = getAction(agent,{rand(obsInfo.Dimension)});
    act{1}
    ans = 
    2
    

    You can now test and train the agent against the environment.

    Version History

    Introduced in R2025a