This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English version of the page.

Note: This page has been translated by MathWorks. Click here to see
To view all translated materials including this page, select Country from the country navigator on the bottom of this page.

Train Reinforcement Learning Agents

Once you have created an environment and reinforcement learning agent, you can train the agent in the environment using the train function. To configure your training, use the rlTrainingOptions function. For example, create a training option set opt, and train agent agent in environment env.

opt = rltrainingOptions('DiscountFactor',0.95);
trainStats = train(agent,env,opt);

For more information on creating:

train updates the agent as training progresses. To preserve the original agent parameters for later use, save the agent to a MAT-file.

save("initialAgent.mat","agent")

Training terminates automatically when the conditions specified in StopTrainingCriteria and StopTrainingValue of your rlTrainingOptions object are satisfied. To manually terminate training in progress, type ctrl-C or, in the Reinforcement Learning Episode Manager, click Stop Training. Because train updates the agent at each episode, you can resume training by calling train(agent,env,trainOpts) again, without losing the trained parameters learned during the first call to train.

Training Algorithm

In general, training performs the following iterative steps:

  1. Initialize the agent.

  2. For each episode:

    1. Reset the environment.

    2. Get the initial observation s0 from the environment.

    3. Compute the initial action a0 = μ(s0), where μ(s) is the current policy.

    4. Set the current action to the initial action (aa0), and set the current observation to the initial observation (ss0).

    5. While the episode is not finished or terminated:

      1. Step the environment with action a to obtain the next observation s' and the reward r.

      2. Learn from the experience set (s,a,r,s').

      3. Compute the next action a' = μ(s').

      4. Update the current action with the next action (aa') and update the current observation with the next observation (ss').

      5. Break if the episode termination conditions defined in the environment are met.

  3. If the training termination condition is met, terminate training. Otherwise, begin the next episode.

The specifics of how the software performs these steps depends on the configuration of the agent and environment. For instance, resetting the environment at the start of each episode can include randomizing initial state values, if you configure your environment to do so. For more information on agents and their training algorithms, see Reinforcement Learning Agents.

Episode Manager

By default, calling the train function opens the Reinforcement Learning Episode Manager, which lets you visualize the progress of the training. The Episode Manager plot shows the reward for each episode (EpisodeReward), a running average reward value (AverageReward). Also, for agents that have critics, plot shows the critics estimate of the discounted long-term reward at the start of each episode (EpisodeQ0). The Episode Manager also displays various episode and training statistics. This episode and training information is also returned by the train function.

For agents with a critic, Episode Q0 is the estimate of the discounted long-term reward at the start of each episode, given the initial observation of the environment. As training progresses, Episode Q0 should approach the true discounted long-term reward if the critic is well-designed, as shown in the preceding figure.

To turn off the Reinforcement Learning Episode Manager, set the Plots option of rlTrainingOptions to "none".

Save Candidate Agents

During training, you can save candidate agents that meet conditions you specify in SaveAgentCriteria and SaveAgentValue of your rlTrainingOptions object. For instance, you can save any agent whose episode reward exceeds a certain value, even if the overall condition for terminating training is not yet satisfied. For example, to save agents when the episode reward is greater than 100, use:

opt = rlTrainingOptions('SaveAgentCriteria',"EpisodeReward",'SaveAgentValue',100');

train stores saved agents in a MAT-file in the folder you specify using the SaveAgentDirectory option of rlTrainingOptions. Saved agents can be useful, for instance, to allow you to test candidate agents generated during a long-running training process. For details about saving criteria and saving location, see rlTrainingOptions.

After training is complete, you can save the final trained agent from the MATLAB® workspace using the save function. For example, save the agent myAgent to the file finalAgent.mat in the current working directory.

save(opt.SaveAgentDirectory + "/finalAgent.mat",'agent')

By default, when DDPG and DQN agents are saved, the experience buffer data is not saved. If you plan to further train your saved agent, you can start training with the previous experience buffer as a starting point. In this case, set the SaveExperienceBufferWithAgent agent option to true. For some agents, such as those with large experience buffers and image-based observations, the memory required for saving their experience buffer is large. In these cases, you must ensure that there is enough memory available for the saved agents.

Parallel Computing

You can accelerate agent training by running parallel training simulations. If you have:

  • Parallel Computing Toolbox™ software, you can run parallel simulations on multicore computers

  • MATLAB Parallel Server™software, you can run parallel simulations on computer clusters or cloud resources

When training with parallel computing, the host client sends copies of the agent and environment to each parallel worker. Each worker simulates the agent within the environment and sends their simulation data back to the host. The host agent learns from the data sent by the workers and sends the updates policy parameters back to the workers.

To create a parallel pool of N workers, type:

pool = parpool(N);

If you do not create a parallel pool using parpool, the train function automatically creates one using your default parallel pool preferences. For more information on specifying these preferences, see Specify Your Parallel Preferences (Parallel Computing Toolbox).

For off-policy agents, such as DDPG and DQN, do not use all of your cores for parallel training. For example, if your CPU has six cores, train with four workers. Doing so provides more resources for the host client to compute gradients based on the experiences sent back from the workers. Limiting the number of workers is not necessary for on-policy agents, such as PG and AC, when the gradients are computed on the workers.

For more information on configuring your training to use parallel computing, see UseParallel and ParallelizationOptions in rlTrainingOptions.

To benefit from parallel computing, the computational cost for simulating the environment must be relatively expensive compared to the optimization of parameters when sending experiences back to the host. If the simulation of the environment is not expensive enough, the workers idle while waiting for the host to learn and send back updated parameters.

When sending experiences back from the workers, you can improve sample efficiency when the ratio R = (complexity of environment step)/(complexity of learning) is large. If the environment is fast to simulate (R is small), you are unlikely to get any benefit from experience-based parallelization. If the environment is expensive to simulate but it is also expensive to learn (for example, if the mini-batch size is large) then you are also unlikely to improve sample efficiency. However in this case, for off-policy agents, you can reduce the mini-batch size to make R larger, which improves sample efficiency.

For an example that trains an agent using parallel computing in:

GPU Acceleration

When using deep neural network function approximators for your actor or critic representations, you can speed up training by performing representation operations on a GPU rather than a CPU. To do so, set the UseDevice option to "GPU".

opt = rlRepresentationOptions('UseDevice',"gpu");

The size of any performance improvement depends on your specific application and network configuration.

Validate Trained Policy

To validate your trained agent, you can simulate the agent within the training environment using the sim function. To configure the simulation, use rlSimulationOptions.

When validating your agent, consider checking how your agent handles:

Environment Visualization

If your training environment implements the plot method, you can visualize the environment behavior during training and simulation. If you call plot(env) before training or simulation, where env is your environment object, then the visualization updates during training to allow you to visualize the progress of each episode or simulation. For custom environments, you must implement your own plot method. For more information on creating custom environments with plot functions, see Create Custom MATLAB Environment from Template.

See Also

Related Topics