PPO | RL | Training for 10000 rounds still doesn't get effective training, and the reward curve is very flat

11 views (last 30 days)
With reinforcement learning using PPO algorithm, the Episode reward will oscillate in a wide range all the time, and the average reward curve will change very little.
My question is whether the Episode reward value for such oscillations is due to the large difference in the initial random environment settings, the large difference in the rewards obtained per episode, or the unreasonable parameter settings of the training.
The initial environment generates random locations through distance and angle variables
angle=-4*pi/8+sign(2*rand(1)-1)*rand(1)*pi/8;
dist=10000+rand(1)*4000;
The current agent settings and training settings are as follows:
actorOpts = rlOptimizerOptions(LearnRate=1e-4,GradientThreshold=1);
criticOpts = rlOptimizerOptions(LearnRate=1e-4,GradientThreshold=1);
agentOpts = rlPPOAgentOptions(...
ActorOptimizerOptions=actorOpts,...
CriticOptimizerOptions=criticOpts,...
ExperienceHorizon=2500,...
ClipFactor=0.1,...
EntropyLossWeight=0.02,...
MiniBatchSize=256,...
NumEpoch=9,...
AdvantageEstimateMethod="gae",...
GAEFactor=0.95,...
SampleTime=0.01,...
DiscountFactor=0.99);
trainOpts = rlTrainingOptions(...
'MaxEpisodes',100000, ...
'MaxStepsPerEpisode',2500, ...
'Verbose',false, ...
'StopTrainingCriteria',"AverageReward",...
'StopTrainingValue',1000,...
'ScoreAveragingWindowLength',1000);
trainOpts.UseParallel = true;
trainOpts.ParallelizationOptions.StepsUntilDataIsSent = 2^11;

Answers (1)

Prasanna
Prasanna on 13 Sep 2024
Hi Yu,
The oscillation of episode rewards in reinforcement learning (RL), particularly with algorithms like Proximal Policy Optimization (PPO), is due to several factors. Some of the potential causes of reward oscilalations is due to initial environment setup, reward function design, training hyperparameters, experience horizon and the batch size.
To improve the training reward curve, you can consider normalizing the range of initial conditions to reduce variability, increasing the clip factor, and reducing the entropy loss weight. You can do the same by changing the factors present in the ‘rlPPOAgentOptions’. You can also experiment with the GAE factor with values closer to 1 for more stable advantage estimates. You can also ensure that the agent is not overfitting to specific trajectories.
By systematically experimenting with the various parameters of the reinforcement learning training options and monitoring their impact on training stability, you should be able to reduce reward oscillations and achieve more consistent outputs.
For more information regarding training Proximal policy optimization (PPO) reinforcement learning agents, refer the following documentation links:
Hope this helps!

Products


Release

R2023b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!