PPO | RL | Training for 10000 rounds still doesn't get effective training, and the reward curve is very flat

Question

yu on 2 Jun 2024

0
Link

Direct link to this question

https://nl.mathworks.com/matlabcentral/answers/2124666-ppo-rl-training-for-10000-rounds-still-doesn-t-get-effective-training-and-the-reward-curve-is-v

Answered: Prasanna on 13 Sep 2024

With reinforcement learning using PPO algorithm, the Episode reward will oscillate in a wide range all the time, and the average reward curve will change very little.

My question is whether the Episode reward value for such oscillations is due to the large difference in the initial random environment settings, the large difference in the rewards obtained per episode, or the unreasonable parameter settings of the training.

The initial environment generates random locations through distance and angle variables

angle=-4*pi/8+sign(2*rand(1)-1)*rand(1)*pi/8;
dist=10000+rand(1)*4000;

The current agent settings and training settings are as follows：

actorOpts = rlOptimizerOptions(LearnRate=1e-4,GradientThreshold=1);
criticOpts = rlOptimizerOptions(LearnRate=1e-4,GradientThreshold=1);
agentOpts = rlPPOAgentOptions(...
    ActorOptimizerOptions=actorOpts,...
    CriticOptimizerOptions=criticOpts,...
    ExperienceHorizon=2500,...
    ClipFactor=0.1,...
    EntropyLossWeight=0.02,... 
    MiniBatchSize=256,... 
    NumEpoch=9,... 
    AdvantageEstimateMethod="gae",... 
    GAEFactor=0.95,...
    SampleTime=0.01,... 
    DiscountFactor=0.99);
trainOpts = rlTrainingOptions(...
    'MaxEpisodes',100000, ...
    'MaxStepsPerEpisode',2500, ...
    'Verbose',false, ...
    'StopTrainingCriteria',"AverageReward",...
    'StopTrainingValue',1000,...
    'ScoreAveragingWindowLength',1000);
trainOpts.UseParallel = true;
trainOpts.ParallelizationOptions.StepsUntilDataIsSent = 2^11;

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Prasanna on 13 Sep 2024

0
Link

Direct link to this answer

https://nl.mathworks.com/matlabcentral/answers/2124666-ppo-rl-training-for-10000-rounds-still-doesn-t-get-effective-training-and-the-reward-curve-is-v#answer_1515790

Hi Yu,

The oscillation of episode rewards in reinforcement learning (RL), particularly with algorithms like Proximal Policy Optimization (PPO), is due to several factors. Some of the potential causes of reward oscilalations is due to initial environment setup, reward function design, training hyperparameters, experience horizon and the batch size.

To improve the training reward curve, you can consider normalizing the range of initial conditions to reduce variability, increasing the clip factor, and reducing the entropy loss weight. You can do the same by changing the factors present in the ‘rlPPOAgentOptions’. You can also experiment with the GAE factor with values closer to 1 for more stable advantage estimates. You can also ensure that the agent is not overfitting to specific trajectories.

By systematically experimenting with the various parameters of the reinforcement learning training options and monitoring their impact on training stability, you should be able to reduce reward oscillations and achieve more consistent outputs.

For more information regarding training Proximal policy optimization (PPO) reinforcement learning agents, refer the following documentation links:

rlPPOAgent - https://www.mathworks.com/help/reinforcement-learning/ref/rl.agent.rlppoagent.html
Training reinforcement learning agents - https://www.mathworks.com/help/reinforcement-learning/ug/train-reinforcement-learning-agents.html

Hope this helps!

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

PPO | RL | Training for 10000 rounds still doesn't get effective training, and the reward curve is very flat

0 Comments
Show -2 older commentsHide -2 older comments

Answers (1)

0 Comments
Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

PPO | RL | Training for 10000 rounds still doesn't get effective training, and the reward curve is very flat

0 Comments Show -2 older commentsHide -2 older comments

Answers (1)

0 Comments Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

0 Comments
Show -2 older commentsHide -2 older comments