Policy Gradient with Baseline Reward Oscillation (MATLAB Reinforcement Learning Toolbox)

Question

rakbar on 18 Mar 2020

0
Link

Direct link to this question

https://nl.mathworks.com/matlabcentral/answers/511574-policy-gradient-with-baseline-reward-oscillation-matlab-reinforcement-learning-toolbox

Commented: rakbar on 22 Mar 2020

I'm trying to train a Policy Gradient Agent with Baseline for my RL research. I'm using the in-built RL toolbox from MATLAB (https://www.mathworks.com/help/reinforcement-learning/ug/pg-agents.html) and have created my own Environment following the MATLAB documention.

The goal is to train the system to sample an underlying time-series (

) given battery constrains (

is battery cost). I also have a prediction model which outputs

given exogenous input time series

.

Environment setup:

Geophyiscal time series to be sampled:
State/Obs. time series:
X includes some exogenous time-series, along with some system info such as battery level, date/time, etc.
N ~ 3 years of houlry data (30k)
A binary Action is taken at each time-step t. If = 0 keep a model prediction ; If = 1 sample the "true" time series (y)

Epsiodes setup:

Get a random snipet of each time-series and each with length
Espiode start and end lenghts, , are currently randomly set at the begining of each episode. Overlaps are ok.
Randomly set the system's initial battery level.
State/Obs. time series:
**A each timestep t the policy receives an inputs and should determine .**
Epsidoes end if the time-step reaches the end of the time-series or the system runs our of battery.

Reward function is

Where is the RMSE error between the sampled time series , and true time series y.
The Terminal State Rewards are T1 = -100 if sensor runs out of battery. T2 = 100 if reached the end of the episode with RMSE < threshold and some battery level remains. The goal is to always end in T2.

RL Code:

obsInfo = getObservationInfo(env);
numObservations = obsInfo.Dimension(1); % 13x1
actInfo = getActionInfo(env);
numActions = numel(actInfo.Elements); % 2x1
learing_rate = 1e-4;
% Actor Network
ActorNetwork_ = [
    imageInputLayer([numObservations 1 1],'Normalization','none','Name','state')
    fullyConnectedLayer(32,'Name','fc1')
    reluLayer
    fullyConnectedLayer(16,'Name','fc2')
    reluLayer
    fullyConnectedLayer(numActions,'Name','action')];
actorOpts = rlRepresentationOptions('LearnRate',learing_rate,'GradientThreshold',1);
ActorNetwork = rlRepresentation(ActorNetwork_,obsInfo,actInfo,'Observation',{'state'},...
                                                      'Action',{'action'},actorOpts);
% Critic Network 
CriticNetwork_ = [
    imageInputLayer([numObservations 1 1], 'Normalization', 'none', 'Name', 'state')
    fullyConnectedLayer(32,'Name','fc1')
    reluLayer
    fullyConnectedLayer(16,'Name','fc2')
    reluLayer
    fullyConnectedLayer(1,'Name','action')];
baselineOpts = rlRepresentationOptions('LearnRate',learing_rate,'GradientThreshold',1);
CriticNetwork = rlRepresentation(CriticNetwork_,baselineOpts,'Observation',{'state'},obsInfo);
agentOpts = rlPGAgentOptions(...
    'UseBaseline',true, ...
    'DiscountFactor', 0.99,...
    'EntropyLossWeight',0.2);
agent = rlPGAgent(ActorNetwork,CriticNetwork,agentOpts);
validateEnvironment(env)
%
warning('off','all')
trainOpts = rlTrainingOptions(...
                            'MaxEpisodes', 2500, ...
                            'MaxStepsPerEpisode', envConstants.MaxEpsiodeStesp, ...
                            'Verbose', true, ...
                            'Plots','training-progress',...
                            'StopTrainingCriteria','AverageReward',...
                            'StopTrainingValue',100,...
                            'ScoreAveragingWindowLength',20,...
                            'SaveAgentDirectory',save_path,...
                            'SaveAgentCriteria','AverageReward',... 
                            'SaveAgentValue',-50,...                
                            'UseParallel',true);
trainOpts.ParallelizationOptions.DataToSendFromWorkers = 'Gradients';
trainingStats = train(agent,env,trainOpts);

My current setup is using mostly default RL setups from MALTAB with learning rate of 1e-4 and ADAM optimizer. The training is slow, and shows a lot of Reward oscillation between the two terminal states. MATLAB RL toolbox also outputs a

value which the state is:

Episode Q0 is the estimate of the discounted long-term reward at the start of each episode, given the initial observation of the environment. As training progresses, Episode Q0 should approach the true discounted long-term reward if the critic is well-designed.

**Questions**

Are my training and episodes too random? i.e., time-series of different lengths and random initial sensor setup.
Should I simplify my reward function to be just T2? (probably not)
Why doesn't Q0 change at all?
Why not use DQN? I'll give that a try as well.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Emmanouil Tzorakoleftherakis on 19 Mar 2020

0
Link

Direct link to this answer

https://nl.mathworks.com/matlabcentral/answers/511574-policy-gradient-with-baseline-reward-oscillation-matlab-reinforcement-learning-toolbox#answer_420953

Hello,

Some suggestions:

1) For a 13 to 2 mapping, maybe you need another set of FCl+Relu layers in your actor

2) Since you have discrete action space, have you considered trying DQN instead? PG is Monte Carlo based so training will be slower

3) I wouldn't reduce the reward to just T2 - that would make it very sparse and would make training harder

In terms of randomness, it's not clear to me how the time-series is processed in the environment/how the environment works. How often does the actor take an action? You mentioned your observations are 13x1, so does this mean that you have a 12xlength time series of data coming into the sensor? At each time step, the policy should receive 13 values so I am trying to understand how the time-series is being processed by the policy.

3 Comments
Show 1 older commentHide 1 older comment

Emmanouil Tzorakoleftherakis on 19 Mar 2020

Thanks for the info. Here are some more comments based on what you added:

1) I wouldn't randomize the episod elength, because the episode rewards will not be directly comparable. I would divide the 3-year-worth data into equal smaller chunks (there could be overlaps) and randomly pick which chunk you want to train for at the beginning of the episode

2) I would add more neurons to the FC layers, probably 50 for each (maybe more)

3) In the reward, instead of using the RMSE term, maybe you could use just the error

, otherwise the choices from previous time steps are considered too (I am assuming

if

?)

4) You may not need to use all the states as observations. If nothing else, you should definitely use the RMSE error (or some other error) as an observation. At the end of the day you want the agent to correct your prediction model by sampling when you are off, and this does not seem to be part of your observations.

I hope that helps

rakbar on 22 Mar 2020

Thanks!!

1) I updated my RL to use DQN with Actor/Critic each having 2 FC layers of 68 size. I'm using MATALB version 2020a rlQValueRepresentation.

2) I also set all the episodes to have the same length (equivalent to 1 year of data), but random time-series snippets. The system's initial battery level is, however, still randomly set at the begining of each episode.

3) Part of the reward, at each time step i, was updated to be

with the rationale that at the end of the episode, this will be the mean absolute (percentage) error.

4) In "practice" I don't know the RMSE, so I input the model's prediction confidence intervals (CI) as a state/observation in to the agent (the model one-step forecast and CI will keep growing and predicton will worsen).

Will post results here once done. Thanks again!

Sign in to comment.

Policy Gradient with Baseline Reward Oscillation (MATLAB Reinforcement Learning Toolbox)

0 Comments
Show -2 older commentsHide -2 older comments

Answers (1)

3 Comments
Show 1 older commentHide 1 older comment

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

Policy Gradient with Baseline Reward Oscillation (MATLAB Reinforcement Learning Toolbox)

0 Comments Show -2 older commentsHide -2 older comments

Answers (1)

3 Comments Show 1 older commentHide 1 older comment

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

3 Comments
Show 1 older commentHide 1 older comment