rlDDPGAgent learns to generate extreme and low reward outputs during trainging.

2 views (last 30 days)
I have been working on a rl project for data center cooling and after setting up the environment for a while the agent is giving me some problems. When I run the agent with the default function values it gives varied values between 0.1 and 1 as expected, but when I start training it it quickly diverges to either extreme.
Here's a run without training:
Here's what the first training episode looks like:
After this, running the agent normally shows it's outputs are either 1 or 0.1 with no jumping around like the one we observe during training.
I have seen many responses to similar problems saying the cause of this is a too large standard deviation value, but I've went as far as setting it to zero just to get similar results.
Here's my initialization code.
obsInfo = rlNumericSpec([15 1],"LowerLimit",0.01,"UpperLimit",1);
actInfo = rlNumericSpec([15 1],"LowerLimit",0.1,"UpperLimit",1);
opt = rlDDPGAgentOptions("SampleTime",1,"NumStepsToLookAhead",900,"ResetExperienceBufferBeforeTraining",false,"SaveExperienceBufferWithAgent",true);
opt.NoiseOptions.StandardDeviation = 0.003;
agent = rlDDPGAgent(obsInfo,actInfo,opt);
actor = getActor(agent)
actor.Options.UseDevice = 'gpu';
actor.Options.LearnRate = 0.05;
critic = getCritic(agent);
critic.Options.UseDevice = 'gpu';
critic.Options.LearnRate = 0.0005;
agent = setActor(agent,actor);
agent = setCritic(agent,critic);
env = rlSimulinkEnv("collagenSim","collagenSim/Collagen",obsInfo, actInfo);
opts = rlTrainingOptions("MaxStepsPerEpisode",1800,"MaxEpisodes",2000,"StopTrainingValue",3600);
train(agent,env,opts);
This is all quite troubling since I have to turn in the project in not too long and training takes long enough already. If anyone could help to solve this it would be greatly appeciated.

Answers (1)

Alan
Alan on 14 Mar 2024
Hi Genis,
It seems like an interesting application of RL. This answer could be late, as you mentioned that you might have to turn in the project soon. Let me give it a shot either way.
It would be easier to find the root cause if you could share your Simulink model, or some more information about the behaviour of the environment.
I am assuming that the rewards are proportional to the error between the observed and target temperatures, and there are 15 sensors and 15 air conditioners that can be mapped to the observation and action spaces.
There are a few things you can try out:
  1. Increase the SampleTime parameter of the DDPG agent: It is possible that the update rate of the agent is low. By the time the cooler is switched on or off, it has deviated a lot from the target temperature, resulting in the cooler taking either of the extreme actions. Also, experiment with the TargetSmoothFactor parameter, which dictates the rate at which the target networks are updated.
  2. Bring the Learning Rates closer: Currently, the learning rates used for actor and critic networks differ by a factor of 100, which can cause instability while training. Consider bringing them closer (maybe to 0.05).
  3. Better modelling of the reward function: The reward function dictates the way the agent learns. Hence, you can try modifying your reward function to better suit the environment, i.e., you could also penalize the cooler for over-cooling and resulting in higher power consumption.
  4. Try out other agents: If tuning parameters does not help, there are various other agents that take a continuous action space which you could try out: https://www.mathworks.com/help/deeplearning/ug/reinforcement-learning-using-deep-neural-networks.html
Also, the action graph should not matter as long as the rewards are maximized.
I hope this helped.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!