DDPG: Actor clips outputs to zero, thus, keeping exploration minimal

Question

Tobias Michl on 18 Jul 2022

0
Link

Direct link to this question

https://nl.mathworks.com/matlabcentral/answers/1762865-ddpg-actor-clips-outputs-to-zero-thus-keeping-exploration-minimal

Edited: Tobias Michl on 18 Jul 2022

I'm training a DDPG agent from the Reinforcement Learning Toolbox to adjust a PI controller. Thus, the agent should output P and I. After some initial learning episodes (~ 10 to 50) with high values for both, P and I, both outputs decrease to zero.

This is followed by either of the two cases, switching from time to time:

Both output values stay at zero. (marked green in the following picture)
Output I stays at zero while P being a very low value. (marked purple in the following picture)

The actor is structured as follows:

featureInputLayer(20, 'Normalization', 'none', 'Name', 'state vector')

fullyConnectedLayer(20, 'Name', 'fc1')

reluLayer('Name', 'relu1')

fullyConnectedLayer(256, 'Name', 'fc2')

reluLayer('Name', 'relu2')

fullyConnectedLayer(2, 'Name', 'fc3')

tanhLayer('Name', 'output')];

The PI controller is used to control a transfer function while a timed disturbance occurs. The disturbance is always identical.

The used fitness function is the IAE-value of the speed error:

The reward then is calculated by this formula:

r = r1*(2*exp(r2*I/In)-r3) + p;

with r1, r2, r3 being constants; I is the DDPG agent's IAE value and In the IAE value of the reference system; and p being a punishment, that is capped to [-15, 0]:

p = -max(|n_ref-n_act|²) * p1;

What have I done so far:

trying to recreate a paper's solution
- agent should take action once per episode as the disturbance is detected
- copied the transfer function, networks sizes, observation and all options (critic, actor, DDPG agent, training)
- added a flexible punishment (for the system to not oscillate)
adjusted the range of the punishment to the range of the reward
set gradient threshold from 'inf' to '1'
set lower and upper limit within actionInfo
set standarddeviation to different values
- currently being 0.1
- while 1% of the action range may be 0.8943 and 10% corresponds to 8.943
- with standarddeviation being 0.8943: I stays zero; P explores a bit after then staying on it's max value

DDPG: Actor clips outputs to zero, thus, keeping exploration minimal

0 Comments
Show -2 older commentsHide -2 older comments

Answers (0)

See Also

Categories

Tags

Community Treasure Hunt

DDPG: Actor clips outputs to zero, thus, keeping exploration minimal

0 Comments Show -2 older commentsHide -2 older comments

Answers (0)

See Also

Categories

Tags

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments