Struggling to train a DDPG agent - 3DOF aircraft control

Question

Leonardo Molino on 11 May 2024

0
Link

Direct link to this question

https://nl.mathworks.com/matlabcentral/answers/2117651-struggling-to-train-a-ddpg-agent-3dof-aircraft-control

Hello everyone,

As in the title, I am struggling to train a DDPG agent to behave like a TECS (Total Energy Control System) Algorithm. The idea is the following; I have my 3DOF airframe model + the innerloops that control the commanded pitch angle and the commanded thrust. These two commands must come from the RL agent so that the airframe will keep the same altitude while gaining airspeed.

The observations are:

The reference airspeed and altitude at three different instants of time;
The error signals between the airspeed/altitude and the references.

In order to build the reward functions, I have exploited the generateRewardFunction as suggested here in the example of the water tank. In particular, I have generated two Model Verification Blocks:

- Check Step Response Charateristics for the airspeed; I'd like to reach a setpoint 2 m/s higher with respect to the trim one with a given transient

this is the generated reward function

function spd_reward = SpdTrackRewardFunction(V_out,t, V0)
% REWARDFUNCTION generates rewards from Simulink block specifications.
%
% V_out : Input of RL_Training_TECS_Model/Check Airspeed Response Characteristics
% t : Simulation time (s)
% Reinforcement Learning Toolbox
% 07-May-2024 12:19:41
%#codegen
%% Specifications from RL_Training_TECS_Model/Check Airspeed Response Characteristics
Block1_x = V_out;
Block1_InitialValue = V0;
Block1_FinalValue = V0+2;
Block1_StepTime = 0;
Block1_StepRange = Block1_FinalValue - ...
    Block1_InitialValue;
Block1_MinRise = Block1_InitialValue + ...
    Block1_StepRange * 80/100;
Block1_MaxSettling = Block1_InitialValue + ...
    Block1_StepRange * (1+1/100);
Block1_MinSettling = Block1_InitialValue + ...
    Block1_StepRange * (1-1/100);
Block1_MaxOvershoot = Block1_InitialValue + ...
    Block1_StepRange * (1+2/100);
Block1_MinUndershoot = Block1_InitialValue - ...
    Block1_StepRange * 1/100;
if t >= Block1_StepTime
    if Block1_InitialValue <= Block1_FinalValue
        Block1_UpperBoundTimes = [0,100; 100,max(100+1,t+1)];
        Block1_UpperBoundAmplitudes = [ ...
            Block1_MaxOvershoot,Block1_MaxOvershoot; ...
            Block1_MaxSettling,Block1_MaxSettling];
        Block1_LowerBoundTimes = [0,60; 60,100; 100,max(100+1,t+1)];
        Block1_LowerBoundAmplitudes = [ ...
            Block1_MinUndershoot,Block1_MinUndershoot; ...
            Block1_MinRise,Block1_MinRise; ...
            Block1_MinSettling,Block1_MinSettling];
    else
        Block1_UpperBoundTimes = [0,60; 60,100; 100,max(100+1,t+1)];
        Block1_UpperBoundAmplitudes = [ ...
            Block1_MinUndershoot,Block1_MinUndershoot; ...
            Block1_MinRise,Block1_MinRise; ...
            Block1_MinSettling,Block1_MinSettling];
        Block1_LowerBoundTimes = [0,100; 100,max(100+1,t+1)];
        Block1_LowerBoundAmplitudes = [ ...
            Block1_MaxOvershoot,Block1_MaxOvershoot; ...
            Block1_MaxSettling,Block1_MaxSettling];
    end
    Block1_xmax = zeros(1,size(Block1_UpperBoundTimes,1));
    for idx = 1:numel(Block1_xmax)
        tseg = Block1_UpperBoundTimes(idx,:);
        xseg = Block1_UpperBoundAmplitudes(idx,:);
        Block1_xmax(idx) = interp1(tseg,xseg,t,'linear',NaN);
    end
    if all(isnan(Block1_xmax))
        Block1_xmax = Inf;
    else
        Block1_xmax = max(Block1_xmax,[],'omitnan');
    end
    Block1_xmin = zeros(1,size(Block1_LowerBoundTimes,1));
    for idx = 1:numel(Block1_xmin)
        tseg = Block1_LowerBoundTimes(idx,:);
        xseg = Block1_LowerBoundAmplitudes(idx,:);
        Block1_xmin(idx) = interp1(tseg,xseg,t,'linear',NaN);
    end
    if all(isnan(Block1_xmin))
        Block1_xmin = -Inf;
    else
        Block1_xmin = max(Block1_xmin,[],'omitnan');
    end
else
    Block1_xmin = -Inf;
    Block1_xmax = Inf;
end
%% Penalty function weight (specify nonnegative)
Weight = 1;
%% Compute penalty
% Penalty is computed for violation of linear bound constraints.
%
% Use the following functions to compute penalties on constraints.
%  - exteriorPenalty
%  - barrierPenalty
%  - hyperbolicPenalty
% For exteriorPenalty,
% Specify the penalty method as step or quadratic.
%
% For more information, see help for these functions.
Penalty = sum(exteriorPenalty(Block1_x, Block1_xmin, Block1_xmax, 'step'));
%% Compute reward
spd_reward = - Weight * Penalty;
end

- Check custom bounds for the altitude, in order to generate a "safe box" +/- 1.6 m wide with respect to the trim altitude

the following is the generated reward function

function alt_reward = AltBoundsRewardFunction(Alt_out,t, Alt_hold, t_fin)
% REWARDFUNCTION generates rewards from Simulink block specifications.
%
% Alt_out : Input of RL_Training_TECS_Model/Check Altitude Bounds
% t : Simulation time (s)
% Reinforcement Learning Toolbox
% 07-May-2024 14:07:07
%#codegen
%% Specifications from RL_Training_TECS_Model/Check Altitude Bounds
Block1_x = Alt_out;
Block1_UpperBoundTimes = [0 t_fin];
Block1_UpperBoundAmplitudes = [Alt_hold+1.6 Alt_hold+1.6];
Block1_xmax = zeros(1,size(Block1_UpperBoundTimes,1));
for idx = 1:numel(Block1_xmax)
    tseg = Block1_UpperBoundTimes(idx,:);
    xseg = Block1_UpperBoundAmplitudes(idx,:);
    Block1_xmax(idx) = interp1(tseg,xseg,t,'linear',NaN);
end
if all(isnan(Block1_xmax))
    Block1_xmax = Inf;
else
    Block1_xmax = max(Block1_xmax,[],'omitnan');
end
Block1_LowerBoundTimes = [0 t_fin];
Block1_LowerBoundAmplitudes = [Alt_hold-1.6 Alt_hold-1.6];
Block1_xmin = zeros(1,size(Block1_LowerBoundTimes,1));
for idx = 1:numel(Block1_xmin)
    tseg = Block1_LowerBoundTimes(idx,:);
    xseg = Block1_LowerBoundAmplitudes(idx,:);
    Block1_xmin(idx) = interp1(tseg,xseg,t,'linear',NaN);
end
if all(isnan(Block1_xmin))
    Block1_xmin = -Inf;
else
    Block1_xmin = max(Block1_xmin,[],'omitnan');
end
%% Penalty function weight (specify nonnegative)
Weight = 1;
%% Compute penalty
% Penalty is computed for violation of linear bound constraints.
%
% Use the following functions to compute penalties on constraints.
%  - exteriorPenalty
%  - barrierPenalty
%  - hyperbolicPenalty
% For exteriorPenalty,
% Specify the penalty method as step or quadratic.
%
% For more information, see help for these functions.
Penalty = sum(exteriorPenalty(Block1_x, Block1_xmin, Block1_xmax, 'step'));
%% Compute reward
alt_reward = - Weight * Penalty;
end

In addition to these two rewards, I have also considered two penalty functions when the commanded actions are too close to the saturation boundaries. Indeed, as you can see from the picture of the model, the two actions must be within a maximum and a minimum value. In particoular, the minimum action for both the pitch and the thrust is a constant value while the maximum depends for both the actions on the airspeed and alitude pair, so it is a dynamic bound.

function PitchPenalty = PitchCmdPenalty(u, u_max, u_min)
k_PP = -100;
u_center = (u_max + u_min)/2;
PitchPenalty = k_PP*(u - u_center)^2;
end

function ThrustPenalty = ThrustCmdPenalty(u, u_max, u_min)
k_TP = -1e-7;
u_center = (u_max + u_min)/2;
ThrustPenalty = k_TP*(u - u_center)^2;
end

The two weights here are selected so that the penalty is more or less the same for both the actions, as you can see.

The isdone command is generated so that when the airspeed/altitude is outside the boundaries for more than x seconds, the episode will stop.

The random function that initialize the episodes will generate a random trim condition within a given "reference box" in the envelope of the aircraft.

The agent options are the following

% Build and configure the DDPG object
ddpg_opt = rlDDPGAgentOptions; % option constructor
ddpg_opt.SampleTime                                 = 0.1;
ddpg_opt.NoiseOptions.StandardDeviation             = [0.1; 0.1];
ddpg_opt.NoiseOptions.StandardDeviationDecayRate    = 1e-6;
ddpg_opt.NoiseOptions.Variance                      = 0.015;
ddpg_opt.NoiseOptions.VarianceDecayRate             = 1e-6;
ddpg_opt.ExperienceBufferLength                     = 1e6;
ddpg_opt.MiniBatchSize                              = 256;
AltHoldTECSAgent = rlDDPGAgent(obsInfo, actInfo, ddpg_opt)
% get the actor
actor           = getActor(AltHoldTECSAgent);
actorNet        = getModel(actor);
actorLayers     = actorNet.Layers;
% configure the learning
learnOptions = rlOptimizerOptions("LearnRate",1e-03,"GradientThreshold",1);
actor.UseDevice = 'cpu';
ddpg_opt.ActorOptimizerOptions = learnOptions;
% get the critic
critic          = getCritic(AltHoldTECSAgent);
criticNet       = getModel(critic);
criticLayers    = criticNet.Layers;
% configure the critic
critic.UseDevice = 'gpu';
ddpg_opt.CriticOptimizerOptions = learnOptions;
AltHoldTECSAgent = rlDDPGAgent(actor, critic, ddpg_opt);
trainOptions = rlTrainingOptions();
trainOptions.MaxEpisodes = 1200;
trainOptions.MaxStepsPerEpisode = 150;
trainOptions.StopTrainingCriteria = "AverageReward";
trainOptions.StopTrainingValue = 0; 
trainOptions.ScoreAveragingWindowLength = 5;  
trainStats = train(AltHoldTECSAgent, rlEnv, trainOptions);

I've started several trainings, but I have stopped them because everytime the 'Episode Q0' diverges and the commaned actions are always in the order of 10^6 radiants for the pitch and 10^7 Newton for the thrust...