# Custom Training Loop with Simulink Action Noise

This example shows how to tune a controller for vehicle platooning applications using a custom reinforcement learning (RL) training loop. For this application, action noise is generated in the Simulink® model to promote exploration during training.

For an example on tuning a PID-based vehicle platooning system, see Design Controller for Vehicle Platooning (Simulink Control Design).

Platooning has the following control objectives [1].

• Individual vehicle stability — Spacing error for each following vehicle converges to zero if the preceding vehicle is traveling at constant speed.

• String stability — Spacing errors do not amplify as they propagate towards the tail of the vehicle string.

### Platooning Environment Model

In this example, there are five vehicles in the platoon. Every vehicle is modeled as a truck-trailer system with the following parameters. All lengths are in meters.

```L1 = 6; % Truck length L2 = 10; % Trailer length M1 = 1; % Hitch length L = L1 + L2 + M1 + 5; % Desired front-to-front vehicle spacing```

The lead vehicle follows a given acceleration profile. Each trailing vehicle has a controller that controls its acceleration.

```mdl = "fiveVehiclePlatoonEnv"; open_system(mdl)```

The model contains an RL Agent block with its last action input port enabled. This input port allows the specification of custom noise in the Simulink model for off-policy RL agents, such as deep deterministic policy gradient (DDPG) agents.

Specify the path to the RL Agent block.

`agentBlk = mdl + "/RL Agent";`

### Controller Structure

In this example, each trailing vehicle (ego vehicle) has the same continuous-time controller structure and parameterization.

`${\mathit{a}}_{\mathrm{ego}}={\mathit{K}}_{1}{\mathit{a}}_{\mathrm{front}}-{\mathit{K}}_{2}\left({\mathit{v}}_{\mathrm{ego}}-{\mathit{v}}_{\mathrm{front}}\right)-{\mathit{K}}_{3}\left({\mathit{x}}_{\mathrm{ego}}-{\mathit{x}}_{\mathrm{front}}+\mathit{L}\right)$`

Here:

• ${\mathit{a}}_{\mathrm{ego}}$, ${\mathit{v}}_{\mathrm{ego}}$, and ${\mathit{x}}_{\mathrm{ego}}$ are the respective acceleration, velocity, and position of the ego vehicle.

• ${\mathit{a}}_{\mathrm{front}}$, ${\mathit{v}}_{\mathrm{front}}$, and ${\mathit{x}}_{\mathrm{front}}$ are the respective acceleration, velocity, and position of the vehicle directly in front of the ego vehicle.

Each vehicle has full access to its own velocity and position states but can only access the acceleration, velocity, and position of the vehicle directly in front using wireless communication.

The controller minimizes the velocity error ${\mathit{v}}_{\mathrm{front}}-{\mathit{v}}_{\mathrm{ego}}$ using the velocity gain ${\mathit{K}}_{2}$ and minimizes the spacing error ${\mathit{x}}_{\mathrm{ego}}-{\mathit{x}}_{\mathrm{front}}+\mathit{L}$ using the spacing gain ${\mathit{K}}_{3}$. The feedforward gain ${\mathit{K}}_{1}$ is used to improve tracking of the front vehicle.

The lead vehicle' acceleration is assumed to be a sine wave

`${\mathit{a}}_{\mathrm{lead}}=\mathit{A}\text{\hspace{0.17em}}\mathrm{sin}\left(\mathrm{wt}\right)$`

Here:

• ${\mathit{a}}_{\mathrm{lead}}$ is the lead vehicle acceleration (m/s^2).

• $\mathit{A}$ is the amplitude of the sine wave (m/s).

• $\mathit{w}$ is the frequency of the sine wave (rad/s).

• $\mathit{t}$ is the simulation time (s).

### Reinforcement Learning Agent Design

The objective of the agent is to compute adaptive gains so that each vehicle can track the desired spacing with respect to the vehicle immediately in front. Therefore, the model is configured such that:

• The action signal consists of the gains $\mathbit{K}=\left[{\mathit{K}}_{1}\text{\hspace{0.17em}}{\mathit{K}}_{2}\text{\hspace{0.17em}}{\mathit{K}}_{3}\right]$ shared by all vehicles except the lead vehicle. Each gain has a lower bound of 0 and upper bounds of 1, 20, and 20, respectively. The agent calculates new gains once per second. To encourage exploration during training, the gains are perturbed by random noise with a zero-mean normal distribution: ${\mathit{K}}_{\mathrm{noisy}}=\mathit{K}+Ν\left(0,{\sigma }^{2}\right)$ where the variance ${\sigma }^{2}=\left[0.02,0.1,0.1\right]$.

• The observation signal consists of the vehicle spacing ($\mathrm{diff}\left(\mathbit{x}\right)\right)$ minus the target spacing ($\mathit{L}\right)$, the vehicle velocities ($\mathbit{v}$), and the vehicle accelerations ($\mathbit{a}$).

• The reward calculated at every time step $\mathit{t}$ is

`${\mathit{r}}_{\mathit{t}}=\sum _{}\frac{1}{1+{\left({\mathbit{s}}_{\mathbit{t}}-\mathit{L}\right)}^{2}}-0.2\sum _{}{\left({\mathbit{K}}_{\mathbit{t}-1}-{\mathbit{K}}_{\mathbit{t}-2}\right)}^{2}-1.0\text{\hspace{0.17em}}\mathrm{maxOvershoot}\left({\mathbit{s}}_{\mathbit{t}},\mathit{L}\right)-10{\mathit{c}}_{\mathit{t}\text{\hspace{0.17em}}}$`

where:

• ${\mathbit{s}}_{\mathbit{t}}$ is the vehicle spacing ($\mathrm{diff}\left({\mathbit{x}}_{\mathbit{t}}\right)$) at time step $\mathit{t}$.

• $\mathrm{maxOvershoot}\left({\mathbit{s}}_{\mathbit{t}},\mathit{L}\right)$ calculates the max overshoot of all vehicles given the actual vehicle spacing ${\mathit{s}}_{\mathit{t}}$ and the desired spacing $\mathit{L}$. In this case, overshoot is defined as when the vehicle spacing is less than the desired spacing ${\mathbit{s}}_{\mathbit{t}}<\mathit{L}$.

• ${\mathit{c}}_{\mathit{t}}=$$\mathrm{any}\left(\left({\mathbit{s}}_{\mathbit{t}}+\text{\hspace{0.17em}}{\mathit{L}}_{1}+{\mathit{L}}_{2}+{\mathit{M}}_{1}-\mathit{L}\right)\le 0\right)\right)$ indicates if a vehicle collision occurs. The simulation will terminate if ${\mathit{c}}_{\mathit{t}\text{\hspace{0.17em}}}$is true.

The first term in the reward function encourages the vehicle spacing to match $\mathit{L}$. The second term penalizes large changes in gain between time steps. The third term penalizes overshooting the target spacing (getting too close to the front vehicle). Finally, the fourth term penalizes collisions.

For this example, to accommodate the custom noise specified in the model, you implement a custom DDPG training loop.

#### Define Model Parameters

Define the training and simulation parameters that remain fixed during training.

```Ts = 1; % Sample time (seconds) Tf = 100; % Simulation length (seconds) AccelNoiseV = ones(1,5)*0.01; % Acceleration input noise variance VelNoiseV = ones(1,5)*0.01; % Velocity sensor noise variance PosNoiseV = ones(1,5)*0.01; % Position sensor noise variance ParamLowerLimit = [0 0 0]'; % Lower limits for the controller gains ParamUpperLimit = [1 20 20]'; % Upper limits for the controller gains UseParamNoise = 1; % Option to indicate if noise is injected```

Define the parameters that change every training episode. The values for these parameters are updated in the environment reset function `resetFunction`.

```LeadA = 2; % Lead vehicle acceleration amplitude LeadF = 1; % Lead vehicle acceleration frequency ParamNoiseV = [0.02 0.1 0.1]; % Variance for controller gains % Random noise seeds ParamNoiseSeed = 1:3; % Controller gain noise seed AccelNoiseSeed = 1:5 + 100; % Acceleration input noise seed VelNoiseSeed = 1:5 + 200; % Velocity sensor noise seed PosNoiseSeed = 1:5 + 300; % Position sensor noise seed % Initial position and velocity of each vehicle InitialPositions = [200 150 100 50 0] + 50; % Positions InitialVelocities = [10 10 10 10 10]; % Velocities```

#### Create Environment

Create an environment using `rlSimulinkEnv`.

To do so, first define the observation and action specifications for the environment.

```obsInfo = rlNumericSpec([14 1]); actInfo = rlNumericSpec([3 1],... LowerLimit=ParamLowerLimit,... UpperLimit=ParamUpperLimit); obsInfo.Name = "measurements"; actInfo.Name = "control_gains";```

Next, create the environment object.

`env = rlSimulinkEnv(mdl,agentBlk,obsInfo,actInfo);`

Set the environment reset function to the local function `resetFunction` included with this example. This function varies the training conditions for each episode.

`env.ResetFcn = @resetFunction;`

Noise in the model is specified using the Random Number (Simulink) block. Each block has its own random number generator and thus its own starting seed parameter. To ensure the noise stream varies across episodes, the seed variables are updated using `resetFunction` .

#### Create Actor, Critic, and Policy

Create actor and critic function approximators for the agent using the local function `createNetworks` included with this example.

`[critic,actor] = createNetworks(obsInfo,actInfo);`

Create optimizer objects for updating the actor and critic. Use the same options for both optimizers.

```optimizerOpt = rlOptimizerOptions(... LearnRate=1e-3, ... GradientThreshold=1, ... L2RegularizationFactor=1e-3); criticOptimizer = rlOptimizer(optimizerOpt); actorOptimizer = rlOptimizer(optimizerOpt);```

Create a deterministic policy for the actor approximator.

`policy = rlDeterministicActorPolicy(actor);`

Specify the policy sample time.

`policy.SampleTime = Ts;`

#### Create Experience Buffer

Create an experience buffer for the agent with a maximum length of `1e6`.

`replayMemory = rlReplayMemory(obsInfo,actInfo,1e6);`

#### Data Required for Learning

To update the actor and critic during training, the `runEpisode` function processes each experience as it is received from the environment. For this example, the processing function is the `processExperienceFcn` local function.

This function requires additional data to perform its processing. Create a structure to store this additional data.

```processExpData.Critic = critic; processExpData.TargetCritic = critic; processExpData.Actor = actor; processExpData.TargetActor = actor; processExpData.ReplayMemory = replayMemory; processExpData.CriticOptimizer = criticOptimizer; processExpData.ActorOptimizer = actorOptimizer; processExpData.MiniBatchSize = 128; processExpData.DiscountFactor = 0.99; processExpData.TargetSmoothFactor = 1e-3;```

Each episode, the `processExperienceFcn` function updates the critics, actors, replay memory, and optimizers. The updated data is used as the input for the next episode.

### Training Loop

To train the agent, the custom training loop simulates the agent in the environment for a maximum of `maxEpisodes` episodes.

`maxEpisodes = 1000;`

Compute the maximum number of steps per episode using the simulation time and sample time.

`maxSteps = ceil(Tf/Ts);`

For this custom training loop:

• The `runEpisode` function simulates the agent in the environment for one episode.

• Experiences are processed as they are received from the environment using the `processExperienceFcn` function.

• Experiences are not logged by `runEpisode` since the experiences are processed as they are received.

• To speed up training, when calling `runEpisode`, the `CleanupPostSim` option is set to `false`. Doing so keeps the model compiled between episodes.

• The `PlatooningTrainingCurvePlotter` object is a helper object to plot training data while the training is running.

• You can stop the training using a Stop button in the training plot.

• After all the episodes are complete, the `cleanup` function cleans up the environment and terminates the model compilation.

Training the policy is a computationally intensive process that can take several minutes to hours to complete. To save time while running this example, load a pretrained agent by setting `doTraining` to `false`. To train the policy yourself, set `doTraining` to `true`.

```doTraining = false; if doTraining % Create plotting helper object. plotObj = PlatooningTrainingCurvePlotter(); % Training loop for i = 1:maxEpisodes % Run the episode. out = runEpisode(... env,policy,... MaxSteps=maxSteps,... ProcessExperienceFcn=@processExperienceFcn,... ProcessExperienceData=processExpData,... LogExperiences=false,... CleanupPostSim=false); % Extract episode information for updating the training curves. episodeInfo = out.AgentData.EpisodeInfo; % Extract updated processExpData for the next episode. processExpData = out.AgentData.ProcessExperienceData; % Extract the updated policy for the next episode. policy = out.AgentData.Agent; % Extract the critic and actor networks from processExpData. critic = processExpData.Critic; actor = processExpData.Actor; % Extract the cumulative reward and calculate average reward % per step for this episode. cumulativeRwd = episodeInfo.CumulativeReward; avgRwdPerStep = cumulativeRwd/episodeInfo.StepsTaken; % Evaluate q0 from the initial observation of the episode. obs0 = episodeInfo.InitialObservation; q0 = evaluate(critic,[obs0,evaluate(actor,obs0)]); q0 = double(q0{1}); % Update the plot. update(plotObj,i,avgRwdPerStep,cumulativeRwd,q0); % Exit training if button pushed. if plotObj.StopTraining break; end end % Clean up the environment. cleanup(env); % Save the policy. save("PlatooningDDPGPolicy.mat","policy"); else % Load the pretrained policy. load("PlatooningDDPGPolicy.mat"); end```

### Validate Trained Policy

Validate the learned policy by running five simulations with random initial conditions specified by the reset function.

First, turn off parameter noise in the model.

`UseParamNoise = 0;`

Simulate the model against the trained policy five times.

```N = 5; simOpts = rlSimulationOptions(... MaxSteps=maxSteps, ... NumSimulations=N); experiences = sim(env,policy,simOpts);```

Plot the vehicle spacing error, gains, and reward from the `experiences` output structure.

```f = figure(Position=[100 100 1024 768]); tiledlayout(f,N,3); for i = 1:N % get the spacing tspacing = experiences(i).Observation.measurements.Time; spacing = squeeze(experiences(i).Observation.measurements.Data(1:4,:,:)); % get the gains tgains = experiences(i).Action.control_gains.Time; gains = squeeze(experiences(i).Action.control_gains.Data); % get the reward trwd = experiences(i).Reward.Time; rwd = experiences(i).Reward.Data; % plot the spacing nexttile stairs(tspacing,spacing'); title(sprintf("Vehicle Spacing Error Simulation %u",i)) grid on % plot the gains nexttile stairs(tgains,gains'); title(sprintf("Vehicle Gains Simulation %u",i)) grid on % plot the reward nexttile stairs(trwd,rwd); title(sprintf("Vehicle Reward Simulation %u",i)) grid on end```

From the plots, you can see that the trained policy generates adaptive gains that adequately track the desired spacing for all vehicles.

### Local Functions

The process experience function is called every time an experience is processed by the RL Agent block. Here, `processExperienceFcn` appends the experience to the replay memory, samples a mini-batch of experiences from the replay memory, and updates the critic, actor, and target networks.

```function [policy,processExpData] = processExperienceFcn(exp,episodeInfo,policy,processExpData) % Append experience to replay memory buffer. append(processExpData.ReplayMemory,exp); % Sample a mini-batch of experiences from replay memory. miniBatch = sample(processExpData.ReplayMemory, ... processExpData.MiniBatchSize,... DiscountFactor=processExpData.DiscountFactor); if ~isempty(miniBatch) % Update network parameters using the mini-batch. [processExpData,actorParams] = learnFcn(processExpData,miniBatch); % Update the policy parameters using the actor parameters. policy = setLearnableParameters(policy,actorParams); end end```

The `learnFcn` function updates the critic, actor, and target networks given a sampled mini-batch

```function [processExpData,actorParams] = learnFcn(processExpData,miniBatch) % Find the terminal experiences. doneidx = (miniBatch.IsDone == 1); % Compute target next actions against the next observations. nextAction = evaluate(processExpData.TargetActor,miniBatch.NextObservation); % compute qtarget = reward + gamma*Q(nextObservation,nextAction) % = reward + gamma*expectedFutureReturn targetq = miniBatch.Reward; % Bootstrap the target at nonterminal experiences. expectedFutureReturn = ... getValue(processExpData.TargetCritic,miniBatch.NextObservation,nextAction); targetq(~doneidx) = targetq(~doneidx) + ... processExpData.DiscountFactor.*expectedFutureReturn(~doneidx); % Compute critic gradient using deepCriticLoss function. criticGradient = gradient(processExpData.Critic,@deepCriticLoss,... [miniBatch.Observation,miniBatch.Action],targetq); % Update the critic parameters. [processExpData.Critic,processExpData.CriticOptimizer] = update(... processExpData.CriticOptimizer,processExpData.Critic,... criticGradient); % Compute the actor gradient using the deepActorGradient function. To % accelerate the deepActorGradient function, the critic network is % extracted outside the function and is passed in as a field to the % actorGradData input struct. actorGradData.CriticNet = getModel(processExpData.Critic); actorGradData.MiniBatchSize = processExpData.MiniBatchSize; actorGradient = customGradient(processExpData.Actor,@deepActorGradient,... miniBatch.Observation,actorGradData); % Update the actor parameters. [processExpData.Actor,processExpData.ActorOptimizer] = update(... processExpData.ActorOptimizer,processExpData.Actor,... actorGradient); actorParams = getLearnableParameters(processExpData.Actor); % Update targets using the given TargetSmoothFactor hyperparameter. processExpData.TargetCritic = syncParameters(processExpData.TargetCritic,... processExpData.Critic,processExpData.TargetSmoothFactor); processExpData.TargetActor = syncParameters(processExpData.TargetActor ,... processExpData.Actor ,processExpData.TargetSmoothFactor); end```

The critic gradient is computed against the `deepCriticLoss` function.

```function loss = deepCriticLoss(q,targetq) q = q{1}; % Loss is the half mean-square error of q = Q(observation,action) %against qtarget loss = mse(q,reshape(targetq,size(q))); end```

The actor gradient is computed to maximize the expected value of an observation-action pair given the policy parameters. Here, the negative sign is used to maximize $\mathit{Q}$ with respect to $\theta$.

`$\frac{d}{\mathrm{d}\theta \text{\hspace{0.17em}}}\left(-\frac{1}{\mathit{N}}\sum _{}{\mathit{Q}}_{\varphi }\left(\mathbit{s},\mathbit{a}\right)\right)=\frac{d}{\mathrm{d}\theta \text{\hspace{0.17em}}}\left(-\frac{1}{\mathit{N}}\sum _{}{\mathit{Q}}_{\varphi }\left(\mathbit{s},{\pi }_{\theta }\left(\mathbit{s}\right)\text{\hspace{0.17em}}\right)\right)$`

Here:

• $\mathit{s}$ is the batch observations

• $\mathit{a}$ is the batch actions

• ${\mathit{Q}}_{\varphi }$ is the critic network parameterized by $\varphi$

• ${\pi }_{\theta }$ is the actor network parameterized by $\theta$

• $\mathit{N}$ is the mini batch size

```function dQdTheta = deepActorGradient(actorNet,observation,gradData) % Evaluate actions from current observations. action = forward(actorNet,observation{:}); % Compute: q = Q(s,a) q = predict(gradData.CriticNet,observation{:},action); % Compute: qsum = -sum(q)/N to maximize q qsum = -sum(q,"all")/gradData.MiniBatchSize; % Compute: d(-sum(q)/N)/dActorParams dQdTheta = dlgradient(qsum,actorNet.Learnables); end```

The environment reset function varies the initial conditions, reference trajectory, and noise seeds for every episode.

```function in = resetFunction(in) % Perturb the nominal reference amplitude and frequency. LeadA = max(2 + 0.1*randn,0.1); LeadF = max(1 + 0.1*randn,0.1); % Perturb the nominal spacing. L = 22 + 3*randn; % Perturb the initial states. InitialPositions = [250 200 150 100 50] + 5*randn(1,5); InitialVelocities = [10 10 10 10 10] + 1*randn(1,5); % Update the noise seeds. ParamNoiseSeed = randi(100,1,3); AccelNoiseSeed = randi(100,1,5) + 100; VelNoiseSeed = randi(100,1,5) + 200; PosNoiseSeed = randi(100,1,5) + 300; % Update the model variables. in = setVariable(in,"L",L); in = setVariable(in,"LeadA",LeadA); in = setVariable(in,"LeadF",LeadF); in = setVariable(in,"InitialPositions",InitialPositions); in = setVariable(in,"InitialVelocities",InitialVelocities); in = setVariable(in,"ParamNoiseSeed",ParamNoiseSeed); in = setVariable(in,"AccelNoiseSeed",AccelNoiseSeed); in = setVariable(in,"VelNoiseSeed",VelNoiseSeed); in = setVariable(in,"PosNoiseSeed",PosNoiseSeed); end```

Create the critic and actor networks.

```function [critic,actor] = createNetworks(obsInfo,actInfo) rng(0); hiddenLayerSize = 64; numObs = prod(obsInfo.Dimension); numAct = prod(actInfo.Dimension); % Create the critic network. obsInput = featureInputLayer(numObs, ... Normalization="none", ... Name=obsInfo.Name); actInput = featureInputLayer(numAct, ... Normalization="none", ... Name=actInfo.Name); catPath = [ concatenationLayer(1,2,Name="concat") fullyConnectedLayer(hiddenLayerSize,Name="fc1") reluLayer(Name="relu1") fullyConnectedLayer(hiddenLayerSize,Name="fc2") reluLayer(Name="relu2") fullyConnectedLayer(1,Name="q")]; net = layerGraph(); net = addLayers(net,obsInput); net = addLayers(net,actInput); net = addLayers(net,catPath); net = connectLayers(net,obsInfo.Name,"concat/in1"); net = connectLayers(net,actInfo.Name,"concat/in2"); net = dlnetwork(net); % Create the critic function approximator. critic = rlQValueFunction(net,obsInfo,actInfo); critic = accelerate(critic,true); % Create the actor network. scale = (actInfo.UpperLimit - actInfo.LowerLimit)/2; bias = actInfo.LowerLimit + scale; obsPath = [ featureInputLayer(numObs, ... Normalization="none", ... Name=obsInfo.Name) fullyConnectedLayer(hiddenLayerSize,Name="fc1") reluLayer(Name="relu1") fullyConnectedLayer(numAct,Name="fc2") reluLayer(Name="relu2") fullyConnectedLayer(numAct,Name="fc3") tanhLayer(Name="tanh1") scalingLayer(Scale=scale,... Bias=bias,... Name=actInfo.Name)]; net = layerGraph; net = addLayers(net,obsPath); net = dlnetwork(net); % Create the actor function approximator. actor = rlContinuousDeterministicActor(net,obsInfo,actInfo); actor = accelerate(actor,true); end```

### References

[1] Rajamani, Rajesh. Vehicle Dynamics and Control. 2. ed. Mechanical Engineering Series. New York, NY Heidelberg: Springer, 2012.