RL toolbox train on continuous simulation with delay between episodes

I have a Simulink simulation of an environment that runs continuously and cannot be interrupted. I want to train a PPO agent using this simulation. Episodes would have to start and run for a set number of time steps (receiving observations and rewards and sending out actions) and then end without the environment simulation stopping. After an episode ends I would like to have a delay before the next episode starts and during this delay I want to apply a pre-defined control that stabilizes the simulation.
Is there any way learning can be set up in this way? If delays are not possible I'm still very interested in learning on a continuous simulation (continuous in terms of ongoing for a long time, not in terms of action and observation space)
Thanks for any help on this

Answers (2)

Hi Joe,
I believe the setup you mention may be possible but it will require some work.Essentially, you need to set up training to have a single very long episode and put the RL Agent block in an enabled subsystem. Once some condition A is met, the enabled subsystem will be OFF and input to the system will be directed by some other source until A does not hold (in which case the RL Agent will become active again). The downside is that you would not be able to view the evolution of episode rewards since you only have 1 episode.
I would be curious to find out more though since, basically what you are describing, i.e., having one very long episode and periodically introducing a delay during which you stabilize the simulation is more naturally implemented by having distinct episodes and resetting the states inbetween. What is the application? Any reason you want to set up the problem that way?

5 Comments

Hi Emmanouil,
thanks for the reply, I will try and implement your suggestion. If my condition A just depends on the value of a timer, that for example enables the subsystem with the RL agent every 20s (mod(t,20)==0) and then disables it after 10s (mod(t+10,20)==0), would this give me 10s 'episodes' with 10s settling time (suppose a pulse generator would be the easiest way to implement this)? Do I set the ExperienceHorizon in rlPPOAgentOptions to 10/Ts but the MaxStepsPerEpisode in rlTrainingOptions to "very long time / Ts"?
The reason for trying to set the problem up this way is because we have a physical test rig running in real time that we don't want to start and stop, but we do want to start episodes from a 'standard' state. I want to simulate as closely as possible in a Simscape model of the rig what will happen when switching between the training policies and the pre-defined control, to assess if transitions can result in potentially harmful motions/forces.
I also want to see if we can train a policy on our Simscape simulation that can then be used on the physical rig (this assumes the model we made of the rig is sufficiently accurate). This way we can train faster than real time, as would happen on the rig.
Any thoughts very much appreciated, also if you think this is not a sensible way of doing things
Joe
Hello Emmanouil,
To add to Jos' explanation, what we are trying to do is use Simscape to simulate an RL experiment on a real rig where we plan to do the following on a rig that is running continuously without stopping:
  1. Put the running rig into a "safe mode" by running a known safe control policy.
  2. Switch to an RL training control for a period of time (which we'll call an episode) and gather data for the agent (PPO) to use to update its control policy.
  3. Switch back to the safe control while the RL agent trains on the data from the last episode, and to put the rig back into the initial "safe mode" state ready for the next RL episode.
  4. Repeat from 2.
Paul
Thank you that makes more sense. I thought it would have something to do with training on hardware since the concept of episodes and resetting state with a physical system is not straightforward.
I think it also makes sense to train with the simulation model first before going to the real hardware. This should in theory reduce the amount of training with the real hardware.
Joe to your questions - I wouldn't view this setup as having "episodes". My understanding is that the agent is keeping track of how many training steps it is going through under the hood as well as any experience data it has collected, so even after the delay these will still be in use (whereas the agent would start clean in a new episode). For example, the data from an experience sequence in PPO will be discarded after the agent reached the ExperienceHorizon or the end of an actual episode.
It certainly makes sense to have a very long MaxStepsPerEpisode value. For the ExperienceHorizon, you can either sync it up with the 'episode' duration as you are suggesting but I *think* it wouldn't matter if it's asynchronous since the data points collected are St, At, Rt+1,St+1.
Unfortunately we don't have an example with enabled subsystems but we would be interested to hear how this works to make appropriate adjustments in the product.
Hi @Emmanouil Tzorakoleftherakis, I'm currently facing a similar issue. I have a model based on Simscape that requires some amount of time to attain a steady state. My objective is to train an agent based on reinforcement learning when the system is in a steady state. Could you suggest a suitable method to pause the training for a period and resume it once the model reaches the desired state in each episodes? I don't wan't to train my RL agent before the model is in steady state.
I would greatly appreciate any assistance.
I think your question is a bit different, so ideally this would be a new thread so that the answer is more discoverable. In any case, since R2022a, you can place the RL Agent block inside conditionally executed subsystems. So you can initiate training whenever it makes sense:

Sign in to comment.

Thanks for this. I found a subsystem called 'Enabled Subsystems.' As a control input to this subsystem, I guess I can use a step block that outputs a value of 1 after some time (when I want my agent to start training) to this subsystem, else 0.
This should solve a problem right?

Asked:

on 8 Feb 2021

Answered:

on 24 Apr 2023

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!