RL toolbox train on continuous simulation with delay between episodes
Show older comments
I have a Simulink simulation of an environment that runs continuously and cannot be interrupted. I want to train a PPO agent using this simulation. Episodes would have to start and run for a set number of time steps (receiving observations and rewards and sending out actions) and then end without the environment simulation stopping. After an episode ends I would like to have a delay before the next episode starts and during this delay I want to apply a pre-defined control that stabilizes the simulation.
Is there any way learning can be set up in this way? If delays are not possible I'm still very interested in learning on a continuous simulation (continuous in terms of ongoing for a long time, not in terms of action and observation space)
Thanks for any help on this
Answers (2)
Emmanouil Tzorakoleftherakis
on 9 Feb 2021
0 votes
Hi Joe,
I believe the setup you mention may be possible but it will require some work.Essentially, you need to set up training to have a single very long episode and put the RL Agent block in an enabled subsystem. Once some condition A is met, the enabled subsystem will be OFF and input to the system will be directed by some other source until A does not hold (in which case the RL Agent will become active again). The downside is that you would not be able to view the evolution of episode rewards since you only have 1 episode.
I would be curious to find out more though since, basically what you are describing, i.e., having one very long episode and periodically introducing a delay during which you stabilize the simulation is more naturally implemented by having distinct episodes and resetting the states inbetween. What is the application? Any reason you want to set up the problem that way?
5 Comments
Joseph van 't Hoff
on 10 Feb 2021
Paul Stansell
on 10 Feb 2021
Edited: Paul Stansell
on 10 Feb 2021
Hello Emmanouil,
To add to Jos' explanation, what we are trying to do is use Simscape to simulate an RL experiment on a real rig where we plan to do the following on a rig that is running continuously without stopping:
- Put the running rig into a "safe mode" by running a known safe control policy.
- Switch to an RL training control for a period of time (which we'll call an episode) and gather data for the agent (PPO) to use to update its control policy.
- Switch back to the safe control while the RL agent trains on the data from the last episode, and to put the rig back into the initial "safe mode" state ready for the next RL episode.
- Repeat from 2.
Paul
Emmanouil Tzorakoleftherakis
on 10 Feb 2021
Thank you that makes more sense. I thought it would have something to do with training on hardware since the concept of episodes and resetting state with a physical system is not straightforward.
I think it also makes sense to train with the simulation model first before going to the real hardware. This should in theory reduce the amount of training with the real hardware.
Joe to your questions - I wouldn't view this setup as having "episodes". My understanding is that the agent is keeping track of how many training steps it is going through under the hood as well as any experience data it has collected, so even after the delay these will still be in use (whereas the agent would start clean in a new episode). For example, the data from an experience sequence in PPO will be discarded after the agent reached the ExperienceHorizon or the end of an actual episode.
It certainly makes sense to have a very long MaxStepsPerEpisode value. For the ExperienceHorizon, you can either sync it up with the 'episode' duration as you are suggesting but I *think* it wouldn't matter if it's asynchronous since the data points collected are St, At, Rt+1,St+1.
Unfortunately we don't have an example with enabled subsystems but we would be interested to hear how this works to make appropriate adjustments in the product.
Bipin Paudel
on 19 Apr 2023
Hi @Emmanouil Tzorakoleftherakis, I'm currently facing a similar issue. I have a model based on Simscape that requires some amount of time to attain a steady state. My objective is to train an agent based on reinforcement learning when the system is in a steady state. Could you suggest a suitable method to pause the training for a period and resume it once the model reaches the desired state in each episodes? I don't wan't to train my RL agent before the model is in steady state.
I would greatly appreciate any assistance.
Emmanouil Tzorakoleftherakis
on 24 Apr 2023
I think your question is a bit different, so ideally this would be a new thread so that the answer is more discoverable. In any case, since R2022a, you can place the RL Agent block inside conditionally executed subsystems. So you can initiate training whenever it makes sense:
Bipin Paudel
on 24 Apr 2023
0 votes
Thanks for this. I found a subsystem called 'Enabled Subsystems.' As a control input to this subsystem, I guess I can use a step block that outputs a value of 1 after some time (when I want my agent to start training) to this subsystem, else 0.
This should solve a problem right?
Categories
Find more on Reinforcement Learning in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!