- Trajectory-Based Learning: Instead of learning from randomly sampled individual experiences, the agent learns from sequences (or trajectories) of experiences. Each trajectory consists of states, actions, rewards, and next states that are sequentially connected, reflecting the temporal dependencies inherent in the decision-making process.
- ‘MiniBatchSize’ Interpretation: The ‘MiniBatchSize’ value specifies the length of these trajectories. For example, if ‘MiniBatchSize’ is set to 256, it means that the LSTM network will be trained on sequences of experiences where each sequence is 256 steps long. This setup allows the LSTM to effectively learn policies that depend on temporal sequences of events.
How does PPO+LSTM work?Can anyone explain my confusion?
67 views (last 30 days)
Show older comments
Hello, everyone!
When I read about PPO in the official MATLAB documentation, I found this sentence: “When the agent uses a recurrent neural network, MiniBatchSize is treated as the training trajectory length.”
I'm puzzled how does PPO+LSTM sample and learn from the current set of experiences?
How to understand "MiniBatchSize is treated as the training trajectory length".
0 Comments
Answers (1)
Prasanna
on 22 May 2024
Hi,
When training reinforcement learning (RL) agents, ‘MiniBatchSize’ typically refers to the number of samples from the experience replay buffer that are used for one iteration of learning. In the case of standard (non-recurrent) neural networks, these samples can be randomly selected because the network treats each input independently.
When the documentation mentions that "MiniBatchSize is treated as the training trajectory length" for an agent using an LSTM, it implies a shift in how data samples are structured and utilized during training:
Therefore, when configuring a PPO agent with an LSTM network in MATLAB using ‘rlPPOAgentOptions’, setting the ‘MiniBatchSize’ appropriately is crucial for effective learning. The choice of ‘MiniBatchSize’ affects how well the LSTM can learn from the temporal dependencies in the data. Too short sequences might not capture enough of the temporal context, while too long sequences might be computationally expensive and harder to learn from due to the vanishing gradient problem common in RNNs.
For more information, refer the following documentation:
Hope this helps.
1 Comment
Lance
ongeveer 13 uur ago
Hi,
Here are two methods in creating a PPO agent:
Method 1: rlPPOAgent(Actor, Critic, AgentOps) in which custom Actor & Critic Networks are defined, in my case with LSTMs
Method 2: rlPPOAgent(ObsInfo,ActInfo, rlInitilizationOptions(UseRNN = "true"), AgentOpts) in which default networks are used (single LSTM) but RNN is specified as true
My question is, if method 1 is utilized, does the MATLAB/PPO algorithm recognize the actor/critic networks as being RNNs? This is important because you cannot use rlInitilizationOptions in method 1, so I am wondering if MiniBatchSize is treated correctly?
Thanks
See Also
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!