What Is Reinforcement Learning?
3 things you need to know
3 things you need to know
Reinforcement learning is a type of machine learning technique in which a computer agent learns to perform a task through repeated trial-and-error interactions with an environment. This learning approach enables the agent to make a series of decisions that maximize a reward metric for the task without human intervention and without being explicitly programmed to achieve the task.
The typical training mechanism behind reinforcement learning reflects many real-world scenarios. Consider, for example, pet training through positive reinforcement.
Using reinforcement learning terminology, the goal of learning in this case is to train the dog (agent) to complete a task within an environment, which includes the surroundings of the dog as well as the trainer. First, the trainer issues a command or cue, which the dog observes (observation). The dog then responds by taking an action. If the action is close to the desired behavior, the trainer will likely provide a reward, such as a food treat or a toy; otherwise, no reward will be provided. At the beginning of training, the dog will likely take more random actions, such as rolling over when the command given is “sit,” as it is trying to associate specific observations with actions and rewards. This association, or mapping, between observations and actions is called policy. From the dog’s perspective, the ideal case would be one in which it would respond correctly to every cue so that it gets as many treats as possible. So, the whole meaning of reinforcement learning training is to “tune” the dog’s policy so that it learns the desired behaviors that will maximize some reward. After training is complete, the dog should be able to observe the owner and take the appropriate action—for example, sitting when commanded to “sit” by using the internal policy it has developed. By this point, treats are welcome but, theoretically, shouldn’t be necessary.
Keeping in mind the dog training example, consider the task of parking a vehicle using an automated driving system. The goal is to teach the vehicle computer (agent) to park in the correct parking spot with reinforcement learning. As in the dog training case, the environment is everything outside the agent and could include the dynamics of the vehicle, other vehicles that may be nearby, weather conditions, and so on. During training, the agent uses readings from sensors such as cameras, GPS, and lidar (observations) to generate steering, braking, and acceleration commands (actions). To learn how to generate the correct actions from the observations (policy tuning), the agent repeatedly tries to park the vehicle using a trial-and-error process. A reward signal can be provided to evaluate the goodness of a trial and to guide the learning process.
In the dog training example, training is happening inside the dog’s brain. In the autonomous parking example, training is handled by a training algorithm. The training algorithm is responsible for tuning the agent’s policy based on the collected sensor readings, actions, and rewards. After training is complete, the vehicle’s computer should be able to park using only the tuned policy and sensor readings. Note that both of these are examples of model-free reinforcement learning as they both involve trial-and-error interactions with the environment for data generation.
The main components of reinforcement learning (policy, environment, agent, actions, rewards, and observations) are formalized within a framework known as a Markov decision process, which provides a mathematical model for decision-making in environments with uncertainty.
The trade-off between exploration and exploitation is a critical aspect of reinforcement learning and can greatly affect quality of learning. The idea is this: Should the agent exploit the environment by choosing the actions that collect the most rewards that it already knows about, or should it choose actions that explore parts of the environment that are still unknown? The choices the agent makes determine the information it receives and, therefore, the information from which it can learn. Too much exploration and the agent won’t be able to converge to a good policy. Too much exploitation and the agent may get stuck to local, suboptimal solutions. In general, it makes sense for an agent to explore more at the beginning of learning, when there is not enough information to exploit, and gradually transition to a higher exploitation role by the end.
Deep reinforcement learning combines reinforcement learning and deep learning. While for simpler problems a policy in the form of a lookup table may be sufficient, this approach will not scale well for large problems and problems continuous in nature. Deep neural networks trained with deep reinforcement learning can encode complex behaviors, providing an alternative approach to applications that are otherwise intractable or more challenging to tackle with more traditional methods. For example, in autonomous driving, a neural network can replace the driver and decide how to turn the steering wheel by simultaneously looking at multiple sensors, such as camera frames and lidar measurements. Without neural networks, the problem would normally be broken down in smaller pieces, such as extracting features from camera frames, filtering the lidar measurements, fusing the sensor outputs, and making “driving” decisions based on sensor inputs, which would be easier to solve with more traditional policy representations, such as lookup tables or polynomial functions.
The general workflow for training an agent using reinforcement learning includes the following steps:
Training an agent using reinforcement learning is an iterative process. Decisions and results in later stages can require you to return to an earlier stage in the learning workflow. If the training process does not converge to an optimal policy within a reasonable amount of time, you may have to go back and take another look at the problem definition (dynamics, observations, actions), reward signal, policy architecture, and algorithm hyperparameters before training again.
Unlike unsupervised and supervised machine learning, reinforcement learning does not need to rely on a static data set, but can operate in a dynamic environment and learn from collected experiences. Data points, or experiences, can be collected during training through trial-and-error interactions between the environment and a software agent. This aspect of reinforcement learning is important because it alleviates the need for data collection, preprocessing, and labeling before training, otherwise necessary in supervised and unsupervised learning.
Deep learning spans all three types of machine learning; reinforcement learning and deep learning are not mutually exclusive. Complex reinforcement learning problems often rely on deep neural networks and deep reinforcement learning.
Reinforcement learning algorithms can be organized into several categories based on their approaches to learning and decision-making.
There are three major classes of reinforcement learning algorithms:
Model-based reinforcement learning builds or uses a model of the environment’s dynamics (transition probabilities and reward functions) to plan and make decisions. The word “plan” is key; these algorithms typically need fewer or no interactions with the environment at all as they rely on their internal model to simulate future states. The internal model can be provided a priori (in which case the agent does not need to interact with the environment at all) or can be learned through data collected from interactions with the actual environment. Model-based reinforcement learning is usually more sample-efficient than model-free because the model can be used to quickly generate large sets of training data. However, depending on whether the internal model is available or learned, model-based methods can require many more computational resources than model-free ones because, in addition to training the base agent, they must also train the environment model and generate training data.
Unlike model-based algorithms, model-free methods do not construct an explicit model of the environment. Instead, they learn optimal actions through direct interaction, relying on trial and error (recall the dog training and automated parking examples from the previous section). This approach is simpler and better suited to high-dimensional or unstructured environments, though it’s typically less efficient in terms of data usage. The majority of state-of-the-art reinforcement learning algorithms are model-free.
Online reinforcement learning involves an agent that actively interacts with the environment during learning—collecting experiences, updating its policy, and adapting continuously as new data arrives. The dog training and automated parking scenarios described above are examples of online reinforcement learning.
In contrast, offline (or batch) reinforcement learning learns solely from a static data set of logged experiences (e.g., from human demonstrations or past policies), without further environment interaction. Offline methods excel where real-world interaction is costly or unsafe and can extract useful info even from random or non-expert data (though learning quality will lag behind that from expert or more structured data). In practice, offline reinforcement learning is often a good option to pretrain a policy before moving to online reinforcement learning, which, though sample-inefficient, typically achieves superior performance since it continually adapts using new data.
On-policy reinforcement learning algorithms update and evaluate the same policy that’s used to generate training data, meaning the agent learns the value of the current policy based on its actual actions (e.g., SARSA, PPO, TRPO). This approach often leads to more stable and reliable updates since you’re not trying to reconcile differences between behavior and target policies as in off-policy methods. Because only a single policy is maintained and updated, they also tend to have lower computational complexity.
Off-policy methods (e.g., Q-learning, DQN), on the other hand, leverage data collected by one policy (the behavior policy) to learn or improve a different target policy (such as a greedy or optimal strategy). As a result, they can store past experiences (data) in a replay buffer and reuse them multiple times. This dramatically improves sample efficiency compared with on-policy methods, which discard data after each policy update. Another advantage of off-policy methods is that they can learn from any policy, including random, outdated, or even human-generated data. This flexibility enables training from offline data sets or demonstrations.
Gradient-based algorithms are the workhorse for many contemporary deep reinforcement learning algorithms as they leverage noisy estimates of the policy gradient (via backpropagation), yielding fast, sample-efficient learning. On the other hand, they are sensitive to hyperparameters, prone to local optima, and require differentiability.
Evolutionary reinforcement learning, in contrast, treats the policy as a black box and is a powerful tool when gradients are unavailable or untrustworthy, when massive parallel compute is accessible, or when you need broad exploration in rugged search spaces. Evolutionary strategies search globally via population-based mutations and selections—offering robustness to sparse or non-differentiable reward signals, though they are far less sample-efficient and slower to converge.
Hybrid techniques increasingly let you exploit the best of both: they use evolution for exploration and global search, then refine promising policies with gradient-based updates, often achieving stronger performance overall.
The nature of the problem at hand often dictates which algorithm(s) is appropriate. If the state and action spaces of the environment are discrete and few in number, then you could use a simple table to represent policies. Q-learning and SARSA are examples of common tabular algorithms. Representing policy parameters in a table is not feasible when the number of state/action pairs becomes large or infinite. This is the so-called curse of dimensionality, and this is where neural networks come in. In general, most modern reinforcement learning algorithms rely on neural networks because they are good candidates for large state/action spaces and complex problems.
In single-agent reinforcement learning, only one agent interacts with the environment—this makes learning simpler, stable, and easier to analyze. In contrast, multi-agent reinforcement learning (MARL) involves multiple agents interacting within the same environment. Because the agents influence each other, the environment is non-stationary, violating Markov assumptions and destabilizing the learning processes. While MARL can solve more complex tasks and develop emergent behaviors such as coordination or negotiation, it also introduces challenges, including convergence issues, high computation demands, and instability if each agent updates independently.
While reinforcement learning is by no means a new concept, recent progress in deep learning and computing power has made it possible to achieve some remarkable results in the area of artificial intelligence.
The benefits of reinforcement learning include:
Reinforcement learning is a powerful technique, but it also comes with challenges, including:
Reinforcement learning has been used in multiple areas over the past few years, including AI chatbots and large language models (LLMs), recommender systems, marketing and advertising, and gaming. However, it is still (for the most part) under evaluation for production applications, and that is especially true for engineered systems. On the bright side, the aforementioned benefits slowly but steadily open up the use of the technology to several areas. Real-life applications of reinforcement learning in engineered systems typically fall into these areas:
MATLAB®, Simulink®, and Reinforcement Learning Toolbox™ simplify reinforcement learning tasks. You can implement controllers and decision-making algorithms by working through every step of the reinforcement learning workflow in the same ecosystem. Specifically, you can:
Creating a reinforcement learning agent interactively with the Reinforcement Learning Designer app. (See documentation.)
Creating a reinforcement learning agent interactively with the Reinforcement Learning Designer app. (See documentation.)
You can use MATLAB and Simulink to address many of the challenges commonly associated with reinforcement learning.
Get started with reinforcement learning quickly: try out algorithms out of the box (no need to manually develop these yourself), consult reference examples to get ideas on how to set up your problem, and ramp up with free learning resources and training courses.
With Parallel Computing Toolbox™ and MATLAB Parallel Server™ you can train reinforcement learning policies faster by leveraging multiple GPUs, multiple CPUs, computer clusters, and cloud resources. For example, you can generate training data faster by spinning off multiple simulations in parallel, and you can also accelerate learning by speeding up gradient calculations.
Reinforcement Learning Toolbox also provides capabilities for model-based reinforcement learning, which can help with sample efficiency.
Model-based policy optimization (MBPO) agents can be more sample-efficient than model-free agents because the model can generate large sets of diverse experiences.
You can use Reinforcement Learning Toolbox to reduce the number of hyperparameters you need to tune manually. For example, you can create agents without manually specifying the architecture neural network policies; you can tune agent hyperparameters interactively with Bayesian optimization in the Reinforcement Learning Designer app. Also, you can generate reward functions automatically if you already have Model Predictive Control Toolbox™ specifications or performance constraints specified with Simulink Design Optimization™ model verification blocks.
Simulations are key in reinforcement learning. With tight integration to Simulink, it is easy to enhance generalization of policies through domain randomization by training agents across various scenarios, even for extreme or dangerous conditions that would be hard and risky to create in the real world.
Using Reinforcement Learning Toolbox, you can fully exploit all available sources of data for better generalization and bridge the sim2real gap. With offline reinforcement learning you can pretrain a policy using existing data (such as data coming from target hardware). You can then improve the policy by training against a simulated environment and apply domain randomization to make it robust to uncertain factors and scenarios. To ensure the simulation model accurately represents the real-world system, you can rely on system identification. The last step of the process is to fine-tune the trained policy by training directly against the real hardware, if needed. While including actual hardware in the training loop would normally be risky and even dangerous, the first two steps of this workflow ensure that the number of interactions with the physical hardware required for fine-tuning the policy is minimal. This functionality can be used for real-time applications as well.
Reinforcement learning is often characterized by “silent errors”—subtle, difficult-to-detect issues that can arise during training or execution. Reinforcement Learning Toolbox lets you log and visualize key training data for easier analysis and debugging.
Interpretability and verification are still open and active areas in the research community, especially with regard to neural networks. Deep Learning Toolbox™ provides a range of visualization methods—a type of interpretability technique that explains network predictions using visual representations of what a network is looking at. Another approach is through fuzzy logic; training a fuzzy inference system (FIS) to replicate the behavior of a (deep) reinforcement learning policy enables you to use the FIS rules to explain its behavior.
Simulation-based verification is the most common approach to verify reinforcement learning policies, and is made easy through Simulink. With Model-Based Design, simulation-based verification can be extended through traditional verification and validation; for example, you can formalize requirements for your policy and analyze them for consistency, completeness, and correctness using Requirements Toolbox™. In addition, you can assess certain properties of neural network policies such as robustness and network output bounds using formal methods provided in the Deep Learning Toolbox Verification Library.
One last thing to consider is that breaking down a complex problem into smaller subproblems can help with all challenges discussed in this section; debugging and interpretability become more tractable (a smaller problem will typically require a simpler policy architecture), and the verification requirements can potentially be reduced. In these situations, reinforcement learning can be combined with traditional (control) methods. The main idea behind this architecture is that the verifiable or traditional methods can be used to tackle the safety-critical aspects of the problem, while black-box reinforcement learning policies can handle higher-level, potentially less critical components. Other architectures to consider include using a hybrid approach in which a traditional method runs alongside reinforcement learning or have reinforcement learning complement or correct a traditional method. MATLAB makes it easy to implement such architectures; in addition to reinforcement learning and AI-based methods, you can access a variety of traditional methods out of the box and combine them using a single simulation platform, Simulink.
Expand your knowledge through documentation, examples, videos, and more.
Explore similar topic areas commonly used with MATLAB and Simulink products.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
Europe