Reinforcement Learning

What Is Reinforcement Learning?

3 things you need to know

Reinforcement learning is a type of machine learning technique in which a computer agent learns to perform a task through repeated trial-and-error interactions with an environment. This learning approach enables the agent to make a series of decisions that maximize a reward metric for the task without human intervention and without being explicitly programmed to achieve the task.

How Reinforcement Learning Works

The typical training mechanism behind reinforcement learning reflects many real-world scenarios. Consider, for example, pet training through positive reinforcement.

Illustration of a dog interacting with a trainer. In reinforcement learning terminology, the dog is the agent and the trainer is part of the environment.

Reinforcement learning in dog training.

Illustration of a vehicle learning how to park autonomously, with the vehicle computer being the reinforcement learning agent.

Reinforcement learning in autonomous parking.

Using reinforcement learning terminology, the goal of learning in this case is to train the dog (agent) to complete a task within an environment, which includes the surroundings of the dog as well as the trainer. First, the trainer issues a command or cue, which the dog observes (observation). The dog then responds by taking an action. If the action is close to the desired behavior, the trainer will likely provide a reward, such as a food treat or a toy; otherwise, no reward will be provided. At the beginning of training, the dog will likely take more random actions, such as rolling over when the command given is “sit,” as it is trying to associate specific observations with actions and rewards. This association, or mapping, between observations and actions is called policy. From the dog’s perspective, the ideal case would be one in which it would respond correctly to every cue so that it gets as many treats as possible. So, the whole meaning of reinforcement learning training is to “tune” the dog’s policy so that it learns the desired behaviors that will maximize some reward. After training is complete, the dog should be able to observe the owner and take the appropriate action—for example, sitting when commanded to “sit” by using the internal policy it has developed. By this point, treats are welcome but, theoretically, shouldn’t be necessary.

Keeping in mind the dog training example, consider the task of parking a vehicle using an automated driving system. The goal is to teach the vehicle computer (agent) to park in the correct parking spot with reinforcement learning. As in the dog training case, the environment is everything outside the agent and could include the dynamics of the vehicle, other vehicles that may be nearby, weather conditions, and so on. During training, the agent uses readings from sensors such as cameras, GPS, and lidar (observations) to generate steering, braking, and acceleration commands (actions). To learn how to generate the correct actions from the observations (policy tuning), the agent repeatedly tries to park the vehicle using a trial-and-error process. A reward signal can be provided to evaluate the goodness of a trial and to guide the learning process.

In the dog training example, training is happening inside the dog’s brain. In the autonomous parking example, training is handled by a training algorithm. The training algorithm is responsible for tuning the agent’s policy based on the collected sensor readings, actions, and rewards. After training is complete, the vehicle’s computer should be able to park using only the tuned policy and sensor readings. Note that both of these are examples of model-free reinforcement learning as they both involve trial-and-error interactions with the environment for data generation.

The main components of reinforcement learning (policy, environment, agent, actions, rewards, and observations) are formalized within a framework known as a Markov decision process, which provides a mathematical model for decision-making in environments with uncertainty.

Exploration vs. Exploitation

The trade-off between exploration and exploitation is a critical aspect of reinforcement learning and can greatly affect quality of learning. The idea is this: Should the agent exploit the environment by choosing the actions that collect the most rewards that it already knows about, or should it choose actions that explore parts of the environment that are still unknown? The choices the agent makes determine the information it receives and, therefore, the information from which it can learn. Too much exploration and the agent won’t be able to converge to a good policy. Too much exploitation and the agent may get stuck to local, suboptimal solutions. In general, it makes sense for an agent to explore more at the beginning of learning, when there is not enough information to exploit, and gradually transition to a higher exploitation role by the end.

Deep Reinforcement Learning

Deep reinforcement learning combines reinforcement learning and deep learning. While for simpler problems a policy in the form of a lookup table may be sufficient, this approach will not scale well for large problems and problems continuous in nature. Deep neural networks trained with deep reinforcement learning can encode complex behaviors, providing an alternative approach to applications that are otherwise intractable or more challenging to tackle with more traditional methods. For example, in autonomous driving, a neural network can replace the driver and decide how to turn the steering wheel by simultaneously looking at multiple sensors, such as camera frames and lidar measurements. Without neural networks, the problem would normally be broken down in smaller pieces, such as extracting features from camera frames, filtering the lidar measurements, fusing the sensor outputs, and making “driving” decisions based on sensor inputs, which would be easier to solve with more traditional policy representations, such as lookup tables or polynomial functions.

Reinforcement Learning Workflow

The general workflow for training an agent using reinforcement learning includes the following steps:

  1. Create the environment. First you need to define the environment within which the reinforcement learning agent operates. The environment can be either a simulation model or a real physical system, but simulated environments are usually a good first step since they are safer and allow experimentation.
  2. Define the reward. Next, specify the reward signal that the agent uses to measure its performance against the task goals and how this signal is calculated from the environment. Reward shaping can be tricky and may require a few iterations to get it right.
  3. Create the agent. Then you create the agent, which consists of selecting the policy representation (e.g., neural network or lookup tables) and selecting and tuning the reinforcement learning training algorithm.
  4. Train and validate the agent. Set up training options (such as stopping criteria) and train the agent to tune the policy. Make sure to validate the trained policy after training ends.
  5. Deploy the policy. Deploy the trained policy representation using, for example, generated C/C++ or CUDA code. At this point, the policy is a standalone decision-making system.
Steps in the reinforcement learning workflow: environment, reward, agent, training of agent, and deployment.

Reinforcement learning workflow.

Training an agent using reinforcement learning is an iterative process. Decisions and results in later stages can require you to return to an earlier stage in the learning workflow. If the training process does not converge to an optimal policy within a reasonable amount of time, you may have to go back and take another look at the problem definition (dynamics, observations, actions), reward signal, policy architecture, and algorithm hyperparameters before training again.

Keep Exploring This Topic

Types of Reinforcement Learning Algorithms

Reinforcement Learning vs. Machine Learning vs. Deep Learning

Unlike unsupervised and supervised machine learning, reinforcement learning does not need to rely on a static data set, but can operate in a dynamic environment and learn from collected experiences. Data points, or experiences, can be collected during training through trial-and-error interactions between the environment and a software agent. This aspect of reinforcement learning is important because it alleviates the need for data collection, preprocessing, and labeling before training, otherwise necessary in supervised and unsupervised learning.

Deep learning spans all three types of machine learning; reinforcement learning and deep learning are not mutually exclusive. Complex reinforcement learning problems often rely on deep neural networks and deep reinforcement learning.

Illustrations of unsupervised, supervised, and reinforcement learning showing clustering, classification, and a reward-based path system, respectively.

Three broad categories of machine learning: unsupervised learning, supervised learning, and reinforcement learning.

A Venn diagram illustrating three major classes of reinforcement learning algorithms: value-based and policy-based algorithms overlap to create actor-critic algorithms.

Three major categories of reinforcement learning algorithms.

Reinforcement Learning Algorithms

Reinforcement learning algorithms can be organized into several categories based on their approaches to learning and decision-making.

Value-Based vs. Policy-Based vs. Actor-Critic Reinforcement Learning Algorithms

There are three major classes of reinforcement learning algorithms:

  • Value-based methods focus on learning a value function, such as the Q-function, which estimates the expected cumulative reward for taking a particular action in a given state and following a certain policy thereafter. The most well-known example is Q-learning. The policy is derived indirectly from the value function by selecting the action with the highest value in each state (i.e., greedy with respect to the value function). Value-based methods are generally simple and sample efficient but are limited to a reasonably sized set of discrete actions (consider how time-consuming it would be to extract the policy from the value function if the action space is high-dimensional).
  • Policy-based methods directly learn a parameterized policy that maps states to actions, optimizing the policy parameters to maximize expected rewards and operate very similarly to traditional gradient descent. Examples include REINFORCE and other policy gradient methods. These methods are naturally suited for continuous or large action spaces and can also learn stochastic policies, which can be useful for exploration and domain randomization. The downsides of policy-based methods include slow and unstable learning, sample inefficiency, and sensitivity to local minima.
  • Actor-critic methods combine the strengths of value-based and policy-based approaches. The actor updates the policy directly, while the critic evaluates actions by estimating value functions. The critic’s feedback leads to a more stable and efficient update by lowering the variance on policy gradient estimates. Also, having a direct representation of the policy (actor) helps with high-dimensional and continuous action space problems. The main disadvantage of these methods is that there are now more moving parts to implement and tune (actor and critic). Examples of actor-critic algorithms include deep deterministic policy gradient (DDPG), proximal policy optimization (PPO), and soft actor-critic (SAC).
Four high-level steps of an actor-critic reinforcement learning algorithm: 1) The actor chooses an action 2) The critic makes a prediction of the value of that action 3) The critic updates itself using the reward collected from applying that action 4) The actor updates itself with the response from the critic.

Actor-critic reinforcement learning in action.

Model-Based vs. Model-Free Reinforcement Learning Algorithms

Model-based reinforcement learning builds or uses a model of the environment’s dynamics (transition probabilities and reward functions) to plan and make decisions. The word “plan” is key; these algorithms typically need fewer or no interactions with the environment at all as they rely on their internal model to simulate future states. The internal model can be provided a priori (in which case the agent does not need to interact with the environment at all) or can be learned through data collected from interactions with the actual environment. Model-based reinforcement learning is usually more sample-efficient than model-free because the model can be used to quickly generate large sets of training data. However, depending on whether the internal model is available or learned, model-based methods can require many more computational resources than model-free ones because, in addition to training the base agent, they must also train the environment model and generate training data.

Unlike model-based algorithms, model-free methods do not construct an explicit model of the environment. Instead, they learn optimal actions through direct interaction, relying on trial and error (recall the dog training and automated parking examples from the previous section). This approach is simpler and better suited to high-dimensional or unstructured environments, though it’s typically less efficient in terms of data usage. The majority of state-of-the-art reinforcement learning algorithms are model-free.

Online vs. Offline Reinforcement Learning Algorithms

Online reinforcement learning involves an agent that actively interacts with the environment during learning—collecting experiences, updating its policy, and adapting continuously as new data arrives. The dog training and automated parking scenarios described above are examples of online reinforcement learning.

In contrast, offline (or batch) reinforcement learning learns solely from a static data set of logged experiences (e.g., from human demonstrations or past policies), without further environment interaction. Offline methods excel where real-world interaction is costly or unsafe and can extract useful info even from random or non-expert data (though learning quality will lag behind that from expert or more structured data). In practice, offline reinforcement learning is often a good option to pretrain a policy before moving to online reinforcement learning, which, though sample-inefficient, typically achieves superior performance since it continually adapts using new data.

On-Policy vs. Off-Policy Reinforcement Learning Algorithms

On-policy reinforcement learning algorithms update and evaluate the same policy that’s used to generate training data, meaning the agent learns the value of the current policy based on its actual actions (e.g., SARSA, PPO, TRPO). This approach often leads to more stable and reliable updates since you’re not trying to reconcile differences between behavior and target policies as in off-policy methods. Because only a single policy is maintained and updated, they also tend to have lower computational complexity.

Off-policy methods (e.g., Q-learning, DQN), on the other hand, leverage data collected by one policy (the behavior policy) to learn or improve a different target policy (such as a greedy or optimal strategy). As a result, they can store past experiences (data) in a replay buffer and reuse them multiple times. This dramatically improves sample efficiency compared with on-policy methods, which discard data after each policy update. Another advantage of off-policy methods is that they can learn from any policy, including random, outdated, or even human-generated data. This flexibility enables training from offline data sets or demonstrations.

Gradient-Based vs. Evolutionary Reinforcement Learning Algorithms

Gradient-based algorithms are the workhorse for many contemporary deep reinforcement learning algorithms as they leverage noisy estimates of the policy gradient (via backpropagation), yielding fast, sample-efficient learning. On the other hand, they are sensitive to hyperparameters, prone to local optima, and require differentiability.

Evolutionary reinforcement learning, in contrast, treats the policy as a black box and is a powerful tool when gradients are unavailable or untrustworthy, when massive parallel compute is accessible, or when you need broad exploration in rugged search spaces. Evolutionary strategies search globally via population-based mutations and selections—offering robustness to sparse or non-differentiable reward signals, though they are far less sample-efficient and slower to converge.

Hybrid techniques increasingly let you exploit the best of both: they use evolution for exploration and global search, then refine promising policies with gradient-based updates, often achieving stronger performance overall.

Tabular vs. Neural Network–Based Reinforcement Learning Algorithms

The nature of the problem at hand often dictates which algorithm(s) is appropriate. If the state and action spaces of the environment are discrete and few in number, then you could use a simple table to represent policies. Q-learning and SARSA are examples of common tabular algorithms. Representing policy parameters in a table is not feasible when the number of state/action pairs becomes large or infinite. This is the so-called curse of dimensionality, and this is where neural networks come in. In general, most modern reinforcement learning algorithms rely on neural networks because they are good candidates for large state/action spaces and complex problems.

Single- vs. Multi-Agent Reinforcement Learning Algorithms

In single-agent reinforcement learning, only one agent interacts with the environment—this makes learning simpler, stable, and easier to analyze. In contrast, multi-agent reinforcement learning (MARL) involves multiple agents interacting within the same environment. Because the agents influence each other, the environment is non-stationary, violating Markov assumptions and destabilizing the learning processes. While MARL can solve more complex tasks and develop emergent behaviors such as coordination or negotiation, it also introduces challenges, including convergence issues, high computation demands, and instability if each agent updates independently.

Benefits and Challenges of Reinforcement Learning

Benefits of Reinforcement Learning

While reinforcement learning is by no means a new concept, recent progress in deep learning and computing power has made it possible to achieve some remarkable results in the area of artificial intelligence.

The benefits of reinforcement learning include:

  • Ability to solve complex, sequential tasks. Reinforcement learning can learn to optimize long-term goals across many steps, which provides an alternative route to explore for problems that are hard to solve with more traditional methods. Deep reinforcement learning can also potentially generate end-to-end solutions using advanced or complex sensors, exploiting the representational capabilities of neural networks.
  • Less reliance on labeled data. Unlike supervised learning, reinforcement learning can learn directly from environment feedback through rewards and penalties, reducing the need for costly labeled data sets.
  • Fewer requirements for existing data sets. Reinforcement learning typically learns from data generated on the fly, but this does not mean it cannot use existing data sets. In fact, unlike supervised learning, offline reinforcement learning can extract useful information even from non-expert data or data that does not encode the desired behavior to be learned.
  • Adaptiveness, self-correction, and robustness. Reinforcement learning continuously refines behavior using trial and error, enabling agents to revise strategies based on performance even after deployment. Also, it is designed to handle non-deterministic conditions where outcomes are unpredictable, which suits real-world complexity.

Challenges of Reinforcement Learning

Reinforcement learning is a powerful technique, but it also comes with challenges, including:

  • High entry barrier. Reinforcement learning often presents a high entry barrier due to the complexity of the algorithms and concepts.
  • High data and compute demands (sample inefficiency). Training often requires a lot of data, which translates to a large number of interactions or simulations. As a result, it’s not uncommon for complex problems to require multiple days of training to converge.
  • High number of design parameters. Reinforcement learning has a large number of hyperparameters that require tuning, such as the reward signal, neural network architecture, and agent-specific hyperparameters. Even small changes to these parameters can drastically affect training performance, often requiring multiple attempts to train an acceptable policy.
  • Generalization and transfer challenges. Agents often struggle to perform outside their training scenarios or transfer learning from simulation to the real world efficiently (aka the sim2real gap).
  • Verification, interpretability, and debugging issues. As powerful as deep reinforcement learning is, neural networks come with their own caveats. Complex neural network policies are hard to explain and troubleshoot, reducing transparency. Additionally, formal verification of neural networks is still an open area, which makes applying deep reinforcement learning to safety-critical systems challenging.

Reinforcement Learning Applications in Engineered Systems

Reinforcement learning has been used in multiple areas over the past few years, including AI chatbots and large language models (LLMs), recommender systems, marketing and advertising, and gaming. However, it is still (for the most part) under evaluation for production applications, and that is especially true for engineered systems. On the bright side, the aforementioned benefits slowly but steadily open up the use of the technology to several areas. Real-life applications of reinforcement learning in engineered systems typically fall into these areas:

  • Advanced controls: Controlling nonlinear or complex systems or systems with more advanced-sensor feedback is challenging and often requires additional prework (e.g., linearizing the system at different operating points or extracting features from sensor data, such as images). In such cases, reinforcement learning can be applied directly without extra prework. Application areas include automated driving (e.g., driving decisions based on camera input) and robotics (e.g., teaching a robotic arm how to manipulate a variety of objects for pick-and-place applications or teaching a robot how to walk).
  • Scheduling: Scheduling problems appear in traffic light control, coordinating resources on the factory floor, and many other scenarios. Reinforcement learning is a good alternative to evolutionary methods to solve these (high-dimensional) combinatorial optimization problems.
  • Calibration: Applications that involve manual calibration of parameters, such as electronic control unit (ECU) or engine calibration, or those that are prone to human error, such as production line optimization, may be good candidates for reinforcement learning.
  • Adversarial problems: In machine learning, adversarial applications involve deliberately crafted inputs (called adversarial examples) that are designed to exploit weaknesses in other models, systems, or environments. An example of adversarial reinforcement learning would be to design a policy-verifying agent that tries to “break” another policy (or other algorithm for that matter) to identify counterexamples and assess system vulnerabilities. Adversarial reinforcement learning is also particularly useful in cybersecurity applications, where it can be used to simulate attacks to test defenses and vice versa.
  • Design optimization: An area that can greatly benefit from reinforcement learning is simulation-based design optimization, especially if simulations are expensive. Reinforcement learning agents can explore the design space in a smart way, thus reducing the amount of training data and simulations required. Specific application examples include radar design and optimization of chip placements.

Reinforcement Learning with MATLAB and Simulink

MATLAB®, Simulink®, and Reinforcement Learning Toolbox™ simplify reinforcement learning tasks. You can implement controllers and decision-making algorithms by working through every step of the reinforcement learning workflow in the same ecosystem. Specifically, you can:

  1. Create training environments and reward signals in MATLAB and Simulink. Interface with created environments easily through MATLAB functions and Simulink blocks.
  2. Create neural network–based policies programmatically or interactively with Deep Network Designer. Alternatively, you can use lookup tables and polynomials.
  3. Switch, evaluate, and compare popular value-based, policy-based, and actor-critic algorithms such as DQN, DDPG, PPO, and SAC with only minor code changes, or create your own custom algorithm. Try out various training approaches out of the box, such as single-agent/multi-agent, gradient-based/evolutionary, online/offline, and model-based/model-free reinforcement learning.
  4. Design, train, and simulate agents interactively with the Reinforcement Learning Designer app.
  5. Deploy reinforcement learning policies to production systems and embedded devices (with automatic code generation tools). If needed, reduce the memory footprint of neural network policies with compression techniques before deployment.
Screenshot of the Reinforcement Learning Designer app showing interactive dialogs for setting agent hyperparameters.

Creating a reinforcement learning agent interactively with the Reinforcement Learning Designer app. (See documentation.)

Addressing Common Reinforcement Learning Challenges with MATLAB and Simulink

You can use MATLAB and Simulink to address many of the challenges commonly associated with reinforcement learning.

High Entry Barrier

Get started with reinforcement learning quickly: try out algorithms out of the box (no need to manually develop these yourself), consult reference examples to get ideas on how to set up your problem, and ramp up with free learning resources and training courses.

High Data and Compute Demands (Sample Inefficiency)

With Parallel Computing Toolbox™ and MATLAB Parallel Server™ you can train reinforcement learning policies faster by leveraging multiple GPUs, multiple CPUs, computer clusters, and cloud resources. For example, you can generate training data faster by spinning off multiple simulations in parallel, and you can also accelerate learning by speeding up gradient calculations.

Reinforcement Learning Toolbox also provides capabilities for model-based reinforcement learning, which can help with sample efficiency.

Model-based policy optimization (MBPO) agents can be more sample-efficient than model-free agents because the model can generate large sets of diverse experiences.

A diagram showing how the problem can be run on multiple machines using parallel computing to accelerate reinforcement learning and reach an optimal policy faster.

Training a sample-inefficient learning problem with parallel computing.

High Number of Design Parameters

You can use Reinforcement Learning Toolbox to reduce the number of hyperparameters you need to tune manually. For example, you can create agents without manually specifying the architecture neural network policies; you can tune agent hyperparameters interactively with Bayesian optimization in the Reinforcement Learning Designer app. Also, you can generate reward functions automatically if you already have Model Predictive Control Toolbox™ specifications or performance constraints specified with Simulink Design Optimization™ model verification blocks.

Generalization and Transfer Challenges

Simulations are key in reinforcement learning. With tight integration to Simulink, it is easy to enhance generalization of policies through domain randomization by training agents across various scenarios, even for extreme or dangerous conditions that would be hard and risky to create in the real world.

Using Reinforcement Learning Toolbox, you can fully exploit all available sources of data for better generalization and bridge the sim2real gap. With offline reinforcement learning you can pretrain a policy using existing data (such as data coming from target hardware). You can then improve the policy by training against a simulated environment and apply domain randomization to make it robust to uncertain factors and scenarios. To ensure the simulation model accurately represents the real-world system, you can rely on system identification. The last step of the process is to fine-tune the trained policy by training directly against the real hardware, if needed. While including actual hardware in the training loop would normally be risky and even dangerous, the first two steps of this workflow ensure that the number of interactions with the physical hardware required for fine-tuning the policy is minimal. This functionality can be used for real-time applications as well.

Diagram showing offline reinforcement learning, system identification, and training directly against real hardware systems.

A sim2real transfer workflow using Reinforcement Learning Toolbox that exploits all sources of available data for better generalization.

Verification, Interpretability, and Debugging Issues

Reinforcement learning is often characterized by “silent errors”—subtle, difficult-to-detect issues that can arise during training or execution. Reinforcement Learning Toolbox lets you log and visualize key training data for easier analysis and debugging.

Interpretability and verification are still open and active areas in the research community, especially with regard to neural networks. Deep Learning Toolbox™ provides a range of visualization methods—a type of interpretability technique that explains network predictions using visual representations of what a network is looking at. Another approach is through fuzzy logic; training a fuzzy inference system (FIS) to replicate the behavior of a (deep) reinforcement learning policy enables you to use the FIS rules to explain its behavior.

Simulation-based verification is the most common approach to verify reinforcement learning policies, and is made easy through Simulink. With Model-Based Design, simulation-based verification can be extended through traditional verification and validation; for example, you can formalize requirements for your policy and analyze them for consistency, completeness, and correctness using Requirements Toolbox™. In addition, you can assess certain properties of neural network policies such as robustness and network output bounds using formal methods provided in the Deep Learning Toolbox Verification Library.

One last thing to consider is that breaking down a complex problem into smaller subproblems can help with all challenges discussed in this section; debugging and interpretability become more tractable (a smaller problem will typically require a simpler policy architecture), and the verification requirements can potentially be reduced. In these situations, reinforcement learning can be combined with traditional (control) methods. The main idea behind this architecture is that the verifiable or traditional methods can be used to tackle the safety-critical aspects of the problem, while black-box reinforcement learning policies can handle higher-level, potentially less critical components. Other architectures to consider include using a hybrid approach in which a traditional method runs alongside reinforcement learning or have reinforcement learning complement or correct a traditional method. MATLAB makes it easy to implement such architectures; in addition to reinforcement learning and AI-based methods, you can access a variety of traditional methods out of the box and combine them using a single simulation platform, Simulink.

Block diagram combining reinforcement learning with a traditional controller.

Combining reinforcement learning with traditional methods, which can make it easier to apply to safety-critical systems.