Technical Articles

Speeding Up Signal Processing Algorithm Simulation in Simulink Models

By Akash Gopisetty, MathWorks


Signal processing system designs often have high computational complexity due to the algorithm or data-intensive applications involved. Building and simulating these complex systems can be time-consuming. The Dataflow domain feature in Simulink® provides a way to reduce model simulation times. This feature accelerates simulation by automatically partitioning Simulink models and executing them in parallel using the CPU cores available on the host computer.

This article shows you how to set up Dataflow in three simple steps. We then demonstrate Dataflow in action, using a radio model as an example, and compare model simulation times with and without Dataflow enabled.

The models used in this example are available for download.

Types of Parallelism Used by Dataflow

To partition models and execute them in parallel, Dataflow uses one of the following combinations of data and task parallelism (Figure 1):

  • Explicit parallelism processes different data sets through different algorithms.
  • Unfolding parallelism processes consecutive frames of a data stream through the same algorithm.
  • Pipelining parallelism processes different parts of the same data through different algorithms.
Figure 1. Types of parallelism used by Dataflow.

Figure 1. Types of parallelism used by Dataflow.

Setting Up Dataflow

To enable the Dataflow domain in a Simulink model, you first implement a subsystem. The way you do this depends on how far your design has progressed.

If you are just beginning the design process, use the Dataflow Subsystem Block in DSP System Toolbox™ (Figure 2). This block is preconfigured and ready to use. You just drag it into the Simulink model and populate it with algorithmic components.

Figure 2. Dataflow subsystem in the DSP System Toolbox block library.

Figure 2. Dataflow subsystem in the DSP System Toolbox block library.

If you have already built your design model, place the blocks modeling the algorithm that you would like to parallelize in a subsystem and set up Dataflow as follows:

  1. Select the subsystem you just created.
  2. Select the Set execution domain check box from the Execution tab of the Property Inspector.
  3. Set the Domain option to Dataflow.

Inside the subsystem, the > icon on the bottom left indicates that the subsystem is set to the Dataflow domain.

The Dataflow domain first profiles the model by running it on a single thread and then automatically partitions the subsystem for multithreaded execution.

Dataflow in Action

Our example model simulates a radio transmitter and receiver. It contains digital up and down converters to adjust the signal frequencies, and implements a modulator and demodulator (Figure 3). The input is a speech recording sampled at 8 kHz. The outputs are two spectrum analyzers and an audio sink.

Figure 3. Radio model with a single-channel audio input.

Figure 3. Radio model with a single-channel audio input.

First, we measure the time taken to simulate this model without enabling Dataflow1. With the output blocks commented out, we can focus on simulating the algorithm, and are not bound by the fixed amount of time needed to run the scopes and audio output.

We measure simulation time using the tic-toc commands:

modelname = 'mono_radiomodel';
tic;
simData = sim(modelname);
t = toc

The execution time to run the model is 3.67 seconds.

Now let’s introduce Dataflow. We’ll place the blocks representing the algorithm into a subsystem and set the domain to Dataflow (Figure 4).

Figure 4. Dataflow enabled on the radio model with a single-channel audio input.

Figure 4. Dataflow enabled on the radio model with a single-channel audio input. 

The Dataflow assistant displays suggested model setting changes (Figure 5). 

Figure 5. Dataflow assistant showing suggested changes to model settings.

Figure 5. Dataflow assistant showing suggested changes to model settings. 

One change recommended by the assistant is to add latency. Latency is typically added to a model when Dataflow identifies an opportunity for parallelism to increase throughput. The delays added along the signal lines are indicated with a z-n label.

We accept the changes and save the model with Dataflow enabled as mono_radiomodel_dataflow.

We then measure the execution time of the new subsystem using the same tic-toc commands that we used before.

modelname = 'mono_radiomodel_dataflow';
tic; 
simData2 = sim(modelname);
t_Dataflow = toc

The execution time with Dataflow enabled is 2.5 seconds, which is 1.7 times faster than normal, single-thread execution. The speedup is due to the compiler optimizations, model settings changes, and latency added by Dataflow.

However, the model was executed on just one thread, and the speedup is not significant. This is because most of the computational load is concentrated in the up- and down-converter blocks. Dataflow works best when the computational load is spread across the model, providing more opportunities to create parallel threads. In the next section, we extend our model and show how implementing Dataflow further improves simulation performance.

Working with Larger Models

We increase the computational complexity of the model by introducing a multichannel audio input signal. This doubles the amount of data that needs to be processed and gives Dataflow more avenues to optimize simulation performance. Figure 6 shows the model modified with a stereo audio, which takes 18.6 seconds to run. By enabling the signal dimensions information overlay, we see that the signal input does indeed have two audio channels.

Figure 6. Radio model with a stereo input multichannel audio signal.

Figure 6. Radio model with a stereo input multichannel audio signal.

After turning on Dataflow and rerunning the model, we observe an execution time of 4.5 seconds—an almost four-fold speedup—with the model being run on five concurrent threads (Figure 7).

Figure 7. Dataflow assistant showing the latency and number of threads for model execution.

Figure 7. Dataflow assistant showing the latency and number of threads for model execution.

Multithreaded Code Generation with Dataflow

Dataflow supports both single-core and multicore C/C++ code generation with Simulink Coder™ and Embedded Coder®. You first enable the Allow tasks to be executed concurrently on target parameter in the Solver pane of the Simulink model and then generate code using Ctrl + B. Code generated for desktop targets is multithreaded via OpenMP. Code generated for Embedded Coder targets is multithreaded via POSIX.

Figure 8 shows OpenMP C code generated from our radio model, including concurrent tasks created by Dataflow.

Figure 8. Multithreaded code generated with Dataflow and OpenMP.

Figure 8. Multithreaded code generated with Dataflow and OpenMP.

Limitations of Dataflow

While Dataflow helps speed up most simulations, it might not be applicable to all models—for example, smaller, less complex models or models with the computational load concentrated in a few blocks. In these cases, the speed-up achieved by Dataflow does not offset the overheads needed to synchronize and execute the model on parallel threads. As the radio model example showed, Dataflow does best when the computational load is spread evenly across the model, because an evenly spread load provides more opportunities to partition the model for parallel execution.

In terms of modeling limitations, as of Release 2020b Dataflow does not support continuous blocks, variable-sized signals, or virtual Simulink buses for multithreaded simulation.

Summary

The Dataflow domain lets you identify modeling patterns in a Simulink model that can be distributed into multiple threads and executed in parallel. This approach takes advantage of the processing power available on the host CPU, optimizes throughput, and reduces model simulation time. The Dataflow domain is most effective when the computational load is spread out across the model so that parallelism can be introduced, and it works only with discrete signals. 

1 All simulations were run on a Windows® desktop computer with Intel® Xeon® CPU W-2133 @ 3.6 GHz 6 Cores 12 threads processor.

Published 2021