Main Content

Generate GPU Kernels for Reduction Operations

A reduction is a common pattern in MATLAB® code that uses a loop to accumulate a value that depends on all the iterations together, but which is independent of the iteration order. You can implement reductions by using the gpucoder.reduce function or for-loops.

What Is a Reduction?

In MATLAB, a reduction is a code pattern that uses a variable on both sides of an assignment statement to accumulate a value. For example, this pseudocode shows a reduction for the variable acc:

acc = ...; % initialize acc to a value
for i=1:n
  ...
  acc = f(acc,b);
end
Each iteration calculates acc by using the value of acc from the previous iteration with a binary function f. Because each loop iteration depends on the previous iteration, this implementation is usually not suitable for parallel execution. However, you can calculate the reduction in parallel on the GPU if the function f:

  • Is associative. For inputs a, b, and c, f(a,f(b,c)) is equal to f(f(a,b),c).

  • Is commutative. For inputs a and b, f(a,b) is equal to f(b,a).

For example, this code uses plus as a reduction operator to calculate the sum of the matrix X:

X = [3,1,7,0,4,1,6,3]
out = 0;
for i=1:numel(X)
  out = out+X(i);
end

Because adding integers is associative and commutative, the result of summing the elements is the same regardless of the pairs and the order in which you add them. You can calculate the sum by using a binary tree.

In this example, the tree represents the process of summing the array elements that contain the values [3,1,7,0,4,1,6,3] from top to bottom. Each level of the tree is the result of adding a pair of values from the level above it. The last level contains the sum of the array, 25. The code generated by GPU Coder™ calculates reductions in parallel by using this binary tree-based approach in which multiple GPU threads apply the reduction operator to pairs of values.

Note

Mathematical addition of real numbers is associative and commutative. However, addition of floating-point numbers is only approximately associative. Reducing large arrays of floating-point numbers in parallel might introduce numeric errors.

Choose an Approach for Writing Reductions

You can write parallel reductions by using the gpucoder.reduce function or by using for-loops.

You can use gpucoder.reduce to express common reduction patterns, like reducing an array along one dimension or preprocessing before a reduction. GPU Coder maps calls to gpucoder.reduce to kernels in generated GPU code.

To use gpucoder.reduce, you use the binary function that reduces the array as an argument. For example, consider this pseudocode that calculates a reduction by using a loop:

acc = ...; % initialize acc to a value
for i=1:n
  ...
  acc = f(acc,b);
end

If the function f is commutative and associative, you can replace the loop with a call to gpucoder.reduce. This pseudocode shows how to reduce the array by using gpucoder.reduce:

acc = ...; % initialize acc to a value
if ~isempty(X)
  acc = gpucoder.reduce(X,@f);
end

GPU Coder can also generate kernels for for-loops that implement reductions. However, if GPU Coder cannot prove that the result of the loop is independent of the iteration order, it generates a CPU loop.

Generate Kernels for for-Loop Reductions

GPU Coder generates kernels for for-loops when it can determine the result is independent of the loop iteration order. For example, this function, sumArray, computes the sum of an input array X.

function y = sumArray(X) %#codegen
y = zeros(1,"like",X);
for i=1:numel(X)
    y = y+X(i);
end
end

To check if the generated code contains a GPU kernel, generate code for and profile sumArray by using the gpuPerformanceAnalyzer function. The GPU Performance Analyzer opens. In the Profiling Timeline pane, the analyzer shows the function uses one kernel, sumArray_kernel1, to calculate the sum.

X = gpuArray(randi(100,512));
cfg = coder.gpuConfig("mex");
gpuPerformanceAnalyzer("sumArray.m",{X},Config=cfg);

Profiling Timeline pane showing the generated code for sumArray executes one GPU kernel and zero loops

Generating GPU Kernels by Using gpucoder.reduce

When you use gpucoder.reduce to implement a reduction, GPU Coder maps the call to a GPU kernel in the generated code. You can use gpucoder.reduce to reduce arrays to scalars or to reduce arrays along one dimension.

Reduce Array to Scalar

To reduce an array to a scalar, first determine the binary function you want to use as a reduction operator. For example, consider the numberOdd function. The function returns the number of odd elements in an array.

function y = numberOdd(X) %#codegen
y = 0;
for i=1:numel(X)
    y = y + mod(X(i),2);
end
end

Generate code for and profile numberOdd. In the GPU Performance Analyzer, the GPU Activities row does not contain kernel activities. A loop numberOdd_loop_0 executes for most of the run time.

X = randi(32,1024,"gpuArray");
cfg = coder.gpuConfig("mex");
gpuPerformanceAnalyzer("numberOdd.m",{X},Config=cfg);

Profiling Timeline pane showing no GPU kernels

In this example, the loop uses mod to preprocess the input array X and plus to add to the output variable y. To generate a GPU kernel that returns the same result, replace the loop with a call to gpucoder.reduce that uses mod as the preprocessing function and plus as the reduction operator. Name the new function numberOddReduce.

function y = numberOddReduce(X) %#codegen
y = 0;
if ~isempty(X)
    y = gpucoder.reduce(X,@plus,preprocess=@(x) mod(x,2));
end

Generate code for and profile numberOddReduce. The code generator maps the call to gpucoder.reduce to a GPU kernel in the generated code.

gpuPerformanceAnalyzer("numberOddReduce.m",{X},Config=cfg);

Performance Analyzer showing the results for the numberOddReduce function. The GPU Activities row shows a reduction kernel and other kernels.

Reduce Array Along One Dimension

You can also use gpucoder.reduce to reduce along one dimension of an array. For example, consider this function, colNorm, which calculates the norm of each column of a three-dimensional array.

function Y = colNorm(X) %#codegen
% X: 3d matrix, MxNxK
% Y: 2d matrix, NxK, each element is a norm of a column of input
[~,n,k] = size(X);
Y = coder.nullcopy(zeros(n,k,"like",X));
for i = 1:n
    for j = 1:k
        Y(i,j) = norm(X(:,i,j));
    end
end
end

Generate code for and profile colNorm. In the GPU Performance Analyzer, in the Profiling Timeline pane, the Loops row shows that the loop colNorm_loop_0 executes for most of the execution time, and the GPU Activities row contains many kernel launches.

X = rand(512,256,128,"gpuArray");
gpuPerformanceAnalyzer("colNorm",{X},Config=cfg);

Profiling Timeline pane of the Performance Analyzer showing data for colNorm

Because the norm calculation is the square root of a sum, you can write it as a reduction by using gpucoder.reduce. To calculate the norm of the columns of the array:

  1. Preprocess the array by squaring the elements.

  2. Calculate the sum of each column of the array by using gpucoder.reduce with plus as the reduction operator. To sum the columns, specify the dimension to reduce along as 1.

  3. Take the square root of the array that gpucoder.reduce returns.

This code shows a function, colNormReduce, that calculates the norm of each column by using gpucoder.reduce. The function uses the coder.ceval function to synchronize the GPU before the program terminates.

function Y = colNormReduce(X) %#codegen
% input: 3d matrix, MxNxK
% output: 2d matrix, NxK, each element is a norm of a column of input
[~,n,k] = size(X);
Y = coder.nullcopy(zeros(n,k,"like",X));
if ~isempty(X)
    fh = @(x)x^2;
    Y = squeeze(gpucoder.reduce(X,@plus,dim=1,preprocess=fh));
    Y = sqrt(Y);
    if coder.target("GPU")
        coder.ceval("cudaDeviceSynchronize");
    end
end

Generate code for and profile colNormReduce. In the analyzer, in the Profiling Timeline pane, the GPU Activities row shows that the generated code uses a GPU kernel to calculate the norm.

gpuPerformanceAnalyzer("colNormReduce",{X},Config=cfg);

Profiling Timeline pane for colNormreduce showing that it executes a kernel, ReduceFirstDimBaseCaseKernel, for most of the execution time

See Also

Functions

Tools

Topics