You can often improve code performance with execution on a graphical processing unit (GPU). For example, execution on a GPU can improve performance if:
Your code is computationally expensive, where computing time significantly exceeds the time spent transferring data to and from GPU memory.
Your workflow uses functions with
gpuArray (Parallel Computing Toolbox) support with large array inputs.
When writing code for the GPU, it is best to start with code that already performs well on the CPU. Vectorization is usually critical for achieving high performance on the GPU. Convert code to use functions that support GPU array arguments and transfer the input data to the GPU. For more information about MATLAB functions with GPU array inputs, see Run MATLAB Functions on a GPU (Parallel Computing Toolbox).
Many functions in Statistics and Machine Learning Toolbox™ automatically execute on the GPU when you use GPU array input data. For example, you can create a probability distribution object on the GPU, where the output is a GPU array.
pd = fitdist(gpuArray(x),"Normal")
Using a GPU requires Parallel Computing Toolbox™ and a supported GPU device. For information about supported devices, see GPU Support by Release (Parallel Computing Toolbox). For the complete list of Statistics and Machine Learning Toolbox™ functions that accept GPU arrays, see Functions.
You can query and select your GPU device using the
gpuDevice function. If you have multiple GPUs, you can examine the properties of all GPUs detected in your system with the
gpuDeviceTable function. Then, you can select a specific GPU for single-GPU execution by using its index (
D = gpuDevice
D = CUDADevice with properties: Name: 'Tesla V100-PCIE-32GB' Index: 1 ComputeCapability: '7.0' SupportsDouble: 1 DriverVersion: 11.2000 ToolkitVersion: 11 MaxThreadsPerBlock: 1024 MaxShmemPerBlock: 49152 MaxThreadBlockSize: [1024 1024 64] MaxGridSize: [2.1475e+09 65535 65535] SIMDWidth: 32 TotalMemory: 3.4090e+10 AvailableMemory: 3.3374e+10 MultiprocessorCount: 80 ClockRateKHz: 1380000 ComputeMode: 'Default' GPUOverlapsTransfers: 1 KernelExecutionTimeout: 0 CanMapHostMemory: 1 DeviceSupported: 1 DeviceAvailable: 1 DeviceSelected: 1
Explore a data distribution on the GPU using descriptive statistics.
Generate a data set of normally distributed random numbers on the GPU.
dist = randn(1e5,1e4,"gpuArray");
dist is a GPU array
TF = isgpuarray(dist)
TF = logical 1
Execute a function with a GPU array input argument. For example, calculate the sample skewness for each column in
dist is a GPU array, the
skewness function executes on the GPU and returns the result as a GPU array.
skew = skewness(dist);
Verify that the output
skew is a GPU array.
TF = isgpuarray(skew)
TF = logical 1
Evaluate function execution time on the GPU and compare performance with execution on the CPU.
Comparing the time taken to execute code on the CPU and the GPU can be useful to select the execution environment. For example, if you want to compute descriptive statistics from sample data, considering the execution time and the data transfer time is important to evaluating the overall performance. If a function has gpuArray support, as the number of observations increases, computation on the GPU generally becomes more performant compared to the CPU.
Measure the function run time in seconds by using the
gputimeit (Parallel Computing Toolbox) function.
gputimeit is preferable to
timeit for functions that use the GPU because it ensures operation completion and compensates for overhead.
skew = @() skewness(dist); t = gputimeit(skew)
t = 0.6270
Evaluate the performance difference between the GPU and CPU by independently measuring the CPU execution time. For this GPU, execution of this code is faster than execution on the CPU.
The performance of code on a GPU is heavily dependent on the GPU used. For additional information about measuring and improving GPU performance, see Measure and Improve GPU Performance (Parallel Computing Toolbox).
You can improve the performance of your code by doing your calculations in single precision instead of double precision.
Determine the execution time of the
skewness function with an input argument of the
dist data set in single precision.
dist_single = single(dist); skew_single = @() skewness(dist_single); t_single = gputimeit(skew_single)
t_single = 0.2206
For this GPU, execution of this code with single precision data is faster than execution with double precision data.
The performance improvement is dependent on the GPU card and total number of cores. For more information about using single precision with the GPU, see Measure and Improve GPU Performance (Parallel Computing Toolbox).
Implement dimensionality reduction and classification workflows on a GPU.
The principal component analysis (PCA) function reduces data dimensionality by replacing several correlated variables with a new set of variables that are linear combinations of the original variables.
fitcensemble function fits many classification learners to form an ensemble model that can make better predictions than a single learner.
Both functions are computationally intensive and can be significantly accelerated using the GPU.
For an example, use the
humanactivity data set. The data set contains 24,075 observations of five different physical human activities: sitting, standing, walking, running, and dancing. Each observation has 60 features extracted from acceleration data measured by smartphone accelerometer sensors. The data set contains the following variables:
actid — Response vector containing the activity IDs in integers: 1, 2, 3, 4, and 5 representing sitting, standing, walking, running, and dancing, respectively
actnames — Activity names corresponding to the integer activity IDs
feat — Feature matrix of 60 features for 24,075 observations
featlabels — Labels of the 60 features
Use 90% of the observations to train a model that classifies the five types of human activities, and use 10% of the observations to validate the trained model. Use
cvpartition to specify a 10% holdout for the test set.
Partition = cvpartition(actid,"Holdout",0.10); trainingInds = training(Partition); % Indices for the training set testInds = test(Partition); % Indices for the test set
Transfer the training and test data to the GPU.
XTrain = gpuArray(feat(trainingInds,:)); YTrain = gpuArray(actid(trainingInds)); XTest = gpuArray(feat(testInds,:)); YTest = gpuArray(actid(testInds));
Find the principal components for the training data set
[coeff,score,~,~,explained,mu] = pca(XTrain);
Find the number of components required to explain at least 99% of variability.
idx = find(cumsum(explained)>99,1);
Determine the principal component scores that represent
X in the principal component space.
XTrainPCA = score(:,1:idx);
Fit an ensemble of learners for classification.
template = templateTree("MaxNumSplits",20,"Reproducible",true); classificationEnsemble = fitcensemble(XTrainPCA,YTrain, ... "Method","AdaBoostM2", ... "NumLearningCycles",30, ... "Learners",template, ... "LearnRate",0.1, ... "ClassNames",[1; 2; 3; 4; 5]);
To use the trained model for the test set, you need to transform the test data set by using the PCA obtained from the training data set.
XTestPCA = (XTest-mu)*coeff(:,1:idx);
Evaluate the accuracy of the trained classifier with the test data.
classificationError = loss(classificationEnsemble,XTestPCA,YTest);
Transfer data or model properties from the GPU to the local workspace for use with a function that does not support GPU arrays.
Transferring GPU arrays can be costly and is generally not necessary unless you need to use your result with functions that do not support GPU arrays or in another workspace where a GPU is unavailable.
gather (Parallel Computing Toolbox) function transfers data from the GPU into the local workspace. Gather the
dist data and confirm that the data is no longer a GPU array.
dist = gather(dist); TF = isgpuarray(dist)
TF = logical 0
gather function transfers properties of a machine learning model from the GPU into the local workspace. Gather the
classificationEnsemble model and confirm that the model properties which were previously a GPU array, such as X, are no longer GPU arrays.
classificationEnsemble = gather(classificationEnsemble); TF = isgpuarray(classificationEnsemble.X)
TF = logical 0