Deep Learning with MATLAB on Multiple GPUs

MATLAB^® supports training a single deep neural network using multiple GPUs in parallel. By using parallel workers with GPUs, you can train with multiple GPUs on your local machine, on a cluster, or on the cloud. Using multiple GPUs can speed up training significantly. To decide if you expect multi-GPU training to deliver a performance gain, consider the following factors:

How long is the iteration on each GPU? If each GPU iteration is short, then the added overhead of communication between GPUs can dominate. Try increasing the computation per iteration by using a larger batch size.
Are all the GPUs on a single machine? Communication between GPUs on different machines introduces a significant communication delay. You can mitigate this if you have suitable hardware. For more information, see Advanced Support for Fast Multi-Node GPU Communication.

Tip

To train a single network using multiple GPUs on your local machine, you can simply specify the ExecutionEnvironment option as "multi-gpu" without changing the rest of your code. The trainnet functions automatically uses your available GPUs for training computations. For an example showing how to train a network using multiple local GPUs, see Train Network Using Automatic Multi-GPU Support.

When you train on a remote cluster, specify the ExecutionEnvironment option as "parallel-auto". If the cluster has access to one or more GPUs, then trainnet only use the GPUs for training. Workers without a unique GPU are never used for training computation.

If you want to use more resources, you can scale up deep learning training to clusters or the cloud. To learn more about parallel options, see Scale Up Deep Learning in Parallel, on GPUs, and in the Cloud. To try an example, see Train Network in the Cloud Using Automatic Parallel Support.

Using a GPU or parallel options requires Parallel Computing Toolbox™. Using a GPU also requires a supported GPU device. For information on supported devices, see GPU Computing Requirements (Parallel Computing Toolbox). Using a remote cluster also requires MATLAB Parallel Server™.

Use Multiple GPUs in Local Machine

Note

If you run MATLAB on a single machine in the cloud that you connect to via ssh or remote desktop protocol (RDP), then network execution and training uses the same code as if you were running on your local machine.

If you have access to a machine with multiple GPUs, you can train a network using the trainnet function by setting the ExecutionEnvironment training option to "multi-gpu" using the trainingOptions function.

The "multi-gpu" option allows you to use multiple GPUs in a local parallel pool. If there is no current parallel pool, trainnet automatically starts a local parallel pool using your default cluster profile settings. The pool has as many workers as the number of available GPUs.

For information on how to perform custom training using multiple GPUs in your local machine, see Run Custom Training Loops on a GPU and in Parallel.

Use Multiple GPUs in Cluster

For training with multiple GPUs in a remote cluster, set the ExecutionEnvironment training option to "parallel-auto" or "parallel-gpu" using the trainingOptions function.

If there is no current parallel pool, trainnet automatically starts a parallel pool using your default cluster profile settings. If the pool has access to GPUs, then only workers with a unique GPU perform training computation. If the pool does not have GPUs, then training takes place on all available CPU workers instead.

For information on how to perform custom training using multiple GPUs in a remote cluster, see Run Custom Training Loops on a GPU and in Parallel.

Optimize Mini-Batch Size and Learning Rate

Convolutional neural networks are typically trained iteratively using mini-batches of images. This is because the whole data set is usually too large to fit into GPU memory. For optimum performance, you can experiment with the mini-batch size by changing the MiniBatchSize option using the trainingOptions function.

The optimal mini-batch size depends on your exact network, data set, and GPU hardware. When training with multiple GPUs, each image batch is distributed between the GPUs. This effectively increases the total GPU memory available, allowing larger batch sizes. A recommended practice is to scale up the mini-batch size linearly with the number of GPUs, in order to keep the workload on each GPU constant. For example, if you are training on a single GPU using a mini-batch size of 64, and you want to scale up to training with four GPUs of the same type, you can increase the mini-batch size to 256 so that each GPU processes 64 observations per iteration.

Because increasing the mini-batch size improves the significance of each iteration, you can increase the learning rate. A good general guideline is to increase the learning rate proportionally to the increase in mini-batch size. Depending on your application, a larger mini-batch size and learning rate can speed up training without a decrease in accuracy, up to some limit.

You can use the Experiment Manager app to find optimal training options by sweeping through a range of hyperparameter values or by using Bayesian optimization. For more information on how to use Experiment Manager, see Compare Classification Network Architectures Using Experiment.

Select Particular GPUs to Use for Training

If you do not want to use all of your GPUs, you can select the GPUs that you want to use for training and inference directly. Doing so can be useful to avoid training on a poor-performance GPU, for example, your display GPU.

If your GPUs are in your local machine, you can use the gpuDeviceTable (Parallel Computing Toolbox) and gpuDeviceCount (Parallel Computing Toolbox) functions to examine your GPU resources and determine the index of the GPUs you want to use.

For single GPU training with the "auto" or "gpu" options, by default, MATLAB uses the GPU device with index 1. You can use a different GPU by selecting the device before you start training. Use gpuDevice (Parallel Computing Toolbox) to select the desired GPU using its index:

gpuDevice(index)

trainnet automatically uses the selected GPU when you set the ExecutionEnvironment option to "auto" or "gpu".

For multiple GPU training with the "multi-gpu" option, by default, MATLAB uses all available GPUs in your local machine. If you want to exclude GPUs, you can start the parallel pool in advance and select the devices manually.

For example, suppose you have three GPUs but you only want to use the devices with indices 1 and 3. You can use the following code to start a parallel pool with two workers and select one GPU on each worker.

useGPUs = [1 3];
parpool("Processes",numel(useGPUs));
spmd 
    gpuDevice(useGPUs(spmdIndex)); 
end

trainnet automatically uses the current parallel pool when you set the ExecutionEnvironment option to "multi-gpu" (or "parallel-auto" or "parallel-gpu" for the same result).

Train Multiple Networks on Multiple GPUs

To train multiple models in parallel with one GPU each, start a parallel pool with one worker per available GPU, and train each network on a different worker. Use parfor or parfeval to simultaneously execute a network on each worker. Use the trainingOptions function to set the ExecutionEnvironment name-value option to "gpu" on each worker.

For example, use code of the following form to train multiple networks in parallel on all available GPUs:

options = trainingOptions("sgdm",ExecutionEnvironment="gpu");

parfor i=1:gpuDeviceCount("available")
    trainnet(…,options); 
end

To run in the background without blocking your local MATLAB, use parfeval. For examples showing how to train multiple networks using parfor and parfeval, see

Make Predictions Using Multiple GPUs

To make predictions in parallel using multiple GPUs, create a parallel pool with one worker per GPU, divide up your data, and make the predictions in parallel. For an example showing how to make predictions using multiple GPUs, see Train Network Using Automatic Multi-GPU Support.

Advanced Support for Fast Multi-Node GPU Communication

Some multi-GPU features in MATLAB, including the trainnet function, are optimized for direct communication via fast interconnects for improved performance.

If you have appropriate hardware connections, then data transfer between multiple GPUs uses fast peer-to-peer communication, including NVLink, if available.

If you are using a Linux^® compute cluster with fast interconnects between machines such as Infiniband, or fast interconnects between GPUs on different machines, such as GPUDirect RDMA, you might be able to take advantage of fast multi-node support in MATLAB. Enable this support on all the workers in your pool by setting the environment variable PARALLEL_SERVER_FAST_MULTINODE_GPU_COMMUNICATION to 1. Set this environment variable in the Cluster Profile Manager.

This feature is part of the NVIDIA NCCL library for GPU communication. To configure it, you must set additional environment variables to define the network interface protocol, especially NCCL_SOCKET_IFNAME. For more information, see the NCCL documentation and in particular the section on NCCL Environment Variables.