Main Content

Use Experiment Manager in the Cloud with MATLAB Deep Learning Container

This example shows how to fine tune your deep learning network by using Experiment Manager in the cloud. Make use of multiple NVIDIA® high-performance GPUs on an AWS® EC2 instance to run multiple experiments in parallel. Tune the hyperparameters of your network and try different network architectures. You can sweep through a range of hyperparameters automatically and save the results for each variation. Compare the results of your experiments to find the best network.

Classification of CIFAR-10 Image Data with Experiment Manager in the Cloud

To get started with Experiment Manager for a classification network example, first download the training data to the MATLAB® Deep Learning Container. A simple way to do so is to open the following Live Script and run the first section.

openExample('deeplearning_shared/TrainNetworkUsingAutomaticMultiGPUSupportExample')
directory = pwd;
[locationCifar10Train,locationCifar10Test] = downloadCIFARToFolders(directory);

Next, open Experiment Manager by running experimentManager in the MATLAB command window or by opening the Experiment Manager App from the Apps toolstrip.

experimentManager

Open an example as a starting point to edit. In the start page of Experiment Manager, select the "Create a Deep Learning Experiment for Classification" example.

Hyperparameters

In the hyperparameters section, delete the existing two parameters and add two new parameters. Name the first "Momentum" with values "[0.01,0.1]" and the second "InitialLearningRate" with values "[1e-3,4e-3]".

Setup Function

Click edit on the setup function, 'ClassificationExperiment_setup1', and delete the contents. Copy the set up function ClassificationExperiment_setup1 provided at the end of this example and paste the entire contents to the 'ClassificationExperiment_setup1.m' function. As a final step to set up this function, you need to set the path to the training and test data. Check the workspace variables locationCifar10Train and locationCifar10Test created when you downloaded the data, and replace the paths in ClassificationExperiment_setup1 function with the values of these variables.

locationCifar10Train = '/path/to/train/data'; % replace with the path to the CIFAR-10 training data, see the locationCifar10Train workspace variable
locationCifar10Test = '/path/to/test/data'; % replace with the path to the CIFAR-10 test data, see the locationCifar10Test workspace variable

The function written in "ClassificationExperimentSetupExample.m" is an adaptation of the Train Network Using Automatic Multi-GPU Support example Live Script. The set up of the deep learning network is copied. The training options are modified to:

  • Set 'ExecutionEnvironment' to 'gpu'.

  • Replace 'InitialLearnRate' with 'params.InitialLearningRate', which will take the values as specified in the hyperparameters section of Experiment Manager.

  • Add a 'Momentum' training option set to 'params.Momentum', also specified in the hyperparameters table.

Run in Parallel

You are now ready to run the experiments. Check the number of available GPUs by running the following function:

gpuDeviceCount("available")
  1. Configure your local cluster to specify the number of workers equal to the number of available GPUs. For information on specifying the number of workers in a cluster, see Run Code on Parallel Pools.

  2. In Experiment Manager, select "Use Parallel" and then "Run" to run experiments in parallel on 1 GPU each (you cannot select the multi-gpu training option when running trials in parallel). You can see your experiments running concurrently in the Experiment Manager results tab. This example was ran on 4 NVIDIA™ Titan Xp GPUs, therefore 4 trials run concurrently. This step shows the R2021a version of the deep learning container.

Export Trial and Save to Cloud

Once the trials have finished running, compare the results to choose your preferred network. You can view the 'Training Plot' and 'Confusion Matrix' for each trial to help with your comparisons.

After you have selected your preferred trained network, export this to the MATLAB workspace by clicking "Export". This create the (default) variable trainedNetwork in the MATLAB workspace. Following the procedure to create an s3 bucket and AWS access keys (if you have not done so already) on the previous "Step 2", save the trainedNetwork directly to Amazon S3™.

setenv('AWS_ACCESS_KEY_ID', 'YOUR_AWS_ACCESS_KEY_ID'); 
setenv('AWS_SECRET_ACCESS_KEY', 'YOUR_AWS_SECRET_ACCESS_KEY');
setenv('AWS_SESSION_TOKEN', 'YOUR_AWS_SESSION_TOKEN'); % optional
setenv('AWS_DEFAULT_REGION', 'YOUR_AWS_DEFAULT_REGION'); % optional
save('s3://mynewbucket/trainedNetwork.mat','trainedNetwork','-v7.3');

For example, load this trained network to your local MATLAB session on your desktop. Note: Saving and loading MAT-files to and from remote filesystems using the save and load functions are supported from MATLAB releases R2021a and later, provided the MAT-files are version 7.3. Ensure you are running MATLAB release R2021a or later on both your local machine and in the Deep Learning Container.

setenv('AWS_ACCESS_KEY_ID', 'YOUR_AWS_ACCESS_KEY_ID'); 
setenv('AWS_SECRET_ACCESS_KEY', 'YOUR_AWS_SECRET_ACCESS_KEY');
setenv('AWS_SESSION_TOKEN', 'YOUR_AWS_SESSION_TOKEN'); % optional
setenv('AWS_DEFAULT_REGION', 'YOUR_AWS_DEFAULT_REGION'); % optional
load('s3://mynewbucket/trainedNetwork.mat')

Appendix - Setup Function for CIFAR-10 Classification Network

function [augmentedImdsTrain,layers,options] = ClassificationExperiment_setup1(params)
locationCifar10Train = '/path/to/train/data'; % Replace with the path to the CIFAR-10 training data, see the locationCifar10Train workspace variable
locationCifar10Test = '/path/to/test/data'; % Replace with the path to the CIFAR-10 test data, see the locationCifar10Test workspace variable
imdsTrain = imageDatastore(locationCifar10Train, ...
    'IncludeSubfolders',true, ...
    'LabelSource','foldernames');
imdsTest = imageDatastore(locationCifar10Test, ...
    'IncludeSubfolders',true, ...
    'LabelSource','foldernames');
imageSize = [32 32 3];
pixelRange = [-4 4];
imageAugmenter = imageDataAugmenter( ...
    'RandXReflection',true, ...
    'RandXTranslation',pixelRange, ...
    'RandYTranslation',pixelRange);
augmentedImdsTrain = augmentedImageDatastore(imageSize,imdsTrain, ...
    'DataAugmentation',imageAugmenter, ...
    'OutputSizeMode','randcrop');
blockDepth = 4; % blockDepth controls the depth of a convolutional block
netWidth = 32; % netWidth controls the number of filters in a convolutional block
layers = [
    imageInputLayer(imageSize) 
    
    convolutionalBlock(netWidth,blockDepth)
    maxPooling2dLayer(2,'Stride',2)
    convolutionalBlock(2*netWidth,blockDepth)
    maxPooling2dLayer(2,'Stride',2)    
    convolutionalBlock(4*netWidth,blockDepth)
    averagePooling2dLayer(8) 
    
    fullyConnectedLayer(10)
    softmaxLayer
    classificationLayer
    ];
miniBatchSize = 256;
options = trainingOptions('sgdm', ...
    'ExecutionEnvironment','gpu', ... 
    'InitialLearnRate',params.InitialLearningRate, ... % hyperparameter 'InitialLearningRate'
    'Momentum', params.Momentum, ... % hyperparameter 'Momentum'
    'MiniBatchSize',miniBatchSize, ... 
    'Verbose',false, ... 
    'Plots','training-progress', ... 
    'L2Regularization',1e-10, ...
    'MaxEpochs',50, ...
    'Shuffle','every-epoch', ...
    'ValidationData',imdsTest, ...
    'ValidationFrequency',floor(numel(imdsTrain.Files)/miniBatchSize), ...
    'LearnRateSchedule','piecewise', ...
    'LearnRateDropFactor',0.1, ...
    'LearnRateDropPeriod',45);
end
function layers = convolutionalBlock(numFilters,numConvLayers)
layers = [
    convolution2dLayer(3,numFilters,'Padding','same')
    batchNormalizationLayer
    reluLayer];

layers = repmat(layers,numConvLayers,1);
end

Related Topics