# Classify Videos Using Deep Learning with Custom Training Loop

This example shows how to create a network for video classification by combining a pretrained image classification model and a sequence classification network.

You can perform video classification without using a custom training loop by using the `trainNetwork` function. For an example, see Classify Videos Using Deep Learning. However, If `trainingOptions` does not provide the options you need (for example, a custom learning rate schedule), then you can define your own custom training loop as shown in this example.

To create a deep learning network for video classification:

1. Convert videos to sequences of feature vectors using a pretrained convolutional neural network, such as GoogLeNet, to extract features from each frame.

2. Train a sequence classification network on the sequences to predict the video labels.

3. Assemble a network that classifies videos directly by combining layers from both networks.

The following diagram illustrates the network architecture:

• To input image sequences to the network, use a sequence input layer.

• To extract features from the image sequences, use convolutional layers from the pretrained GoogLeNet network.

• To classify the resulting vector sequences, include the sequence classification layers.

When training this type of network with the `trainNetwork` function (not done in this example), you must use sequence folding and unfolding layers to process the video frames independently. When you train this type of network with a `dlnetwork` object and a custom training loop (as in this example), sequence folding and unfolding layers are not required because the network uses dimension information given by the `dlarray` dimension labels.

To convert frames of videos to feature vectors, use the activations of a pretrained network.

Load a pretrained GoogLeNet model using the `googlenet` function. This function requires the Deep Learning Toolbox™ Model for GoogLeNet Network support package. If this support package is not installed, then the function provides a download link.

`netCNN = googlenet;`

Download the HMBD51 data set from HMDB: a large human motion database and extract the RAR file into a folder named `"hmdb51_org"`. The data set contains about 2 GB of video data for 7000 clips over 51 classes, such as `"drink"`, `"run"`, and `"shake_hands"`.

After extracting the RAR file, make sure that the folder `hmdb51_org` contains subfolders named after the body motions. If it contains RAR files, you need to extract them as well. Use the supporting function `hmdb51Files` to get the file names and the labels of the videos. To speed up training at the cost of accuracy, specify a fraction in the range [0 1] to read only a random subset of files from the database. If the `fraction` input argument is not specified, the function `hmdb51Files` reads the full dataset without changing the order of the files.

```dataFolder = "hmdb51_org"; fraction = 1; [files,labels] = hmdb51Files(dataFolder,fraction);```

Read the first video using the `readVideo` helper function, defined at the end of this example, and view the size of the video. The video is an H-by-W-by-C-by-T array, where H, W, C, and T are the height, width, number of channels, and number of frames of the video, respectively.

```idx = 1; filename = files(idx); video = readVideo(filename); size(video)```
```ans = 1×4 240 352 3 115 ```

View the corresponding label.

`labels(idx)`
```ans = categorical shoot_ball ```

To view the video, loop over the individual frames and use the `image` function. Alternatively, you can use the `implay` function (requires Image Processing Toolbox).

```numFrames = size(video,4); figure for i = 1:numFrames frame = video(:,:,:,i); image(frame); xticklabels([]); yticklabels([]); drawnow end```

### Convert Frames to Feature Vectors

Use the convolutional network as a feature extractor: input video frames to the network and extract the activations. Convert the videos to sequences of feature vectors, where the feature vectors are the output of the `activations` function on the last pooling layer of the GoogLeNet network (`"pool5-7x7_s1"`).

This diagram illustrates the data flow through the network.

Read the video data using the `readVideo` function, defined at the end of this example, and resize it to match the input size of the GoogLeNet network. Note that this step can take a long time to run. After converting the videos to sequences, save the sequences and corresponding labels in a MAT file in the `tempdir` folder. If the MAT file already exists, then load the sequences and labels from the MAT file directly. In case a MAT file already exists but you want to overwrite it, set the variable `overwriteSequences` to `true`.

```inputSize = netCNN.Layers(1).InputSize(1:2); layerName = "pool5-7x7_s1"; tempFile = fullfile(tempdir,"hmdb51_org.mat"); overwriteSequences = false; if exist(tempFile,'file') && ~overwriteSequences load(tempFile) else numFiles = numel(files); sequences = cell(numFiles,1); for i = 1:numFiles fprintf("Reading file %d of %d...\n", i, numFiles) video = readVideo(files(i)); video = imresize(video,inputSize); sequences{i,1} = activations(netCNN,video,layerName,'OutputAs','columns'); end % Save the sequences and the labels associated with them. save(tempFile,"sequences","labels","-v7.3"); end```

View the sizes of the first few sequences. Each sequence is a D-by-T array, where D is the number of features (the output size of the pooling layer) and T is the number of frames of the video.

`sequences(1:10)`
```ans=10×1 cell array {1024×115 single} {1024×227 single} {1024×180 single} {1024×40 single} {1024×60 single} {1024×156 single} {1024×83 single} {1024×42 single} {1024×82 single} {1024×110 single} ```

### Prepare Training Data

Prepare the data for training by partitioning the data into training and validation partitions and removing any long sequences.

Create Training and Validation Partitions

Partition the data. Assign 90% of the data to the training partition and 10% to the validation partition.

```numObservations = numel(sequences); idx = randperm(numObservations); N = floor(0.9 * numObservations); idxTrain = idx(1:N); sequencesTrain = sequences(idxTrain); labelsTrain = labels(idxTrain); idxValidation = idx(N+1:end); sequencesValidation = sequences(idxValidation); labelsValidation = labels(idxValidation);```

Remove Long Sequences

Sequences that are much longer than typical sequences in the networks can introduce lots of padding into the training process. Having too much padding can negatively impact the classification accuracy.

Get the sequence lengths of the training data and visualize them in a histogram of the training data.

```numObservationsTrain = numel(sequencesTrain); sequenceLengths = zeros(1,numObservationsTrain); for i = 1:numObservationsTrain sequence = sequencesTrain{i}; sequenceLengths(i) = size(sequence,2); end figure histogram(sequenceLengths) title("Sequence Lengths") xlabel("Sequence Length") ylabel("Frequency")```

Only a few sequences have more than 400 time steps. To improve the classification accuracy, remove the training sequences that have more than 400 time steps along with their corresponding labels.

```maxLength = 400; idx = sequenceLengths > maxLength; sequencesTrain(idx) = []; labelsTrain(idx) = [];```

#### Create Datastore for Data

Create an `arrayDatastore` object for the sequences and the labels, and then combine them into a single datastore.

```dsXTrain = arrayDatastore(sequencesTrain,'OutputType','same'); dsYTrain = arrayDatastore(labelsTrain,'OutputType','cell'); dsTrain = combine(dsXTrain,dsYTrain);```

Determine the classes in the training data.

`classes = categories(labelsTrain);`

### Create Sequence Classification Network

Next, create a sequence classification network that can classify the sequences of feature vectors representing the videos.

Define the sequence classification network architecture. Specify the following network layers:

• A sequence input layer with an input size corresponding to the feature dimension of the feature vectors.

• A BiLSTM layer with 2000 hidden units with a dropout layer afterwards. To output only one label for each sequence, set the `'OutputMode'` option of the BiLSTM layer to `'last'.`

• A dropout layer with a probability of 0.5.

• A fully connected layer with an output size corresponding to the number of classes and a softmax layer.

```numFeatures = size(sequencesTrain{1},1); numClasses = numel(categories(labelsTrain)); layers = [ sequenceInputLayer(numFeatures,'Name','sequence') bilstmLayer(2000,'OutputMode','last','Name','bilstm') dropoutLayer(0.5,'Name','drop') fullyConnectedLayer(numClasses,'Name','fc') softmaxLayer('Name','softmax') ];```

Convert the layers to a `layerGraph` object.

`lgraph = layerGraph(layers);`

Create a `dlnetwork` object from the layer graph.

`dlnet = dlnetwork(lgraph);`

### Specify Training Options

Train for 15 epochs and specify a mini-batch size of 16.

```numEpochs = 15; miniBatchSize = 16;```

Specify the options for Adam optimization. Specify an initial learning rate of `1e-4` with a decay of 0.001, a gradient decay of 0.9, and a squared gradient decay of 0.999.

```initialLearnRate = 1e-4; decay = 0.001; gradDecay = 0.9; sqGradDecay = 0.999;```

Visualize the training progress in a plot.

`plots = "training-progress";`

### Train Sequence Classification Network

Create a `minibatchqueue` object that processes and manages mini-batches of sequences during training. For each mini-batch:

• Use the custom mini-batch preprocessing function `preprocessLabeledSequences` (defined at the end of this example) to convert the labels to dummy variables.

• Format the vector sequence data with the dimension labels `'CTB'` (channel, time, batch). By default, the `minibatchqueue` object converts the data to `dlarray` objects with underlying type `single`. Do not add a format to the class labels.

• Train on a GPU if one is available. By default, the `minibatchqueue` object converts each output to a `gpuArray` object if a GPU is available. Using a GPU requires Parallel Computing Toolbox™ and a supported GPU device. For information on supported devices, see GPU Support by Release (Parallel Computing Toolbox).

```mbq = minibatchqueue(dsTrain,... 'MiniBatchSize',miniBatchSize,... 'MiniBatchFcn', @preprocessLabeledSequences,... 'MiniBatchFormat',{'CTB',''});```

Initialize the training progress plot.

```if plots == "training-progress" figure lineLossTrain = animatedline('Color',[0.85 0.325 0.098]); ylim([0 inf]) xlabel("Iteration") ylabel("Loss") grid on end```

```averageGrad = []; averageSqGrad = [];```

Train the model using a custom training loop. For each epoch, shuffle the data and loop over mini-batches of data. For each mini-batch:

• Evaluate the model gradients, state, and loss using `dlfeval` and the `modelGradients` function and update the network state.

• Determine the learning rate for the time-based decay learning rate schedule: for each iteration, the solver uses the learning rate given by ${\rho }_{\mathit{t}}=\frac{{\rho }_{0}}{1+\mathit{k}\text{\hspace{0.17em}}\mathit{t}}$, where t is the iteration number, ${\rho }_{0}$ is the initial learning rate, and k is the decay.

• Update the network parameters using the `adamupdate` function.

• Display the training progress.

Note that training can take a long time to run.

```iteration = 0; start = tic; % Loop over epochs. for epoch = 1:numEpochs % Shuffle data. shuffle(mbq); % Loop over mini-batches. while hasdata(mbq) iteration = iteration + 1; % Read mini-batch of data. [dlX, dlY] = next(mbq); % Evaluate the model gradients, state, and loss using dlfeval and the % modelGradients function. [gradients,state,loss] = dlfeval(@modelGradients,dlnet,dlX,dlY); % Determine learning rate for time-based decay learning rate schedule. learnRate = initialLearnRate/(1 + decay*iteration); % Update the network parameters using the Adam optimizer. [dlnet,averageGrad,averageSqGrad] = adamupdate(dlnet,gradients,averageGrad,averageSqGrad, ... iteration,learnRate,gradDecay,sqGradDecay); % Display the training progress. if plots == "training-progress" D = duration(0,0,toc(start),'Format','hh:mm:ss'); addpoints(lineLossTrain,iteration,double(gather(extractdata(loss)))) title("Epoch: " + epoch + " of " + numEpochs + ", Elapsed: " + string(D)) drawnow end end end```

### Test Model

Test the classification accuracy of the model by comparing the predictions on the validation set with the true labels.

After training is complete, making predictions on new data does not require the labels.

To create a `minibatchqueue` object for testing:

• Create an array datastore containing only the predictors of the test data.

• Specify the same mini-batch size used for training.

• Preprocess the predictors using the `preprocessUnlabeledSequences` helper function, listed at the end of the example.

• For the single output of the datastore, specify the mini-batch format `'CTB'` (channel, time, batch).

```dsXValidation = arrayDatastore(sequencesValidation,'OutputType','same'); mbqTest = minibatchqueue(dsXValidation, ... 'MiniBatchSize',miniBatchSize, ... 'MiniBatchFcn',@preprocessUnlabeledSequences, ... 'MiniBatchFormat','CTB');```

Loop over the mini-batches and classify the images using the `modelPredictions` helper function, listed at the end of the example.

`predictions = modelPredictions(dlnet,mbqTest,classes);`

Evaluate the classification accuracy by comparing the predicted labels to the true validation labels.

`accuracy = mean(predictions == labelsValidation)`
```accuracy = 0.6721 ```

### Assemble Video Classification Network

To create a network that classifies videos directly, assemble a network using layers from both of the created networks. Use the layers from the convolutional network to transform the videos into vector sequences and the layers from the sequence classification network to classify the vector sequences.

The following diagram illustrates the network architecture:

• To input image sequences to the network, use a sequence input layer.

• To use convolutional layers to extract features, that is, to apply the convolutional operations to each frame of the videos independently, use the GoogLeNet convolutional layers.

• To classify the resulting vector sequences, include the sequence classification layers.

When training this type of network with the `trainNetwork` function (not done in this example), you have to use sequence folding and unfolding layers to process the video frames independently. When training this type of network with a `dlnetwork` object and a custom training loop (as in this example), sequence folding and unfolding layers are not required because the network uses dimension information given by the `dlarray` dimension labels.

First, create a layer graph of the GoogLeNet network.

`cnnLayers = layerGraph(netCNN);`

Remove the input layer (`"data"`) and the layers after the pooling layer used for the activations (`"pool5-drop_7x7_s1"`, `"loss3-classifier"`, `"prob"`, and `"output"`).

```layerNames = ["data" "pool5-drop_7x7_s1" "loss3-classifier" "prob" "output"]; cnnLayers = removeLayers(cnnLayers,layerNames);```

Create a sequence input layer that accepts image sequences containing images of the same input size as the GoogLeNet network. To normalize the images using the same average image as the GoogLeNet network, set the `'Normalization'` option of the sequence input layer to `'zerocenter'` and the `'Mean'` option to the average image of the input layer of GoogLeNet.

```inputSize = netCNN.Layers(1).InputSize(1:2); averageImage = netCNN.Layers(1).Mean; inputLayer = sequenceInputLayer([inputSize 3], ... 'Normalization','zerocenter', ... 'Mean',averageImage, ... 'Name','input');```

Add the sequence input layer to the layer graph. Connect the output of the input layer to the input of the first convolutional layer (`"conv1-7x7_s2"`).

```lgraph = addLayers(cnnLayers,inputLayer); lgraph = connectLayers(lgraph,"input","conv1-7x7_s2");```

Add the previously trained sequence classification network layers to the layer graph and connect them.

Take the layers from the sequence classification network and remove the sequence input layer.

```lstmLayers = dlnet.Layers; lstmLayers(1) = [];```

Add the sequence classification layers to the layer graph. Connect the last convolutional layer `pool5-7x7_s1` to the `bilstm` layer.

```lgraph = addLayers(lgraph,lstmLayers); lgraph = connectLayers(lgraph,"pool5-7x7_s1","bilstm");```

Convert to `dlnetwork`

To be able to do predictions, convert the layer graph to a `dlnetwork` object.

`dlnetAssembled = dlnetwork(lgraph)`
```dlnetAssembled = dlnetwork with properties: Layers: [144×1 nnet.cnn.layer.Layer] Connections: [170×2 table] Learnables: [119×3 table] State: [2×3 table] InputNames: {'input'} OutputNames: {'softmax'} Initialized: 1 ```

### Classify Using New Data

Unzip the file `pushup_mathworker.zip. `

`unzip("pushup_mathworker.zip")`

The extracted `pushup_mathworker` folder contains a video of a push-up. Create a file datastore for this folder. Use a custom read function to read the videos.

```ds = fileDatastore("pushup_mathworker", ... 'ReadFcn',@readVideo);```

Read the first video from the datastore. To be able to read the video again, reset the datastore.

```video = read(ds); reset(ds);```

To view the video, loop over the individual frames and use the `image` function. Alternatively, you can use the `implay` function (requires Image Processing Toolbox).

```numFrames = size(video,4); figure for i = 1:numFrames frame = video(:,:,:,i); image(frame); xticklabels([]); yticklabels([]); drawnow end```

To preprocess the videos to have the input size expected by the network, use the `transform` function and apply the `imresize` function to each image in the datastore.

`dsXTest = transform(ds,@(x) imresize(x,inputSize));`

To manage and process the unlabeled videos, create a `minibatchqueue:`

• Specify a mini-batch size of 1.

• Preprocess the videos using the `preprocessUnlabeledVideos` helper function, listed at the end of the example.

• For the single output of the datastore, specify the mini-batch format `'SSCTB'` (spatial, spatial, channel, time, batch).

```mbqTest = minibatchqueue(dsXTest,... 'MiniBatchSize',1,... 'MiniBatchFcn', @preprocessUnlabeledVideos,... 'MiniBatchFormat',{'SSCTB'});```

Classify the videos using the `modelPredictions` helper function, defined at the end of this example. The function expects three inputs: a `dlnetwork` object, a `minibatchqueue` object, and a cell array containing the network classes.

`[predictions] = modelPredictions(dlnetAssembled,mbqTest,classes)`
```predictions = categorical pushup ```

### Helper Functions

The `readVideo` function reads the video in `filename` and returns an H-by-W-by-C`-`by-T array, where H, W, C, and T are the height, width, number of channels, and number of frames of the video, respectively.

```function video = readVideo(filename) vr = VideoReader(filename); H = vr.Height; W = vr.Width; C = 3; % Preallocate video array numFrames = floor(vr.Duration * vr.FrameRate); video = zeros(H,W,C,numFrames,'uint8'); % Read frames i = 0; while hasFrame(vr) i = i + 1; video(:,:,:,i) = readFrame(vr); end % Remove unallocated frames if size(video,4) > i video(:,:,:,i+1:end) = []; end end```

The `modelGradients` function takes as input a `dlnetwork` object `dlnet` and a mini-batch of input data `dlX` with corresponding labels `Y`, and returns the gradients of the loss with respect to the learnable parameters in `dlnet`, the network state, and the loss. To compute the gradients automatically, use the `dlgradient` function.

```function [gradients,state,loss] = modelGradients(dlnet,dlX,Y) [dlYPred,state] = forward(dlnet,dlX); loss = crossentropy(dlYPred,Y); gradients = dlgradient(loss,dlnet.Learnables); end```

#### Model Predictions Function

The `modelPredictions` function takes as input a `dlnetwork` object `dlnet`, a `minibatchqueue` object of input data `mbq`, and the network classes, and computes the model predictions by iterating over all data in the mini-batch queue. The function uses the `onehotdecode` function to find the predicted class with the highest score. The function returns the predicted labels.

```function [predictions] = modelPredictions(dlnet,mbq,classes) predictions = []; while hasdata(mbq) % Extract a mini-batch from the minibatchqueue and pass it to the % network for predictions [dlXTest] = next(mbq); dlYPred = predict(dlnet,dlXTest); % To obtain categorical labels, one-hot decode the predictions YPred = onehotdecode(dlYPred,classes,1)'; predictions = [predictions; YPred]; end end```

#### Labeled Sequence Data Preprocessing Function

The `preprocessLabeledSequences` function preprocesses the sequence data using the following steps:

1. Use the `padsequences` function to pad the sequences in the time dimension and concatenate them in the batch dimension.

2. Extract the label data from the incoming cell array and concatenate into a categorical array.

3. One-hot encode the categorical labels into numeric arrays.

4. Transpose the array of one-hot encoded labels to match the shape of the network output.

```function [X, Y] = preprocessLabeledSequences(XCell,YCell) % Pad the sequences with zeros in the second dimension (time) and concatenate along the third % dimension (batch) X = padsequences(XCell,2); % Extract label data from cell and concatenate Y = cat(1,YCell{1:end}); % One-hot encode labels Y = onehotencode(Y,2); % Transpose the encoded labels to match the network output Y = Y'; end```

#### Unlabeled Sequence Data Preprocessing Function

The `preprocessUnlabeledSequences` function preprocesses the sequence data using the `padsequences` function. This function pads the sequences with zeros in the time dimension and concatenates the result in the batch dimension.

```function [X] = preprocessUnlabeledSequences(XCell) % Pad the sequences with zeros in the second dimension (time) and concatenate along the third % dimension (batch) X = padsequences(XCell,2); end```

#### Unlabeled Video Data Preprocessing Function

The `preprocessUnlabeledVideos` function preprocesses unlabeled video data using the `padsequences` function. This function pads the videos with zero in the time dimension and concatenates the result in the batch dimension.

```function [X] = preprocessUnlabeledVideos(XCell) % Pad the sequences with zeros in the fourth dimension (time) and % concatenate along the fifth dimension (batch) X = padsequences(XCell,4); end```