Technical Articles and Newsletters

Deep Learning for Computer Vision with MATLAB

By Avinash Nehemiah and Valerie Leung, MathWorks

Computer vision engineers have used machine learning techniques for decades to detect objects of interest in images and to classify or identify categories of objects. They extract features representing points, regions, or objects of interest and then use those features to train a model to classify or learn patterns in the image data.

In traditional machine learning, feature selection is a time-consuming manual process. Feature extraction usually involves processing each image with one or more image processing operations, such as calculating gradient to extract the discriminative information from each image.

Enter deep learning. Deep learning algorithms can learn features, representations, and tasks directly from images, text, and sound, eliminating the need for manual feature selection.

Using a simple object detection and recognition example, this article illustrates how easy it is to use MATLAB® for deep learning, even without extensive knowledge of advanced computer vision algorithms or neural networks.

The code used in this example is available for download.

Getting Started

The goal in this example is to train an algorithm to detect a pet in a video and correctly label the pet as a cat or a dog. We’ll be using a convolutional neural network (CNN), a specific type of deep learning algorithm that can both perform classification and extract features from raw images.

To build the object detection and recognition algorithm in MATLAB, all we need is a pretrained CNN and some dog and cat images. We’ll use the CNN to extract discriminative features from the images, and then use a MATLAB app to train a machine learning algorithm to discriminate between cats and dogs.

Importing a CNN Classifier

We begin by downloading a CNN classifier pretrained on ImageNet, a database containing over 1.2 million labeled high-resolution images in 1000 categories. In this example we’ll be using the AlexNet architecture.

websave('\networks\imagenet-caffe-alex.mat',...
    'http://www.vlfeat.org/matconvnet/models/beta16/
                             imagenet-caffe-alex.mat');

We import the network into MATLAB as a SeriesNetwork using Neural Network Toolbox™, and display the architecture of the CNN. The SeriesNetwork object represents the CNN.

% Load MatConvNet network into a SeriesNetwork
convnet = helperImportMatConvNet(cnnFullMatFile);


% View the CNN architecture
convnet.Layers

We’ve stored the images in separate cat and dog folders under a parent called pet_images. The advantage of using this folder structure is that the MATLAB imageDatastore we create will be able to automatically read and manage image locations and class labels. (imageDatastore is a repository for collections of data that are too large to fit in memory.)

We initialize an imageDatastore to access the images in MATLAB.

%% Set up image data
dataFolder = ' \data\PetImages';
categories = {'Cat', 'Dog'};
imds = imageDatastore(fullfile(dataFolder, categories), ...
    'LabelSource', 'foldernames');

We then select a subset of the data that gives us an equal number of dog and cat images.

tbl = countEachLabel(imds)

%% Use the smallest overlap set
minSetCount = min(tbl{:,2});

% Use splitEachLabel method to trim the set.
imds = splitEachLabel(imds, minSetCount, 'randomize');

% Notice that each set now has exactly the same number of images.
countEachLabel(imds)

Since the AlexNet network was trained on 227x227-pixel images, we have to resize all our training images to the same resolution. The following code allows us to read and process images from theimageDatastore at the same time.

%% Pre-process Images For CNN
% Set the ImageDatastore ReadFcn
imds.ReadFcn = @(filename)readAndPreprocessImage(filename);
 
%% Divide data into training and testing sets
[trainingSet, testSet] = splitEachLabel(imds, 0.3, 'randomize');

We use the readAndPreprocessImage function to resize the images to 227x227 pixels.

function Iout = readAndPreprocessImage(filename)

I = imread(filename);

% Some images may be grayscale. Replicate the image 3 times to
% create an RGB image.
if ismatrix(I)
    I = cat(3,I,I,I);
end

% Resize the image as required for the CNN.
Iout = imresize(I, [227 227]);

end

Performing Feature Extraction

We want to use this new dataset with the pretrained AlexNet CNN. CNNs can learn to extract generic features that can be used to train a new classifier to solve a different problem—in our case, classifying cats and dogs (Figure 1).

Figure 1. Workflow for using a pretrained CNN to extract features for a new task.

We pass the training data through the CNN and use the activations method to extract features at a particular layer in the network. Like other neural networks, CNNs are formed using interconnected layers of nonlinear processing elements, or neurons. Input and output layers connect to input and output signals, and hidden layers provide nonlinear complexity that gives a neural network its computational capacity.

While each layer of a CNN produces a response to an input image, only a few layers are suitable for image feature extraction. There is no exact formula for identifying these layers. The best approach is to simply try a few different layers and see how well they work.

The layers at the beginning of the network capture basic image features, such as edges and blobs. To see this, we visualize the network filter weights from the first convolutional layer (Figure 2).

% Get the network weights for the second convolutional layer
w1 = convnet.Layers(2).Weights;
 
% Scale and resize the weights for visualization
w1 = mat2gray(w1);
w1 = imresize(w1,5); 
 
% Display a montage of network weights. There are 96 individual
% sets of weights in the first layer.
figure
montage(w1)
title('First convolutional layer weights')
Figure 2. Visualization of first layer filter weights.

Notice that the first layer of the network has learned filters for capturing blob and edge features. These "primitive" features are then processed by deeper network layers, which combine the early features to form higher-level image features. These higher-level features are better suited for recognition tasks because they combine all the primitive features into a richer image representation. You can easily extract features from one of the deeper layers using the activations method.

The layer right before the classification layer fc7 is a good place to start. We extract training features using that layer.

featureLayer = 'fc7';
trainingFeatures = activations(convnet, trainingSet, featureLayer, ...
    'MiniBatchSize', 32, 'OutputAs', 'columns');

Training an SVM Classifier Using the Extracted Features

We’re now ready to train a "shallow" classifier with the features extracted in the previous step. Note that the original network was trained to classify 1000 object categories. The “shallow” classifier will be trained to solve the specific dogs vs. cats problem.

The Classification Learner app in Statistics and Machine Learning Toolbox™ lets us train and compare multiple models interactively (Figure 3).

Figure 3. Classification Learner app.

Alternatively, we could train the classifier in our MATLAB script.

We split the data into two sets, one for training and one for testing. Next, we train a support vector machine (SVM) classifier using the extracted features by calling the fitcsvm function using trainingFeatures as the input or predictors and trainingLabels as the output or response values. We will cross-validate the classifier on the test data to determine its validation accuracy, an unbiased estimate of how the classifier would perform on new data.

%% Train a classifier using extracted features 
trainingLabels = trainingSet.Labels;

% Here I train a linear support vector machine (SVM) classifier.
svmmdl = fitcsvm(trainingFeatures ,trainingLabels);

% Perform cross-validation and check accuracy
cvmdl = crossval(svmmdl,'KFold',10);
fprintf('kFold CV accuracy: %2.2f\n',1-cvmdl.kfoldLoss)

We can now use the svmmdl classifier to classify an image as a cat or a dog (Figure 4).

Figure 4. Result of using the trained pet classifier on an image of a cat.

Performing Object Detection

In most images and video frames, there is a lot going on. For example, in addition to a dog, there could be a tree, or a flock of pigeons, or a raccoon chasing the dog. Even a reliable image classifier will only work well if we can locate the object of interest, crop the object, and then feed it to the classifier—in other words, if we can perform object detection.

For object detection we will use a technique called optical flow, which uses the motion of pixels in a video from frame to frame. Figure 5 shows a single frame of video with the motion vectors overlaid.

Figure 5. A single frame of video showing the motion vectors overlaid.

The next step in the detection process is to separate out pixels that are moving and then use the Image Region Analyzer app to analyze the connected components in the binary image to filter out noise caused by the motion of the camera. The output of the app is a MATLAB function that can locate the pet in the field of view (Figure 6).

Figure 6. Image Region Analyzer app.

We now have all the pieces we need to build a pet detection and recognition system (Figure 7). The system can:

  • Detect the location of the pet in new images using optical flow
  • Crop the pet from the image and extract features using a pretrained CNN
  • Classify the features using the SVM classifier we trained to determine if the pet is a cat or a dog
Figure 7. Accurately classified dog and cat.

In this article we used an existing deep learning network to solve a different task. You can use the same techniques to solve your own image classification problem—for example, classifying types of cars in videos for traffic flow analysis, identifying tumors in mass spectrometry data for cancer research, or identifying individuals by their facial features for security systems.

Article featured in MathWorks News & Notes

Published 2016 - 93019v00


View Articles for Related Capabilities