Object Detection Using YOLO v2 Deep Learning

This example shows how to train an object detector using a deep learning technique named you only look once (YOLO) v2.

Overview

Deep learning is a powerful machine learning technique that automatically learns image features required for detection tasks. There are several techniques for object detection using deep learning such as Faster R-CNN and you only look once (YOLO) v2. This example trains YOLO v2, which is an efficient deep learning object detector.

Object Detection using Deep Learning (Computer Vision Toolbox)

Note: This example requires Computer Vision Toolbox™ and Deep Learning Toolbox™. Parallel Computing Toolbox™ is recommended to train the detector using a CUDA-capable NVIDIA™ GPU with compute capability 3.0.

Download Pretrained Detector

This example uses a pretrained detector to allow the example to run without having to wait for training to complete. If you want to train the detector with the trainYOLOv2ObjectDetector function, set the doTraining variable to true. Otherwise, download the pretrained detector.

doTraining = false;
if ~doTraining && ~exist('yolov2ResNet50VehicleExample.mat','file')
    % Download pretrained detector.
    disp('Downloading pretrained detector (98 MB)...');
    pretrainedURL = 'https://www.mathworks.com/supportfiles/vision/data/yolov2ResNet50VehicleExample.mat';
    websave('yolov2ResNet50VehicleExample.mat',pretrainedURL);
end
Downloading pretrained detector (98 MB)...

Load Dataset

This example uses a small vehicle data set that contains 295 images. Each image contains one or two labeled instances of a vehicle. A small data set is useful for exploring the YOLO v2 training procedure, but in practice, more labeled images are needed to train a robust detector.

% Unzip vehicle dataset images.
unzip vehicleDatasetImages.zip

% Load vehicle dataset ground truth.
data = load('vehicleDatasetGroundTruth.mat');
vehicleDataset = data.vehicleDataset;

The training data is stored in a table. The first column contains the path to the image files. The remaining columns contain the ROI labels for vehicles.

% Display first few rows of the data set.
vehicleDataset(1:4,:)
ans=4×2 table
             imageFilename               vehicle   
    _______________________________    ____________

    'vehicleImages/image_00001.jpg'    [1×4 double]
    'vehicleImages/image_00002.jpg'    [1×4 double]
    'vehicleImages/image_00003.jpg'    [1×4 double]
    'vehicleImages/image_00004.jpg'    [1×4 double]

Display one of the images from the data set to understand the type of images it contains.

% Add the fullpath to the local vehicle data folder.
vehicleDataset.imageFilename = fullfile(pwd,vehicleDataset.imageFilename);

% Read one of the images.
I = imread(vehicleDataset.imageFilename{10});

% Insert the ROI labels.
I = insertShape(I,'Rectangle',vehicleDataset.vehicle{10});

% Resize and display image.
I = imresize(I,3);
imshow(I)

Split the data set into a training set for training the detector, and a test set for evaluating the detector. Select 60% of the data for training. Use the rest for evaluation.

% Set random seed to ensure example training reproducibility.
rng(0);

% Randomly split data into a training and test set.
shuffledIndices = randperm(height(vehicleDataset));
idx = floor(0.6 * length(shuffledIndices) );
trainingData = vehicleDataset(shuffledIndices(1:idx),:);
testData = vehicleDataset(shuffledIndices(idx+1:end),:);

Create a YOLO v2 Object Detection Network

The YOLO v2 object detection network can be thought of as having two sub-networks. A feature extraction network, followed by a detection network.

The feature extraction network is typically a pretrained CNN (see Pretrained Deep Neural Networks for more details). This example uses ResNet-50 for feature extraction. Other pretrained networks such as MobileNet v2 or ResNet-18 can also be used depending on application requirements. The detection sub-network is a small CNN compared to the feature extraction network and is composed of a few convolutional layers and layers specific for YOLO v2.

Use the yolov2Layers function to automatically modify a pretrained ResNet-50 network into a YOLO v2 object detection network. yolov2Layers requires you to specify several inputs that parameterize a YOLO v2 network.

First, specify the image input size and the number of classes. The image input size should be at least as big as the images in the training image set. In this example, the images are 224-by-224 RGB images.

% Define the image input size.
imageSize = [224 224 3];

% Define the number of object classes to detect.
numClasses = width(vehicleDataset)-1;

Next, specify the size of the anchor boxes. The anchor boxes should be selected based on the scale and size of objects in the training data. You can Estimate Anchor Boxes Using Clustering (Computer Vision Toolbox) to determine a good set of anchor boxes based on the training data. Using this procedure, the anchor boxes for the vehicle dataset are:

anchorBoxes = [
    43 59
    18 22
    23 29
    84 109
];

See Anchor Boxes for Object Detection (Computer Vision Toolbox) for additional details.

Finally, specify the network and feature extraction layer within that network to use as the basis of YOLO v2.

% Load a pretrained ResNet-50.
baseNetwork = resnet50;

Select 'activation_40_relu' as the feature extraction layer. The layers after 'activation_40_relu' are discarded and the detection sub-network is attached to 'activation_40_relu'. This feature extraction layer outputs feature maps that are downsampled by a factor of 16. This amount of downsampling is a good trade-off between spatial resolution and the strength of the extracted features (features extracted further down the network encode stronger image features at the cost of spatial resolution). Choosing the optimal feature extraction layer requires empirical analysis and is another hyperparameter to tune.

% Specify the feature extraction layer.
featureLayer = 'activation_40_relu';

% Create the YOLO v2 object detection network. 
lgraph = yolov2Layers(imageSize,numClasses,anchorBoxes,baseNetwork,featureLayer);

You can visualize the network using analyzeNetwork or deepNetworkDesigner from Deep Learning Toolbox™.

Note that you can also create a custom YOLO v2 network layer-by-layer. Design a YOLO v2 Detection Network (Computer Vision Toolbox)

Train YOLO v2 Object Detector

To use the trainYOLOv2ObjectDetector function, set doTraining to true. Otherwise, load a pretrained detector.

if doTraining
    
    % Configure the training options. 
    %  * Lower the learning rate to 1e-3 to stabilize training. 
    %  * Set CheckpointPath to save detector checkpoints to a temporary
    %    location. If training is interrupted due to a system failure or
    %    power outage, you can resume training from the saved checkpoint.
    options = trainingOptions('sgdm', ...
        'MiniBatchSize', 16, ....
        'InitialLearnRate',1e-3, ...
        'MaxEpochs',10,...
        'CheckpointPath', tempdir, ...
        'Shuffle','every-epoch');    
    
    % Train YOLO v2 detector.
    [detector,info] = trainYOLOv2ObjectDetector(vehicleDataset,lgraph,options);
else
    % Load pretrained detector for the example.
    pretrained = load('yolov2ResNet50VehicleExample.mat');
    detector = pretrained.detector;
end

Note: This example verified on an NVIDA™ Titan X with 12 GB of GPU memory. If your GPU has less memory, you may run out of memory. If this happens, lower the 'MiniBatchSize' using the trainingOptions function. Training this network took approximately 5 minutes using this setup. Training time varies depending on the hardware you use.

As a quick test, run the detector on one test image.

% Read a test image.
I = imread(testData.imageFilename{end});

% Run the detector.
[bboxes,scores] = detect(detector,I);

% Annotate detections in the image.
I = insertObjectAnnotation(I,'rectangle',bboxes,scores);
imshow(I)

Evaluate Detector Using Test Set

Evaluate the detector on a large set of images to measure the trained detector's performance. Computer Vision Toolbox™ provides object detector evaluation functions to measure common metrics such as average precision (evaluateDetectionPrecision) and log-average miss rates (evaluateDetectionMissRate). Here, the average precision metric is used. The average precision provides a single number that incorporates the ability of the detector to make correct classifications (precision) and the ability of the detector to find all relevant objects (recall).

The first step for detector evaluation is to collect the detection results by running the detector on the test set.

% Create a table to hold the bounding boxes, scores, and labels output by
% the detector. 
numImages = height(testData);
results = table('Size',[numImages 3],...
    'VariableTypes',{'cell','cell','cell'},...
    'VariableNames',{'Boxes','Scores','Labels'});

% Run detector on each image in the test set and collect results.
for i = 1:numImages
    
    % Read the image.
    I = imread(testData.imageFilename{i});
    
    % Run the detector.
    [bboxes,scores,labels] = detect(detector,I);
   
    % Collect the results.
    results.Boxes{i} = bboxes;
    results.Scores{i} = scores;
    results.Labels{i} = labels;
end

% Extract expected bounding box locations from test data.
expectedResults = testData(:, 2:end);

% Evaluate the object detector using average precision metric.
[ap, recall, precision] = evaluateDetectionPrecision(results, expectedResults);

The precision/recall (PR) curve highlights how precise a detector is at varying levels of recall. Ideally, the precision would be 1 at all recall levels. The use of additional layers in the network can help improve the average precision, but might require additional training data and longer training time.

% Plot precision/recall curve
plot(recall,precision)
xlabel('Recall')
ylabel('Precision')
grid on
title(sprintf('Average Precision = %.2f', ap))

Code Generation

Once the detector is trained and evaluated, you can generate code for the yolov2ObjectDetector using GPU Coder™. See Code Generation for Object Detection Using YOLO v2 (GPU Coder) example for more details.

Summary

This example showed how to train a vehicle detector using deep learning. You can follow similar steps to train detectors for traffic signs, pedestrians, or other objects.

To learn more about deep learning, see Object Detection using Deep Learning (Computer Vision Toolbox).

References

[1] Redmon, Joseph, and Ali Farhadi. "YOLO9000: Better, Faster, Stronger." 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017.