Main Content

Perform Instance Segmentation Using Mask R-CNN

This example shows how to segment individual instances of people and cars using a multiclass Mask region-based convolutional neural network (R-CNN).

Instance segmentation is a computer vision technique in which you detect and localize objects while simultaneously generating a segmentation map for each of the detected instances.

This example first shows how to perform instance segmentation using a pretrained Mask R-CNN that detects two classes. Then, you can optionally download a data set and train a multiclass Mask R-CNN using transfer learning.

Note: This example requires the Computer Vision Toolbox™ Model for Mask R-CNN. You can install the Computer Vision Toolbox Model for Mask R-CNN from the Add-On Explorer. For more information about installing add-ons, see Get and Manage Add-Ons.

Perform Instance Segmentation Using Pretrained Mask R-CNN

Download the pretrained Mask R-CNN. The network is stored as a maskrcnn (Computer Vision Toolbox) object.

dataFolder = fullfile(tempdir,"coco");
trainedMaskRCNN_url = 'https://www.mathworks.com/supportfiles/vision/data/maskrcnn_object_person_car.mat';
helper.downloadTrainedMaskRCNN(trainedMaskRCNN_url,dataFolder);
Pretrained MaskRCNN network already exists.
pretrained = load(fullfile(dataFolder,'maskrcnn_object_person_car.mat'));
net = pretrained.net;

Read a test image that contains objects of the target classes.

imTest = imread('visionteam.jpg');

Segment the objects and their masks using the segmentObjects (Computer Vision Toolbox) function. The segmentObjects function performs these preprocessing steps on the input image before performing prediction:

  1. Zero center the images using the COCO data set mean.

  2. Resize the image to the input size of the network, while maintaining the aspect ratio (letter boxing).

[masks,labels,scores,boxes] = segmentObjects(net,imTest);

Visualize the predictions by overlaying the detected masks on the image using the insertObjectMask (Computer Vision Toolbox) function.

overlayedImage = insertObjectMask(imTest,masks);
imshow(overlayedImage)

Show the bounding boxes and labels on the objects.

showShape("rectangle",gather(boxes),"Label",labels,"LineColor",'r')

Download Training Data

The COCO 2014 train images data set [2] consists of 82,783 images. The annotations data contains at least five captions corresponding to each image.

Create directories to store the COCO training images and annotation data.

imageFolder = fullfile(dataFolder,"images");
captionsFolder = fullfile(dataFolder,"annotations");
if ~exist(imageFolder,'dir')
    mkdir(imageFolder)
    mkdir(captionsFolder)
end

Download the COCO 2014 training images and captions from https://cocodataset.org/#download by clicking the "2014 Train images" and "2014 Train/Val annotations" links, respectively. Extract the image files into the folder specified by imageFolder. Extract the annotation files into the folder specified by captionsFolder.

annotationFile = fullfile(captionsFolder,"instances_train2014.json");
str = fileread(annotationFile);

Read and Preprocess Training Data

To train a Mask R-CNN, you need this data.

  • RGB images that serve as input to the network, specified as H-by-W-by-3 numeric arrays.

  • Bounding boxes for objects in the RGB images, specified as NumObjects-by-4 matrices, with rows in the format [x y w h]).

  • Instance labels, specified as NumObjects-by-1 string vectors.

  • Instance masks. Each mask is the segmentation of one instance in the image. The COCO data set specifies object instances using polygon coordinates formatted as NumObjects-by-2 cell arrays. Each row of the array contains the (x,y) coordinates of a polygon along the boundary of one instance in the image. However, the Mask R-CNN in this example requires binary masks specified as logical arrays of size H-by-W-by-NumObjects.

Initialize Training Data Parameters

trainClassNames = {'person', 'car'};
numClasses = length(trainClassNames);
imageSizeTrain = [800 800 3];

Format COCO Annotation Data as MAT Files

The COCO API for MATLAB enables you to access the annotation data. Download the COCO API for MATLAB from https://github.com/cocodataset/cocoapi by clicking the "Code" button and selecting "Download ZIP." Extract the cocoapi-master directory and its contents to the folder specified by dataFolder. If needed for your operating system, compile the gason parser by following the instructions in the gason.m file within the MatlabAPI subdirectory.

Specify the directory location for the COCO API for MATLAB and add the directory to the path.

cocoAPIDir = fullfile(dataFolder,"cocoapi-master","MatlabAPI");
addpath(cocoAPIDir);

Specify the folder in which to store the MAT files.

unpackAnnotationDir = fullfile(dataFolder,"annotations_unpacked","matFiles");
if ~exist(unpackAnnotationDir,'dir')
    mkdir(unpackAnnotationDir)
end

Extract the COCO annotations to MAT files using the unpackAnnotations helper function, which is attached to this example as a supporting file in the folder helper. Each MAT file corresponds to a single training image and contains the file name, bounding boxes, instance labels, and instance masks for each training image. The function converts object instances specified as polygon coordinates to binary masks using the poly2mask (Image Processing Toolbox) function.

helper.unpackAnnotations(trainClassNames,annotationFile,imageFolder,unpackAnnotationDir);
Loading and preparing annotations... DONE (t=9.11s).
Unpacking annotations into MAT files...
Done!

Create Datastore

The Mask R-CNN expects input data as a 1-by-4 cell array containing the RGB training image, bounding boxes, instance labels, and instance masks.

Create a file datastore with a custom read function, cocoAnnotationMATReader, that reads the content of the unpacked annotation MAT files, converts grayscale training images to RGB, and returns the data as a 1-by-4 cell array in the required format. The custom read function is attached to this example as a supporting file in the folder helper.

ds = fileDatastore(unpackAnnotationDir, ...
    'ReadFcn',@(x)helper.cocoAnnotationMATReader(x,imageFolder));

Preprocess the training images, bounding boxes, and instance masks to the size expected by the network using the transform function. The transform function processes the data using the operations specified in the preprocessData helper function. The helper function is attached to the example as a supporting file in the folder helper.

The preprocessData helper function performs these operations on the training images, bounding boxes, and instance masks:

  • Resize the RGB images and masks using the imresize function and rescale the bounding boxes using the bboxresize (Computer Vision Toolbox) function. The helper function selects a homogenous scale factor such that the smaller dimension of the image, bounding box, or mask is equal to the target network input size.

  • Crop the RGB images and masks using the imcrop (Image Processing Toolbox) function and crop the bounding boxes using the bboxcrop (Computer Vision Toolbox) function. The helper function crops the image, bounding box, or mask such that the larger dimension is equal to the target network input size.

  • Zero center the images using the COCO data set image mean. (The standard deviation normalization factor is included in the weights of the first convolutional layer.)

dsTrain = transform(ds,@(x)helper.preprocessData(x,imageSizeTrain));

Preview the data returned by the transformed datastore.

data = preview(dsTrain)
data=1×4 cell array
    {800×800×3 single}    {14×4 double}    {14×1 categorical}    {800×800×14 logical}

Create Mask R-CNN Network Layers

The Mask R-CNN builds upon a Faster R-CNN with a ResNet-50 base network. To transfer learn on the pretrained Mask R-CNN network, use the maskrcnn object to load the pretrained network and customize the network for the new set of classes and input size. By default, the maskrcnn object uses the same anchor boxes as the COCO data set.

net = maskrcnn("resnet50-coco",trainClassNames,"InputSize",imageSizeTrain)
net = 
  maskrcnn with properties:

      ModelName: 'maskrcnn'
     ClassNames: {'person'  'car'}
      InputSize: [800 800 3]
    AnchorBoxes: [15×2 double]

Create a structure containing configuration paramaters for the Mask R-CNN network.

params = createMaskRCNNConfig(imageSizeTrain,numClasses,[trainClassNames {'background'}]);
params.ClassAgnosticMasks = false;
params.AnchorBoxes = net.AnchorBoxes;
params.FreezeBackbone = true;

Specify Training Options

Specify the options for SGDM optimization. Train the network for 10 epochs.

initialLearnRate = 0.0012;
momentum = 0.9;
decay = 0.01;
velocity = [];
maxEpochs = 10;
miniBatchSize = 2;

Batch Training Data

Create a minibatchqueue object that manages the mini-batching of observations in a custom training loop. The minibatchqueue object also casts data to a dlarray object that enables automatic differentiation in deep learning applications.

Define a custom batching function named miniBatchFcn. The images are concatenated along the fourth dimension to get an H-by-W-by-C-by-miniBatchSize shaped batch. The other ground truth data is configured a cell array of length equal to the mini-batch size.

miniBatchFcn = @(img,boxes,labels,masks) deal(cat(4,img{:}),boxes,labels,masks);

Specify the mini-batch data extraction format for the image data as "SSCB" (spatial, spatial, channel, batch). If a supported GPU is available for computation, then the minibatchqueue object preprocesses mini-batches in the background in a parallel pool during training.

mbqTrain = minibatchqueue(dsTrain,4, ...
    "MiniBatchFormat",["SSCB","","",""], ...
    "MiniBatchSize",miniBatchSize, ...
    "OutputCast",["single","","",""], ...
    "OutputAsDlArray",[true,false,false,false], ...
    "MiniBatchFcn",miniBatchFcn, ...
    "OutputEnvironment",["auto","cpu","cpu","cpu"]);

Train Network

To train the network, set the doTraining variable in the following code to true. Train the model in a custom training loop. For each iteration:

  • Read the data for the current mini-batch using the next function.

  • Evaluate the model gradients using the dlfeval function and the networkGradients helper function. The function networkGradients, listed as a supporting function, returns the gradients of the loss with respect to the learnable parameters, the corresponding mini-batch loss, and the state of the current batch.

  • Update the network parameters using the sgdmupdate function.

  • Update the state parameters of the network with the moving average.

  • Update the training progress plot.

Train on a GPU if one is available. Using a GPU requires Parallel Computing Toolbox™ and a CUDA® enabled NVIDIA® GPU. For more information, see GPU Support by Release (Parallel Computing Toolbox).

doTraining = true;
if doTraining
    
    iteration = 1; 
    start = tic;
    
     % Create subplots for the learning rate and mini-batch loss
    fig = figure;
    [lossPlotter, learningratePlotter] = helper.configureTrainingProgressPlotter(fig);
    
    % Initialize verbose output
    helper.initializeVerboseOutput([]);
    
    % Custom training loop
    for epoch = 1:maxEpochs
        reset(mbqTrain)
        shuffle(mbqTrain)
    
        while hasdata(mbqTrain)
            % Get next batch from minibatchqueue
            [X,gtBox,gtClass,gtMask] = next(mbqTrain);
        
            % Evaluate the model gradients and loss using dlfeval
            [gradients,loss,state,learnables] = dlfeval(@networkGradients,X,gtBox,gtClass,gtMask,net,params);
            %dlnet.State = state;
            
            % Compute the learning rate for the current iteration
            learnRate = initialLearnRate/(1 + decay*(epoch-1));
            
            if(~isempty(gradients) && ~isempty(loss))
                [net.AllLearnables,velocity] = sgdmupdate(learnables,gradients,velocity,learnRate,momentum);
            else
                continue;
            end
            
            % Plot loss/accuracy metric every 10 iterations
            if(mod(iteration,10)==0)
                helper.displayVerboseOutputEveryEpoch(start,learnRate,epoch,iteration,loss);
                D = duration(0,0,toc(start),'Format','hh:mm:ss');
                addpoints(learningratePlotter,iteration,learnRate)
                addpoints(lossPlotter,iteration,double(gather(extractdata(loss))))
                subplot(2,1,2)
                title(strcat("Epoch: ",num2str(epoch),", Elapsed: "+string(D)))
                drawnow
            end
            
            iteration = iteration + 1;    
        end
    
    end
    
    % Save the trained network
    modelDateTime = string(datetime('now','Format',"yyyy-MM-dd-HH-mm-ss"));
    save(strcat("trainedMaskRCNN-",modelDateTime,"-Epoch-",num2str(epoch),".mat"),'net');
    
end
 
Training on GPU.
|=========================================================================|
|  Epoch  |  Iteration  |  Time Elapsed  |  Mini-batch  |  Base Learning  |
|         |             |   (hh:mm:ss)   |     Loss     |      Rate       |
|=========================================================================|
|    1    |     10      |    00:00:26    |    1.9042    |     0.0012      | 
|    1    |     20      |    00:00:45    |    2.3645    |     0.0012      | 
|    1    |     30      |    00:01:03    |    2.1728    |     0.0012      | 
|    1    |     40      |    00:01:22    |    2.4587    |     0.0012      | 
|    1    |     50      |    00:01:40    |    1.6101    |     0.0012      | 
|    1    |     60      |    00:01:59    |    1.9428    |     0.0012      | 
|    1    |     70      |    00:02:17    |    2.0966    |     0.0012      | 
|    1    |     80      |    00:02:35    |    1.8483    |     0.0012      | 
|    1    |     90      |    00:02:53    |    1.9071    |     0.0012      | 
|    1    |     100     |    00:03:11    |    2.3982    |     0.0012      | 
|    1    |     110     |    00:03:29    |    1.8156    |     0.0012      | 
|    1    |     120     |    00:03:48    |    1.1133    |     0.0012      | 
|    1    |     130     |    00:04:07    |    1.5866    |     0.0012      | 
|    1    |     140     |    00:04:24    |    1.5608    |     0.0012      | 
|    1    |     150     |    00:04:43    |    0.9455    |     0.0012      | 
|    1    |     160     |    00:05:01    |    1.5179    |     0.0012      | 
|    1    |     170     |    00:05:20    |    1.5809    |     0.0012      | 
|    1    |     180     |    00:05:39    |    1.1198    |     0.0012      | 
|    1    |     190     |    00:05:58    |    1.9142    |     0.0012      | 
|    1    |     200     |    00:06:17    |    1.5293    |     0.0012      | 
|    1    |     210     |    00:06:35    |    1.9376    |     0.0012      | 
|    1    |     220     |    00:06:53    |    1.1024    |     0.0012      | 
|    1    |     230     |    00:07:11    |    2.7115    |     0.0012      | 
|    1    |     240     |    00:07:29    |    1.0415    |     0.0012      | 
|    1    |     250     |    00:07:48    |    2.0512    |     0.0012      | 
|    1    |     260     |    00:08:07    |    1.9210    |     0.0012      | 

Using the trained network, you can perform instance segmentation on test images, such as demonstrated in the section Perform Instance Segmentation Using Pretrained Mask R-CNN.

References

[1] He, Kaiming, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. “Mask R-CNN.” Preprint, submitted January 24, 2018. https://arxiv.org/abs/1703.06870.

[2] Lin, Tsung-Yi, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. “Microsoft COCO: Common Objects in Context,” May 1, 2014. https://arxiv.org/abs/1405.0312v3.

See Also

(Computer Vision Toolbox) | | | | (Computer Vision Toolbox) | | | | (Computer Vision Toolbox) |

Related Topics

External Websites