This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English version of the page.

Note: This page has been translated by MathWorks. Click here to see
To view all translated materials including this page, select Country from the country navigator on the bottom of this page.

YOLO v2 Basics

The you-only-look-once (YOLO) v2 object detector uses a single stage object detection network. YOLO v2 is faster than other two-stage deep learning object detectors, such as regions with convolutional neural networks (Faster R-CNNs).

The YOLO v2 model runs a deep learning CNN on an input image to produce network predictions. The object detector decodes the predictions and generates bounding boxes.

Predicting Objects in the Image

YOLO v2 uses anchor boxes to detect classes of objects in an image. For more details, see Anchor Boxes for Object Detection.The YOLO v2 predicts these three attributes for each anchor box:

  • Intersection over union (IoU) — Predicts the objectness score of each anchor box.

  • Anchor box offsets — Refine the anchor box position

  • Class probability — Predicts the class label assigned to each anchor box.

The figure shows the predefined anchor box (the dotted line) and the refined location after offsets are applied.

Transfer Learning

With transfer learning, you can use a pretrained CNN as the feature extractor in a YOLO v2 detection network. Use the yolov2Layers function to create a YOLO v2 detection network from any pretrained CNN, for example MobileNet v2. For a list of pretrained CNNs, see Pretrained Deep Neural Networks (Deep Learning Toolbox)

You can also design a custom model based on a pretrained image classification CNN. For more details, see Design a YOLO v2 Detection Network.

Design a YOLO v2 Detection Network

You can design a custom YOLO v2 model layer by layer. The model starts with a feature extractor network, which can be initialized from a pretrained CNN or trained from scratch. The detection subnetwork contains a series of Conv, Batch norm, and ReLu layers, followed by the transform and output layers, yolov2TransformLayer and yolov2OutputLayer objects, respectively. yolov2TransformLayer transforms the raw CNN output into a form required to produce object detections. yolov2OutputLayer defines the anchor box parameters and implements the loss function used to train the detector.

You can also use the Deep Network Designer app to manually create a network. The designer incorporates Computer Vision Toolbox™ YOLO v2 features.

Design a YOLO v2 Detection Network with a Reorg Layer

The reorganization layer (created using the yolov2ReorgLayer object) and the depth concatenation layer ( created using the depthConcatenationLayer object) are used to combine low-level and high-level features. These layers improve detection by adding low-level image information and improving detection accuracy for smaller objects. Typically, the reorganization layer is attached to a layer within the feature extraction network whose output feature map is larger than the feature extraction layer output.


For more details on how to create this kind of network, see Create YOLO v2 Object Detection Network.

Train an Object Detector and Detect Objects with a YOLO v2 Model

To learn how to train an object detector by using the YOLO deep learning technique with a CNN, see the Object Detection Using YOLO v2 Deep Learning example.

Code Generation

To learn how to generate CUDA® code using the YOLO v2 object detector (created using the yolov2ObjectDetector object) see Code Generation for Object Detection Using YOLO v2.

Label Training Data for Deep Learning

You can use the Image Labeler, Video Labeler, or Ground Truth Labeler (available in Automated Driving Toolbox™) apps to interactively label pixels and export label data for training. The apps can also be used to label rectangular regions of interest (ROIs) for object detection, scene labels for image classification, and pixels for semantic segmentation.


[1] Redmon, J. and A. Farhadi. "YOLO9000: Better, Faster, Stronger." IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 6517–6525. Honolulu, HI: CVPR 2017.

[2] Redmon, J., S. Divvala, R. Girshick, and A. Farhadi. "You only look once: Unified, real-time object detection." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 779–788. Las Vegas, NV: CVPR, 2016.

See Also




Related Examples

More About