# trainYOLOv2ObjectDetector

Train YOLO v2 object detector

## Syntax

``detector = trainYOLOv2ObjectDetector(trainingData,lgraph,options)``
``[detector,info] = trainYOLOv2ObjectDetector(___)``
``detector = trainYOLOv2ObjectDetector(trainingData,checkpoint,options)``
``detector = trainYOLOv2ObjectDetector(trainingData,detector,options)``
``detector = trainYOLOv2ObjectDetector(___,'TrainingImageSize',trainingSizes)``
``detector = trainYOLOv2ObjectDetector(___,Name,Value)``

## Description

### Train a Detector

example

````detector = trainYOLOv2ObjectDetector(trainingData,lgraph,options)` returns an object detector trained using you only look once version 2 (YOLO v2) network architecture specified by the input `lgraph`. The `options` input specifies training parameters for the detection network.```

example

````[detector,info] = trainYOLOv2ObjectDetector(___)` also returns information on the training progress, such as the training accuracy and learning rate for each iteration.```

### Resume Training a Detector

example

````detector = trainYOLOv2ObjectDetector(trainingData,checkpoint,options)` resumes training from the saved detector checkpoint.You can use this syntax to: Add more training data and continue the training.Improve training accuracy by increasing the maximum number of iterations. ```

### Fine Tune a Detector

````detector = trainYOLOv2ObjectDetector(trainingData,detector,options)` continues training a YOLO v2 object detector. Use this syntax for fine-tuning a detector.```

### Multiscale Training

````detector = trainYOLOv2ObjectDetector(___,'TrainingImageSize',trainingSizes)` specifies the image sizes for multiscale training by using a name-value pair in addition to the input arguments in any of the preceding syntaxes.```

````detector = trainYOLOv2ObjectDetector(___,Name,Value)` uses additional options specified by one or more `Name,Value` pair arguments and any of the previous inputs.```

## Examples

collapse all

Load the training data for vehicle detection into the workspace.

```data = load('vehicleTrainingData.mat'); trainingData = data.vehicleTrainingData;```

Specify the directory in which training samples are stored. Add full path to the file names in training data.

```dataDir = fullfile(toolboxdir('vision'),'visiondata'); trainingData.imageFilename = fullfile(dataDir,trainingData.imageFilename);```

Randomly shuffle data for training.

```rng(0); shuffledIdx = randperm(height(trainingData)); trainingData = trainingData(shuffledIdx,:);```

Create an imageDatastore using the files from the table.

`imds = imageDatastore(trainingData.imageFilename);`

Create a boxLabelDatastore using the label columns from the table.

`blds = boxLabelDatastore(trainingData(:,2:end));`

Combine the datastores.

`ds = combine(imds, blds);`

Load a preinitialized YOLO v2 object detection network.

```net = load('yolov2VehicleDetector.mat'); lgraph = net.lgraph```
```lgraph = LayerGraph with properties: Layers: [25×1 nnet.cnn.layer.Layer] Connections: [24×2 table] InputNames: {'input'} OutputNames: {'yolov2OutputLayer'} ```

Inspect the layers in the YOLO v2 network and their properties. You can also create the YOLO v2 network by following the steps given in Create YOLO v2 Object Detection Network.

`lgraph.Layers`
```ans = 25x1 Layer array with layers: 1 'input' Image Input 128x128x3 images 2 'conv_1' Convolution 16 3x3 convolutions with stride [1 1] and padding [1 1 1 1] 3 'BN1' Batch Normalization Batch normalization 4 'relu_1' ReLU ReLU 5 'maxpool1' Max Pooling 2x2 max pooling with stride [2 2] and padding [0 0 0 0] 6 'conv_2' Convolution 32 3x3 convolutions with stride [1 1] and padding [1 1 1 1] 7 'BN2' Batch Normalization Batch normalization 8 'relu_2' ReLU ReLU 9 'maxpool2' Max Pooling 2x2 max pooling with stride [2 2] and padding [0 0 0 0] 10 'conv_3' Convolution 64 3x3 convolutions with stride [1 1] and padding [1 1 1 1] 11 'BN3' Batch Normalization Batch normalization 12 'relu_3' ReLU ReLU 13 'maxpool3' Max Pooling 2x2 max pooling with stride [2 2] and padding [0 0 0 0] 14 'conv_4' Convolution 128 3x3 convolutions with stride [1 1] and padding [1 1 1 1] 15 'BN4' Batch Normalization Batch normalization 16 'relu_4' ReLU ReLU 17 'yolov2Conv1' Convolution 128 3x3 convolutions with stride [1 1] and padding 'same' 18 'yolov2Batch1' Batch Normalization Batch normalization 19 'yolov2Relu1' ReLU ReLU 20 'yolov2Conv2' Convolution 128 3x3 convolutions with stride [1 1] and padding 'same' 21 'yolov2Batch2' Batch Normalization Batch normalization 22 'yolov2Relu2' ReLU ReLU 23 'yolov2ClassConv' Convolution 24 1x1 convolutions with stride [1 1] and padding [0 0 0 0] 24 'yolov2Transform' YOLO v2 Transform Layer. YOLO v2 Transform Layer with 4 anchors. 25 'yolov2OutputLayer' YOLO v2 Output YOLO v2 Output with 4 anchors. ```

Configure the network training options.

```options = trainingOptions('sgdm',... 'InitialLearnRate',0.001,... 'Verbose',true,... 'MiniBatchSize',16,... 'MaxEpochs',30,... 'Shuffle','never',... 'VerboseFrequency',30,... 'CheckpointPath',tempdir);```

Train the YOLO v2 network.

`[detector,info] = trainYOLOv2ObjectDetector(ds,lgraph,options);`
```************************************************************************* Training a YOLO v2 Object Detector for the following object classes: * vehicle Training on single CPU. |========================================================================================| | Epoch | Iteration | Time Elapsed | Mini-batch | Mini-batch | Base Learning | | | | (hh:mm:ss) | RMSE | Loss | Rate | |========================================================================================| | 1 | 1 | 00:00:01 | 7.13 | 50.8 | 0.0010 | | 2 | 30 | 00:00:14 | 1.35 | 1.8 | 0.0010 | | 4 | 60 | 00:00:27 | 1.13 | 1.3 | 0.0010 | | 5 | 90 | 00:00:39 | 0.64 | 0.4 | 0.0010 | | 7 | 120 | 00:00:51 | 0.65 | 0.4 | 0.0010 | | 9 | 150 | 00:01:04 | 0.72 | 0.5 | 0.0010 | | 10 | 180 | 00:01:16 | 0.52 | 0.3 | 0.0010 | | 12 | 210 | 00:01:28 | 0.45 | 0.2 | 0.0010 | | 14 | 240 | 00:01:41 | 0.61 | 0.4 | 0.0010 | | 15 | 270 | 00:01:52 | 0.43 | 0.2 | 0.0010 | | 17 | 300 | 00:02:05 | 0.42 | 0.2 | 0.0010 | | 19 | 330 | 00:02:17 | 0.52 | 0.3 | 0.0010 | | 20 | 360 | 00:02:29 | 0.43 | 0.2 | 0.0010 | | 22 | 390 | 00:02:42 | 0.43 | 0.2 | 0.0010 | | 24 | 420 | 00:02:54 | 0.59 | 0.4 | 0.0010 | | 25 | 450 | 00:03:06 | 0.61 | 0.4 | 0.0010 | | 27 | 480 | 00:03:18 | 0.65 | 0.4 | 0.0010 | | 29 | 510 | 00:03:31 | 0.48 | 0.2 | 0.0010 | | 30 | 540 | 00:03:42 | 0.34 | 0.1 | 0.0010 | |========================================================================================| Detector training complete. ************************************************************************* ```

Inspect the properties of the detector.

`detector`
```detector = yolov2ObjectDetector with properties: ModelName: 'vehicle' Network: [1×1 DAGNetwork] TrainingImageSize: [128 128] AnchorBoxes: [4×2 double] ClassNames: vehicle ```

You can verify the training accuracy by inspecting the training loss for each iteration.

```figure plot(info.TrainingLoss) grid on xlabel('Number of Iterations') ylabel('Training Loss for Each Iteration')```

Read a test image into the workspace.

`img = imread('detectcars.png');`

Run the trained YOLO v2 object detector on the test image for vehicle detection.

`[bboxes,scores] = detect(detector,img);`

Display the detection results.

```if(~isempty(bboxes)) img = insertObjectAnnotation(img,'rectangle',bboxes,scores); end figure imshow(img)```

## Input Arguments

collapse all

Labeled ground truth images, specified as a datastore or a table.

• If you use a datastore, your data must be set up so that calling the datastore with the `read` and `readall` functions returns a cell array or table with two or three columns. When the output contains two columns, the first column must contain bounding boxes, and the second column must contain labels, {boxes,labels}. When the output contains three columns, the second column must contain the bounding boxes, and the third column must contain the labels. In this case, the first column can contain any type of data. For example, the first column can contain images or point cloud data.

databoxeslabels

The first column must be images.

M-by-4 matrices of bounding boxes of the form [x, y, width, height], where [x,y] represent the top-left coordinates of the bounding box.

The third column must be a cell array that contains M-by-1 categorical vectors containing object class names. All categorical data returned by the datastore must contain the same categories.

For more information, see Datastores for Deep Learning (Deep Learning Toolbox).

• If you use a table, the table must have two or more columns. The first column of the table must contain image file names with paths. The images must be grayscale or truecolor (RGB) and they can be in any format supported by `imread`. Each of the remaining columns must be a cell vector that contains M-by-4 matrices that represent a single object class, such as vehicle, flower, or stop sign. The columns contain 4-element double arrays of M bounding boxes in the format [x,y,width,height]. The format specifies the upper-left corner location and size of the bounding box in the corresponding image. To create a ground truth table, you can use the Image Labeler app or Video Labeler app. To create a table of training data from the generated ground truth, use the `objectDetectorTrainingData` function.

Note

When the training data is specified using a table, the `trainYOLOv2ObjectDetector` function checks these conditions

• The bounding box values must be integers. Otherwise, the function automatically rounds each noninteger values to its nearest integer.

• The bounding box must not be empty and must be within the image region. While training the network, the function ignores empty bounding boxes and bounding boxes that lie partially or fully outside the image region.

Layer graph, specified as a `LayerGraph` object. The layer graph contains the architecture of the YOLO v2 network. You can create this network by using the `yolov2Layers` function. Alternatively, you can create the network layers by using `yolov2TransformLayer`, `yolov2ReorgLayer`, and `yolov2OutputLayer` functions. For more details on creating a custom YOLO v2 network, see Design a YOLO v2 Detection Network.

Training options, specified as a `TrainingOptionsSGDM`, `TrainingOptionsRMSProp`, or `TrainingOptionsADAM` object returned by the `trainingOptions` (Deep Learning Toolbox) function. To specify the solver name and other options for network training, use the `trainingOptions` (Deep Learning Toolbox) function.

Note

The `trainYOLOv2ObjectDetector` function does not support these training options:

• The `trainingOptions` `Shuffle` values, `'once'` and `'every-epoch'` are not supported when you use a datastore input.

• Datastore inputs are not supported when you set the `DispatchInBackground` training option to `true`.

Saved detector checkpoint, specified as a `yolov2ObjectDetector` object. To periodically save a detector checkpoint during training, specify `CheckpointPath`. To control how frequently check points are saved see the `CheckPointFrequency` and `CheckPointFrequencyUnit` training options.

To load a checkpoint for a previously trained detector, load the MAT-file from the checkpoint path. For example, if the `CheckpointPath` property of the object specified by `options` is `'/checkpath'`, you can load a checkpoint MAT-file by using this code.

```data = load('/checkpath/yolov2_checkpoint__216__2018_11_16__13_34_30.mat'); checkpoint = data.detector;```

The name of the MAT-file includes the iteration number and timestamp of when the detector checkpoint was saved. The detector is saved in the `detector` variable of the file. Pass this file back into the `trainYOLOv2ObjectDetector` function:

`yoloDetector = trainYOLOv2ObjectDetector(trainingData,checkpoint,options);`

Previously trained YOLO v2 object detector, specified as a `yolov2ObjectDetector` object. Use this syntax to continue training a detector with additional training data or to perform more training iterations to improve detector accuracy.

Set of image sizes for multiscale training, specified as an M-by-2 matrix, where each row is of the form [`height` `width`]. For each training epoch, the input training images are randomly resized to one of the M image sizes specified in this set.

If you do not specify the `trainingSizes`, the function sets this value to the size in the image input layer of the YOLO v2 network. The network resizes all training images to this value.

Note

The input `trainingSizes` values specified for multiscale training must be greater than or equal to the input size in the image input layer of the `lgraph` input argument.

### Name-Value Arguments

Specify optional pairs of arguments as `Name1=Value1,...,NameN=ValueN`, where `Name` is the argument name and `Value` is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose `Name` in quotes.

Example: `'ExperimentManager'`,`'none'` sets the `'ExperimentManager'` to `'none'`.

Detector training experiment monitoring, specified as an `experiments.Monitor` (Deep Learning Toolbox) object for use with the Experiment Manager (Deep Learning Toolbox) app. You can use this object to track the progress of training, update information fields in the training results table, record values of the metrics used by the training, and to produce training plots. For an example using this app, see Train Object Detectors in Experiment Manager.

Information monitored during training:

• Training loss at each iteration.

• Training accuracy at each iteration.

• Training root mean square error (RMSE) for the box regression layer.

• Learning rate at each iteration.

Validation information when the training `options` input contains validation data:

• Validation loss at each iteration.

• Validation accuracy at each iteration.

• Validation RMSE at each iteration.

## Output Arguments

collapse all

Trained YOLO v2 object detector, returned as `yolov2ObjectDetector` object. You can train a YOLO v2 object detector to detect multiple object classes.

Training progress information, returned as a structure array with seven fields. Each field corresponds to a stage of training.

• `TrainingLoss` — Training loss at each iteration is the mean squared error (MSE) calculated as the sum of localization error, confidence loss, and classification loss. For more information about the training loss function, see Training Loss.

• `TrainingRMSE` — Training root mean squared error (RMSE) is the RMSE calculated from the training loss at each iteration.

• `BaseLearnRate` — Learning rate at each iteration.

• `ValidationLoss` — Validation loss at each iteration.

• `ValidationRMSE` — Validation RMSE at each iteration.

• `FinalValidationLoss` — Final validation loss at end of the training.

• `FinalValidationRMSE` — Final validation RMSE at end of the training.

Each field is a numeric vector with one element per training iteration. Values that have not been calculated at a specific iteration are assigned as `NaN`. The struct contains `ValidationLoss`, `ValidationAccuracy`, `ValidationRMSE`, `FinalValidationLoss`, and `FinalValidationRMSE` fields only when `options` specifies validation data.

collapse all

### Data Preprocessing

By default, the `trainYOLOv2ObjectDetector` function preprocesses the training images by:

• Resizing the input images to match the input size of the network.

• Normalizing the pixel values of the input images to lie in the range [0, 1].

When you specify the training data by using a table, the `trainYOLOv2ObjectDetector` function performs data augmentation for preprocessing. The function augments the input dataset by:

• Reflecting the training data horizontally. The probability for horizontally flipping each image in the training data is 0.5.

• Uniformly scaling (zooming) the training data by a scale factor that is randomly picked from a continuous uniform distribution in the range [1, 1.1].

• Random color jittering for brightness, hue, saturation, and contrast.

When you specify the training data by using a datastore, the `trainYOLOv2ObjectDetector` function does not perform data augmentation. Instead you can augment the training data in datastore by using the `transform` function and then, train the network with the augmented training data. For more information on how to apply augmentation while using datastores, see Preprocess Deep Learning Data (Deep Learning Toolbox).

### Training Loss

During training, the YOLO v2 object detection network optimizes the MSE loss between the predicted bounding boxes and the ground truth. The loss function is defined as

$\begin{array}{l}{K}_{1}\sum _{i=0}^{{S}^{2}}\sum _{j=0}^{B}{1}_{ij}^{obj}\left[{\left({x}_{i}-{\stackrel{^}{x}}_{i}\right)}^{2}+{\left({y}_{i}-{\stackrel{^}{y}}_{i}\right)}^{2}\right]\text{\hspace{0.17em}}\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}+\text{\hspace{0.17em}}{K}_{1}\sum _{i=0}^{{S}^{2}}\sum _{j=0}^{B}{1}_{ij}^{obj}\left[{\left(\sqrt{{w}_{i}}-\sqrt{{\stackrel{^}{w}}_{i}}\right)}^{2}+{\left(\sqrt{{h}_{i}}-\sqrt{{\stackrel{^}{h}}_{i}}\right)}^{2}\right]\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}+{K}_{2}\sum _{i=0}^{{S}^{2}}\sum _{j=0}^{B}{1}_{ij}^{obj}{\left({C}_{i}-{\stackrel{^}{C}}_{i}\right)}^{2}\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}+{K}_{3}\sum _{i=0}^{{S}^{2}}\sum _{j=0}^{B}{1}_{ij}^{noobj}{\left({C}_{i}-{\stackrel{^}{C}}_{i}\right)}^{2}\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}+\text{\hspace{0.17em}}{K}_{4}\sum _{i=0}^{{S}^{2}}{1}_{i}^{obj}\sum _{c\in classes}{\left({p}_{i}\left(c\right)-{\stackrel{^}{p}}_{i}\left(c\right)\right)}^{2}\end{array}$

where:

• S is the number of grid cells.

• B is the number of bounding boxes in each grid cell.

• ${1}_{ij}^{obj}$ is 1 if the jth bounding box in grid cell i is responsible for detecting the object. Otherwise it is set to 0. A grid cell i is responsible for detecting the object, if the overlap between the ground truth and a bounding box in that grid cell is greater than or equal to 0.6.

• ${1}_{ij}^{noobj}$ is 1 if the jth bounding box in grid cell i does not contain any object. Otherwise it is set to 0.

• ${1}_{i}^{obj}$ is 1 if an object is detected in grid cell i. Otherwise it is set to 0.

• K1, K2, K3, and K4 are the weights. To adjust the weights, modify the `LossFactors` property of the output layer by using the `yolov2OutputLayer` function.

The loss function can be split into three parts:

• Localization loss

The first and second terms in the loss function comprise the localization loss. It measures error between the predicted bounding box and the ground truth. The parameters for computing the localization loss include the position, size of the predicted bounding box, and the ground truth. The parameters are defined as follows.

• $\left({x}_{i},{y}_{i}\right)$, is the center of the jth bounding box relative to grid cell i.

• $\left({\stackrel{^}{x}}_{i},{\stackrel{^}{y}}_{i}\right)$, is the center of the ground truth relative to grid cell i.

• ${w}_{i}\text{\hspace{0.17em}}\text{and}\text{\hspace{0.17em}}{h}_{i}$ is the width and the height of the jth bounding box in grid cell i, respectively. The size of the predicted bounding box is specified relative to the input image size.

• ${\stackrel{^}{w}}_{i}\text{\hspace{0.17em}}\text{and}\text{\hspace{0.17em}}{\stackrel{^}{h}}_{i}$ is the width and the height of the ground truth in grid cell i, respectively.

• K1 is the weight for localization loss. Increase this value to increase the weightage for bounding box prediction errors.

• Confidence loss

The third and fourth terms in the loss function comprise the confidence loss. The third term measures the objectness (confidence score) error when an object is detected in the jth bounding box of grid cell i. The fourth term measures the objectness error when no object is detected in the jth bounding box of grid cell i. The parameters for computing the confidence loss are defined as follows.

• Ci is the confidence score of the jth bounding box in grid cell i.

• Ĉi is the confidence score of the ground truth in grid cell i.

• K2 is the weight for objectness error, when an object is detected in the predicted bounding box. You can adjust the value of K2 to weigh confidence scores from grid cells that contain objects.

• K3 is the weight for objectness error, when an object is not detected in the predicted bounding box. You can adjust the value of K3 to weigh confidence scores from grid cells that do not contain objects.

The confidence loss can cause the training to diverge when the number of grid cells that do not contain objects is more than the number of grid cells that contain objects. To remedy this, increase the value for K2 and decrease the value for K3.

• Classification loss

The fifth term in the loss function comprises the classification loss. For example, suppose that an object is detected in the predicted bounding box contained in grid cell i. Then, the classification loss measures the squared error between the class conditional probabilities for each class in grid cell i. The parameters for computing the classification loss are defined as follows.

• pi (c) is the estimated conditional class probability for object class c in grid cell i.

• ${\stackrel{^}{p}}_{i}\left(c\right)$ is the actual conditional class probability for object class c in grid cell i.

• K4 is the weight for classification error when an object is detected in the grid cell. Increase this value to increase the weightage for classification loss.

## References

[1] Joseph. R, S. K. Divvala, R. B. Girshick, and F. Ali. "You Only Look Once: Unified, Real-Time Object Detection." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788. Las Vegas, NV: CVPR, 2016.

[2] Joseph. R and F. Ali. "YOLO 9000: Better, Faster, Stronger." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6517–6525. Honolulu, HI: CVPR, 2017.

## Version History

Introduced in R2019a