Quantize Layers in Object Detectors and Generate CUDA Code

This example uses:

This example shows how to generate CUDA® code for an SSD vehicle detector and a YOLO v2 vehicle detector that performs inference computations in 8-bit integers for the convolutional layers.

Deep learning is a powerful machine learning technique in which you train a network to learn image features and perform detection tasks. There are several techniques for object detection using deep learning, such as Faster R-CNN, You Only Look Once (YOLO v2), and SSD. For more information, see Object Detection Using YOLO v2 Deep Learning (Computer Vision Toolbox) and Object Detection Using SSD Deep Learning (Computer Vision Toolbox).

Neural network architectures used for deep learning applications contain many processing layers, including convolutional layers. Deep learning models typically work on large sets of labeled data. Performing inference on these models is computationally intensive, consuming significant amounts of memory. Neural networks use memory to store input data, parameters (weights), and activations from each layer as the input propagates through the network. Deep neural networks trained in MATLAB® use single-precision floating point data types. Even networks that are small in size require a considerable amount of memory and hardware to perform these floating-point arithmetic operations. These restrictions can inhibit deployment of deep learning models to devices that have low computational power and smaller memory resources. By using a lower precision to store the weights and activations, you can reduce the memory requirements of the network.

You can use Deep Learning Toolbox™ in tandem with the Deep Learning Toolbox Model Quantization Library support package to reduce the memory footprint of a deep neural network by quantizing the weights, biases, and activations of convolution layers to 8-bit scaled integer data types. Then, you can use GPU Coder™ to generate CUDA code for the optimized network.

Download Pretrained Network

Download a pretrained object detector to avoid having to wait for training to complete.

detectorType = 2

detectorType = 2

switch detectorType
    case 1
        if ~exist('ssdResNet50VehicleExample_20a.mat','file')
            disp('Downloading pretrained detector...');
            pretrainedURL = 'https://www.mathworks.com/supportfiles/vision/data/ssdResNet50VehicleExample_20a.mat';
            websave('ssdResNet50VehicleExample_20a.mat',pretrainedURL);
        end
    case 2
        if ~exist('yolov2ResNet50VehicleExample_19b.mat','file')    
            disp('Downloading pretrained detector...');
            pretrainedURL = 'https://www.mathworks.com/supportfiles/vision/data/yolov2ResNet50VehicleExample_19b.mat';
            websave('yolov2ResNet50VehicleExample_19b.mat',pretrainedURL);
        end
end

Load Data

This example uses a small vehicle data set that contains 295 images. Many of these images come from the Caltech Cars 1999 and 2001 data sets, created by Pietro Perona and used with permission. Each image contains one or two labeled instances of a vehicle. A small data set is useful for exploring the training procedure, but in practice, more labeled images are needed to train a robust detector. Extract the vehicle images and load the vehicle ground truth data.

unzip vehicleDatasetImages.zip
data = load('vehicleDatasetGroundTruth.mat');
vehicleDataset = data.vehicleDataset;

Prepare Data for Training, Calibration, and Validation

The training data is stored in a table. The first column contains the path to the image files. The remaining columns contain the ROI labels for vehicles. Display the first few rows of the data.

vehicleDataset(1:4,:)

ans=4×2 table
              imageFilename                   vehicle     
    _________________________________    _________________

    {'vehicleImages/image_00001.jpg'}    {[220 136 35 28]}
    {'vehicleImages/image_00002.jpg'}    {[175 126 61 45]}
    {'vehicleImages/image_00003.jpg'}    {[108 120 45 33]}
    {'vehicleImages/image_00004.jpg'}    {[124 112 38 36]}

Split the data set into training, validation, and test sets. Select 60% of the data for training, 10% for calibration, and the remainder for validating the trained detector.

rng(0);
shuffledIndices = randperm(height(vehicleDataset));
idx = floor(0.6 * length(shuffledIndices) );

trainingIdx = 1:idx;
trainingDataTbl = vehicleDataset(shuffledIndices(trainingIdx),:);

calibrationIdx = idx+1 : idx + 1 + floor(0.1 * length(shuffledIndices) );
calibrationDataTbl = vehicleDataset(shuffledIndices(calibrationIdx),:);

validationIdx = calibrationIdx(end)+1 : length(shuffledIndices);
validationDataTbl = vehicleDataset(shuffledIndices(validationIdx),:);

Use imageDatastore and boxLabelDatastore to create datastores for loading the image and label data during training and evaluation.

imdsTrain = imageDatastore(trainingDataTbl{:,'imageFilename'});
bldsTrain = boxLabelDatastore(trainingDataTbl(:,'vehicle'));

imdsCalibration = imageDatastore(calibrationDataTbl{:,'imageFilename'});
bldsCalibration = boxLabelDatastore(calibrationDataTbl(:,'vehicle'));

imdsValidation = imageDatastore(validationDataTbl{:,'imageFilename'});
bldsValidation = boxLabelDatastore(validationDataTbl(:,'vehicle'));

Combine the image and box label datastores.

trainingData = combine(imdsTrain,bldsTrain);
calibrationData = combine(imdsCalibration,bldsCalibration);
validationData = combine(imdsValidation,bldsValidation);

Display one of the training images and box labels.

data = read(calibrationData);
I = data{1};
bbox = data{2};
annotatedImage = insertShape(I,'Rectangle',bbox);
annotatedImage = imresize(annotatedImage,2);
figure
imshow(annotatedImage)

Define Network Parameters

To reduce the computational cost of running the example, specify a network input size that corresponds to the minimum size required to run the network.

inputSize = []; 
switch detectorType 
    case 1
        inputSize = [300 300 3]; % Minimum size for SSD
    case 2
        inputSize = [224 224 3]; % Minimum size for YOLO v2
end

Define the number of object classes to detect.

numClasses = width(vehicleDataset)-1;

Data Augmentation

Data augmentation is used to improve network accuracy by randomly transforming the original data during training. By using data augmentation, you can add more variety to the training data without actually having to increase the number of labeled training samples.

Use transformations to augment the training data by:

Randomly flipping the image and associated box labels horizontally.
Randomly scaling the image and associated box labels.
Jitter the image color.

Note that data augmentation is not applied to the test data. Ideally, test data is representative of the original data and left unmodified for unbiased evaluation.

augmentedCalibrationData = transform(calibrationData,@augmentVehicleData);

Visualize augmented training data by reading the same image multiple times.

augmentedData = cell(4,1);
for k = 1:4
    data = read(augmentedCalibrationData);
    augmentedData{k} = insertShape(data{1},'Rectangle',data{2});
    reset(augmentedCalibrationData);
end

figure
montage(augmentedData,'BorderSize',10)

Preprocess Calibration Data

Preprocess the augmented calibration data to prepare for calibration of the network.

preprocessedCalibrationData = transform(augmentedCalibrationData,@(data)preprocessVehicleData(data,inputSize));

Read the preprocessed calibration data.

data = read(preprocessedCalibrationData);

Display the image and bounding boxes.

I = data{1};
bbox = data{2};
annotatedImage = insertShape(I,'Rectangle',bbox);
annotatedImage = imresize(annotatedImage,2);
figure
imshow(annotatedImage)

Load and Test Pretrained Detector

Load the pretrained detector.

switch detectorType
    case 1
        % Load pretrained SSD detector for the example.
        pretrained = load('ssdResNet50VehicleExample_20a.mat');
        detector = pretrained.detector;
    case 2 
        % Load pretrained YOLO v2 detector for the example.
        pretrained = load('yolov2ResNet50VehicleExample_19b.mat');
        detector = pretrained.detector;
end

As a quick test, run the detector on one test image.

data = read(calibrationData);
I = data{1,1};
I = imresize(I,inputSize(1:2));
[bboxes,scores] = detect(detector,I, 'Threshold', 0.4);

Display the results.

I = insertObjectAnnotation(I,'rectangle',bboxes,scores);
figure
imshow(I)

Validate Floating-Point Network

Evaluate the trained object detector on a large set of images to measure the performance. Use the evaluateObjectDetection (Computer Vision Toolbox) function to measure common object detector metrics, such as average precision and log-average miss rates. For this example, use the average precision metric to evaluate performance. The average precision provides a single number that incorporates the ability of the detector to make correct classifications (precision) and the ability of the detector to find all relevant objects (recall).

Apply the same preprocessing transform to the test data as for the training data. Note that data augmentation is not applied to the test data. Ideally, test data is representative of the original data and left unmodified for unbiased evaluation.

preprocessedValidationData = transform(validationData,@(data)preprocessVehicleData(data,inputSize));

Run the detector on all the test images.

detectionResults = detect(detector, preprocessedValidationData,'Threshold',0.4);

Evaluate the object detector using average precision metric.

metrics = evaluateObjectDetection(detectionResults,preprocessedValidationData);
ap = averagePrecision(metrics,ClassName="vehicle");
[precision, recall] = precisionRecall(metrics,ClassName="vehicle");
precision = precision{:};
recall = recall{:};

The precision/recall (PR) curve highlights how precise a detector is at varying levels of recall. Ideally, the precision is 1 at all recall levels. Using more data can help improve the average precision, but might require more training time. Plot the PR curve.

figure
plot(recall,precision)
xlabel('Recall')
ylabel('Precision')
grid on
title(sprintf('Average Precision = %.2f',ap))

Generate Calibration Result File for the Network

Create a dlquantizer object and specify the detector to quantize. By default, the execution environment is set to GPU. To learn about the products required to quantize and deploy the detector to a GPU environment, see Quantization Workflow Prerequisites. Note that code generation does not support quantized deep neural networks produced by the quantize function.

quantObj = dlquantizer(detector)

quantObj = 
  dlquantizer with properties:

           NetworkObject: [1×1 yolov2ObjectDetector]
    ExecutionEnvironment: 'GPU'

Specify the metric function in a dlquantizationOptions object.

quantOpts = dlquantizationOptions;
quantOpts = dlquantizationOptions('Target','gpu', ...
    'MetricFcn', ...
    {@(x)hVerifyDetectionResults(x, detector.Network, preprocessedValidationData)});

Use the calibrate function to exercise the network with sample inputs and collect range information. The calibrate function exercises the network and collects the dynamic ranges of the weights and biases in the convolution and fully connected layers of the network, as well as the dynamic ranges of the activations in all layers of the network. The function returns a table. Each row of the table contains range information for a learnable parameter of the optimized network.

calResults = calibrate(quantObj,preprocessedCalibrationData)

calResults=202×5 table
       Optimized Layer Name       Network Layer Name    Learnables / Activations    MinValue    MaxValue
    __________________________    __________________    ________________________    ________    ________

    {'conv1_Weights'         }    {'conv1'         }           "Weights"             -9.3984      9.511 
    {'conv1_Bias'            }    {'conv1'         }           "Bias"                -2.6468     6.3474 
    {'res2a_branch2a_Weights'}    {'res2a_branch2a'}           "Weights"            -0.85967    0.35191 
    {'res2a_branch2a_Bias'   }    {'res2a_branch2a'}           "Bias"                -5.0999     5.6429 
    {'res2a_branch2b_Weights'}    {'res2a_branch2b'}           "Weights"            -0.24903    0.32103 
    {'res2a_branch2b_Bias'   }    {'res2a_branch2b'}           "Bias"                 -2.749     5.1706 
    {'res2a_branch2c_Weights'}    {'res2a_branch2c'}           "Weights"             -1.6711     1.6394 
    {'res2a_branch2c_Bias'   }    {'res2a_branch2c'}           "Bias"                -6.8159     9.2926 
    {'res2a_branch1_Weights' }    {'res2a_branch1' }           "Weights"             -2.4565     1.1476 
    {'res2a_branch1_Bias'    }    {'res2a_branch1' }           "Bias"                -5.3913     22.913 
    {'res2b_branch2a_Weights'}    {'res2b_branch2a'}           "Weights"            -0.46713    0.34267 
    {'res2b_branch2a_Bias'   }    {'res2b_branch2a'}           "Bias"                -2.9678     3.5533 
    {'res2b_branch2b_Weights'}    {'res2b_branch2b'}           "Weights"            -0.42871    0.57949 
    {'res2b_branch2b_Bias'   }    {'res2b_branch2b'}           "Bias"                 -2.697     2.1982 
    {'res2b_branch2c_Weights'}    {'res2b_branch2c'}           "Weights"             -1.1761     1.3237 
    {'res2b_branch2c_Bias'   }    {'res2b_branch2c'}           "Bias"                -4.9467     5.1857 
      ⋮

Use the validate function to quantize the learnable parameters in the convolution layers of the network and exercise the network. The function uses the metric function defined in the dlquantizationOptions object to compare the results of the network before and after quantization.

valResults = validate(quantObj,preprocessedValidationData,quantOpts)

valResults = struct with fields:
       NumSamples: 88
    MetricResults: [1×1 struct]
       Statistics: [2×2 table]

Examine the MetricResults.Result and Statistics fields of the validation output to see the performance of the optimized network. The first row of each table contains information for the original, floating-point implementation. The second row contains the information for the quantized implementation. The output of the metric function is displayed in the MetricOutput column.

valResults.MetricResults.Result

ans=2×2 table
    NetworkImplementation    MetricOutput
    _____________________    ____________

     {'Floating-Point'}        0.75749   
     {'Quantized'     }        0.72435

valResults.Statistics

ans=2×2 table
    NetworkImplementation    LearnableParameterMemory(bytes)
    _____________________    _______________________________

     {'Floating-Point'}                1.0979e+08           
     {'Quantized'     }                  2.75e+07

The metrics show that quantization reduces the required memory by approximately 75% and the network accuracy by approximately 3%.

To visualize the calibration statistics, use the Deep Network Quantizer app. First, save the dlquantizer object.

save('dlquantObj.mat','quantObj')

In the MATLAB® Command Window, open the Deep Network Quantizer app.

deepNetworkQuantizer

Then import the dlquantizer object dq in the Deep Network Quantizer app by selecting New > Import dlquantizer object.

Generate CUDA Code

After you train and evaluate the detector, you can generate code for the ssdObjectDetector or yolov2ObjectDetector using GPU Coder™. For more details, see Code Generation for Object Detection by Using Single Shot Multibox Detector (Computer Vision Toolbox) and Code Generation for Object Detection by Using YOLO v2 (GPU Coder).

cfg = coder.gpuConfig('mex');
cfg.TargetLang = 'C++';

% Check compute capability of GPU
gpuInfo = gpuDevice;
cc = gpuInfo.ComputeCapability;

% Create deep learning code generation configuration object
cfg.DeepLearningConfig = coder.DeepLearningConfig('cudnn');

% INT8 precision requires a CUDA GPU with minimum compute capability of
% 6.1, 6.3, or higher
cfg.GpuConfig.ComputeCapability = cc;
cfg.DeepLearningConfig.DataType = 'int8';
cfg.DeepLearningConfig.CalibrationResultFile = 'dlquantObj.mat';

Run the codegen command to generate CUDA code.

codegen -config cfg mynet_detect -args {coder.Constant(detectorType), ones(inputSize, 'single')} -report

When code generation is successful, you can view the resulting code generation report by clicking View Report in the MATLAB Command Window. The report is displayed in the Report Viewer window. If the code generator detects errors or warnings during code generation, the report describes the issues and provides links to the problematic MATLAB code. See Code Generation Reports (MATLAB Coder).

References

[1] Liu, Wei, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng Yang Fu, and Alexander C. Berg. "SSD: Single Shot Multibox Detector." In Computer Vision - ECCV 2016, edited by Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, 9905:21-37. Cham: Springer International Publishing, 2016. https://doi.org/10.1007/978-3-319-46448-0_2

[2] Redmon, Joseph, and Ali Farhadi. "YOLO9000: Better, Faster, Stronger." In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 6517-25. Honolulu, HI: IEEE, 2017. https://doi.org/10.1109/CVPR.2017.690