Deploy Image Recognition Network on FPGA with and Without Pruning

This example uses:

This example shows you how to deploy an image recognition network with and without convolutional filter pruning. Filter pruning is a compression technique that uses some criterion to identify and remove the least important filters in a network, which reduces the overall memory footprint of the network without significantly reducing the network accuracy.

Load Unpruned Network

Load the unpruned trained network. For information on network training, see Train Residual Network for Image Classification.

load("trainedYOLONet.mat");

Test Network

Load a test image. The test image is a part of the CIFAR-10 data set[1]. To download the data set, see the Prepare Data section in Train Residual Network for Image Classification.

load("testImage.mat");

Use the runonHW function to:

Prepare the network for deployment.
Compile the network to generate weights, biases, and instructions.
Deploy the network to the FPGA board.
Retrieve the prediction results using MATLAB®.

To view the code for this function, see Helper Functions.

[~, speedInitial] = runOnHW(trainedNet,testImage,'zcu102_single');

### Compiling network for Deep Learning FPGA prototyping ...
### Targeting FPGA bitstream zcu102_single.
### Optimizing network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer'
### Notice: The layer 'input' of type 'ImageInputLayer' is split into an image input layer 'input' and an addition layer 'input_norm' for normalization on hardware.
### The network includes the following layers:
     1   'input'         Image Input             32×32×3 images with 'zerocenter' normalization                      (SW Layer)
     2   'convInp'       2-D Convolution         16 3×3×3 convolutions with stride [1  1] and padding 'same'         (HW Layer)
     3   'reluInp'       ReLU                    ReLU                                                                (HW Layer)
     4   'S1U1_conv1'    2-D Convolution         16 3×3×16 convolutions with stride [1  1] and padding 'same'        (HW Layer)
     5   'S1U1_relu1'    ReLU                    ReLU                                                                (HW Layer)
     6   'S1U1_conv2'    2-D Convolution         16 3×3×16 convolutions with stride [1  1] and padding 'same'        (HW Layer)
     7   'add11'         Addition                Element-wise addition of 2 inputs                                   (HW Layer)
     8   'relu11'        ReLU                    ReLU                                                                (HW Layer)
     9   'S1U2_conv1'    2-D Convolution         16 3×3×16 convolutions with stride [1  1] and padding 'same'        (HW Layer)
    10   'S1U2_relu1'    ReLU                    ReLU                                                                (HW Layer)
    11   'S1U2_conv2'    2-D Convolution         16 3×3×16 convolutions with stride [1  1] and padding 'same'        (HW Layer)
    12   'add12'         Addition                Element-wise addition of 2 inputs                                   (HW Layer)
    13   'relu12'        ReLU                    ReLU                                                                (HW Layer)
    14   'S1U3_conv1'    2-D Convolution         16 3×3×16 convolutions with stride [1  1] and padding 'same'        (HW Layer)
    15   'S1U3_relu1'    ReLU                    ReLU                                                                (HW Layer)
    16   'S1U3_conv2'    2-D Convolution         16 3×3×16 convolutions with stride [1  1] and padding 'same'        (HW Layer)
    17   'add13'         Addition                Element-wise addition of 2 inputs                                   (HW Layer)
    18   'relu13'        ReLU                    ReLU                                                                (HW Layer)
    19   'S2U1_conv1'    2-D Convolution         32 3×3×16 convolutions with stride [2  2] and padding 'same'        (HW Layer)
    20   'S2U1_relu1'    ReLU                    ReLU                                                                (HW Layer)
    21   'S2U1_conv2'    2-D Convolution         32 3×3×32 convolutions with stride [1  1] and padding 'same'        (HW Layer)
    22   'skipConv1'     2-D Convolution         32 1×1×16 convolutions with stride [2  2] and padding [0  0  0  0]  (HW Layer)
    23   'add21'         Addition                Element-wise addition of 2 inputs                                   (HW Layer)
    24   'relu21'        ReLU                    ReLU                                                                (HW Layer)
    25   'S2U2_conv1'    2-D Convolution         32 3×3×32 convolutions with stride [1  1] and padding 'same'        (HW Layer)
    26   'S2U2_relu1'    ReLU                    ReLU                                                                (HW Layer)
    27   'S2U2_conv2'    2-D Convolution         32 3×3×32 convolutions with stride [1  1] and padding 'same'        (HW Layer)
    28   'add22'         Addition                Element-wise addition of 2 inputs                                   (HW Layer)
    29   'relu22'        ReLU                    ReLU                                                                (HW Layer)
    30   'S2U3_conv1'    2-D Convolution         32 3×3×32 convolutions with stride [1  1] and padding 'same'        (HW Layer)
    31   'S2U3_relu1'    ReLU                    ReLU                                                                (HW Layer)
    32   'S2U3_conv2'    2-D Convolution         32 3×3×32 convolutions with stride [1  1] and padding 'same'        (HW Layer)
    33   'add23'         Addition                Element-wise addition of 2 inputs                                   (HW Layer)
    34   'relu23'        ReLU                    ReLU                                                                (HW Layer)
    35   'S3U1_conv1'    2-D Convolution         64 3×3×32 convolutions with stride [2  2] and padding 'same'        (HW Layer)
    36   'S3U1_relu1'    ReLU                    ReLU                                                                (HW Layer)
    37   'S3U1_conv2'    2-D Convolution         64 3×3×64 convolutions with stride [1  1] and padding 'same'        (HW Layer)
    38   'skipConv2'     2-D Convolution         64 1×1×32 convolutions with stride [2  2] and padding [0  0  0  0]  (HW Layer)
    39   'add31'         Addition                Element-wise addition of 2 inputs                                   (HW Layer)
    40   'relu31'        ReLU                    ReLU                                                                (HW Layer)
    41   'S3U2_conv1'    2-D Convolution         64 3×3×64 convolutions with stride [1  1] and padding 'same'        (HW Layer)
    42   'S3U2_relu1'    ReLU                    ReLU                                                                (HW Layer)
    43   'S3U2_conv2'    2-D Convolution         64 3×3×64 convolutions with stride [1  1] and padding 'same'        (HW Layer)
    44   'add32'         Addition                Element-wise addition of 2 inputs                                   (HW Layer)
    45   'relu32'        ReLU                    ReLU                                                                (HW Layer)
    46   'S3U3_conv1'    2-D Convolution         64 3×3×64 convolutions with stride [1  1] and padding 'same'        (HW Layer)
    47   'S3U3_relu1'    ReLU                    ReLU                                                                (HW Layer)
    48   'S3U3_conv2'    2-D Convolution         64 3×3×64 convolutions with stride [1  1] and padding 'same'        (HW Layer)
    49   'add33'         Addition                Element-wise addition of 2 inputs                                   (HW Layer)
    50   'relu33'        ReLU                    ReLU                                                                (HW Layer)
    51   'globalPool'    2-D Average Pooling     8×8 average pooling with stride [1  1] and padding [0  0  0  0]     (HW Layer)
    52   'fcFinal'       Fully Connected         10 fully connected layer                                            (HW Layer)
    53   'softmax'       Softmax                 softmax                                                             (SW Layer)
    54   'classoutput'   Classification Output   crossentropyex with 'airplane' and 9 other classes                  (SW Layer)
                                                                                                                   
### Notice: The layer 'softmax' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented in software.
### Notice: The layer 'classoutput' with type 'nnet.cnn.layer.ClassificationOutputLayer' is implemented in software.
### Compiling layer group: convInp>>reluInp ...
### Compiling layer group: convInp>>reluInp ... complete.
### Compiling layer group: S1U1_conv1>>S1U1_conv2 ...
### Compiling layer group: S1U1_conv1>>S1U1_conv2 ... complete.
### Compiling layer group: S1U2_conv1>>S1U2_conv2 ...
### Compiling layer group: S1U2_conv1>>S1U2_conv2 ... complete.
### Compiling layer group: S1U3_conv1>>S1U3_conv2 ...
### Compiling layer group: S1U3_conv1>>S1U3_conv2 ... complete.
### Compiling layer group: skipConv1 ...
### Compiling layer group: skipConv1 ... complete.
### Compiling layer group: S2U1_conv1>>S2U1_conv2 ...
### Compiling layer group: S2U1_conv1>>S2U1_conv2 ... complete.
### Compiling layer group: S2U2_conv1>>S2U2_conv2 ...
### Compiling layer group: S2U2_conv1>>S2U2_conv2 ... complete.
### Compiling layer group: S2U3_conv1>>S2U3_conv2 ...
### Compiling layer group: S2U3_conv1>>S2U3_conv2 ... complete.
### Compiling layer group: skipConv2 ...
### Compiling layer group: skipConv2 ... complete.
### Compiling layer group: S3U1_conv1>>S3U1_conv2 ...
### Compiling layer group: S3U1_conv1>>S3U1_conv2 ... complete.
### Compiling layer group: S3U2_conv1>>S3U2_conv2 ...
### Compiling layer group: S3U2_conv1>>S3U2_conv2 ... complete.
### Compiling layer group: S3U3_conv1>>S3U3_conv2 ...
### Compiling layer group: S3U3_conv1>>S3U3_conv2 ... complete.
### Compiling layer group: globalPool ...
### Compiling layer group: globalPool ... complete.
### Compiling layer group: fcFinal ...
### Compiling layer group: fcFinal ... complete.

### Allocating external memory buffers:

          offset_name          offset_address    allocated_space 
    _______________________    ______________    ________________

    "InputDataOffset"           "0x00000000"     "4.0 MB"        
    "OutputResultOffset"        "0x00400000"     "4.0 MB"        
    "SchedulerDataOffset"       "0x00800000"     "4.0 MB"        
    "SystemBufferOffset"        "0x00c00000"     "28.0 MB"       
    "InstructionDataOffset"     "0x02800000"     "4.0 MB"        
    "ConvWeightDataOffset"      "0x02c00000"     "4.0 MB"        
    "FCWeightDataOffset"        "0x03000000"     "4.0 MB"        
    "EndOffset"                 "0x03400000"     "Total: 52.0 MB"

### Network compilation complete.

### Programming FPGA Bitstream using Ethernet...
### Attempting to connect to the hardware board at 192.168.1.101...
### Connection successful
### Programming FPGA device on Xilinx SoC hardware board at 192.168.1.101...
### Copying FPGA programming files to SD card...
### Setting FPGA bitstream and devicetree for boot...
# Copying Bitstream zcu102_single.bit to /mnt/hdlcoder_rd
# Set Bitstream to hdlcoder_rd/zcu102_single.bit
# Copying Devicetree devicetree_dlhdl.dtb to /mnt/hdlcoder_rd
# Set Devicetree to hdlcoder_rd/devicetree_dlhdl.dtb
# Set up boot for Reference Design: 'AXI-Stream DDR Memory Access : 3-AXIM'
### Rebooting Xilinx SoC at 192.168.1.101...
### Reboot may take several seconds...
### Attempting to connect to the hardware board at 192.168.1.101...
### Connection successful
### Programming the FPGA bitstream has been completed successfully.
### Loading weights to Conv Processor.
### Conv Weights loaded. Current time is 07-Mar-2023 11:23:24
### Loading weights to FC Processor.
### FC Weights loaded. Current time is 07-Mar-2023 11:23:24
### Finished writing input activations.
### Running single input activation.


              Deep Learning Processor Profiler Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                     820446                  0.00373                       1             823069            267.3
    input_norm                7334                  0.00003 
    convInp                  14042                  0.00006 
    S1U1_conv1               32046                  0.00015 
    S1U1_conv2               32198                  0.00015 
    add11                    30643                  0.00014 
    S1U2_conv1               32428                  0.00015 
    S1U2_conv2               32212                  0.00015 
    add12                    30553                  0.00014 
    S1U3_conv1               32074                  0.00015 
    S1U3_conv2               32289                  0.00015 
    add13                    30553                  0.00014 
    skipConv1                20674                  0.00009 
    S2U1_conv1               21193                  0.00010 
    S2U1_conv2               26334                  0.00012 
    add21                    15373                  0.00007 
    S2U2_conv1               26655                  0.00012 
    S2U2_conv2               26481                  0.00012 
    add22                    15353                  0.00007 
    S2U3_conv1               26614                  0.00012 
    S2U3_conv2               26584                  0.00012 
    add23                    15313                  0.00007 
    skipConv2                25361                  0.00012 
    S3U1_conv1               24950                  0.00011 
    S3U1_conv2               41437                  0.00019 
    add31                     7714                  0.00004 
    S3U2_conv1               41695                  0.00019 
    S3U2_conv2               41679                  0.00019 
    add32                     7827                  0.00004 
    S3U3_conv1               41513                  0.00019 
    S3U3_conv2               42203                  0.00019 
    add33                     7764                  0.00004 
    globalPool               10197                  0.00005 
    fcFinal                    973                  0.00000 
 * The clock frequency of the DL processor is: 220MHz

Load Pruned Network

Load the trained, pruned network. For more information on network training, see Prune Image Classification Network Using Taylor Scores.

load("prunedDAGNet.mat");

Test Network

Load a test image. The test image is a part of the CIFAR-10 data set[1]. To download the data set, see the Prepare Data section in Train Residual Network for Image Classification.

load("testImage.mat");

Use the runonHW function to:

Prepare the network for deployment.
Compile the network to generate weights, biases, and instructions.
Deploy the network to the FPGA board.
Retrieve the prediction results using MATLAB®.

To view the code for this function, see Helper Functions.

[~, speedPruned] = runOnHW(prunedDAGNet,testImage,'zcu102_single');

### Compiling network for Deep Learning FPGA prototyping ...
### Targeting FPGA bitstream zcu102_single.
### Optimizing network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer'
### Notice: The layer 'input' of type 'ImageInputLayer' is split into an image input layer 'input' and an addition layer 'input_norm' for normalization on hardware.
### The network includes the following layers:
     1   'input'         Image Input             32×32×3 images with 'zerocenter' normalization                      (SW Layer)
     2   'convInp'       2-D Convolution         16 3×3×3 convolutions with stride [1  1] and padding [1  1  1  1]   (HW Layer)
     3   'reluInp'       ReLU                    ReLU                                                                (HW Layer)
     4   'S1U1_conv1'    2-D Convolution         5 3×3×16 convolutions with stride [1  1] and padding [1  1  1  1]   (HW Layer)
     5   'S1U1_relu1'    ReLU                    ReLU                                                                (HW Layer)
     6   'S1U1_conv2'    2-D Convolution         16 3×3×5 convolutions with stride [1  1] and padding [1  1  1  1]   (HW Layer)
     7   'add11'         Addition                Element-wise addition of 2 inputs                                   (HW Layer)
     8   'relu11'        ReLU                    ReLU                                                                (HW Layer)
     9   'S1U2_conv1'    2-D Convolution         8 3×3×16 convolutions with stride [1  1] and padding [1  1  1  1]   (HW Layer)
    10   'S1U2_relu1'    ReLU                    ReLU                                                                (HW Layer)
    11   'S1U2_conv2'    2-D Convolution         16 3×3×8 convolutions with stride [1  1] and padding [1  1  1  1]   (HW Layer)
    12   'add12'         Addition                Element-wise addition of 2 inputs                                   (HW Layer)
    13   'relu12'        ReLU                    ReLU                                                                (HW Layer)
    14   'S1U3_conv1'    2-D Convolution         14 3×3×16 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    15   'S1U3_relu1'    ReLU                    ReLU                                                                (HW Layer)
    16   'S1U3_conv2'    2-D Convolution         16 3×3×14 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    17   'add13'         Addition                Element-wise addition of 2 inputs                                   (HW Layer)
    18   'relu13'        ReLU                    ReLU                                                                (HW Layer)
    19   'S2U1_conv1'    2-D Convolution         22 3×3×16 convolutions with stride [2  2] and padding [0  1  0  1]  (HW Layer)
    20   'S2U1_relu1'    ReLU                    ReLU                                                                (HW Layer)
    21   'S2U1_conv2'    2-D Convolution         27 3×3×22 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    22   'skipConv1'     2-D Convolution         27 1×1×16 convolutions with stride [2  2] and padding [0  0  0  0]  (HW Layer)
    23   'add21'         Addition                Element-wise addition of 2 inputs                                   (HW Layer)
    24   'relu21'        ReLU                    ReLU                                                                (HW Layer)
    25   'S2U2_conv1'    2-D Convolution         30 3×3×27 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    26   'S2U2_relu1'    ReLU                    ReLU                                                                (HW Layer)
    27   'S2U2_conv2'    2-D Convolution         27 3×3×30 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    28   'add22'         Addition                Element-wise addition of 2 inputs                                   (HW Layer)
    29   'relu22'        ReLU                    ReLU                                                                (HW Layer)
    30   'S2U3_conv1'    2-D Convolution         26 3×3×27 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    31   'S2U3_relu1'    ReLU                    ReLU                                                                (HW Layer)
    32   'S2U3_conv2'    2-D Convolution         27 3×3×26 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    33   'add23'         Addition                Element-wise addition of 2 inputs                                   (HW Layer)
    34   'relu23'        ReLU                    ReLU                                                                (HW Layer)
    35   'S3U1_conv1'    2-D Convolution         37 3×3×27 convolutions with stride [2  2] and padding [0  1  0  1]  (HW Layer)
    36   'S3U1_relu1'    ReLU                    ReLU                                                                (HW Layer)
    37   'S3U1_conv2'    2-D Convolution         39 3×3×37 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    38   'skipConv2'     2-D Convolution         39 1×1×27 convolutions with stride [2  2] and padding [0  0  0  0]  (HW Layer)
    39   'add31'         Addition                Element-wise addition of 2 inputs                                   (HW Layer)
    40   'relu31'        ReLU                    ReLU                                                                (HW Layer)
    41   'S3U2_conv1'    2-D Convolution         38 3×3×39 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    42   'S3U2_relu1'    ReLU                    ReLU                                                                (HW Layer)
    43   'S3U2_conv2'    2-D Convolution         39 3×3×38 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    44   'add32'         Addition                Element-wise addition of 2 inputs                                   (HW Layer)
    45   'relu32'        ReLU                    ReLU                                                                (HW Layer)
    46   'S3U3_conv1'    2-D Convolution         36 3×3×39 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    47   'S3U3_relu1'    ReLU                    ReLU                                                                (HW Layer)
    48   'S3U3_conv2'    2-D Convolution         39 3×3×36 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    49   'add33'         Addition                Element-wise addition of 2 inputs                                   (HW Layer)
    50   'relu33'        ReLU                    ReLU                                                                (HW Layer)
    51   'globalPool'    2-D Average Pooling     8×8 average pooling with stride [1  1] and padding [0  0  0  0]     (HW Layer)
    52   'fcFinal'       Fully Connected         10 fully connected layer                                            (HW Layer)
    53   'softmax'       Softmax                 softmax                                                             (SW Layer)
    54   'classoutput'   Classification Output   crossentropyex with 'airplane' and 9 other classes                  (SW Layer)
                                                                                                                   
### Notice: The layer 'softmax' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented in software.
### Notice: The layer 'classoutput' with type 'nnet.cnn.layer.ClassificationOutputLayer' is implemented in software.
### Compiling layer group: convInp>>reluInp ...
### Compiling layer group: convInp>>reluInp ... complete.
### Compiling layer group: S1U1_conv1>>S1U1_conv2 ...
### Compiling layer group: S1U1_conv1>>S1U1_conv2 ... complete.
### Compiling layer group: S1U2_conv1>>S1U2_conv2 ...
### Compiling layer group: S1U2_conv1>>S1U2_conv2 ... complete.
### Compiling layer group: S1U3_conv1>>S1U3_conv2 ...
### Compiling layer group: S1U3_conv1>>S1U3_conv2 ... complete.
### Compiling layer group: skipConv1 ...
### Compiling layer group: skipConv1 ... complete.
### Compiling layer group: S2U1_conv1>>S2U1_conv2 ...
### Compiling layer group: S2U1_conv1>>S2U1_conv2 ... complete.
### Compiling layer group: S2U2_conv1>>S2U2_conv2 ...
### Compiling layer group: S2U2_conv1>>S2U2_conv2 ... complete.
### Compiling layer group: S2U3_conv1>>S2U3_conv2 ...
### Compiling layer group: S2U3_conv1>>S2U3_conv2 ... complete.
### Compiling layer group: skipConv2 ...
### Compiling layer group: skipConv2 ... complete.
### Compiling layer group: S3U1_conv1>>S3U1_conv2 ...
### Compiling layer group: S3U1_conv1>>S3U1_conv2 ... complete.
### Compiling layer group: S3U2_conv1>>S3U2_conv2 ...
### Compiling layer group: S3U2_conv1>>S3U2_conv2 ... complete.
### Compiling layer group: S3U3_conv1>>S3U3_conv2 ...
### Compiling layer group: S3U3_conv1>>S3U3_conv2 ... complete.
### Compiling layer group: globalPool ...
### Compiling layer group: globalPool ... complete.
### Compiling layer group: fcFinal ...
### Compiling layer group: fcFinal ... complete.

### Allocating external memory buffers:

          offset_name          offset_address    allocated_space 
    _______________________    ______________    ________________

    "InputDataOffset"           "0x00000000"     "4.0 MB"        
    "OutputResultOffset"        "0x00400000"     "4.0 MB"        
    "SchedulerDataOffset"       "0x00800000"     "4.0 MB"        
    "SystemBufferOffset"        "0x00c00000"     "28.0 MB"       
    "InstructionDataOffset"     "0x02800000"     "4.0 MB"        
    "ConvWeightDataOffset"      "0x02c00000"     "4.0 MB"        
    "FCWeightDataOffset"        "0x03000000"     "4.0 MB"        
    "EndOffset"                 "0x03400000"     "Total: 52.0 MB"

### Network compilation complete.

### FPGA bitstream programming has been skipped as the same bitstream is already loaded on the target FPGA.
### Loading weights to Conv Processor.
### Conv Weights loaded. Current time is 07-Mar-2023 11:24:09
### Loading weights to FC Processor.
### FC Weights loaded. Current time is 07-Mar-2023 11:24:09
### Finished writing input activations.
### Running single input activation.


              Deep Learning Processor Profiler Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                     587863                  0.00267                       1             590483            372.6
    input_norm                7266                  0.00003 
    convInp                  14102                  0.00006 
    S1U1_conv1               20170                  0.00009 
    S1U1_conv2               20248                  0.00009 
    add11                    30471                  0.00014 
    S1U2_conv1               20486                  0.00009 
    S1U2_conv2               20079                  0.00009 
    add12                    30656                  0.00014 
    S1U3_conv1               32404                  0.00015 
    S1U3_conv2               31891                  0.00014 
    add13                    30563                  0.00014 
    skipConv1                19154                  0.00009 
    S2U1_conv1               17965                  0.00008 
    S2U1_conv2               18679                  0.00008 
    add21                    13442                  0.00006 
    S2U2_conv1               23890                  0.00011 
    S2U2_conv2               24006                  0.00011 
    add22                    13462                  0.00006 
    S2U3_conv1               21638                  0.00010 
    S2U3_conv2               21691                  0.00010 
    add23                    13472                  0.00006 
    skipConv2                15603                  0.00007 
    S3U1_conv1               16138                  0.00007 
    S3U1_conv2               18238                  0.00008 
    add31                     4850                  0.00002 
    S3U2_conv1               17971                  0.00008 
    S3U2_conv2               18210                  0.00008 
    add32                     4830                  0.00002 
    S3U3_conv1               16631                  0.00008 
    S3U3_conv2               17296                  0.00008 
    add33                     4760                  0.00002 
    globalPool                6576                  0.00003 
    fcFinal                    838                  0.00000 
 * The clock frequency of the DL processor is: 220MHz

Quantize Pruned Network

You can quantize the pruned network to obtain an improved performance.

Create an augmentedImageDataStore object to store the training images.

imds = augmentedImageDatastore([32,32],testImage);

Create a dlquantizer object.

dlqObj = dlquantizer(prunedDAGNet, ExecutionEnvironment="FPGA");

Calibrate the dlquantizer object using the training images.

calibrate(dlqObj,imds)

ans=100×5 table
     Optimized Layer Name     Network Layer Name    Learnables / Activations     MinValue     MaxValue 
    ______________________    __________________    ________________________    __________    _________

    {'convInp_Weights'   }      {'convInp'   }             "Weights"            -0.0060522    0.0076182
    {'convInp_Bias'      }      {'convInp'   }             "Bias"                 -0.23065      0.79941
    {'S1U1_conv1_Weights'}      {'S1U1_conv1'}             "Weights"              -0.36637      0.37601
    {'S1U1_conv1_Bias'   }      {'S1U1_conv1'}             "Bias"                 0.076761      0.79494
    {'S1U1_conv2_Weights'}      {'S1U1_conv2'}             "Weights"               -0.8197      0.54487
    {'S1U1_conv2_Bias'   }      {'S1U1_conv2'}             "Bias"                 -0.27783      0.85751
    {'S1U2_conv1_Weights'}      {'S1U2_conv1'}             "Weights"              -0.29579      0.27284
    {'S1U2_conv1_Bias'   }      {'S1U2_conv1'}             "Bias"                 -0.55448      0.85351
    {'S1U2_conv2_Weights'}      {'S1U2_conv2'}             "Weights"              -0.78735      0.52628
    {'S1U2_conv2_Bias'   }      {'S1U2_conv2'}             "Bias"                 -0.50762      0.56423
    {'S1U3_conv1_Weights'}      {'S1U3_conv1'}             "Weights"              -0.18651      0.12745
    {'S1U3_conv1_Bias'   }      {'S1U3_conv1'}             "Bias"                 -0.33809      0.73826
    {'S1U3_conv2_Weights'}      {'S1U3_conv2'}             "Weights"              -0.49925      0.55922
    {'S1U3_conv2_Bias'   }      {'S1U3_conv2'}             "Bias"                 -0.42145      0.64184
    {'S2U1_conv1_Weights'}      {'S2U1_conv1'}             "Weights"               -0.1328        0.121
    {'S2U1_conv1_Bias'   }      {'S2U1_conv1'}             "Bias"                -0.097249       1.1291
      ⋮

Use the runonHW function to:

Prepare the network for deployment.
Compile the network to generate weights, biases, and instructions.
Deploy the network to the FPGA board.
Retrieve the prediction results using MATLAB®.

To view the code for this function, see Helper Functions.

[~, speedQuantized] = runOnHW(dlqObj,testImage,'zcu102_int8');

### Compiling network for Deep Learning FPGA prototyping ...
### Targeting FPGA bitstream zcu102_int8.
### Optimizing network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer'
### The network includes the following layers:
     1   'input'         Image Input             32×32×3 images with 'zerocenter' normalization                      (SW Layer)
     2   'convInp'       2-D Convolution         16 3×3×3 convolutions with stride [1  1] and padding [1  1  1  1]   (HW Layer)
     3   'reluInp'       ReLU                    ReLU                                                                (HW Layer)
     4   'S1U1_conv1'    2-D Convolution         5 3×3×16 convolutions with stride [1  1] and padding [1  1  1  1]   (HW Layer)
     5   'S1U1_relu1'    ReLU                    ReLU                                                                (HW Layer)
     6   'S1U1_conv2'    2-D Convolution         16 3×3×5 convolutions with stride [1  1] and padding [1  1  1  1]   (HW Layer)
     7   'add11'         Addition                Element-wise addition of 2 inputs                                   (HW Layer)
     8   'relu11'        ReLU                    ReLU                                                                (HW Layer)
     9   'S1U2_conv1'    2-D Convolution         8 3×3×16 convolutions with stride [1  1] and padding [1  1  1  1]   (HW Layer)
    10   'S1U2_relu1'    ReLU                    ReLU                                                                (HW Layer)
    11   'S1U2_conv2'    2-D Convolution         16 3×3×8 convolutions with stride [1  1] and padding [1  1  1  1]   (HW Layer)
    12   'add12'         Addition                Element-wise addition of 2 inputs                                   (HW Layer)
    13   'relu12'        ReLU                    ReLU                                                                (HW Layer)
    14   'S1U3_conv1'    2-D Convolution         14 3×3×16 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    15   'S1U3_relu1'    ReLU                    ReLU                                                                (HW Layer)
    16   'S1U3_conv2'    2-D Convolution         16 3×3×14 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    17   'add13'         Addition                Element-wise addition of 2 inputs                                   (HW Layer)
    18   'relu13'        ReLU                    ReLU                                                                (HW Layer)
    19   'S2U1_conv1'    2-D Convolution         22 3×3×16 convolutions with stride [2  2] and padding [0  1  0  1]  (HW Layer)
    20   'S2U1_relu1'    ReLU                    ReLU                                                                (HW Layer)
    21   'S2U1_conv2'    2-D Convolution         27 3×3×22 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    22   'skipConv1'     2-D Convolution         27 1×1×16 convolutions with stride [2  2] and padding [0  0  0  0]  (HW Layer)
    23   'add21'         Addition                Element-wise addition of 2 inputs                                   (HW Layer)
    24   'relu21'        ReLU                    ReLU                                                                (HW Layer)
    25   'S2U2_conv1'    2-D Convolution         30 3×3×27 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    26   'S2U2_relu1'    ReLU                    ReLU                                                                (HW Layer)
    27   'S2U2_conv2'    2-D Convolution         27 3×3×30 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    28   'add22'         Addition                Element-wise addition of 2 inputs                                   (HW Layer)
    29   'relu22'        ReLU                    ReLU                                                                (HW Layer)
    30   'S2U3_conv1'    2-D Convolution         26 3×3×27 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    31   'S2U3_relu1'    ReLU                    ReLU                                                                (HW Layer)
    32   'S2U3_conv2'    2-D Convolution         27 3×3×26 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    33   'add23'         Addition                Element-wise addition of 2 inputs                                   (HW Layer)
    34   'relu23'        ReLU                    ReLU                                                                (HW Layer)
    35   'S3U1_conv1'    2-D Convolution         37 3×3×27 convolutions with stride [2  2] and padding [0  1  0  1]  (HW Layer)
    36   'S3U1_relu1'    ReLU                    ReLU                                                                (HW Layer)
    37   'S3U1_conv2'    2-D Convolution         39 3×3×37 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    38   'skipConv2'     2-D Convolution         39 1×1×27 convolutions with stride [2  2] and padding [0  0  0  0]  (HW Layer)
    39   'add31'         Addition                Element-wise addition of 2 inputs                                   (HW Layer)
    40   'relu31'        ReLU                    ReLU                                                                (HW Layer)
    41   'S3U2_conv1'    2-D Convolution         38 3×3×39 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    42   'S3U2_relu1'    ReLU                    ReLU                                                                (HW Layer)
    43   'S3U2_conv2'    2-D Convolution         39 3×3×38 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    44   'add32'         Addition                Element-wise addition of 2 inputs                                   (HW Layer)
    45   'relu32'        ReLU                    ReLU                                                                (HW Layer)
    46   'S3U3_conv1'    2-D Convolution         36 3×3×39 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    47   'S3U3_relu1'    ReLU                    ReLU                                                                (HW Layer)
    48   'S3U3_conv2'    2-D Convolution         39 3×3×36 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    49   'add33'         Addition                Element-wise addition of 2 inputs                                   (HW Layer)
    50   'relu33'        ReLU                    ReLU                                                                (HW Layer)
    51   'globalPool'    2-D Average Pooling     8×8 average pooling with stride [1  1] and padding [0  0  0  0]     (HW Layer)
    52   'fcFinal'       Fully Connected         10 fully connected layer                                            (HW Layer)
    53   'softmax'       Softmax                 softmax                                                             (SW Layer)
    54   'classoutput'   Classification Output   crossentropyex with 'airplane' and 9 other classes                  (SW Layer)
                                                                                                                   
### Notice: The layer 'input' with type 'nnet.cnn.layer.ImageInputLayer' is implemented in software.
### Notice: The layer 'softmax' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented in software.
### Notice: The layer 'classoutput' with type 'nnet.cnn.layer.ClassificationOutputLayer' is implemented in software.
### Compiling layer group: convInp>>reluInp ...
### Compiling layer group: convInp>>reluInp ... complete.
### Compiling layer group: S1U1_conv1>>S1U1_conv2 ...
### Compiling layer group: S1U1_conv1>>S1U1_conv2 ... complete.
### Compiling layer group: S1U2_conv1>>S1U2_conv2 ...
### Compiling layer group: S1U2_conv1>>S1U2_conv2 ... complete.
### Compiling layer group: S1U3_conv1>>S1U3_conv2 ...
### Compiling layer group: S1U3_conv1>>S1U3_conv2 ... complete.
### Compiling layer group: skipConv1 ...
### Compiling layer group: skipConv1 ... complete.
### Compiling layer group: S2U1_conv1>>S2U1_conv2 ...
### Compiling layer group: S2U1_conv1>>S2U1_conv2 ... complete.
### Compiling layer group: S2U2_conv1>>S2U2_conv2 ...
### Compiling layer group: S2U2_conv1>>S2U2_conv2 ... complete.
### Compiling layer group: S2U3_conv1>>S2U3_conv2 ...
### Compiling layer group: S2U3_conv1>>S2U3_conv2 ... complete.
### Compiling layer group: skipConv2 ...
### Compiling layer group: skipConv2 ... complete.
### Compiling layer group: S3U1_conv1>>S3U1_conv2 ...
### Compiling layer group: S3U1_conv1>>S3U1_conv2 ... complete.
### Compiling layer group: S3U2_conv1>>S3U2_conv2 ...
### Compiling layer group: S3U2_conv1>>S3U2_conv2 ... complete.
### Compiling layer group: S3U3_conv1>>S3U3_conv2 ...
### Compiling layer group: S3U3_conv1>>S3U3_conv2 ... complete.
### Compiling layer group: globalPool ...
### Compiling layer group: globalPool ... complete.
### Compiling layer group: fcFinal ...
### Compiling layer group: fcFinal ... complete.

### Allocating external memory buffers:

          offset_name          offset_address    allocated_space 
    _______________________    ______________    ________________

    "InputDataOffset"           "0x00000000"     "4.0 MB"        
    "OutputResultOffset"        "0x00400000"     "4.0 MB"        
    "SchedulerDataOffset"       "0x00800000"     "4.0 MB"        
    "SystemBufferOffset"        "0x00c00000"     "28.0 MB"       
    "InstructionDataOffset"     "0x02800000"     "4.0 MB"        
    "ConvWeightDataOffset"      "0x02c00000"     "4.0 MB"        
    "FCWeightDataOffset"        "0x03000000"     "4.0 MB"        
    "EndOffset"                 "0x03400000"     "Total: 52.0 MB"

### Network compilation complete.

### Programming FPGA Bitstream using Ethernet...
### Attempting to connect to the hardware board at 192.168.1.101...
### Connection successful
### Programming FPGA device on Xilinx SoC hardware board at 192.168.1.101...
### Copying FPGA programming files to SD card...
### Setting FPGA bitstream and devicetree for boot...
# Copying Bitstream zcu102_int8.bit to /mnt/hdlcoder_rd
# Set Bitstream to hdlcoder_rd/zcu102_int8.bit
# Copying Devicetree devicetree_dlhdl.dtb to /mnt/hdlcoder_rd
# Set Devicetree to hdlcoder_rd/devicetree_dlhdl.dtb
# Set up boot for Reference Design: 'AXI-Stream DDR Memory Access : 3-AXIM'
### Rebooting Xilinx SoC at 192.168.1.101...
### Reboot may take several seconds...
### Attempting to connect to the hardware board at 192.168.1.101...
### Connection successful
### Programming the FPGA bitstream has been completed successfully.
### Loading weights to Conv Processor.
### Conv Weights loaded. Current time is 07-Mar-2023 11:26:00
### Loading weights to FC Processor.
### FC Weights loaded. Current time is 07-Mar-2023 11:26:00
### Finished writing input activations.
### Running single input activation.


              Deep Learning Processor Profiler Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                     210121                  0.00084                       1             212770           1175.0
    convInp                   7514                  0.00003 
    S1U1_conv1                7043                  0.00003 
    S1U1_conv2                7378                  0.00003 
    add11                     9185                  0.00004 
    S1U2_conv1                7543                  0.00003 
    S1U2_conv2                7292                  0.00003 
    add12                     8605                  0.00003 
    S1U3_conv1               10908                  0.00004 
    S1U3_conv2               11192                  0.00004 
    add13                     8515                  0.00003 
    skipConv1                 7147                  0.00003 
    S2U1_conv1                6392                  0.00003 
    S2U1_conv2                7332                  0.00003 
    add21                     4344                  0.00002 
    S2U2_conv1                8832                  0.00004 
    S2U2_conv2                9117                  0.00004 
    add22                     4484                  0.00002 
    S2U3_conv1                9175                  0.00004 
    S2U3_conv2                9136                  0.00004 
    add23                     4614                  0.00002 
    skipConv2                 6643                  0.00003 
    S3U1_conv1                6525                  0.00003 
    S3U1_conv2                6498                  0.00003 
    add31                     1520                  0.00001 
    S3U2_conv1                6273                  0.00003 
    S3U2_conv2                6448                  0.00003 
    add32                     1450                  0.00001 
    S3U3_conv1                6255                  0.00003 
    S3U3_conv2                6751                  0.00003 
    add33                     1500                  0.00001 
    globalPool                3605                  0.00001 
    fcFinal                    718                  0.00000 
 * The clock frequency of the DL processor is: 250MHz

Compare the Original, Pruned, and Pruned and Quantized Network Performance

Determine the impact of pruning and quantizing on the network. Pruning improves the network performance to 372 frames per second. However, pruning and quantizing the network improves the performance from 372 frames per second to 1175 frames per second.

fprintf('The performance achieved for the original network is %s frames per second. \n', speedInitial.("Frame/s")(1));

The performance achieved for the original network is 267.2923 frames per second.

fprintf('The performance achieved after pruning is %s frames per second. \n', speedPruned.("Frame/s")(1));

The performance achieved after pruning is 372.5763 frames per second.

fprintf('The performance achieved after pruning and quantizing the network to int8 fixed point is %s frames per second. \n', speedQuantized.("Frame/s")(1));

The performance achieved after pruning and quantizing the network to int8 fixed point is 1174.9777 frames per second.

References

[1] Krizhevsky, Alex. 2009. "Learning Multiple Layers of Features from Tiny Images." https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf

Helper Functions

The runOnHW function prepares the network for deployment, compiles the network, deploys the network to the FPGA board, and retrieves the prediction results.

function [result, speed] = runOnHW(network, image, bitstream)
   wfObj = dlhdl.Workflow(Network=network,Bitstream=bitstream);
    wfObj.Target = dlhdl.Target("xilinx", Interface="Ethernet");
    compile(wfObj);
    deploy(wfObj);
    [result,speed] = predict(wfObj,image, Profiler='on');    
end

Deploy Image Recognition Network on FPGA with and Without Pruning

Load Unpruned Network

Test Network

Load Pruned Network

Test Network

Quantize Pruned Network

Compare the Original, Pruned, and Pruned and Quantized Network Performance

References

Helper Functions

See Also