Running Convolution-Only Networks by Using FPGA Deployment

This example uses:

Deep Learning HDL Toolbox Deep Learning HDL Toolbox
Deep Learning Toolbox Deep Learning Toolbox
Deep Learning HDL Toolbox Support Package for Xilinx FPGA and SoC Devices Deep Learning HDL Toolbox Support Package for Xilinx FPGA and SoC Devices
Deep Learning Toolbox Model for ResNet-50 Network Deep Learning Toolbox Model for ResNet-50 Network

Typical classification networks include a sequence of convolution layers followed by one or more fully connected layers. Recent research results indicate that better performance is achieved for feature extraction and recognition by using the convolution layer activations directly, instead of those from the subsequent fully connected layers.

To understand and debug convolutional networks, running and visualizing data is a useful tool. This example shows how to deploy, run, and debug a convolution-only network by using FPGA deployment.

Prerequisites

Xilinx™ Zynq™ UltraScale+™ ZCU102 Evaluation Kit
Deep Learning HDL Toolbox™ Support Package for Xilinx FPGA and SoC
Deep Learning Toolbox™
Deep Learning HDL Toolbox™
Deep Learning Toolbox™ Model for Resnet-50 Network

Resnet-50 Network

ResNet-50 is a convolutional neural network that is 50 layers deep. This pretrained network can classify images into 1000 object categories (such as keyboard, mouse, pencil, and more). The network has learned rich feature representations for a wide range of images. The network has an image input size of 224-by-224. This example uses ResNet50 as a starting point.

Load Resnet-50 Network

Load the ResNet-50 network.

rnet = imagePretrainedNetwork('resnet50');

To visualize the structure of the Resnet-50 network, at the MATLAB® command prompt, enter:

deepNetworkDesigner(rnet)

Create A Convolution Only Network

A convolution only network is created by selecting a subset of the ResNet-50 network. The subset includes only the first five layers of the ResNet50 network which are convolutional in nature.

To create the convolution only network, enter:

layers = rnet.Layers(1:5);
snet = dlnetwork(layers);

Create Target Object

To deploy the network on an FPGA, create a target object with a custom name and an interface to connect your target device to the host computer. Interface options are JTAG and Ethernet. To use JTAG, install Xilinx™ Vivado™ Design Suite 2023.1. To set the Xilinx Vivado toolpath, enter:

%hdlsetuptoolpath('ToolName', 'Xilinx Vivado', 'ToolPath', 'C:\Xilinx\Vivado\2023.1\bin\vivado.bat');

hTarget = dlhdl.Target('Xilinx','Interface','Ethernet');

Create Workflow Object

Create an object of the dlhdl.Workflow class. When you create the object, specify the network and the bitstream name. Specify the saved pretrained convolutional only network, snet, as the network. Make sure that the bitstream name matches the data type and the FPGA board that you are targeting. In this example the target FPGA board is the Xilinx ZCU102 SoC board. The bitstream uses a single data type.

hW = dlhdl.Workflow('Network', snet, 'Bitstream', 'zcu102_single','Target',hTarget);

Compile Convolution Only Network

To compile the convolution only network, run the compile function of the dlhdl.Workflow object.

hW.compile

dn = hW.compile

### Compiling network for Deep Learning FPGA prototyping ...
### Targeting FPGA bitstream zcu102_single.
### An output layer called 'Output1_max_pooling2d_1' of type 'nnet.cnn.layer.RegressionOutputLayer' has been added to the provided network. This layer performs no operation during prediction and thus does not affect the output of the network.
### Optimizing network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer'
### Notice: The layer 'input_1' of type 'ImageInputLayer' is split into an image input layer 'input_1' and an addition layer 'input_1_norm' for normalization on hardware.
### The network includes the following layers:
     1   'input_1'                   Image Input         224×224×3 images with 'zerocenter' normalization                   (SW Layer)
     2   'conv1'                     2-D Convolution     64 7×7×3 convolutions with stride [2  2] and padding [3  3  3  3]  (HW Layer)
     3   'activation_1_relu'         ReLU                ReLU                                                               (HW Layer)
     4   'max_pooling2d_1'           2-D Max Pooling     3×3 max pooling with stride [2  2] and padding [1  1  1  1]        (HW Layer)
     5   'Output1_max_pooling2d_1'   Regression Output   mean-squared-error                                                 (SW Layer)
                                                                                                                          
### Notice: The layer 'Output1_max_pooling2d_1' with type 'nnet.cnn.layer.RegressionOutputLayer' is implemented in software.
### Compiling layer group: conv1>>max_pooling2d_1 ...
### Compiling layer group: conv1>>max_pooling2d_1 ... complete.

### Allocating external memory buffers:

          offset_name          offset_address    allocated_space 
    _______________________    ______________    ________________

    "InputDataOffset"           "0x00000000"     "23.0 MB"       
    "OutputResultOffset"        "0x016f8000"     "23.0 MB"       
    "SchedulerDataOffset"       "0x02df0000"     "1.6 MB"        
    "SystemBufferOffset"        "0x02f84000"     "6.9 MB"        
    "InstructionDataOffset"     "0x0366c000"     "524.0 kB"      
    "ConvWeightDataOffset"      "0x036ef000"     "336.0 kB"      
    "EndOffset"                 "0x03743000"     "Total: 55.3 MB"

### Network compilation complete.

dn = struct with fields:
             weights: [1×1 struct]
        instructions: [1×1 struct]
           registers: [1×1 struct]
    syncInstructions: [1×1 struct]
        constantData: {{}  [1×200704 single]}
             ddrInfo: [1×1 struct]
       resourceTable: [6×2 table]

Program Bitstream onto FPGA and Download Network Weights

To deploy the network on the Xilinx ZCU102 hardware, run the deploy function of the dlhdl.Workflow object. This function uses the output of the compile function to program the FPGA board by using the programming file. The function also downloads the network weights and biases. The deploy function programs the FPGA device, displays progress messages, and the time it takes to deploy the network.

hW.deploy

### Programming FPGA Bitstream using Ethernet...
### Attempting to connect to the hardware board at 192.168.1.101...
### Connection successful
### Programming FPGA device on Xilinx SoC hardware board at 192.168.1.101...
### Attempting to connect to the hardware board at 192.168.1.101...
### Connection successful
### Copying FPGA programming files to SD card...
### Setting FPGA bitstream and devicetree for boot...
# Copying Bitstream zcu102_single.bit to /mnt/hdlcoder_rd
# Set Bitstream to hdlcoder_rd/zcu102_single.bit
# Copying Devicetree devicetree_dlhdl.dtb to /mnt/hdlcoder_rd
# Set Devicetree to hdlcoder_rd/devicetree_dlhdl.dtb
# Set up boot for Reference Design: 'AXI-Stream DDR Memory Access : 3-AXIM'
### Programming done. The system will now reboot for persistent changes to take effect.
### Rebooting Xilinx SoC at 192.168.1.101...
### Reboot may take several seconds...
### Attempting to connect to the hardware board at 192.168.1.101...
### Connection successful
### Programming the FPGA bitstream has been completed successfully.
### Loading weights to Conv Processor.
### Conv Weights loaded. Current time is 20-Jun-2024 13:34:11

Load Example Image

Load and display an image to use as an input image to the network.

I = imread('daisy.jpg');
imshow(I)

Run the Prediction

Execute the predict function of the dlhdl.Workflow object.

I = dlarray(single(I), 'SSCB');
[P, speed] = hW.predict(I,'Profile','on');

### Finished writing input activations.
### Running single input activation.


              Deep Learning Processor Profiler Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                    3084552                  0.01402                       1            3085179             71.3
    input_1_norm            351857                  0.00160 
    conv1                  2227138                  0.01012 
    max_pooling2d_1         505542                  0.00230 
 * The clock frequency of the DL processor is: 220MHz

The result data is returned as a 3-D array, with the third dimension indexing across the 64 feature images.

P = extractdata(P);
sz = size(P)

sz = 1×3

    56    56    64

To visualize all 64 features in a single image, the data is reshaped into four dimensions, which is appropriate input to the imtile function:

R = reshape(P, [sz(1) sz(2) 1 sz(3)]);
sz = size(R)

sz = 1×4

    56    56     1    64

The third dimension in the input to imtile function represents the image color. Set the third dimension to size 1 because the activation signals in this example are scalars and do not include color. The fourth dimension indexes the channel.

The input to imtile is normalized using mat2gray. All values are scaled so that the minimum activation is 0 and the maximum activation is 1.

J = imtile(mat2gray(R), 'GridSize', [8 8]);

A grid size of 8-by-8 is selected because there are 64 features to display.

imshow(J)

The image shows activation data for each of the 64 features. Bright features indicate a strong activation.

The output from the convolutional layers only network differs from that of a network with convolution and fully connected layers. Convolution layers are used to reduce the input image size while maintaining features which are needed to get a good prediction. Convolution only layer networks are used to study feature extraction. Earlier convolution layers are used to extract low level features such as edges, colors, gradients and so on. Later convolution layers are used to extract high level features such as patterns, curves, lines and so on. These high level features can then be used to identify objects.