Custom Deep Learning Processor Generation to Meet Performance Requirements
This example shows how to create a custom processor configuration and estimate the performance of a pretrained series network. You can then modify parameters of the custom processor configuration and re-estimate the performance. Once you have achieved your performance requirements you can generate a custom bitstream by using the custom processor configuration.
Prerequisites
Deep Learning HDL Toolbox™Support Package for Xilinx FPGA and SoC
Deep Learning Toolbox™
Deep Learning HDL Toolbox™
Deep Learning Toolbox Model Compression Library
MATLAB Coder Interface for Deep Learning
Load Pretrained Series Network
To load the pretrained series network LogoNet, enter:
net = getLogoNetwork;
Define Training and Validation Data Sets
This example uses the logos_dataset data set. The data set consists of 320 images. Create an augmentedImageDatastore object to use for training and validation.
curDir = pwd; unzip('logos_dataset.zip'); imds = imageDatastore('logos_dataset', ... 'IncludeSubfolders',true, ... 'LabelSource','foldernames'); [imdsTrain,imdsValidation] = splitEachLabel(imds,0.7,'randomized');
Create Custom Processor Configuration
To create a custom processor configuration, use the dlhdl.ProcessorConfig object. For more information, see dlhdl.ProcessorConfig. To learn about modifiable parameters of the processor configuration, see getModuleProperty and setModuleProperty.
hPC = dlhdl.ProcessorConfig; hPC.TargetFrequency = 220; hPC
hPC =
Processing Module "conv"
ModuleGeneration: 'on'
LRNBlockGeneration: 'off'
SegmentationBlockGeneration: 'on'
WeightBlockGeneration: 'off'
ConvThreadNumber: 16
InputMemorySize: [227 227 3]
OutputMemorySize: [227 227 3]
FeatureSizeLimit: 2048
Processing Module "fc"
ModuleGeneration: 'on'
SoftmaxBlockGeneration: 'off'
FCThreadNumber: 4
InputMemorySize: 25088
OutputMemorySize: 4096
Processing Module "custom"
ModuleGeneration: 'on'
Addition: 'on'
MishLayer: 'off'
Multiplication: 'on'
Resize2D: 'off'
Sigmoid: 'off'
SwishLayer: 'off'
TanhLayer: 'off'
InputMemorySize: 40
OutputMemorySize: 120
Processor Top Level Properties
RunTimeControl: 'register'
RunTimeStatus: 'register'
InputStreamControl: 'register'
OutputStreamControl: 'register'
SetupControl: 'register'
ProcessorDataType: 'single'
UseVendorLibrary: 'on'
System Level Properties
TargetPlatform: 'Xilinx Zynq UltraScale+ MPSoC ZCU102 Evaluation Kit'
TargetFrequency: 220
SynthesisTool: 'Xilinx Vivado'
ReferenceDesign: 'AXI-Stream DDR Memory Access : 3-AXIM'
SynthesisToolChipFamily: 'Zynq UltraScale+'
SynthesisToolDeviceName: 'xczu9eg-ffvb1156-2-e'
SynthesisToolPackageName: ''
SynthesisToolSpeedValue: ''
Estimate LogoNet Performance
To estimate the performance of the LogoNet series network, use the estimatePerformance function of the dlhdl.ProcessorConfig object. The function returns the estimated layer latency, network latency, and network performance in frames per second (Frames/s).
hPC.estimatePerformance(net)
### Notice: The layer 'imageinput' of type 'ImageInputLayer' is split into an image input layer 'imageinput' and an addition layer 'imageinput_norm' for normalization on hardware.
### The network includes the following layers:
1 'imageinput' Image Input 227×227×3 images with 'zerocenter' normalization and 'randfliplr' augmentations (SW Layer)
2 'conv_1' 2-D Convolution 96 5×5×3 convolutions with stride [1 1] and padding [0 0 0 0] (HW Layer)
3 'relu_1' ReLU ReLU (HW Layer)
4 'maxpool_1' 2-D Max Pooling 3×3 max pooling with stride [2 2] and padding [0 0 0 0] (HW Layer)
5 'conv_2' 2-D Convolution 128 3×3×96 convolutions with stride [1 1] and padding [0 0 0 0] (HW Layer)
6 'relu_2' ReLU ReLU (HW Layer)
7 'maxpool_2' 2-D Max Pooling 3×3 max pooling with stride [2 2] and padding [0 0 0 0] (HW Layer)
8 'conv_3' 2-D Convolution 384 3×3×128 convolutions with stride [1 1] and padding [0 0 0 0] (HW Layer)
9 'relu_3' ReLU ReLU (HW Layer)
10 'maxpool_3' 2-D Max Pooling 3×3 max pooling with stride [2 2] and padding [0 0 0 0] (HW Layer)
11 'conv_4' 2-D Convolution 128 3×3×384 convolutions with stride [2 2] and padding [0 0 0 0] (HW Layer)
12 'relu_4' ReLU ReLU (HW Layer)
13 'maxpool_4' 2-D Max Pooling 3×3 max pooling with stride [2 2] and padding [0 0 0 0] (HW Layer)
14 'fc_1' Fully Connected 2048 fully connected layer (HW Layer)
15 'relu_5' ReLU ReLU (HW Layer)
16 'fc_2' Fully Connected 2048 fully connected layer (HW Layer)
17 'relu_6' ReLU ReLU (HW Layer)
18 'fc_3' Fully Connected 32 fully connected layer (HW Layer)
19 'softmax' Softmax softmax (SW Layer)
20 'classoutput' Classification Output crossentropyex with 'adidas' and 31 other classes (SW Layer)
### Notice: The layer 'softmax' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented in software.
### Notice: The layer 'classoutput' with type 'nnet.cnn.layer.ClassificationOutputLayer' is implemented in software.
Deep Learning Processor Estimator Performance Results
LastFrameLatency(cycles) LastFrameLatency(seconds) FramesNum Total Latency Frames/s
------------- ------------- --------- --------- ---------
Network 39245087 0.17839 1 39245087 5.6
imageinput_norm 275732 0.00125
conv_1 6836112 0.03107
maxpool_1 3706776 0.01685
conv_2 10461413 0.04755
maxpool_2 1174098 0.00534
conv_3 9392181 0.04269
maxpool_3 1230834 0.00559
conv_4 1768564 0.00804
maxpool_4 24482 0.00011
fc_1 2651287 0.01205
fc_2 1696631 0.00771
fc_3 26977 0.00012
* The clock frequency of the DL processor is: 220MHz
The estimated frames per second is 5.5 Frames/s. To improve the network performance, modify the custom processor convolution module kernel data type, convolution processor thread number, fully connected module kernel data type, and fully connected module thread number. For more information about these processor parameters, see getModuleProperty and setModuleProperty.
Create Modified Custom Processor Configuration
To create a custom processor configuration, use the dlhdl.ProcessorConfig object. For more information, see dlhdl.ProcessorConfig. To learn about modifiable parameters of the processor configuration, see getModuleProperty and setModuleProperty.
hPCNew = dlhdl.ProcessorConfig; hPCNew.TargetFrequency = 300; hPCNew.ProcessorDataType = 'int8'; hPCNew.setModuleProperty('conv', 'ConvThreadNumber', 64); hPCNew.setModuleProperty('fc', 'FCThreadNumber', 16); hPCNew.UseVendorLibrary = 'off'; hPCNew
hPCNew =
Processing Module "conv"
ModuleGeneration: 'on'
LRNBlockGeneration: 'off'
SegmentationBlockGeneration: 'on'
WeightBlockGeneration: 'off'
ConvThreadNumber: 64
InputMemorySize: [227 227 3]
OutputMemorySize: [227 227 3]
FeatureSizeLimit: 2048
Processing Module "fc"
ModuleGeneration: 'on'
SoftmaxBlockGeneration: 'off'
FCThreadNumber: 16
InputMemorySize: 25088
OutputMemorySize: 4096
Processing Module "custom"
ModuleGeneration: 'on'
Addition: 'on'
MishLayer: 'off'
Multiplication: 'on'
Resize2D: 'off'
Sigmoid: 'off'
SwishLayer: 'off'
TanhLayer: 'off'
InputMemorySize: 40
OutputMemorySize: 120
Processor Top Level Properties
RunTimeControl: 'register'
RunTimeStatus: 'register'
InputStreamControl: 'register'
OutputStreamControl: 'register'
SetupControl: 'register'
ProcessorDataType: 'int8'
UseVendorLibrary: 'off'
System Level Properties
TargetPlatform: 'Xilinx Zynq UltraScale+ MPSoC ZCU102 Evaluation Kit'
TargetFrequency: 300
SynthesisTool: 'Xilinx Vivado'
ReferenceDesign: 'AXI-Stream DDR Memory Access : 3-AXIM'
SynthesisToolChipFamily: 'Zynq UltraScale+'
SynthesisToolDeviceName: 'xczu9eg-ffvb1156-2-e'
SynthesisToolPackageName: ''
SynthesisToolSpeedValue: ''
Quantize LogoNet Series Network
To quantize the LogoNet network, enter:
imageData = imageDatastore(fullfile(curDir,'logos_dataset'),... 'IncludeSubfolders',true,'FileExtensions','.JPG','LabelSource','foldernames'); imageData_reduced = imageData.subset(1:20); dlquantObj = dlquantizer(net,'ExecutionEnvironment','FPGA'); dlquantObj.calibrate(imageData_reduced)
Warning: Support for GPU devices with compute capability 6.1 will be removed in a future MATLAB release. For more information on GPU support, see <a href="matlab:web('http://www.mathworks.com/help/parallel-computing/gpu-computing-requirements.html','-browser')">GPU Computing Requirements</a>.
ans=35×5 table
Optimized Layer Name Network Layer Name Learnables / Activations MinValue MaxValue
__________________________ __________________ ________________________ ___________ __________
"conv_1_Weights" {'conv_1' } "Weights" -0.048978 0.039352
"conv_1_Bias" {'conv_1' } "Bias" 0.99996 1.0028
"conv_2_Weights" {'conv_2' } "Weights" -0.055518 0.061901
"conv_2_Bias" {'conv_2' } "Bias" -0.00061171 0.00227
"conv_3_Weights" {'conv_3' } "Weights" -0.045942 0.046927
"conv_3_Bias" {'conv_3' } "Bias" -0.0013998 0.0015218
"conv_4_Weights" {'conv_4' } "Weights" -0.045967 0.051
"conv_4_Bias" {'conv_4' } "Bias" -0.00164 0.0037892
"fc_1_Weights" {'fc_1' } "Weights" -0.051394 0.054344
"fc_1_Bias" {'fc_1' } "Bias" -0.00052319 0.00084454
"fc_2_Weights" {'fc_2' } "Weights" -0.05016 0.051557
"fc_2_Bias" {'fc_2' } "Bias" -0.0017564 0.0018502
"fc_3_Weights" {'fc_3' } "Weights" -0.050706 0.04678
"fc_3_Bias" {'fc_3' } "Bias" -0.02951 0.024855
"imageinput" {'imageinput'} "Activations" 0 255
"imageinput_normalization" {'imageinput'} "Activations" -139.34 198.72
⋮
Estimate LogoNet Performance
To estimate the performance of the LogoNet series network, use the estimatePerformance function of the dlhdl.ProcessorConfig object. The function returns the estimated layer latency, network latency, and network performance in frames per second (Frames/s).
hPCNew.estimatePerformance(dlquantObj)
### The network includes the following layers:
1 'imageinput' Image Input 227×227×3 images with 'zerocenter' normalization and 'randfliplr' augmentations (SW Layer)
2 'conv_1' 2-D Convolution 96 5×5×3 convolutions with stride [1 1] and padding [0 0 0 0] (HW Layer)
3 'relu_1' ReLU ReLU (HW Layer)
4 'maxpool_1' 2-D Max Pooling 3×3 max pooling with stride [2 2] and padding [0 0 0 0] (HW Layer)
5 'conv_2' 2-D Convolution 128 3×3×96 convolutions with stride [1 1] and padding [0 0 0 0] (HW Layer)
6 'relu_2' ReLU ReLU (HW Layer)
7 'maxpool_2' 2-D Max Pooling 3×3 max pooling with stride [2 2] and padding [0 0 0 0] (HW Layer)
8 'conv_3' 2-D Convolution 384 3×3×128 convolutions with stride [1 1] and padding [0 0 0 0] (HW Layer)
9 'relu_3' ReLU ReLU (HW Layer)
10 'maxpool_3' 2-D Max Pooling 3×3 max pooling with stride [2 2] and padding [0 0 0 0] (HW Layer)
11 'conv_4' 2-D Convolution 128 3×3×384 convolutions with stride [2 2] and padding [0 0 0 0] (HW Layer)
12 'relu_4' ReLU ReLU (HW Layer)
13 'maxpool_4' 2-D Max Pooling 3×3 max pooling with stride [2 2] and padding [0 0 0 0] (HW Layer)
14 'fc_1' Fully Connected 2048 fully connected layer (HW Layer)
15 'relu_5' ReLU ReLU (HW Layer)
16 'fc_2' Fully Connected 2048 fully connected layer (HW Layer)
17 'relu_6' ReLU ReLU (HW Layer)
18 'fc_3' Fully Connected 32 fully connected layer (HW Layer)
19 'softmax' Softmax softmax (SW Layer)
20 'classoutput' Classification Output crossentropyex with 'adidas' and 31 other classes (SW Layer)
### Notice: The layer 'imageinput' with type 'nnet.cnn.layer.ImageInputLayer' is implemented in software.
### Notice: The layer 'softmax' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented in software.
### Notice: The layer 'classoutput' with type 'nnet.cnn.layer.ClassificationOutputLayer' is implemented in software.
Deep Learning Processor Estimator Performance Results
LastFrameLatency(cycles) LastFrameLatency(seconds) FramesNum Total Latency Frames/s
------------- ------------- --------- --------- ---------
Network 13791559 0.04597 1 13791559 21.8
conv_1 3488979 0.01163
maxpool_1 1852524 0.00618
conv_2 2940919 0.00980
maxpool_2 586833 0.00196
conv_3 2584863 0.00862
maxpool_3 615201 0.00205
conv_4 612220 0.00204
maxpool_4 12217 0.00004
fc_1 665265 0.00222
fc_2 425425 0.00142
fc_3 7113 0.00002
* The clock frequency of the DL processor is: 300MHz
The estimated frames per second is 21.7 Frames/s.
Generate Custom Processor and Bitstream
Use the new custom processor configuration to build and generate a custom processor and bitstream. Use the custom bitstream to deploy the LogoNet network to your target FPGA board.
hdlsetuptoolpath('ToolName', 'Xilinx Vivado', 'ToolPath', 'C:\Xilinx\Vivado\2020.2\bin\vivado.bat'); dlhdl.buildProcessor(hPCNew);
To learn how to use the generated bitstream file, see Generate Custom Bitstream.
The generated bitstream in this example is similar to the zcu102_int8 bitstream. To deploy the quantized LogoNet network using the zcu102_int8 bitstream, see Classify Images on FPGA Using Quantized Neural Network.
See Also
dlhdl.ProcessorConfig | estimatePerformance | estimateResources