Image Classification Using Neural Network on FPGA
This example shows how to train, compile, and deploy a dlhdl.Workflow
object that has ResNet-18 neural network to an FPGA and use MATLAB® to retrieve the prediction results.
For this example, you need:
Deep Learning Toolbox ™
Deep Learning HDL Toolbox ™
Deep Learning Toolbox Model for ResNet-18 Network
Deep Learning HDL Toolbox Support Package for Xilinx FPGA and SoC Devices
Image Processing Toolbox ™
Load Pretrained Network
Lad the pretrained ResNet-18 network.:
snet = resnet18;
View the layers of the pretrained network:
deepNetworkDesigner(snet);
The first layer, the image input layer, requires input images of size 224-by-224-by-3, where three is the number of color channels.
inputSize = snet.Layers(1).InputSize;
Load Data
This example uses the MathWorks
MerchData data set. This is a small data set containing 75 images of MathWorks merchandise, belonging to five different classes (cap, cube, playing cards, screwdriver, and torch).
curDir = pwd; unzip('MerchData.zip'); imds = imageDatastore('MerchData', ... 'IncludeSubfolders',true, ... 'LabelSource','foldernames');
Specify Training and Validation Sets
Divide the data into training and validation data sets, so that 30% percent of the images go to the training data set and 70% of the images to the validation data set. splitEachLabel
splits the datastore imds
into two new datastores, imdsTrain
and imdsValidation
.
[imdsTrain,imdsValidation] = splitEachLabel(imds,0.7,'randomized');
Replace Final layers
To retrain ResNet-18 to classify new images, replace the last fully connected layer and final classification layer of the network. In ResNet-18 , these layers have the names 'fc1000'
and 'ClassificationLayer_predictions
', respectively. The fully connected layer and classification layer of the pretrained network net
are configured for 1000 classes. These two layers fc1000
and ClassificationLayer_predictions
in ResNet-18, contain information on how to combine the features that the network extracts into class probabilities and predicted labels. These two layers must be fine-tuned for the new classification problem. Extract all the layers, except the last two layers, from the pretrained network.
lgraph = layerGraph(snet)
lgraph = LayerGraph with properties: InputNames: {'data'} OutputNames: {'ClassificationLayer_predictions'} Layers: [71×1 nnet.cnn.layer.Layer] Connections: [78×2 table]
numClasses = numel(categories(imdsTrain.Labels))
numClasses = 5
newLearnableLayer = fullyConnectedLayer(numClasses, ... 'Name','new_fc', ... 'WeightLearnRateFactor',10, ... 'BiasLearnRateFactor',10); lgraph = replaceLayer(lgraph,'fc1000',newLearnableLayer); newClassLayer = classificationLayer('Name','new_classoutput'); lgraph = replaceLayer(lgraph,'ClassificationLayer_predictions',newClassLayer);
Prepare Data for Training
The network requires input images of size 224-by-224-by-3, but the images in the image datastores have different sizes. Use an augmented image datastore to automatically resize the training images. Specify additional augmentation operations to perform on the training images, such as randomly flipping the training images along the vertical axis and randomly translating them up to 30 pixels horizontally and vertically. Data augmentation helps prevent the network from overfitting and memorizing the exact details of the training images.
pixelRange = [-30 30]; imageAugmenter = imageDataAugmenter( ... 'RandXReflection',true, ... 'RandXTranslation',pixelRange, ... 'RandYTranslation',pixelRange);
To automatically resize the validation images without performing further data augmentation, use an augmented image datastore without specifying any additional preprocessing operations.
augimdsTrain = augmentedImageDatastore(inputSize(1:2),imdsTrain, ... 'DataAugmentation',imageAugmenter); augimdsValidation = augmentedImageDatastore(inputSize(1:2),imdsValidation);
Specify Training Options
Specify the training options. For transfer learning, keep the features from the early layers of the pretrained network (the transferred layer weights). To slow down learning in the transferred layers, set the initial learning rate to a small value. Specify the mini-batch size and validation data. The software validates the network every ValidationFrequency
iterations during training.
options = trainingOptions('sgdm', ... 'MiniBatchSize',10, ... 'MaxEpochs',6, ... 'InitialLearnRate',1e-4, ... 'Shuffle','every-epoch', ... 'ValidationData',augimdsValidation, ... 'ValidationFrequency',3, ... 'Verbose',false, ... 'Plots','training-progress');
Train Network
Train the network that consists of the transferred and new layers. By default, trainNetwork
uses a GPU if one is available. Using this function on a GPU requires Parallel Computing Toolbox™ and a supported GPU device. For more information, see GPU Computing Requirements (Parallel Computing Toolbox). If a GPU is not available, the network uses a CPU (requires MATLAB® Coder™ Interface for Deep learning). You can also specify the execution environment by using the ExecutionEnvironment
name-value argument of trainingOptions
.
netTransfer = trainNetwork(augimdsTrain,lgraph,options);
Define FPGA Board Interface
Define the target FPGA board programming interface by using the dlhdl.Target
object. Create a programming interface with custom name for your target device and an Ethernet interface to connect the target device to the host computer.
hTarget = dlhdl.Target('Xilinx','Interface','Ethernet');
Prepare Network for Deployment
Prepare the network for deployment by creating a dlhdl.Workflow
object. Specify the network and bitstream name. Ensure that the bitstream name matches the data type and the FPGA board that you are targeting. In this example, the target FPGA board is the Xilinx® Zynq® UltraScale+™ MPSoC ZCU102 board and the bitstream uses the single data type.
hW = dlhdl.Workflow(Network=netTransfer,Bitstream='zcu102_single',Target=hTarget);
Compile Network
Run the compile
method of the dlhdl.Workflow
object to compile the network and generate the instructions, weights, and biases for deployment.
dn =compile(hW,'InputFrameNumberLimit',15)
### Compiling network for Deep Learning FPGA prototyping ... ### Targeting FPGA bitstream zcu102_single. ### Optimizing network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer' ### Notice: The layer 'data' of type 'ImageInputLayer' is split into an image input layer 'data', an addition layer 'data_norm_add', and a multiplication layer 'data_norm' for hardware normalization. ### The network includes the following layers: 1 'data' Image Input 224×224×3 images with 'zscore' normalization (SW Layer) 2 'conv1' 2-D Convolution 64 7×7×3 convolutions with stride [2 2] and padding [3 3 3 3] (HW Layer) 3 'conv1_relu' ReLU ReLU (HW Layer) 4 'pool1' 2-D Max Pooling 3×3 max pooling with stride [2 2] and padding [1 1 1 1] (HW Layer) 5 'res2a_branch2a' 2-D Convolution 64 3×3×64 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 6 'res2a_branch2a_relu' ReLU ReLU (HW Layer) 7 'res2a_branch2b' 2-D Convolution 64 3×3×64 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 8 'res2a' Addition Element-wise addition of 2 inputs (HW Layer) 9 'res2a_relu' ReLU ReLU (HW Layer) 10 'res2b_branch2a' 2-D Convolution 64 3×3×64 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 11 'res2b_branch2a_relu' ReLU ReLU (HW Layer) 12 'res2b_branch2b' 2-D Convolution 64 3×3×64 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 13 'res2b' Addition Element-wise addition of 2 inputs (HW Layer) 14 'res2b_relu' ReLU ReLU (HW Layer) 15 'res3a_branch2a' 2-D Convolution 128 3×3×64 convolutions with stride [2 2] and padding [1 1 1 1] (HW Layer) 16 'res3a_branch2a_relu' ReLU ReLU (HW Layer) 17 'res3a_branch2b' 2-D Convolution 128 3×3×128 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 18 'res3a_branch1' 2-D Convolution 128 1×1×64 convolutions with stride [2 2] and padding [0 0 0 0] (HW Layer) 19 'res3a' Addition Element-wise addition of 2 inputs (HW Layer) 20 'res3a_relu' ReLU ReLU (HW Layer) 21 'res3b_branch2a' 2-D Convolution 128 3×3×128 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 22 'res3b_branch2a_relu' ReLU ReLU (HW Layer) 23 'res3b_branch2b' 2-D Convolution 128 3×3×128 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 24 'res3b' Addition Element-wise addition of 2 inputs (HW Layer) 25 'res3b_relu' ReLU ReLU (HW Layer) 26 'res4a_branch2a' 2-D Convolution 256 3×3×128 convolutions with stride [2 2] and padding [1 1 1 1] (HW Layer) 27 'res4a_branch2a_relu' ReLU ReLU (HW Layer) 28 'res4a_branch2b' 2-D Convolution 256 3×3×256 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 29 'res4a_branch1' 2-D Convolution 256 1×1×128 convolutions with stride [2 2] and padding [0 0 0 0] (HW Layer) 30 'res4a' Addition Element-wise addition of 2 inputs (HW Layer) 31 'res4a_relu' ReLU ReLU (HW Layer) 32 'res4b_branch2a' 2-D Convolution 256 3×3×256 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 33 'res4b_branch2a_relu' ReLU ReLU (HW Layer) 34 'res4b_branch2b' 2-D Convolution 256 3×3×256 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 35 'res4b' Addition Element-wise addition of 2 inputs (HW Layer) 36 'res4b_relu' ReLU ReLU (HW Layer) 37 'res5a_branch2a' 2-D Convolution 512 3×3×256 convolutions with stride [2 2] and padding [1 1 1 1] (HW Layer) 38 'res5a_branch2a_relu' ReLU ReLU (HW Layer) 39 'res5a_branch2b' 2-D Convolution 512 3×3×512 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 40 'res5a_branch1' 2-D Convolution 512 1×1×256 convolutions with stride [2 2] and padding [0 0 0 0] (HW Layer) 41 'res5a' Addition Element-wise addition of 2 inputs (HW Layer) 42 'res5a_relu' ReLU ReLU (HW Layer) 43 'res5b_branch2a' 2-D Convolution 512 3×3×512 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 44 'res5b_branch2a_relu' ReLU ReLU (HW Layer) 45 'res5b_branch2b' 2-D Convolution 512 3×3×512 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 46 'res5b' Addition Element-wise addition of 2 inputs (HW Layer) 47 'res5b_relu' ReLU ReLU (HW Layer) 48 'pool5' 2-D Global Average Pooling 2-D global average pooling (HW Layer) 49 'new_fc' Fully Connected 5 fully connected layer (HW Layer) 50 'prob' Softmax softmax (SW Layer) 51 'new_classoutput' Classification Output crossentropyex with 'MathWorks Cap' and 4 other classes (SW Layer) ### Notice: The layer 'prob' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented in software. ### Notice: The layer 'new_classoutput' with type 'nnet.cnn.layer.ClassificationOutputLayer' is implemented in software. ### Compiling layer group: conv1>>pool1 ... ### Compiling layer group: conv1>>pool1 ... complete. ### Compiling layer group: res2a_branch2a>>res2a_branch2b ... ### Compiling layer group: res2a_branch2a>>res2a_branch2b ... complete. ### Compiling layer group: res2b_branch2a>>res2b_branch2b ... ### Compiling layer group: res2b_branch2a>>res2b_branch2b ... complete. ### Compiling layer group: res3a_branch1 ... ### Compiling layer group: res3a_branch1 ... complete. ### Compiling layer group: res3a_branch2a>>res3a_branch2b ... ### Compiling layer group: res3a_branch2a>>res3a_branch2b ... complete. ### Compiling layer group: res3b_branch2a>>res3b_branch2b ... ### Compiling layer group: res3b_branch2a>>res3b_branch2b ... complete. ### Compiling layer group: res4a_branch1 ... ### Compiling layer group: res4a_branch1 ... complete. ### Compiling layer group: res4a_branch2a>>res4a_branch2b ... ### Compiling layer group: res4a_branch2a>>res4a_branch2b ... complete. ### Compiling layer group: res4b_branch2a>>res4b_branch2b ... ### Compiling layer group: res4b_branch2a>>res4b_branch2b ... complete. ### Compiling layer group: res5a_branch1 ... ### Compiling layer group: res5a_branch1 ... complete. ### Compiling layer group: res5a_branch2a>>res5a_branch2b ... ### Compiling layer group: res5a_branch2a>>res5a_branch2b ... complete. ### Compiling layer group: res5b_branch2a>>res5b_branch2b ... ### Compiling layer group: res5b_branch2a>>res5b_branch2b ... complete. ### Compiling layer group: pool5 ... ### Compiling layer group: pool5 ... complete. ### Compiling layer group: new_fc ... ### Compiling layer group: new_fc ... complete. ### Allocating external memory buffers: offset_name offset_address allocated_space _______________________ ______________ _________________ "InputDataOffset" "0x00000000" "12.0 MB" "OutputResultOffset" "0x00c00000" "4.0 MB" "SchedulerDataOffset" "0x01000000" "8.0 MB" "SystemBufferOffset" "0x01800000" "28.0 MB" "InstructionDataOffset" "0x03400000" "4.0 MB" "ConvWeightDataOffset" "0x03800000" "52.0 MB" "FCWeightDataOffset" "0x06c00000" "4.0 MB" "EndOffset" "0x07000000" "Total: 112.0 MB" ### Network compilation complete.
dn = struct with fields:
weights: [1×1 struct]
instructions: [1×1 struct]
registers: [1×1 struct]
syncInstructions: [1×1 struct]
constantData: {{1×2 cell} [0.0135 0.0162 0.0141 0 0.0135 0.0162 0.0141 0 0.0135 0.0162 0.0141 0 0.0135 0.0162 0.0141 0 0.0135 0.0162 0.0141 0 0.0135 0.0162 0.0141 0 0.0135 0.0162 0.0141 0 0.0135 0.0162 0.0141 0 0.0135 0.0162 0.0141 0 0.0135 … ]}
ddrInfo: [1×1 struct]
Program Bitstream onto FPGA and Download Network Weights
To deploy the network on the Xilinx® Zynq® UltraScale+ MPSoC ZCU102 hardware, run the deploy
method of the dlhdl.Workflow
object. This method programs the FPGA board using the output of the compile method and the programming file, downloads the network weights and biases, displays progress messages, and the time it takes to deploy the network.
deploy(hW)
### Programming FPGA Bitstream using Ethernet... ### Attempting to connect to the hardware board at 192.168.1.101... ### Connection successful ### Programming FPGA device on Xilinx SoC hardware board at 192.168.1.101... ### Copying FPGA programming files to SD card... ### Setting FPGA bitstream and devicetree for boot... # Copying Bitstream zcu102_single.bit to /mnt/hdlcoder_rd # Set Bitstream to hdlcoder_rd/zcu102_single.bit # Copying Devicetree devicetree_dlhdl.dtb to /mnt/hdlcoder_rd # Set Devicetree to hdlcoder_rd/devicetree_dlhdl.dtb # Set up boot for Reference Design: 'AXI-Stream DDR Memory Access : 3-AXIM' ### Rebooting Xilinx SoC at 192.168.1.101... ### Reboot may take several seconds... ### Attempting to connect to the hardware board at 192.168.1.101... ### Connection successful ### Programming the FPGA bitstream has been completed successfully. ### Loading weights to Conv Processor. ### Conv Weights loaded. Current time is 21-Dec-2022 11:15:51 ### Loading weights to FC Processor. ### FC Weights loaded. Current time is 21-Dec-2022 11:15:51
Test Network
Load the example image.
imgFile = fullfile(pwd,'MerchData','MathWorks Cube','Mathworks cube_0.jpg'); inputImg = imresize(imread(imgFile),[224 224]); imshow(inputImg)
Classify the image on the FPGA by using the predict
method of the dlhdl.Workflow
object and display the results.
[prediction,speed] = predict(hW,single(inputImg),'Profile','on');
### Finished writing input activations. ### Running single input activation. Deep Learning Processor Profiler Performance Results LastFrameLatency(cycles) LastFrameLatency(seconds) FramesNum Total Latency Frames/s ------------- ------------- --------- --------- --------- Network 25200171 0.11455 1 25202830 8.7 data_norm_add 351489 0.00160 data_norm 351607 0.00160 conv1 2226639 0.01012 pool1 503083 0.00229 res2a_branch2a 973823 0.00443 res2a_branch2b 973242 0.00442 res2a 374320 0.00170 res2b_branch2a 973556 0.00443 res2b_branch2b 973335 0.00442 res2b 374442 0.00170 res3a_branch1 539156 0.00245 res3a_branch2a 541381 0.00246 res3a_branch2b 908821 0.00413 res3a 187225 0.00085 res3b_branch2a 908911 0.00413 res3b_branch2b 909169 0.00413 res3b 187275 0.00085 res4a_branch1 490738 0.00223 res4a_branch2a 493429 0.00224 res4a_branch2b 894069 0.00406 res4a 93604 0.00043 res4b_branch2a 893888 0.00406 res4b_branch2b 893250 0.00406 res4b 93654 0.00043 res5a_branch1 1131245 0.00514 res5a_branch2a 1133770 0.00515 res5a_branch2b 2210255 0.01005 res5a 46862 0.00021 res5b_branch2a 2211423 0.01005 res5b_branch2b 2211330 0.01005 res5b 46872 0.00021 pool5 73644 0.00033 new_fc 24477 0.00011 * The clock frequency of the DL processor is: 220MHz
[val,idx] = max(prediction); netTransfer.Layers(end).ClassNames{idx}
ans = 'MathWorks Cube'
See Also
dlhdl.Workflow
| dlhdl.Target
| compile
| deploy
| predict