Main Content

Speech Command Recognition by Using FPGA

This example shows how to deploy a custom pretrained series network that detects the presence of speech commands in audio to a Xilinx™ Zynq® UltraScale+™ MPSoC ZCU102 Evaluation Kit. This example uses the pretrained network that was trained by using the Speech Commands Dataset [1]. To create the pretrained network, see Train Speech Command Recognition Model Using Deep Learning.


  • Deep Learning Toolbox™

  • Deep Learning HDL Toolbox™

  • Deep Learning HDL Toolbox™ Support Package for Xilinx FPGA and SoC

  • Audio Toolbox™

  • Xilinx™ Zynq® UltraScale+™ MPSoC ZCU102 Evaluation Kit

Load Speech Commands Data Set

This example uses the Google Speech Commands Dataset [1]. Download the dataset and untar the downloaded file. Set PathToDatabase to the location of the data.

url = '';
downloadFolder = tempdir;
dataFolder = fullfile(downloadFolder,'google_speech');

if ~exist(dataFolder,'dir')
    disp('Downloading data set (1.4 GB) ...')
Downloading data set (1.4 GB) ...

Load Pretrained Speech Recognition Network

The pretrained network trainedAudioNet is a simple series network made up of 24 layers. The network uses max pooling layers to downsample the feature maps "spatially" (that is, in time and frequency) and a final max pooling layer that pools the input feature map globally over time. This enforces (approximate) time-translation invariance in the input spectrograms, allowing the network to perform the same classification independent of the exact position of the speech in time. Global pooling also significantly reduces the number of parameters in the final fully connected layer. To reduce the possibility of the network memorizing specific features of the training data, add a small amount of dropout to the input to the last fully connected layer.

The network is small, as it has only five convolutional layers with few filters. numF controls the number of filters in the convolutional layers. To increase the accuracy of the network, try increasing the network depth by adding identical blocks of convolutional, batch normalization, and ReLU layers. You can also try increasing the number of convolutional filters by increasing numF.

Use a weighted cross entropy classification loss. The weightedClassificationLayer function creates a custom classification layer that calculates the cross entropy loss with observations weighted by classWeights. Specify the class weights in the same order as the classes appear in categories(YTrain). To give each class equal total weight in the loss, use class weights that are inversely proportional to the number of training examples in each class. When using the Adam optimizer to train the network, the training algorithm is independent of the overall normalization of the class weights. Load the pretrained network trainedAudioNet.


Create Training and Validation Datastore

Create an audioDataStore that points to the training and validation data sets. See audioDatastore (Audio Toolbox).

ads = audioDatastore(fullfile(dataFolder, 'train'), ...
    'IncludeSubfolders',true, ...
    'FileExtensions','.wav', ...

Specify the words that you want your model to recognize as commands. Label words that are not commands as unknown. Labeling words that are not commands as unknown creates a group of words that approximates the distribution of all words other than the commands. The network uses this group to learn the difference between commands and all other words.

To reduce the class imbalance between the known and unknown words and speed up processing, include only a fraction of the unknown words in the training set.

To create a datastore that contains only the commands and the subset of unknown words, Use subset (Audio Toolbox) (Audio Toolbox). Count the number of examples belonging to each category.

commands = categorical(["yes","no","up","down","left","right","on","off","stop","go"]);

isCommand = ismember(ads.Labels,commands);
isUnknown = ~isCommand;

includeFraction = 0.2;
mask = rand(numel(ads.Labels),1) < includeFraction;
isUnknown = isUnknown & mask;
ads.Labels(isUnknown) = categorical("unknown");

adsTrain = subset(ads,isCommand|isUnknown);
ans=11×2 table
     Label     Count
    _______    _____

    down       1842 
    go         1861 
    left       1839 
    no         1853 
    off        1839 
    on         1864 
    right      1852 
    stop       1885 
    unknown    6520 
    up         1843 
    yes        1860 

ads = audioDatastore(fullfile(dataFolder, 'validation'), ...
    'IncludeSubfolders',true, ...
    'FileExtensions','.wav', ...

isCommand = ismember(ads.Labels,commands);
isUnknown = ~isCommand;

includeFraction = 0.2;
mask = rand(numel(ads.Labels),1) < includeFraction;
isUnknown = isUnknown & mask;
ads.Labels(isUnknown) = categorical("unknown");

adsValidation = subset(ads,isCommand|isUnknown);
ans=11×2 table
     Label     Count
    _______    _____

    down        264 
    go          260 
    left        247 
    no          270 
    off         256 
    on          257 
    right       256 
    stop        246 
    unknown     883 
    up          260 
    yes         261 

To train the network with the entire dataset and achieve the highest possible accuracy, set reduceDataset to false. To run this example quickly, set reduceDataset to true.

reduceDataset = false;
if reduceDataset
    numUniqueLabels = numel(unique(adsTrain.Labels));
    % Reduce the dataset by a factor of 20
    adsTrain = splitEachLabel(adsTrain,round(numel(adsTrain.Files) / numUniqueLabels / 20));
    adsValidation = splitEachLabel(adsValidation,round(numel(adsValidation.Files) / numUniqueLabels / 20));

Compute Auditory Spectrograms

To prepare the data for efficient training of a convolutional neural network, convert the speech waveforms to auditory-based spectrograms.

Define the parameters of the feature extraction. The segmentDuration variable is the duration of each speech clip (in seconds). The frameDuration variable is the duration of each frame for spectrum calculation. The hopDuration variable is the time step between each spectrum. numBands is the number of filters in the auditory spectrogram.

To perform the feature extraction, create an audioFeatureExtractor (Audio Toolbox) (Audio Toolbox) object.

fs = 16e3; % Known sample rate of the data set.

segmentDuration = 1;
frameDuration = 0.025;
hopDuration = 0.010;

segmentSamples = round(segmentDuration*fs);
frameSamples = round(frameDuration*fs);
hopSamples = round(hopDuration*fs);
overlapSamples = frameSamples - hopSamples;

FFTLength = 512;
numBands = 50;

afe = audioFeatureExtractor( ...
    'SampleRate',fs, ...
    'FFTLength',FFTLength, ...
    'Window',hann(frameSamples,'periodic'), ...
    'OverlapLength',overlapSamples, ...

Read a file from the dataset. Training a convolutional neural network requires input to be a consistent size. Some files in the data set are less than 1 second long. Apply zero-padding to the front and back of the audio signal so that it is of length segmentSamples.

x = read(adsTrain);

numSamples = size(x,1);

numToPadFront = floor( (segmentSamples - numSamples)/2 );
numToPadBack = ceil( (segmentSamples - numSamples)/2 );

xPadded = [zeros(numToPadFront,1,'like',x);x;zeros(numToPadBack,1,'like',x)];

To extract audio features, call extract. The output is a Bark spectrum with time across rows.

features = extract(afe,xPadded);
[numHops,numFeatures] = size(features)
numHops = 98
numFeatures = 50

In this example, you post-process the auditory spectrogram by applying a logarithm. Taking a log of small numbers can lead to roundoff error.

To speed up processing, you can distribute the feature extraction across multiple workers by using parfor.

First, determine the number of partitions for the dataset. If you do not have Parallel Computing Toolbox™, use a single partition.

if ~isempty(ver('parallel')) && ~reduceDataset
    pool = gcp;
    numPar = numpartitions(adsTrain,pool);
    numPar = 1;
Starting parallel pool (parpool) using the 'Processes' profile ...
03-Jan-2024 14:16:35: Job Queued. Waiting for parallel pool job with ID 1 to start ...
Connected to parallel pool with 6 workers.

For each partition, read from the datastore, zero-pad the signal, and then extract the features.

parfor ii = 1:numPar
    subds = partition(adsTrain,numPar,ii);
    XTrain = zeros(numHops,numBands,1,numel(subds.Files));
    for idx = 1:numel(subds.Files)
        x = read(subds);
        xPadded = [zeros(floor((segmentSamples-size(x,1))/2),1);x;zeros(ceil((segmentSamples-size(x,1))/2),1)];
        XTrain(:,:,:,idx) = extract(afe,xPadded);
    XTrainC{ii} = XTrain;

Convert the output to a four-dimensional array that has auditory spectrograms along the fourth dimension.

XTrain = cat(4,XTrainC{:});

[numHops,numBands,numChannels,numSpec] = size(XTrain)
numHops = 98
numBands = 50
numChannels = 1
numSpec = 25058

To obtain data that has a smoother distribution, take the logarithm of the spectrograms by using a small offset.

epsil = 1e-6;
XTrain = log10(XTrain + epsil);

Perform the feature extraction steps described above for the validation set.

if ~isempty(ver('parallel'))
    pool = gcp;
    numPar = numpartitions(adsValidation,pool);
    numPar = 1;
parfor ii = 1:numPar
    subds = partition(adsValidation,numPar,ii);
    XValidation = zeros(numHops,numBands,1,numel(subds.Files));
    for idx = 1:numel(subds.Files)
        x = read(subds);
        xPadded = [zeros(floor((segmentSamples-size(x,1))/2),1);x;zeros(ceil((segmentSamples-size(x,1))/2),1)];
        XValidation(:,:,:,idx) = extract(afe,xPadded);
    XValidationC{ii} = XValidation;
XValidation = cat(4,XValidationC{:});
XValidation = log10(XValidation + epsil);

Isolate the train and validation labels. Remove empty categories.

YTrain = removecats(adsTrain.Labels);
YValidation = removecats(adsValidation.Labels);

Add Background Noise Data

The network must be able to recognize different spoken words and also to detect if the input contains silence or background noise.

To create samples of one-second clips of background noise, use the audio files in the _background_ folder. Create an equal number of background clips from each background noise file. You can also create your own recordings of background noise and add them to the _background_ folder. Before calculating the spectrograms, the function rescales each audio clip by using a factor sampled from a log-uniform distribution in the range provided by volumeRange.

adsBkg = audioDatastore(fullfile(dataFolder, 'background'));
numBkgClips = 4000;
if reduceDataset
    numBkgClips = numBkgClips/20;
volumeRange = log10([1e-4,1]);

numBkgFiles = numel(adsBkg.Files);
numClipsPerFile = histcounts(1:numBkgClips,linspace(1,numBkgClips,numBkgFiles+1));
Xbkg = zeros(size(XTrain,1),size(XTrain,2),1,numBkgClips,'single');
bkgAll = readall(adsBkg);
ind = 1;

for count = 1:numBkgFiles
    bkg = bkgAll{count};
    idxStart = randi(numel(bkg)-fs,numClipsPerFile(count),1);
    idxEnd = idxStart+fs-1;
    gain = 10.^((volumeRange(2)-volumeRange(1))*rand(numClipsPerFile(count),1) + volumeRange(1));
    for j = 1:numClipsPerFile(count)

        x = bkg(idxStart(j):idxEnd(j))*gain(j);

        x = max(min(x,1),-1);

        Xbkg(:,:,:,ind) = extract(afe,x);

        if mod(ind,1000)==0
            disp("Processed " + string(ind) + " background clips out of " + string(numBkgClips))
        ind = ind + 1;
Processed 1000 background clips out of 4000
Processed 2000 background clips out of 4000
Processed 3000 background clips out of 4000
Processed 4000 background clips out of 4000
Xbkg = log10(Xbkg + epsil);

Split the spectrograms of background noise among the training, validation, and test sets. Because the _background_noise_ folder contains only about five and a half minutes of background noise, the background samples in the different data sets are highly correlated. To increase the variation in the background noise, you can create your own background files and add them to the folder. To increase the robustness of the network to noise, you can also try mixing background noise into the speech files.

numTrainBkg = floor(0.85*numBkgClips);
numValidationBkg = floor(0.15*numBkgClips);

XTrain(:,:,:,end+1:end+numTrainBkg) = Xbkg(:,:,:,1:numTrainBkg);
YTrain(end+1:end+numTrainBkg) = "background";

XValidation(:,:,:,end+1:end+numValidationBkg) = Xbkg(:,:,:,numTrainBkg+1:end);
YValidation(end+1:end+numValidationBkg) = "background";

Create Target Object

Create a target object for your target device that has a vendor name and an interface to connect your target device to the host computer. Interface options are JTAG (default) and Ethernet. Vendor options are Intel or Xilinx. Use the installed Xilinx Vivado Design Suite over an Ethernet connection to program the device.

hT = dlhdl.Target('Xilinx', Interface = 'Ethernet');

Create Workflow Object

Create an object of the dlhdl.Workflow class. When you create the object, specify the network and the bitstream name. Specify the saved pretrained series network trainedAudioNet as the network. Make sure that the bitstream name matches the data type and the FPGA board that you are targeting. In this example, the target FPGA board is the Zynq UltraScale+ MPSoC ZCU102 board. The bitstream uses a single data type.

hW = dlhdl.Workflow(Network = trainedNet, Bitstream = 'zcu102_single', Target = hT);

Compile trainedAudioNet Network

To compile the trainedAudioNet series network, run the compile function of the dlhdl.Workflow object.

### Compiling network for Deep Learning FPGA prototyping ...
### Targeting FPGA bitstream zcu102_single.
### Optimizing network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer'
### Notice: The layer 'imageinput' of type 'ImageInputLayer' is split into an image input layer 'imageinput' and an addition layer 'imageinput_norm' for normalization on hardware.
### The network includes the following layers:
     1   'imageinput'    Image Input             98×50×1 images with 'zerocenter' normalization                (SW Layer)
     2   'conv_1'        2-D Convolution         12 3×3×1 convolutions with stride [1  1] and padding 'same'   (HW Layer)
     3   'relu_1'        ReLU                    ReLU                                                          (HW Layer)
     4   'maxpool_1'     2-D Max Pooling         3×3 max pooling with stride [2  2] and padding 'same'         (HW Layer)
     5   'conv_2'        2-D Convolution         24 3×3×12 convolutions with stride [1  1] and padding 'same'  (HW Layer)
     6   'relu_2'        ReLU                    ReLU                                                          (HW Layer)
     7   'maxpool_2'     2-D Max Pooling         3×3 max pooling with stride [2  2] and padding 'same'         (HW Layer)
     8   'conv_3'        2-D Convolution         48 3×3×24 convolutions with stride [1  1] and padding 'same'  (HW Layer)
     9   'relu_3'        ReLU                    ReLU                                                          (HW Layer)
    10   'maxpool_3'     2-D Max Pooling         3×3 max pooling with stride [2  2] and padding 'same'         (HW Layer)
    11   'conv_4'        2-D Convolution         48 3×3×48 convolutions with stride [1  1] and padding 'same'  (HW Layer)
    12   'relu_4'        ReLU                    ReLU                                                          (HW Layer)
    13   'conv_5'        2-D Convolution         48 3×3×48 convolutions with stride [1  1] and padding 'same'  (HW Layer)
    14   'relu_5'        ReLU                    ReLU                                                          (HW Layer)
    15   'maxpool_4'     2-D Max Pooling         13×1 max pooling with stride [1  1] and padding [0  0  0  0]  (HW Layer)
    16   'fc'            Fully Connected         12 fully connected layer                                      (HW Layer)
    17   'softmax'       Softmax                 softmax                                                       (SW Layer)
    18   'classoutput'   Classification Output   Weighted cross entropy                                        (SW Layer)
### Notice: The layer 'softmax' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented in software.
### Notice: The layer 'classoutput' with type 'weightedClassificationLayer' is implemented in software.
### Compiling layer group: conv_1>>relu_5 ...
### Compiling layer group: conv_1>>relu_5 ... complete.
### Compiling layer group: maxpool_4 ...
### Compiling layer group: maxpool_4 ... complete.
### Compiling layer group: fc ...
### Compiling layer group: fc ... complete.

### Allocating external memory buffers:

          offset_name          offset_address     allocated_space  
    _______________________    ______________    __________________

    "InputDataOffset"           "0x00000000"     "2.2 MB"          
    "OutputResultOffset"        "0x0023f000"     "4.0 kB"          
    "SchedulerDataOffset"       "0x00240000"     "336.0 kB"        
    "SystemBufferOffset"        "0x00294000"     "484.0 kB"        
    "InstructionDataOffset"     "0x0030d000"     "196.0 kB"        
    "ConvWeightDataOffset"      "0x0033e000"     "224.0 kB"        
    "FCWeightDataOffset"        "0x00376000"     "16.0 kB"         
    "EndOffset"                 "0x0037a000"     "Total: 3560.0 kB"

### Network compilation complete.
ans = struct with fields:
             weights: [1×1 struct]
        instructions: [1×1 struct]
           registers: [1×1 struct]
    syncInstructions: [1×1 struct]
        constantData: {{}  [1×19600 single]}
             ddrInfo: [1×1 struct]
       resourceTable: [6×2 table]

Program Bitstream onto FPGA and Download Network Weights

To deploy the network on the Zynq® UltraScale+™ MPSoC ZCU102 hardware, run the deploy function of the dlhdl.Workflow object. This function uses the output of the compile function to program the FPGA board by using the programming file.The function also downloads the network weights and biases. The deploy function verifies the Xilinx Vivado tool and the supported tool version. It then starts programming the FPGA device by using the bitstream, displays progress messages, and the time it takes to deploy the network.

### Programming FPGA Bitstream using Ethernet...
### Attempting to connect to the hardware board at
### Connection successful
### Programming FPGA device on Xilinx SoC hardware board at
### Copying FPGA programming files to SD card...
### Setting FPGA bitstream and devicetree for boot...
# Copying Bitstream zcu102_single.bit to /mnt/hdlcoder_rd
# Set Bitstream to hdlcoder_rd/zcu102_single.bit
# Copying Devicetree devicetree_dlhdl.dtb to /mnt/hdlcoder_rd
# Set Devicetree to hdlcoder_rd/devicetree_dlhdl.dtb
# Set up boot for Reference Design: 'AXI-Stream DDR Memory Access : 3-AXIM'
### Rebooting Xilinx SoC at
### Reboot may take several seconds...
### Attempting to connect to the hardware board at
### Connection successful
### Programming the FPGA bitstream has been completed successfully.
### Loading weights to Conv Processor.
### Conv Weights loaded. Current time is 03-Jan-2024 14:19:08
### Loading weights to FC Processor.
### FC Weights loaded. Current time is 03-Jan-2024 14:19:08

Run Prediction on Audio Files

Classify five inputs from the validation data set and compare the prediction results to the classification results from the Deep Learning Toolbox™. YPred is the classification result from the Deep learning Toolbox™. The fpga_prediction variable is the classification result from the FPGA.

numtestFrames = size(XValidation,4);
numView = 5;
listIndex = randperm(numtestFrames,numView);
testDataBatch = XValidation(:,:,:,listIndex);
YPred = classify(trainedNet,testDataBatch);
[scores,speed] = predict(hW,testDataBatch, Profile ='on');
### Finished writing input activations.
### Running in multi-frame mode with 5 inputs.

              Deep Learning Processor Profiler Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                     318982                  0.00145                       5            1544587            712.2
    imageinput_norm          43337                  0.00020 
    conv_1                   47867                  0.00022 
    maxpool_1                37802                  0.00017 
    conv_2                   45553                  0.00021 
    maxpool_2                21443                  0.00010 
    conv_3                   39147                  0.00018 
    maxpool_3                16390                  0.00007 
    conv_4                   28150                  0.00013 
    conv_5                   28730                  0.00013 
    maxpool_4                 8027                  0.00004 
    fc                        2496                  0.00001 
 * The clock frequency of the DL processor is: 220MHz
[~,idx] = max(scores,[],2);
fpga_prediction = trainedNet.Layers(end).Classes(idx);

Compare the prediction results from Deep Learning Toolbox™ and the FPGA side by side. The prediction results from the FPGA match the prediction results from Deep Learning Toolbox™. In this table, the ground truth prediction is the Deep Learning Toolbox™ prediction.

fprintf('%12s %24s\n','Ground Truth','FPGA Prediction');for i= 1:size(fpga_prediction,1)
    fprintf('%s %24s\n',YPred(i),fpga_prediction(i)); end
Ground Truth          FPGA Prediction
background               background
yes                      yes
background               background
background               background
no                       no


[1] Warden P. "Speech Commands: A public dataset for single-word speech recognition", 2017. Available at Copyright Google 2017. The Speech Commands Dataset is licensed under the Creative Commons Attribution 4.0 license, available here:

MathWorks, Inc.

See Also

| | | | |

Related Topics