Main Content

This example shows how to classify each time step of sequence data using a generic temporal convolutional network (TCN).

While sequence-to-sequence tasks are commonly solved with recurrent neural network architectures, Bai et al. [1] show that convolutional neural networks can match the performance of recurrent networks on typical sequence modeling tasks or even outperform them. Potential benefits of using convolutional networks can be better parallelism, better control over the receptive field size, better control of the network's memory footprint during training, and more stable gradients. Just like recurrent networks, convolutional networks can operate on variable length input sequences and can be used to model sequence-to-sequence or sequence-to-one tasks.

This example uses sensor data obtained from a smartphone worn on the body and trains a temporal convolutional network as a function using a custom training loop and automatic differentiation to recognize the activity of the wearer given time series data representing accelerometer readings in three different directions.

Load the human activity recognition data. The data contains seven time series of sensor data obtained from a smartphone worn on the body. Each sequence has three features and varies in length. The three features correspond to the accelerometer readings in three different directions. Six sequences are used for training and one sequence is used for testing after training.

s = load("HumanActivityTrain.mat"); dsXTrain = arrayDatastore(s.XTrain,'OutputType','same'); dsYTrain = arrayDatastore(s.YTrain,'OutputType','same'); dsTrain = combine(dsXTrain,dsYTrain);

Determine the number of observations and the number of classes in the training data.

numObservations = numel(s.XTrain); classes = categories(s.YTrain{1}); numClasses = numel(classes);

Visualize one training sequence in a plot. Plot the first feature of the first training sequence and color the plot according to the corresponding activity.

X = s.XTrain{1}(1,:); figure for j = 1:numel(classes) label = classes(j); idx = find(s.YTrain{1} == label); hold on plot(idx,X(idx)) end hold off xlabel("Time Step") ylabel("Acceleration") title("Training Sequence 1, Feature 1") legend(classes,'Location','northwest')

The main building block of a temporal convolutional network is a dilated causal convolution layer which operates over the time steps of each sequence. In this context, "causal" means that the activations computed for a particular time step cannot depend on activations from future time steps.

In order to build up context from previous time steps, multiple convolutional layers are typically stacked on top of each other. To achieve large receptive field sizes the dilation factor of subsequent convolution layers is increased exponentially as shown in the image below. Assuming the dilation factor of the k-th convolutional layer is ${2}^{\left(\mathit{k}-1\right)}$ and the stride is 1, then the receptive field size of such a network can be computed as $\mathit{R}=\left(\mathit{f}-1\right)\left({2}^{\mathit{K}}-1\right)+1$, where $\mathit{f}$ is the filter size and $\mathit{K}$ is the number of convolutional layers. Change the filter size and number of layers to easily adjust the receptive field size and the number or learnable parameters as necessary for the data and task at hand.

One of the disadvantages of TCNs compared to RNNs is the higher memory footprint during inference. The entire raw sequence is required to compute the next time step. To reduce inference time and memory consumption, especially for step-ahead predictions, it can be beneficial to train with the smallest sensible receptive field size $\mathit{R}$ and only perform prediction with the last $\mathit{R}$ time steps of the input sequence.

A general TCN architecture (as described in [1]) consists of multiple residual blocks, each containing two sets of dilated causal convolution layers with the same dilation factor, followed by normalization, ReLU activation, and spatial dropout layers. The input to each block is added to the output of the block (including a 1-by-1 convolution on the input when the number of channels between the input and output do not match) and a final activation function is applied.

Specify the parameters for the TCN architecture with four residual blocks containing dilated causal convolution layers with each 175 filters of size 3.

numBlocks = 4; numFilters = 175; filterSize = 3; dropoutFactor = 0.05;

To pass the model hyperparameters to the model functions (the number of blocks and the dropout factor), create a struct containing these values.

hyperparameters = struct; hyperparameters.NumBlocks = numBlocks; hyperparameters.DropoutFactor = dropoutFactor;

Create a struct containing `dlarray`

objects for all the learnable parameters of the model based on the number of input channels and the hyperparameters that define the model architecture. Each residual block requires weights parameters and bias parameters for each of the two convolution operations. The first residual block usually also requires weights and biases for an additional convolution operation with filter size 1. The final fully connected operation requires a weights and bias parameter as well. Initialize the learnable layer weights using the function `initializeGaussian`

, listed at the end of the example. Initialize the learnable layer biases with zeros.

numInputChannels = 3; parameters = struct; numChannels = numInputChannels; for k = 1:numBlocks parametersBlock = struct; blockName = "Block"+k; weights = initializeGaussian([filterSize, numChannels, numFilters]); bias = zeros(numFilters, 1, 'single'); parametersBlock.Conv1.Weights = dlarray(weights); parametersBlock.Conv1.Bias = dlarray(bias); weights = initializeGaussian([filterSize, numFilters, numFilters]); bias = zeros(numFilters, 1, 'single'); parametersBlock.Conv2.Weights = dlarray(weights); parametersBlock.Conv2.Bias = dlarray(bias); % If the input and output of the block have different numbers of % channels, then add a convolution with filter size 1. if numChannels ~= numFilters weights = initializeGaussian([1, numChannels, numFilters]); bias = zeros(numFilters, 1, 'single'); parametersBlock.Conv3.Weights = dlarray(weights); parametersBlock.Conv3.Bias = dlarray(bias); end numChannels = numFilters; parameters.(blockName) = parametersBlock; end weights = initializeGaussian([numClasses,numChannels]); bias = zeros(numClasses,1,'single'); parameters.FC.Weights = dlarray(weights); parameters.FC.Bias = dlarray(bias);

View the network parameters.

parameters

`parameters = `*struct with fields:*
Block1: [1×1 struct]
Block2: [1×1 struct]
Block3: [1×1 struct]
Block4: [1×1 struct]
FC: [1×1 struct]

View the parameters of the first block.

parameters.Block1

`ans = `*struct with fields:*
Conv1: [1×1 struct]
Conv2: [1×1 struct]
Conv3: [1×1 struct]

View the parameters of the first convolutional operation of the first block.

parameters.Block1.Conv1

`ans = `*struct with fields:*
Weights: [3×3×175 dlarray]
Bias: [175×1 dlarray]

Create the function `model`

, listed in the Model Function section at the end of the example, that computes the outputs of the deep learning model. The function `model`

takes the input data, the learnable model parameters, the model hyperparameters, and a flag that specifies whether the model should return outputs for training or prediction. The network outputs the predictions for the labels at each time step of the input sequence.

Create the function `modelGradients`

, listed in the Model Gradients Function section at the end of the example, that takes a mini-batch of input data, the corresponding target sequences, and the parameters of the network, and returns the gradients of the loss with respect to the learnable parameters and the corresponding loss.

Specify a set of training options used in the custom training loop.

Train for 30 epochs with mini-batch size 1.

Start with an initial learn rate of 0.001

Multiply the learn rate by 0.1 every 12 epochs.

Clip the gradients using the $${L}_{2}$$ norm with threshold 1.

maxEpochs = 30; miniBatchSize = 1; initialLearnRate = 0.001; learnRateDropFactor = 0.1; learnRateDropPeriod = 12; gradientThreshold = 1;

To monitor the training progress, you can plot the training loss after each iteration. Create the variable `plots`

that contains `"training-progress"`

. If you do not want to plot the training progress, then set this value to `"none"`

.

`plots = "training-progress";`

Train the network via stochastic gradient descent by looping over the sequences in the training dataset, computing parameter gradients, and updating the network parameters via the Adam update rule. This process is repeated multiple times (referred to as *epochs*) until training converges and the maximum number of epochs is reached.

Use `minibatchqueue`

to process and manage mini-batches of images during training. For each mini-batch:

Use the custom mini-batch preprocessing function

`preprocessMiniBatch`

(defined at the end of this example) to preprocess the sequence data, determine the number of time steps, and one-hot encode the class labels. The function has three outputs, so specify three output variables for the`minibatchqueue`

.Convert the data to unformated

`dlarray`

objects with underlying type single.Train on a GPU if one is available. By default, the

`minibatchqueue`

object converts each output to a`gpuArray`

if a GPU is available. Using a GPU requires Parallel Computing Toolbox™ and a CUDA® enabled NVIDIA® GPU with compute capability 3.0 or higher.

mbq = minibatchqueue(dsTrain,3,... 'MiniBatchSize',miniBatchSize,... 'MiniBatchFcn', @preprocessMiniBatch);

For each epoch, shuffle the training data. For each mini-batch:

Evaluate the model gradients and loss using

`dlfeval`

and the`modelGradients`

function.Clip the gradients if they are too large, using the function

`thresholdL2Norm`

, listed at the end of the example, and the`dlupdate`

function.Update the network parameters using the

`adamupdate`

function.Update the training progress plot.

After completing `learnRateDropPeriod`

epochs, reduce the learn rate by multiplying the current learning rate by `learnRateDropFactor`

.

Initialize the learning rate which will be multiplied by the `LearnRateDropFactor`

value every `LearnRateDropPeriod`

epochs.

learnRate = initialLearnRate;

Initialize the moving average of the parameter gradients and the element-wise squares of the gradients used by the Adam optimizer.

trailingAvg = []; trailingAvgSq = [];

Initialize a plot showing the training progress.

if plots == "training-progress" figure lineLossTrain = animatedline('Color',[0.85 0.325 0.098]); ylim([0 inf]) xlabel("Iteration") ylabel("Loss") grid on end

Train the model.

iteration = 0; start = tic; % Loop over epochs. for epoch = 1:maxEpochs % Shuffle the data. shuffle(mbq) % Loop over mini-batches. while hasdata(mbq) iteration = iteration + 1; [dlX,dlY,numTimeSteps] = next(mbq); % Evaluate the model gradients and loss using dlfeval. [gradients, loss] = dlfeval(@modelGradients,dlX,dlY,parameters,hyperparameters,numTimeSteps); % Clip the gradients. gradients = dlupdate(@(g) thresholdL2Norm(g,gradientThreshold),gradients); % Update the network parameters using the Adam optimizer. [parameters,trailingAvg,trailingAvgSq] = adamupdate(parameters,gradients, ... trailingAvg, trailingAvgSq, iteration, learnRate); if plots == "training-progress" % Plot training progress. D = duration(0,0,toc(start),'Format','hh:mm:ss'); % Normalize the loss over the sequence lengths loss = mean(loss ./ numTimeSteps); loss = double(gather(extractdata(loss))); loss = mean(loss); addpoints(lineLossTrain,iteration, mean(loss)); title("Epoch: " + epoch + ", Elapsed: " + string(D)) drawnow end end % Reduce the learning rate after learnRateDropPeriod epochs if mod(epoch,learnRateDropPeriod) == 0 learnRate = learnRate*learnRateDropFactor; end end

Test the classification accuracy of the model by comparing the predictions on a held-out test set with the true labels for each time step. Manage the test data set using a `minibatchqueue`

object with the same settings as the training data.

s = load("HumanActivityTest.mat"); dsXTest = arrayDatastore(s.XTest,'OutputType','same'); dsYTest = arrayDatastore(s.YTest,'OutputType','same'); dsTest = combine(dsXTest,dsYTest); mbqTest = minibatchqueue(dsTest,3,... 'MiniBatchSize',miniBatchSize,... 'MiniBatchFcn', @preprocessMiniBatch);

To predict the labels of the test data, use the model function with the trained parameters, the hyperparameters, and the `doTraining`

option set to false. Use the `onehotdecode`

function to find the predicted class with the highest score and then compare the prediction with the true label. Evaluate the mean classification accuracy.

doTraining = false; predictions = []; predCorr = []; while hasdata(mbqTest) [dlXTest,dlYTest] = next(mbqTest); dlYPred = model(dlXTest,parameters,hyperparameters,doTraining); dlYPred = softmax(dlYPred,'DataFormat','CBT'); YPred = onehotdecode(dlYPred,classes,1); YTest = onehotdecode(dlYTest,classes,1); predictions = [predictions YPred]; predCorr = [predCorr YPred == YTest]; end mean(predCorr)

ans = 0.9996

Compare the predictions for a single sequence with the corresponding test data by using a plot.

figure idx = 1; plot(squeeze(predictions),'.-') hold on plot(squeeze(YTest)) hold off xlabel("Time Step") ylabel("Activity") title("Predicted Activities") legend(["Predicted" "Test Data"])

The function `model`

takes the input data `dlX`

, the learnable model parameters, the model hyperparameters, and the flag `doTraining`

which specifies whether the model should return outputs for training or prediction. The network outputs the predictions for the labels at each time step of the input sequence. The model consists of multiple residual blocks with exponentially increasing dilation factors. After the last residual block, a final `fullyconnect`

operation maps the output to the number of classes in the target data.

function dlY = model(dlX,parameters,hyperparameters,doTraining) numBlocks = hyperparameters.NumBlocks; dropoutFactor = hyperparameters.DropoutFactor; dlY = dlX; % Residual blocks. for k = 1:numBlocks dilationFactor = 2^(k-1); parametersBlock = parameters.("Block"+k); dlY = residualBlock(dlY,dilationFactor,dropoutFactor,parametersBlock,doTraining); end % Fully connect weights = parameters.FC.Weights; bias = parameters.FC.Bias; dlY = fullyconnect(dlY,weights,bias,'DataFormat','CBT'); end

The function `residualBlock`

implements the core building block of the temporal convolutional network.

To apply 1-D causal dilated convolution, use the `dlconv`

function:

To convolve over the spatial dimensions, set the

`'DataFormat'`

option to`'CBS'`

(use the dimension label`'S'`

instead of`'T'`

),Set the

`'DilationFactor'`

option according to the dilation factor of the residual block.To ensure only the past time steps are used, apply padding only at the beginning of the sequence.

function dlY = residualBlock(dlX,dilationFactor,dropoutFactor,parametersBlock,doTraining) % Convolution options. filterSize = size(parametersBlock.Conv1.Weights,1); paddingSize = (filterSize - 1) * dilationFactor; % Convolution. weights = parametersBlock.Conv1.Weights; bias = parametersBlock.Conv1.Bias; dlY = dlconv(dlX,weights,bias, ... 'DataFormat','CBS', ... 'Stride', 1, ... 'DilationFactor', dilationFactor, ... 'Padding', [paddingSize; 0] ); % Instance normalization, ReLU, spatial dropout. dlY = instanceNormalization(dlY,'CBS'); dlY = relu(dlY); dlY = spatialDropout(dlY,dropoutFactor,'CBS',doTraining); % Convolution. weights = parametersBlock.Conv2.Weights; bias = parametersBlock.Conv2.Bias; dlY = dlconv(dlY,weights,bias, ... 'DataFormat','CBS', ... 'Stride', 1, ... 'DilationFactor', dilationFactor, ... 'Padding',[paddingSize; 0] ); % Instance normalization, ReLU, spatial dropout. dlY = instanceNormalization(dlY,'CBS'); dlY = relu(dlY); dlY = spatialDropout(dlY,dropoutFactor,'CBS',doTraining); % Optional 1-by-1 convolution. if ~isequal(size(dlX),size(dlY)) weights = parametersBlock.Conv3.Weights; bias = parametersBlock.Conv3.Bias; dlX = dlconv(dlX,weights,bias,'DataFormat','CBS'); end % Addition and ReLU dlY = relu(dlX+dlY); end

The `modelGradients`

function takes a mini-batch of input data `dlX`

, the corresponding target sequences `T`

, the learnable parameters, and the hyperparameters, and returns the gradients of the loss with respect to the learnable parameters and the corresponding loss. To compute the gradients, evaluate the `modelGradients`

function using the `dlfeval`

function in the training loop.

function [gradients,loss] = modelGradients(dlX,T,parameters,hyperparameters,numTimeSteps) dlY = model(dlX,parameters,hyperparameters,true); dlY = softmax(dlY,'DataFormat','CBT'); dlT = dlarray(T,'CBT'); loss = maskedCrossEntropyLoss(dlY, dlT, numTimeSteps); gradients = dlgradient(mean(loss),parameters); end

The `maskedCrossEntropyLoss`

function computes the cross-entropy loss for mini-batches of sequences, where the sequences are different lengths.

function loss = maskedCrossEntropyLoss(dlY,dlT,numTimeSteps) numObservations = size(dlY,2); loss = dlarray(zeros(1,numObservations,'like',dlY)); for i = 1:numObservations idx = 1:numTimeSteps(i); loss(i) = crossentropy(dlY(:,i,idx),dlT(:,i,idx),'DataFormat','CBT'); end end

The `instanceNormalization`

function normalizes the input `dlX`

by first calculating the mean $\mu $ and the variance ${\sigma}^{2}\text{\hspace{0.17em}}$ for each observation over each input channel. Then it calculates the normalized activations as

$$\stackrel{\u02c6}{\mathit{X}}=\frac{\mathit{X}-\mu \text{\hspace{0.17em}}}{\sqrt{\sigma {\text{\hspace{0.17em}}}^{2}+\u03f5}}.$$

In comparison to batch normalization, the mean and variance is different for each observation in the mini-batch. Use normalization, such as instance normalization, between convolutional layers and nonlinearities to speed up training of convolutional neural networks and improve convergence.

function dlY = instanceNormalization(dlX,fmt) reductionDims = find(fmt == 'S'); mu = mean(dlX,reductionDims); sigmaSq = var(dlX,1,reductionDims); epsilon = 1e-5; dlY = (dlX-mu) ./ sqrt(sigmaSq+epsilon); end

The `spatialDropout`

function performs spatial dropout [3] on the input `dlX`

with dimension labels `fmt`

when the `doTraining`

flag is `true`

. Spatial dropout drops an entire channel of the input data. That is, all time steps of a certain channel are dropped with the probability specified by `dropoutFactor`

. The channels are dropped out independently in the batch dimension.

function dlY = spatialDropout(dlX,dropoutFactor,fmt,doTraining) if doTraining maskSize = size(dlX); maskSize(fmt=='S') = 1; dropoutScaleFactor = single(1 - dropoutFactor); dropoutMask = (rand(maskSize,'like',dlX) > dropoutFactor) / dropoutScaleFactor; dlY = dlX .* dropoutMask; else dlY = dlX; end end

The `initializeGaussian`

function samples weights from a Gaussian distribution with mean 0 and standard deviation 0.01.

function parameter = initializeGaussian(sz) parameter = randn(sz,'single') .* 0.01; end

The `preprocessMiniBatch`

function preprocesses the data for training. The *N* input sequences are transformed into a *C*-by-*N*-by-*S* numeric array of left-padded 1-D sequences and the number of time steps in each sequence, where *C* corresponds to the number of features of the sequences and *S* corresponds to the number of time steps of the longest sequence.

The `preprocessMiniBatch`

function preprocesses the data using the following steps:

Calculate the length of the longest sequence.

One-hot encode the categorical labels of each time step into numeric arrays.

Pad the sequences to the same length as the longest sequence in the mini-batch using the

`leftPad`

function

function [XTransformed,YTransformed,numTimeSteps] = preprocessMiniBatch(XCell,YCell) numTimeSteps = cellfun(@(sequence) size(sequence,2),XCell); sequenceLength = max(cellfun(@(sequence) size(sequence,2),XCell)); miniBatchSize = numel(XCell); numFeatures = size(XCell{1},1); classes = categories(YCell{1}); numClasses = numel(classes); szX = [numFeatures miniBatchSize sequenceLength]; XTransformed = zeros(szX,'single'); szY = [numClasses miniBatchSize sequenceLength]; YTransformed = zeros(szY,'single'); for i = 1:miniBatchSize predictors = XCell{i}; responses = onehotencode(YCell{i},1); % Left pad. XTransformed(:,i,:) = leftPad(predictors,sequenceLength); YTransformed(:,i,:) = leftPad(responses,sequenceLength); end end

The `leftPad`

function takes a sequence and left-pads it with zeros to have the specified sequence length.

function sequencePadded = leftPad(sequence,sequenceLength) [numFeatures,numTimeSteps] = size(sequence); paddingSize = sequenceLength - numTimeSteps; padding = zeros(numFeatures,paddingSize); sequencePadded = [padding sequence]; end

The `thresholdL2Norm`

function scales the gradient `g`

so that its $${L}_{2}$$ norm equals `gradientThreshold`

when the $${L}_{2}$$ norm of the gradient is larger than `gradientThreshold`

.

function g = thresholdL2Norm(g,gradientThreshold) gradientNorm = sqrt(sum(g.^2,'all')); if gradientNorm > gradientThreshold g = g * (gradientThreshold / gradientNorm); end end

[1] Bai, Shaojie, J. Zico Kolter, and Vladlen Koltun. "An empirical evaluation of generic convolutional and recurrent networks for sequence modeling." *arXiv preprint arXiv:1803.01271* (2018).

[2] Van Den Oord, Aäron, et al. "WaveNet: A generative model for raw audio." *SSW* 125 (2016).

[3] Tompson, Jonathan, et al. "Efficient object localization using convolutional networks." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*. 2015.

`adamupdate`

| `crossentropy`

| `dlarray`

| `dlconv`

| `dlfeval`

| `dlgradient`

| `fullyconnect`

| `minibatchqueue`

| `onehotdecode`

| `onehotencode`

| `relu`

| `softmax`