# Sequence-to-Sequence Classification Using 1-D Convolutions

This example shows how to classify each time step of sequence data using a generic temporal convolutional network (TCN).

While sequence-to-sequence tasks are commonly solved with recurrent neural network architectures, Bai et al. [1] show that convolutional neural networks can match the performance of recurrent networks on typical sequence modeling tasks or even outperform them. Potential benefits of using convolutional networks can be better parallelism, better control over the receptive field size, better control of the network's memory footprint during training, and more stable gradients. Just like recurrent networks, convolutional networks can operate on variable length input sequences and can be used to model sequence-to-sequence or sequence-to-one tasks.

This example trains a TCN network to recognize the activity of person wearing a smartphone on the body given time series data representing accelerometer readings in three different directions. The example defines the TCN as a function and trains the model using a custom training loop.

To convolve over the time dimension of the input data (also known as 1-D convolutions), use the dlconv function and specify the dimensions of the weights using the 'WeightsFormat' option or specify the weights as a formatted dlarray.

Load the human activity recognition data. The data contains seven time series of sensor data obtained from a smartphone worn on the body. Each sequence has three features and varies in length. The three features correspond to the accelerometer readings in three different directions. Six sequences are used for training and one sequence is used for testing after training.

Create arrayDatastore objects containing the data and combine them using the combine function.

dsXTrain = arrayDatastore(s.XTrain,'OutputType','same');
dsYTrain = arrayDatastore(s.YTrain,'OutputType','same');
dsTrain = combine(dsXTrain,dsYTrain);

View the number of observations in the training data.

numObservations = numel(s.XTrain)
numObservations = 6

View the number of classes in the training data.

classes = categories(s.YTrain{1});
numClasses = numel(classes)
numClasses = 5

Visualize one of the training sequences in a plot. Plot the features of the first training sequence and the corresponding activity.

figure
for i = 1:3
X = s.XTrain{1}(i,:);

subplot(4,1,i)
plot(X)
ylabel("Feature " + i + newline + "Acceleration")
end

subplot(4,1,4)

hold on
plot(s.YTrain{1})
hold off

xlabel("Time Step")
ylabel("Activity")

subplot(4,1,1)
title("Training Sequence 1")

### Define Deep Learning Model

The main building block of a temporal convolutional network is a dilated causal convolution layer which operates over the time steps of each sequence. In this context, "causal" means that the activations computed for a particular time step cannot depend on activations from future time steps.

In order to build up context from previous time steps, multiple convolutional layers are typically stacked on top of each other. To achieve large receptive field sizes the dilation factor of subsequent convolution layers is increased exponentially as shown in the image below. Assuming the dilation factor of the k-th convolutional layer is ${2}^{\left(\mathit{k}-1\right)}$ and the stride is 1, then the receptive field size of such a network can be computed as $\mathit{R}=\left(\mathit{f}-1\right)\left({2}^{\mathit{K}}-1\right)+1$, where $\mathit{f}$ is the filter size and $\mathit{K}$ is the number of convolutional layers. Change the filter size and number of layers to easily adjust the receptive field size and the number or learnable parameters as necessary for the data and task at hand.

One of the disadvantages of TCNs compared to RNNs is the higher memory footprint during inference. The entire raw sequence is required to compute the next time step. To reduce inference time and memory consumption, especially for step-ahead predictions, it can be beneficial to train with the smallest sensible receptive field size $\mathit{R}$ and only perform prediction with the last $\mathit{R}$ time steps of the input sequence.

A general TCN architecture (as described in [1]) consists of multiple residual blocks, each containing two sets of dilated causal convolution layers with the same dilation factor, followed by normalization, ReLU activation, and spatial dropout layers. The input to each block is added to the output of the block (including a 1-by-1 convolution on the input when the number of channels between the input and output do not match) and a final activation function is applied.

#### Define and Initialize Model Parameters

Specify the parameters for the TCN architecture with four residual blocks containing dilated causal convolution layers with each 175 filters of size 3.

numBlocks = 4;
numFilters = 175;
filterSize = 3;
dropoutFactor = 0.05;

To pass the model hyperparameters to the model functions (the number of blocks and the dropout factor), create a structure containing these values.

hyperparameters = struct;
hyperparameters.NumBlocks = numBlocks;
hyperparameters.DropoutFactor = dropoutFactor;

Create a structure containing dlarray objects for all the learnable parameters of the model based on the number of input channels and the hyperparameters that define the model architecture. Each residual block requires weights parameters and bias parameters for each of the two convolution operations. The first residual block usually also requires weights and biases for an additional convolution operation with filter size 1. The final fully connected operation requires a weights and bias parameter as well. Initialize the learnable layer weights and biases using the initializeGaussian and initializeZeros functions, respectively. These functions are attached to this example as a supporting file.

numInputChannels = 3;

parameters = struct;
numChannels = numInputChannels;

mu = 0;
sigma = 0.01;

for k = 1:numBlocks
parametersBlock = struct;
blockName = "Block"+k;

% 1-D Convolution.
sz = [filterSize numChannels numFilters];
parametersBlock.Conv1.Weights = initializeGaussian(sz,mu,sigma);
parametersBlock.Conv1.Bias = initializeZeros([numFilters 1]);

% 1-D Convolution.
sz = [filterSize numFilters numFilters];
parametersBlock.Conv2.Weights = initializeGaussian(sz,mu,sigma);
parametersBlock.Conv2.Bias = initializeZeros([numFilters 1]);

% If the input and output of the block have different numbers of
% channels, then add a convolution with filter size 1.
if numChannels ~= numFilters

% 1-D Convolution.
sz = [1 numChannels numFilters];
parametersBlock.Conv3.Weights = initializeGaussian(sz,mu,sigma);
parametersBlock.Conv3.Bias = initializeZeros([numFilters 1]);
end
numChannels = numFilters;

parameters.(blockName) = parametersBlock;
end

% Fully connect.
sz = [numClasses numChannels];
parameters.FC.Weights = initializeGaussian(sz,mu,sigma);
parameters.FC.Bias = initializeZeros([numClasses 1]);

View the network parameters.

parameters
parameters = struct with fields:
Block1: [1×1 struct]
Block2: [1×1 struct]
Block3: [1×1 struct]
Block4: [1×1 struct]
FC: [1×1 struct]

View the parameters of the first block.

parameters.Block1
ans = struct with fields:
Conv1: [1×1 struct]
Conv2: [1×1 struct]
Conv3: [1×1 struct]

View the parameters of the first convolutional operation of the first block.

parameters.Block1.Conv1
ans = struct with fields:
Weights: [3×3×175 dlarray]
Bias: [175×1 dlarray]

#### Define Model and Model Gradients Functions

Create the function model, listed in the Model Function section at the end of the example, that computes the outputs of the deep learning model. The function model takes the input data, the learnable model parameters, the model hyperparameters, and a flag that specifies whether the model should return outputs for training or prediction. The network outputs the predictions for the labels at each time step of the input sequence.

Create the function modelGradients, listed in the Model Gradients Function section at the end of the example, that takes a mini-batch of input data, the corresponding target sequences, and the parameters of the network, and returns the gradients of the loss with respect to the learnable parameters and the corresponding loss.

### Specify Training Options

Specify a set of training options used in the custom training loop.

• Train for 30 epochs with mini-batch size 1.

• Multiply the learn rate by 0.1 every 12 epochs.

• Clip the gradients using the ${L}_{2}$ norm with threshold 1.

maxEpochs = 30;
miniBatchSize = 1;
initialLearnRate = 0.001;
learnRateDropFactor = 0.1;
learnRateDropPeriod = 12;

To monitor the training progress, you can plot the training loss after each iteration. Create the variable plots that contains "training-progress". If you do not want to plot the training progress, then set this value to "none".

plots = "training-progress";

### Train Model

Train the network via stochastic gradient descent by looping over the sequences in the training dataset, computing parameter gradients, and updating the network parameters via the Adam update rule. This process is repeated multiple times (referred to as epochs) until training converges and the maximum number of epochs is reached.

Use minibatchqueue to process and manage mini-batches of images during training. For each mini-batch:

• Preprocess the data using the custom mini-batch preprocessing function preprocessMiniBatch (defined at the end of this example) that returns padded predictors, padded one-hot encoded targets, and the padding mask. The function has three outputs, so specify three output variables for the minibatchqueue object.

• Convert the outputs to dlarray objects with format 'CTB'(channel, time, batch). The minibatchqueue object, by default, outputs data with underlying type single.

• Train on a GPU if one is available. The minibatchqueue object, by default, converts each output to a gpuArray if a GPU is available. Using a GPU requires Parallel Computing Toolbox™ and a CUDA® enabled NVIDIA® GPU. For more information, see GPU Support by Release (Parallel Computing Toolbox).

numDatastoreOutputs = 3;
mbq = minibatchqueue(dsTrain,numDatastoreOutputs,...
'MiniBatchSize',miniBatchSize,...
'MiniBatchFormat',{'CTB','CTB','CTB'},...
'MiniBatchFcn', @preprocessMiniBatch);

For each epoch, shuffle the training data. For each mini-batch:

• Evaluate the model gradients and loss using dlfeval and the modelGradients function.

• Clip any large gradients using the function thresholdL2Norm, listed at the end of the example, and the dlupdate function.

• Update the network parameters using the adamupdate function.

• Update the training progress plot.

After completing learnRateDropPeriod epochs, reduce the learn rate by multiplying the current learning rate by learnRateDropFactor.

Initialize the learning rate which will be multiplied by the LearnRateDropFactor value every LearnRateDropPeriod epochs.

learnRate = initialLearnRate;

Initialize the moving average of the parameter gradients and the element-wise squares of the gradients used by the Adam optimizer.

trailingAvg = [];
trailingAvgSq = [];

Initialize a plot showing the training progress.

if plots == "training-progress"
figure
lineLossTrain = animatedline('Color',[0.85 0.325 0.098]);
ylim([0 inf])
xlabel("Iteration")
ylabel("Loss")
grid on
end

Train the model.

iteration = 0;

start = tic;

% Loop over epochs.
for epoch = 1:maxEpochs

% Shuffle the data.
shuffle(mbq)

% Loop over mini-batches.
while hasdata(mbq)

iteration = iteration + 1;

% Evaluate the model gradients and loss using dlfeval.

% Update the network parameters using the Adam optimizer.
trailingAvg, trailingAvgSq, iteration, learnRate);

if plots == "training-progress"
% Plot training progress.
D = duration(0,0,toc(start),'Format','hh:mm:ss');

% Normalize the loss over the sequence lengths
loss = mean(loss ./ numTimeSteps);
loss = double(gather(extractdata(loss)));
loss = mean(loss);

title("Epoch: " + epoch + ", Elapsed: " + string(D))
drawnow
end
end

% Reduce the learning rate after learnRateDropPeriod epochs
if mod(epoch,learnRateDropPeriod) == 0
learnRate = learnRate*learnRateDropFactor;
end
end

### Test Model

Test the classification accuracy of the model by comparing the predictions on a held-out test set with the true labels for each time step.

Create a datastore containing the test predictors.

dsXTest = arrayDatastore(s.XTest,'OutputType','same');

After training, making predictions on new data does not require the labels. Create minibatchqueue object containing only the predictors of the test data:

• To ignore the labels for testing, set the number of outputs of the mini-batch queue to 1.

• Specify the same mini-batch size used for training.

• Preprocess the predictors using the preprocessMiniBatchPredictors function, listed at the end of the example.

• For the single output of the datastore, specify the mini-batch format 'CTB' (channel, time, batch).

numDatastoreOutputs = 1;
mbqTest = minibatchqueue(dsXTest,numDatastoreOutputs,...
'MiniBatchSize',miniBatchSize,...
'MiniBatchFormat','CTB', ...
'MiniBatchFcn', @preprocessMiniBatchPredictors);

Loop over the mini-batches and classify the sequences using modelPredictions function, listed at the end of the example.

predictions = modelPredictions(parameters,hyperparameters,mbqTest,classes);

Evaluate the classification accuracy by comparing the predictions to the test labels.

YTest = s.YTest{1};
accuracy = mean(predictions == YTest)
accuracy = 0.9996

Visualize the test sequence in a plot and compare the predictions with the corresponding test data.

figure
for i = 1:3
X = s.XTest{1}(i,:);

subplot(4,1,i)
plot(X)
ylabel("Feature " + i + newline + "Acceleration")
end

subplot(4,1,4)

idx = 1;
plot(predictions(idx,:),'.-')
hold on
plot(YTest(idx,:))
hold off

xlabel("Time Step")
ylabel("Activity")
legend(["Predicted" "Test Data"],'Location','northeast')

subplot(4,1,1)
title("Test Sequence")

### Model Function

The function model, described in the Define Model and Model Gradients Functions section of the example, takes as input the model parameters and hyperparameters, input data dlX, and the flag doTraining which specifies whether the model should return outputs for training or prediction. The network outputs the predictions for the labels at each time step of the input sequence. The model consists of multiple residual blocks with exponentially increasing dilation factors. After the last residual block, a final fullyconnect operation maps the output to the number of classes in the target data. The model function creates residual blocks using the residualBlock function listed in the Residual Block Function section of the example.

function dlY = model(parameters,hyperparameters,dlX,doTraining)

numBlocks = hyperparameters.NumBlocks;
dropoutFactor = hyperparameters.DropoutFactor;

dlY = dlX;

% Residual blocks.
for k = 1:numBlocks
dilationFactor = 2^(k-1);
parametersBlock = parameters.("Block"+k);

dlY = residualBlock(dlY,dilationFactor,dropoutFactor,parametersBlock,doTraining);
end

% Fully connect
weights = parameters.FC.Weights;
bias = parameters.FC.Bias;
dlY = fullyconnect(dlY,weights,bias);

% Softmax.
dlY = softmax(dlY);

end

#### Residual Block Function

The function residualBlock implements the core building block of the temporal convolutional network.

To apply 1-D causal dilated convolution, use the dlconv function:

• To convolve over the time dimension, set the 'WeightsFormat' option to 'TCU' (time, channel, unspecified),

• Set the 'DilationFactor' option according to the dilation factor of the residual block.

• To ensure only the past time steps are used, apply padding only at the beginning of the sequence.

function dlY = residualBlock(dlX,dilationFactor,dropoutFactor,parametersBlock,doTraining)

% Convolution options.
filterSize = size(parametersBlock.Conv1.Weights,1);
paddingSize = (filterSize - 1) * dilationFactor;

% Convolution.
weights = parametersBlock.Conv1.Weights;
bias =  parametersBlock.Conv1.Bias;
dlY = dlconv(dlX,weights,bias, ...
'WeightsFormat','TCU', ...
'DilationFactor', dilationFactor, ...

% Normalization.
dim = find(dims(dlY)=='T');
mu = mean(dlY,dim);
sigmaSq = var(dlY,1,dim);
epsilon = 1e-5;
dlY = (dlY - mu) ./ sqrt(sigmaSq + epsilon);

% ReLU, spatial dropout.
dlY = relu(dlY);
dlY = spatialDropout(dlY,dropoutFactor,doTraining);

% Convolution.
weights = parametersBlock.Conv2.Weights;
bias = parametersBlock.Conv2.Bias;
dlY = dlconv(dlY,weights,bias, ...
'WeightsFormat','TCU', ...
'DilationFactor', dilationFactor, ...

% Normalization.
dim = find(dims(dlY)=='T');
mu = mean(dlY,dim);
sigmaSq = var(dlY,1,dim);
epsilon = 1e-5;
dlY = (dlY - mu) ./ sqrt(sigmaSq + epsilon);

% ReLU, spatial dropout.
dlY = relu(dlY);
dlY = spatialDropout(dlY,dropoutFactor,doTraining);

% Optional 1-by-1 convolution.
if ~isequal(size(dlX),size(dlY))
weights = parametersBlock.Conv3.Weights;
bias = parametersBlock.Conv3.Bias;
dlX = dlconv(dlX,weights,bias,'WeightsFormat','TCU');
end

dlY = relu(dlX + dlY);

end

The modelGradients function, described in the Define Model and Model Gradients Functions section of the example, takes as input the model parameters and hyperparameters, a mini-batch of input data dlX, the corresponding target sequences dlT, and the sequence padding mask, and returns the gradients of the loss with respect to the learnable parameters and the corresponding loss. To calculate the masked cross-entropy loss, use the 'Mask' option of the crossentropy function. To compute the gradients, evaluate the modelGradients function using the dlfeval function in the training loop.

doTraining = true;
dlY = model(parameters,hyperparameters,dlX,doTraining);

loss = crossentropy(dlY,dlT, ...
'Reduction','none');

loss = sum(loss,[1 3]);
loss = mean(loss);

end

### Model Predictions Function

The modelPredictions function takes as input the model parameters and hyperparameters, a minibatchqueue of input data mbq, and the network classes, and computes the model predictions by iterating over all data in the minibatchqueue object. The function uses the onehotdecode function to find the predicted classes with the highest score.

function predictions = modelPredictions(parameters,hyperparameters,mbq,classes)

doTraining = false;

predictions = [];

while hasdata(mbq)
dlX = next(mbq);

dlYPred = model(parameters,hyperparameters,dlX,doTraining);

YPred = onehotdecode(dlYPred,classes,1);

predictions = [predictions; YPred];
end

predictions = permute(predictions,[2 3 1]);

end

### Spatial Dropout Function

The spatialDropout function performs spatial dropout [3] on the input dlX with dimension labels fmt when the doTraining flag is true. Spatial dropout drops an entire channel of the input data. That is, all time steps of a certain channel are dropped with the probability specified by dropoutFactor. The channels are dropped out independently in the batch dimension.

function dlY = spatialDropout(dlX,dropoutFactor,doTraining)

fmt = dims(dlX);

if doTraining

dropoutScaleFactor = single(1 - dropoutFactor);

else
dlY = dlX;
end

end

### Mini-Batch Preprocessing Function

The preprocessMiniBatch function preprocesses the data for training. The function transforms the input sequences to a numeric array of left-padded 1-D sequences and also returns the padding mask.

The preprocessMiniBatch function preprocesses the data using the following step:

1. Preprocess the predictors using the preprocessMiniBatchPredictors function/.

2. One-hot encode the categorical labels of each time step into numeric arrays.

3. Pad the sequences to the same length as the longest sequence in the mini-batch using the padsequences function

XTransformed = preprocessMiniBatchPredictors(XCell);

miniBatchSize = numel(XCell);

responses = cell(1,miniBatchSize);
for i = 1:miniBatchSize
responses{i} = onehotencode(YCell{i},1);
end

end

### Mini-Batch Predictors Preprocessing Function

The preprocessMiniBatchPredictors function preprocesses a mini-batch of predictors by extracting the sequence data from the input cell array and left-pads them to have the same length.

function XTransformed = preprocessMiniBatchPredictors(XCell)

end

The thresholdL2Norm function scales the gradient g so that its ${L}_{2}$ norm equals gradientThreshold when the ${L}_{2}$ norm of the gradient is larger than gradientThreshold.

end

end

### References

[1] Bai, Shaojie, J. Zico Kolter, and Vladlen Koltun. "An empirical evaluation of generic convolutional and recurrent networks for sequence modeling." arXiv preprint arXiv:1803.01271 (2018).

[2] Van Den Oord, Aäron, et al. "WaveNet: A generative model for raw audio." SSW 125 (2016).

[3] Tompson, Jonathan, et al. "Efficient object localization using convolutional networks." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.