Low LSTM Accuracy in Speech Recognition

2 views (last 30 days)
Hamza
Hamza on 31 Oct 2023
Hello everyone, I am applying LSTM to speech emotion recognition. I have performed feature extraction using MFCC, resulting in a matrix of dimensions 60,575 × 39. I subsequently transformed this matrix into a cell array named "AllCellTrain" with dimensions 280 × 1, containing signals of varying sizes, as illustrated in the image below. I then utilized "AllCellTrain" as input for the trainNetwork function, along with the labels YCA, network layers, and training options. However, I encountered a significant issue with accuracy, achieving only around 20%. I'm unsure where I may have made a mistake. Could someone please offer some assistance?
num_hidden_units = 1024;
layers = [
sequenceInputLayer(num_features)
lstmLayer(num_hidden_units, 'OutputMode', 'last')
fullyConnectedLayer(num_classes)
softmaxLayer
classificationLayer];
% Specify the training options
max_epochs = 36;
mini_batch_size = 28;
initial_learning_rate = 0.001;
options = trainingOptions('adam', ...
'MaxEpochs', max_epochs, ...
'MiniBatchSize', mini_batch_size, ...
'InitialLearnRate', initial_learning_rate, ...
'SequenceLength','shortest', ...
'Shuffle','every-epoch',...
'ExecutionEnvironment','gpu', ...
'Verbose', false, ...
'Plots','training-progress');
net = trainNetwork(AllCellTrain, YCA, layers, options);
predicted_labels = classify(net, AllCellTest,'ExecutionEnvironment','gpu');
acc = mean(predicted_labels == YCT)
  4 Comments
Hamza
Hamza on 6 Nov 2023
Edited: Hamza on 6 Nov 2023
Hi @Christopher McCausland , thanks for your answer, I ma trying to classify 7 emotion classes, for your information I have used the same data on 1D CNN and got 90% accuracy, didnt know the issue on LSTM, also when I shufflued the colunms "the features" I got diffrent result, which souldnt be the case. you find the attached curve! thanks in advance
Christopher McCausland
Christopher McCausland on 6 Nov 2023
Hi @Hamza,
To me this looks like classic overfitting, your model appears to train well and learn features, however these features are overfitted to the training data, and are not representative of genralised data.
A few things to consider;
  1. Do you have multiple speakers? If so, how do you pick which speakers are in the test/train set.
  2. You have 280 input sequences, and seven classes, if the data is perfectly ballanced you have 40 observations per class, is this enough?
  3. Can you include a validation split to prevent overfitting?
  4. These are just a few ways to prevent overfitting/ ensure your data is appropreate for training, there are many other which I would suggest you take a look at.
In terms of the CNN preformance, were the test/train set the same and how many epochs did you train the CNN for?

Sign in to comment.

Answers (0)

Products


Release

R2023b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!