Low LSTM Accuracy in Speech Recognition
2 views (last 30 days)
Show older comments
Hello everyone, I am applying LSTM to speech emotion recognition. I have performed feature extraction using MFCC, resulting in a matrix of dimensions 60,575 × 39. I subsequently transformed this matrix into a cell array named "AllCellTrain" with dimensions 280 × 1, containing signals of varying sizes, as illustrated in the image below. I then utilized "AllCellTrain" as input for the trainNetwork function, along with the labels YCA, network layers, and training options. However, I encountered a significant issue with accuracy, achieving only around 20%. I'm unsure where I may have made a mistake. Could someone please offer some assistance?
num_hidden_units = 1024;
layers = [
sequenceInputLayer(num_features)
lstmLayer(num_hidden_units, 'OutputMode', 'last')
fullyConnectedLayer(num_classes)
softmaxLayer
classificationLayer];
% Specify the training options
max_epochs = 36;
mini_batch_size = 28;
initial_learning_rate = 0.001;
options = trainingOptions('adam', ...
'MaxEpochs', max_epochs, ...
'MiniBatchSize', mini_batch_size, ...
'InitialLearnRate', initial_learning_rate, ...
'SequenceLength','shortest', ...
'Shuffle','every-epoch',...
'ExecutionEnvironment','gpu', ...
'Verbose', false, ...
'Plots','training-progress');
net = trainNetwork(AllCellTrain, YCA, layers, options);
predicted_labels = classify(net, AllCellTest,'ExecutionEnvironment','gpu');
acc = mean(predicted_labels == YCT)
4 Comments
Christopher McCausland
on 6 Nov 2023
To me this looks like classic overfitting, your model appears to train well and learn features, however these features are overfitted to the training data, and are not representative of genralised data.
A few things to consider;
- Do you have multiple speakers? If so, how do you pick which speakers are in the test/train set.
- You have 280 input sequences, and seven classes, if the data is perfectly ballanced you have 40 observations per class, is this enough?
- Can you include a validation split to prevent overfitting?
- These are just a few ways to prevent overfitting/ ensure your data is appropreate for training, there are many other which I would suggest you take a look at.
In terms of the CNN preformance, were the test/train set the same and how many epochs did you train the CNN for?
Answers (0)
See Also
Categories
Find more on Speech Recognition in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!