How to extract VGGish features from audio files shorter than 1s?

Question

Robert-Valentin Bencze on 7 Mar 2022

0
Link

Direct link to this question

https://nl.mathworks.com/matlabcentral/answers/1665994-how-to-extract-vggish-features-from-audio-files-shorter-than-1s

Commented: jibrahim on 10 Mar 2022

feature_extraction_VGGish.m

I used the code from the example presented here: https://www.mathworks.com/help/audio/ref/vggish.html by replacing the line [audioIn,fs0] = audioread('Ambiance-16-44p1-mono-12secs.wav'); with [audioIn,fs0] = audioread('1340-a_h.wav');

The file's initial sample rate is 50kHz and its length is 43501 samples. After resampling to 16kHz, its length becomes 13921.

The attempt to run the attached file retrned the following errors:

To reproduce the bug, 1340-a_h.wav is a vocal recording from the Saarbrucken Voice Dataset that can be downloaded here. If the first link is not working, try here by clicking on "Databankanfrage" (Database Request) and selecting the "Cyste" pathology from the list on the right. Click the blue "Exportieren" (Export) button on the bottom right. Click the blue "Alle" (All) checkbox and the WAV checkbox to the right of "Sprach-Signal" (Speech signal). Click the blue "Ubernehmen" (Take over) button. Click "Herunterladen" (Download).

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

jibrahim on 9 Mar 2022

1
Link

Direct link to this answer

https://nl.mathworks.com/matlabcentral/answers/1665994-how-to-extract-vggish-features-from-audio-files-shorter-than-1s#answer_913639

Open in MATLAB Online

Hi Robert-Valentin,

The VGGIsh network accepts auditory spectrograms that correspond to roughly one second of audio, so you do not have enough audio to generate a set of embeddings.

One way around this is to pad your input with zeros. For example (after you've resampled to 16 kHz):

audioIn = [audioIn ;zeros(0.975 *16e3-size(audioIn,1),1)];

Also, note that these two functions should make your life easier:

vggishPreprocess: Will accept the audio signal and create the Mel spectogram for you, including resampling to the right sample rate. No need to do it yourself.
vggishFeatures: Combines Mel spectrogram generation and network inference. You feed the function the audio signal, and it does everything for you and gives you the embeddings.

2 Comments
Show NoneHide None

Robert-Valentin Bencze on 10 Mar 2022

Edited: Robert-Valentin Bencze on 10 Mar 2022

Thank you @jibrahim.

However, I'm expecting that if I attempt the zero-padding strategy to 20ms signal windows, the extracted features will have a poor quality (i.e. they will not yield good prediction accuracy if used for a voice pathology classifier based on windowed signals). Am I right?

jibrahim on 10 Mar 2022

Hi Robert-Valentin,

I guess it depends. If you're padding a small amount compared to the length of the audio, the spectrogram will probably still have enough valuable info to give good results. VGGish essentially expects spectrograms that correspond to roughly one second of audio (975 ms) , so there is no way around this if your entire signal is shorted than that.

Note that, in some of our examples, we do a similar zero-padding if the signal is too short (see this example), and results are fine. I think we pad zeros on each side rather than put all the zeros at the front. That might help too.

Sign in to comment.

How to extract VGGish features from audio files shorter than 1s?

0 Comments
Show -2 older commentsHide -2 older comments

Answers (1)

2 Comments
Show NoneHide None

See Also

Categories

Tags

Products

Community Treasure Hunt

How to extract VGGish features from audio files shorter than 1s?

0 Comments Show -2 older commentsHide -2 older comments

Answers (1)

2 Comments Show NoneHide None

See Also

Categories

Tags

Products

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

2 Comments
Show NoneHide None