How to extract VGGish features from audio files shorter than 1s?

5 views (last 30 days)
I used the code from the example presented here: by replacing the line [audioIn,fs0] = audioread('Ambiance-16-44p1-mono-12secs.wav'); with [audioIn,fs0] = audioread('1340-a_h.wav');
The file's initial sample rate is 50kHz and its length is 43501 samples. After resampling to 16kHz, its length becomes 13921.
The attempt to run the attached file retrned the following errors:
To reproduce the bug, 1340-a_h.wav is a vocal recording from the Saarbrucken Voice Dataset that can be downloaded here. If the first link is not working, try here by clicking on "Databankanfrage" (Database Request) and selecting the "Cyste" pathology from the list on the right. Click the blue "Exportieren" (Export) button on the bottom right. Click the blue "Alle" (All) checkbox and the WAV checkbox to the right of "Sprach-Signal" (Speech signal). Click the blue "Ubernehmen" (Take over) button. Click "Herunterladen" (Download).

Answers (1)

jibrahim on 9 Mar 2022
Hi Robert-Valentin,
The VGGIsh network accepts auditory spectrograms that correspond to roughly one second of audio, so you do not have enough audio to generate a set of embeddings.
One way around this is to pad your input with zeros. For example (after you've resampled to 16 kHz):
audioIn = [audioIn ;zeros(0.975 *16e3-size(audioIn,1),1)];
Also, note that these two functions should make your life easier:
  • vggishPreprocess: Will accept the audio signal and create the Mel spectogram for you, including resampling to the right sample rate. No need to do it yourself.
  • vggishFeatures: Combines Mel spectrogram generation and network inference. You feed the function the audio signal, and it does everything for you and gives you the embeddings.
jibrahim on 10 Mar 2022
Hi Robert-Valentin,
I guess it depends. If you're padding a small amount compared to the length of the audio, the spectrogram will probably still have enough valuable info to give good results. VGGish essentially expects spectrograms that correspond to roughly one second of audio (975 ms) , so there is no way around this if your entire signal is shorted than that.
Note that, in some of our examples, we do a similar zero-padding if the signal is too short (see this example), and results are fine. I think we pad zeros on each side rather than put all the zeros at the front. That might help too.

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!