vadnetPreprocess

Preprocess audio for voice activity detection (VAD) network

Since R2023a

Syntax

features = vadnetPreprocess(audioIn,fs)

Description

features = vadnetPreprocess(audioIn,fs) returns a mel spectrogram from the audio input that you can feed to the pretrained VAD network returned by audioPretrainedNetwork.

example

Examples

collapse all

Detect Speech with Pretrained VAD Model

This example uses:

Open Live Script

Read in an audio signal containing speech and music and listen to the sound.

[audioIn,fs] = audioread("MusicAndSpeech-16-mono-14secs.ogg");
sound(audioIn,fs)

Use vadnetPreprocess to preprocess the audio by computing a mel spectrogram.

features = vadnetPreprocess(audioIn,fs);

Call audioPretrainedNetwork to obtain a pretrained VAD neural network.

net = audioPretrainedNetwork("vadnet");

Pass the preprocessed audio through the network to obtain the probability of speech in each frame.

probs = predict(net,features);

Use vadnetPosprocess to postprocess the network output and determine the boundaries of the speech regions in the signal.

roi = vadnetPostprocess(audioIn,fs,probs)

roi = 2×2

           1       63120
       83600      150000

Plot the audio with the detected speech regions.

vadnetPostprocess(audioIn,fs,probs)

Figure contains an axes object. The axes object with title Detected Speech, xlabel Time (s), ylabel Amplitude contains 8 objects of type line, constantline, patch.

Use VAD Neural Network on Streaming Audio

This example uses:

Open Live Script

Create a dsp.AudioFileReader object to stream an audio file for processing. Set the SamplesPerFrame property to read 100 ms nonoverlapping chunks from the signal.

afr = dsp.AudioFileReader("MaleVolumeUp-16-mono-6secs.ogg");
analysisDuration = 0.1; % seconds
afr.SamplesPerFrame = floor(analysisDuration*afr.SampleRate);

The vadnet architecture does not retain state between calls, and it performs best when analyzing larger chunks of audio signals. When you use vadnet in a streaming scenario, specific application requirements of accuracy, computational efficiency, and latency dictate the analysis duration and whether to overlap analysis chunks.

Create a timescope object to plot the audio signal and the corresponding speech probabilities. Create an audioDeviceWriter to play the audio as you stream it.

scope = timescope(NumInputPorts=2, ...
    SampleRate=afr.SampleRate, ...
    TimeSpanSource="property",TimeSpan=5, ...
    YLimits=[-1.2,1.2], ...
    ShowLegend=true,ChannelNames=["Audio","Speech Probability"]);
adw = audioDeviceWriter(afr.SampleRate);

Call audioPretrainedNetwork to obtain a pretrained VAD neural network.

net = audioPretrainedNetwork("vadnet");

In a streaming loop:

Read in a 100 ms chunk from the audio file.
Preprocess the audio into a mel spectrogram using vadnetPreprocess.
Use the VAD network to predict the probability of speech in each frame of the spectrogram. Replicate the probabilities to correspond to each sample in the audio signal.
Plot the audio signal and the probabilities of speech.
Play the audio with the device writer.

hop = 0.01 * afr.SampleRate;
while ~isDone(afr)
    audioIn = afr();

    features = vadnetPreprocess(audioIn,afr.SampleRate);
    probs = predict(net,features);
    % Replicate probs to correspond to samples in audioIn
    probs = repelem(probs,hop)';
    probs = probs((hop/2)+1:end-hop/2);

    scope(audioIn,probs)
    adw(audioIn);
end

Input Arguments

collapse all

`audioIn` — Audio input
column vector

Audio input signal, specified as a column vector (single channel).

Data Types: single | double

`fs` — Sample rate (Hz)
positive scalar

Sample rate in Hz, specified as a positive scalar.

Data Types: single | double

Output Arguments

collapse all

`features` — Mel spectrogram
40-by-T matrix

Mel spectrogram, returned as a 40-by-T matrix, where T is the number of spectra in the spectrogram.

Algorithms

The vadnetPreprocess function preprocesses the audio data using the following steps.

Resample the audio to 16kHz.
Compute a centered short-time Fourier transform (STFT) using a 25 ms periodic Hamming window and 10 ms hop length. Pad the signal so that the first window is centered at 0 s.
Convert the STFT to a power spectrogram.
Apply a mel filter bank with 40 bands to obtain a mel spectrogram.
Convert the mel spectrogram to a log scale.
Standardize each of the mel bands to have zero mean and standard deviation of 1.

Extended Capabilities

expand all

C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

This function fully supports GPU arrays. For more information, see Run MATLAB Functions on a GPU (Parallel Computing Toolbox).

Version History

Introduced in R2023a

vadnetPreprocess

Syntax

Description

Examples

Detect Speech with Pretrained VAD Model

Use VAD Neural Network on Streaming Audio

Input Arguments

`audioIn` — Audio input
column vector

`fs` — Sample rate (Hz)
positive scalar

Output Arguments

`features` — Mel spectrogram
40-by-T matrix

Algorithms

Extended Capabilities

C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

Version History

See Also

Functions

Objects

Blocks

Topics

vadnetPreprocess

Syntax

Description

Examples

Detect Speech with Pretrained VAD Model

Use VAD Neural Network on Streaming Audio

Input Arguments

audioIn — Audio input column vector

fs — Sample rate (Hz) positive scalar

Output Arguments

features — Mel spectrogram 40-by-T matrix

Algorithms

Extended Capabilities

C/C++ Code Generation Generate C and C++ code using MATLAB® Coder™.

GPU Arrays Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

Version History

See Also

Functions

Objects

Blocks

Topics

`audioIn` — Audio input
column vector

`fs` — Sample rate (Hz)
positive scalar

`features` — Mel spectrogram
40-by-T matrix

C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.