Audio Toolbox and the Audio Toolbox Interface for SpeechBrain and Torchaudio Libraries enable advanced signal processing and analysis tasks on audio and speech signals with pretrained AI models.
Using individual function calls and without requiring any deep learning expertise, you can:
- Transcribe speech with automatic speech recognition (ASR) using speech-to-text (STT) pipelines
- Synthesize speech using text-to-speech (TTS) pipelines
- Detect speech with voice activity detection (VAD), identify spoken languages, and classify sounds
- Enroll and identify speakers via speaker recognition deep learning models and machine learning pipelines
- Separate speech sources in a cocktail party problem and enhance and denoise speech signals
- Estimate musical pitch and extract embeddings from audio, speech, and music signals
The functions use pretrained machine learning and deep learning models, and are run using a combination of MATLAB, Python®, and PyTorch®.
Audio Toolbox Interface for SpeechBrain and Torchaudio Libraries
The Audio Toolbox Interface for SpeechBrain and Torchaudio Libraries enables the use of a collection of pretrained AI models with Audio Toolbox functions for signal processing and signal analysis.
The interface automates the installation of Python and PyTorch, and it downloads selected deep learning models from the SpeechBrain and Torchaudio libraries. Once installed, it runs the following functions through the underlying use of local AI models:
speech2text
accepts aspeechClient
object with the model set toemformer
orwhisper
, in addition to the localwav2vec
model, and the cloud service options likeGoogle
,IBM
,Microsoft
, andAmazon
. Usingwhisper
also requires downloading the model weights separately, as described in Download Whisper Speech-to-Text Model.text2speech
accepts aspeechClient
object with the model set tohifigan
, in addition to the cloud service options likeGoogle
,IBM
,Microsoft
, andAmazon
.
The speech2text
and text2speech
functions accept and return text strings and audio samples. These functions do not require you to code any signal preprocessing, feature extraction, model prediction, and output postprocessing.

Ready-to-Use AI with Additional Functions for Speech and Audio
Audio Toolbox includes additional functions, such as classifySound
, separateSpeakers
, enhanceSpeech
, detectspeechnn
, pitchnn
, and identifyLanguage
. These functions let you use advanced deep learning models for processing and analyzing audio signals without requiring AI expertise. These models do not require the Audio Toolbox Interface for SpeechBrain and Torchaudio Libraries.
Using MATLAB with PyTorch for Deep Learning Model Development
MATLAB and PyTorch users who are familiar with deep learning can use both languages together to develop and train AI models, including through co-execution and model exchange workflows.
Learn more: