Main Content

visqol

Objective metric for perceived audio quality

Since R2024a

    Description

    metric = visqol(degraded,reference,fs) returns the mean opinion score (MOS) calculated by the Virtual Speech Quality Objective Listener (ViSQOL) metric. This metric compares the degraded speech or audio signal with a clean reference signal to measure the perceived audio quality.

    example

    metric = visqol(degraded,reference,fs,Name=Value) specifies options using one or more name-value arguments. For example, visqol(degraded,reference,fs,Mode="speech") computes the ViSQOL metric for speech signals.

    example

    [metric,ftable] = visqol(___) also returns a table containing statistics for each gammatone frequency band.

    example

    [metric,ftable,ttable] = visqol(___) also returns a table containing timing information on the matching of patches between the degraded and reference signals.

    example

    Examples

    collapse all

    Read in an audio signal and average together the stereo channels to convert it to mono. Listen to the audio with sound.

    [rockdrums,fs] = audioread("RockDrums-48-stereo-11secs.mp3");
    rockdrums = mean(rockdrums,2);
    sound(rockdrums,fs)

    Create two noisy signals with different levels of additive pink noise.

    noisy1 = rockdrums + 0.1*pinknoise(size(rockdrums));
    noisy2 = rockdrums + 0.5*pinknoise(size(rockdrums));

    Listen to the first noisy signal.

    sound(noisy1,fs)

    Listen to the second noisy signal.

    sound(noisy2,fs)

    Use visqol with the clean reference signal to measure the audio quality of both noisy signals.

    mos1 = visqol(noisy1,rockdrums,fs)
    mos1 = 
    4.2153
    
    mos2 = visqol(noisy2,rockdrums,fs)
    mos2 = 
    3.2949
    

    Read in an audio file containing speech and noise. Also read in an audio file containing the original clean speech to use as a reference signal.

    [noisySpeech,fs] = audioread("NoisySpeech-16-mono-3secs.ogg");
    reference = audioread("CleanSpeech-16-mono-3secs.ogg");

    Calculate the ViSQOL metric for the noisy speech signal using visqol.

    noisySpeechMOS = visqol(noisySpeech,reference,fs,Mode="speech")
    noisySpeechMOS = 2.9550
    

    Use enhanceSpeech to enhance the speech signal. Evaluate the enhanced signal using the ViSQOL metric and see the improvement compared to the noisy signal.

    enhancedSpeech = enhanceSpeech(noisySpeech,fs);
    enhancedSpeechMOS = visqol(enhancedSpeech,reference,fs,Mode="speech")
    enhancedSpeechMOS = single
        3.2205
    

    Read in an audio signal and average together the stereo channels to convert it to mono.

    [rockdrums,fs] = audioread("RockDrums-48-stereo-11secs.mp3");
    rockdrums = mean(rockdrums,2);

    Create a noisy signal by adding pink noise. Simulate latency and packet loss by adding zeroes to the beginning and removing samples from the signal.

    noisy = rockdrums + 0.5*pinknoise(size(rockdrums));
    noisy = [zeros(800,1); noisy([1:60000 60001+fs/10:end],1)];

    Call visqol with additional output arguments to get information about the frequency bands and timing alignment used in the ViSQOL computation. The frequency table, ftable, contains statistics about the NSIM for each gammatone frequency band. The timing table, ttable, contains information about the timing alignment between the reference and degraded signals.

    [metrics,ftable,ttable] = visqol(noisy,rockdrums,fs,OutputMetric="MOS and NSIM")
    metrics = 1×2
    
        3.2639    0.7549
    
    
    ftable=32×5 table
        FrequencyBand    FVNSIM     FVNSIM10    FVNSIMSTD    DegradedEnergy
        _____________    _______    ________    _________    ______________
    
               50        0.72699    0.37752      0.43309         23.974    
           91.748        0.81116      0.562      0.30221         22.413    
           139.75        0.83848     0.6642      0.31742         22.104    
           194.93        0.87307    0.50747      0.29136         24.002    
           258.38        0.88401    0.58191      0.24084         22.485    
           331.33        0.82519    0.60645      0.28942         20.694    
           415.19        0.77425    0.54247       0.3168          20.48    
           511.62        0.70612    0.44807      0.40192          19.82    
           622.48        0.61074     0.3624      0.47376         18.911    
           749.95        0.57177    0.30356      0.46667         18.545    
           896.49        0.63006    0.35169      0.42972         18.668    
             1065        0.73258    0.53353      0.33579         18.228    
           1258.7        0.76097    0.44779      0.32103          18.81    
           1481.4        0.81142    0.54293       0.2684         18.695    
           1737.5        0.84971    0.45247      0.26654         19.418    
           2031.9        0.91892    0.58922      0.17226         19.147    
          ⋮
    
    
    ttable=18×4 table
        PatchIndex    Similarity    ReferencePatch    DegradedPatch
        __________    __________    ______________    _____________
    
             1         0.77977       0.28    0.88     0.38    0.98 
             2         0.54941       0.88    1.48     0.98    1.58 
             3         0.74057       1.48    2.08     1.48    2.08 
             4         0.76372       2.08    2.68     2.08    2.68 
             5         0.76232       2.68    3.28     2.68    3.28 
             6          0.6989       3.28    3.88     3.28    3.88 
             7         0.79208       3.88    4.48     3.88    4.48 
             8         0.79986       4.48    5.08     4.48    5.08 
             9         0.80775       5.08    5.68     5.08    5.68 
            10         0.83136       5.68    6.28     5.68    6.28 
            11         0.75019       6.28    6.88     6.28    6.88 
            12         0.71107       6.88    7.48     6.88    7.48 
            13         0.76068       7.48    8.08     7.48    8.08 
            14         0.76206       8.08    8.68     8.08    8.68 
            15         0.78091       8.68    9.28     8.68    9.28 
            16         0.71875       9.28    9.88     9.28    9.88 
          ⋮
    
    

    The ReferencePatch and DegradedPatch columns in ttable display the start and end times of the patch, in seconds, within the reference and degraded signals, respectively. See how the function aligned the signals after the simulated latency and packet loss.

    Input Arguments

    collapse all

    Degraded audio signal, specified as a column vector (single channel).

    Data Types: single | double

    Reference audio signal, specified as a column vector (single channel).

    Data Types: single | double

    Sample rate in Hz, specified as a positive scalar.

    Data Types: single | double

    Name-Value Arguments

    Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

    Example: Mode="speech"

    ViSQOL mode, specified as "audio" or "speech".

    • "audio" — Compute the metric for a generic audio signal. The recommended sample rate is 48 kHz.

    • "speech" — Compute the metric for a speech signal. The recommended sample rate is 16 kHz. In speech mode, the function uses voice activity detection to identify relevant parts of the signal.

    Data Types: char | string

    Output metric, specified as "MOS", "NSIM", or "MOS and NSIM".

    • "MOS" — The output is a scalar representing the mean opinion score (MOS) in the range [1,5], where a higher value corresponds to higher quality.

    • "NSIM" — The output is a scalar representing the neurogram similarity index measure (NSIM) [2] in the range [-1,1], where 1 corresponds to a perfect similarity between the degraded and reference signals. In practice, the NSIM is generally in the range [0,1].

    • "MOS and NSIM" — The output is a two-element row vector with both metrics in the form [mos nsim], where the first element is the MOS value and the second element is the NSIM value.

    Data Types: char | string

    Scale MOS, specified as true or false. When you set this argument to true, a similarity of 1 produces an MOS of 5. If you set this argument to false, a similarity of 1 produces an MOS less than 5.

    This argument only applies if the Mode is speech.

    Data Types: logical

    Size of search window for aligning the signals, specified as a nonnegative integer. The search window size determines how many signal patches the function searches through to align the reference and degraded signals in time. For each patch in the reference signal, the function searches through 2*L+1 patches in the degraded signal, where L is the size of the search window.

    A larger window helps find patches that have further deviated for reasons such as packet loss. A small or zero-length window requires less computation but does not handle large latency variations.

    Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

    Output Arguments

    collapse all

    ViSQOL metric measuring the quality of the degraded signal, returned as a scalar or two-element row vector. The output metric can be NSIM, MOS, or both, depending on the OutputMetric argument.

    Frequency information table, returned as a table with the following columns:

    • FrequencyBand — Center frequency of each gammatone frequency band.

    • FVNSIM — NSIM value for each band.

    • FVNSIM10 — Mean of the first decile of the NSIM.

    • FVNSIMSTD — Standard deviation of the NSIM.

    • DegradedEnergy — Energy of the degraded signal in each band.

    Timing information table, returned as a table with the following columns:

    • PatchIndex — One-based index of the patch.

    • Similarity — Similarity metric for each patch.

    • ReferencePatch — Start and end times of the reference patch in seconds.

    • DegradedPatch — Start and end times of the degraded patch in seconds.

    References

    [1] Hines, Andrew, Jan Skoglund, Anil C Kokaram, and Naomi Harte. “ViSQOL: An Objective Speech Quality Model.” EURASIP Journal on Audio, Speech, and Music Processing 2015, no. 1 (December 2015): 13. https://doi.org/10.1186/s13636-015-0054-9.

    [2] Hines, Andrew, and Naomi Harte. “Speech Intelligibility Prediction Using a Neurogram Similarity Index Measure.” Speech Communication 54, no. 2 (February 2012): 306–20. https://doi.org/10.1016/j.specom.2011.09.004.

    [3] Hines, Andrew, Eoin Gillen, Damien Kelly, Jan Skoglund, Anil Kokaram, and Naomi Harte. “ViSQOLAudio: An Objective Audio Quality Metric for Low Bitrate Codecs.” The Journal of the Acoustical Society of America 137, no. 6 (June 1, 2015): EL449–55. https://doi.org/10.1121/1.4921674.

    [4] Chinen, Michael, Felicia S. C. Lim, Jan Skoglund, Nikita Gureev, Feargus O’Gorman, and Andrew Hines. “ViSQOL v3: An Open Source Production Ready Objective Speech and Audio Metric.” In 2020 Twelfth International Conference on Quality of Multimedia Experience (QoMEX), 1–6. Athlone, Ireland: IEEE, 2020. https://doi.org/10.1109/QoMEX48832.2020.9123150.

    Extended Capabilities

    C/C++ Code Generation
    Generate C and C++ code using MATLAB® Coder™.

    Version History

    Introduced in R2024a