laitimes

Artificial Intelligence and Acoustics | AI sound quality restoration

author:21dB acoustics

Author: Wang Jiajie

introduction

With the rapid development of audio platforms such as headphones and vehicles, manufacturers are gradually raising their standards in sound quality in addition to competing for conventional functions such as ANC, ENC, KWS, SV, ASR, TTS. After lossless music is encoded and transmitted via Bluetooth, how to repair the damaged sound quality on the end side has gradually become a concerned task.

Artificial Intelligence and Acoustics | AI sound quality restoration

First, the background of the problem

After a single-channel lossless music at a 48kHz sample rate is encoded at a low bit rate (such as 32kbps) by a certain encoding format (such as AAC), the sound quality will be damaged to a certain extent. The damage is reflected in the following aspects:

(1) On the time-frequency diagram, there will be a loss of high-frequency bandwidth and low-frequency hole damage. (Taking the soprano Han Hong's version of "Qinghai-Tibet Plateau" as an example, in the "Qinghai-Tibet Plateau" at the climax of the final ending, the music encoded with a 32kbps bitrate AAC has a lack of high frequency and low-frequency voids.) )

(2) In terms of hearing, corresponding to (1), the golden ear can hear the narrowing of the bandwidth of the music in the vocal range, and due to the existence of a cavity that destroys the continuity of the spectrum in the direction of time, a slight "click" sound with a certain grain can be heard.

(3) In terms of objective indicators, ODG scoring recognized for the sound quality evaluation task is used [1], the input reference signal is lossless music, the test signal is encoded music, and the score is often between -3 and -4 (the ODG scoring range is 0 to -5, the lower the sound quality).

If the encoded lossy music can be repaired, and the parameters and takeoffs can run on edge-side devices such as headphones and cars, it can become a gimmick and selling point.

2. Historical work

In October 2020, Shichao Hu et al. of Tencent Music Lab [2] solved the prediction problem of high-frequency phase by improving the Griffin-lim algorithm and Mel-GAN vocoder when reconstructing high-frequency music signals, and achieved the effect of subjective hearing improvement. The low-resolution signal for this operation is the result of low-pass filtering, unlike the impairment caused by the coding.

In April 2022, Moliner E and Vlimki V proposed BEHM-GAN [3], using piano classical music training, and blind listening test results show that the algorithm can effectively improve the subjective listening sound quality after repairing the gramophone recording in the early 20th century. The low-resolution signal for this operation is a low-sample rate signal after denoising and distortion, and the goal is to upsample.

In November 2022, Yan Yanqiao et al. [4] used the WassersteinGAN network to generate high-frequency amplitude spectra, which concluded that the traditional phase flipping or Griffin-lim algorithm is too low for music signals, and there are many redundant calculations, so the high-frequency subband phase is estimated using a fully connected network. It outperforms the phase flip and Griffin-lim algorithms in logarithmic spectral distortion LSD and segment signal-to-noise ratio segSNR indicators.

Artificial Intelligence and Acoustics | AI sound quality restoration

(Quoted from Figure 4 in literature [4])

In December 2022, Davidson G et al. of Dolby Laboratories proposed MDCTNet [5] to capture the correlation of the modified discrete cosine transform spectrum in both the time dimension and the frequency dimension using RNN. The low-resolution signal of this work is the signal that is damaged after encoding, and the network can simultaneously complete the task of filling the low-frequency hole and reconstructing the high-frequency content of the music. In the main observation of a variety of music content by 10 professional listening professionals, the restored music has a subjectively perceived bit rate equivalent to 48kbps compared to music encoded at a 24kbps bitrate.

3. Reflection and discussion

(1) Music repair, in fact, belongs to the large research field of Audio Super Resolution, and there is a small branch of Bandwidth Extension (BWE). However, compared to voice BWE, sound quality restoration will be more difficult in the difficulty of the task. This is reflected in:

    1. In terms of defective components, sound quality repair not only reconstructs the high frequency, but also needs to fill the hole at the low frequency, and the voice signal only needs to reconstruct the high frequency.
    2. In terms of data correlation, the structure of speech signals is relatively simple, the harmonic components of voiced sounds have good correlation in low and high frequency signals, and the energy "smear" area and depth of clean syllables/prime in the high frequency of the power spectrum can also be predicted. However, the music signal is ever-changing, the correlation between low and high frequency is poor, and some music sounds such as "wind chimes" are even completely uncorrelated, and it is basically impossible to predict the original high-frequency components from the residual, weak, and damaged low-frequency components.
    3. In terms of hearing, if the voice BWE is done well, it is easier to hear the benefits of range expansion. However, the bandwidth coverage after music encoding is wide, such as a single-channel 48kHz signal can reach 13kHz after encoding at a 32kbps code rate in AAC format, which has covered the sensitive hearing threshold bandwidth of ordinary people. The AI sound quality repair algorithm has benefits, even professionally trained golden ears, it takes several times to listen intently and repeatedly confirm to hear the benefits, but once abnormal sounds or noises are generated, even untrained ordinary people can hear abnormalities in one ear. Industrial-grade, landable music repair algorithms, even if the benefits are small, talk is better than nothing, can not have abnormal sounds.

(2) For individual types of music signals with strong temporal variability and poor correlation between low and high frequencies, such as drums, ringtones, and voices, it is necessary to use related work such as DrumGAN to achieve better sound quality restoration effect.

Bibliography:

[1] Kabal P. An Examination and Interpretation of ITU-R BS.1387: Perceptual Evaluation of Audio Quality.

[2] Hu S, Zhang B, Liang B, et al. Phase-aware music super-resolution using generative adversarial networks[J]. 2020.

[3] Moliner E, Vlimki V. BEHM-GAN: Bandwidth extension of historical music using generative adversarial networks[J]. arXiv e-prints, 2022.

[4] Yan Y, Nguyen B T, Geng Y, Iwai K, Nishiura T. Phase-aware audio super-resolution for music signals using Wasserstein generative adversarial network[C]//Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. Thailand, 2022, 1673-1677.

[5] Davidson G, Vinton M, Ekstrand P, et al. High quality audio coding with MDCTNet[J]. 2022.

Read on