How to improve the accuracy rate of speech recognition in noisy scenes? Facebook: Look at the lips

2022-01-10 14:38:54

With lip reading, humans can more easily understand the content of other people's speech, so can AI do the same?

Recently, Meta proposed an audiovisual version of BERT that not only reads lips, but also reduces the recognition error rate by 75%.

The effect is roughly as follows, give a video, the model can output what the character says according to his mouth type and voice.

How to improve the accuracy rate of speech recognition in noisy scenes? Facebook: Look at the lips

And compared to previous similar methods, it uses only one-tenth of the labeled data, and the performance can exceed the best audio-visual speech recognition systems in the past.

This speech recognition method, which combines lip reading, is of great help in recognizing speech in noisy environments.

Abdelrahman Mohamed, a research expert at Meta, said the technology could be used in the future on smart devices such as smart assistants and AR glasses.

At present, Meta has open sourced the relevant code to GitHub.

Self-supervising + multimodal

Meta names the method AV-HuBERT, a multimodal, self-supervised learning framework.

Multimodality is not difficult to understand, the framework needs to input two different forms of speech audio and lip video content, and then output the corresponding text.

Meta said that by combining information about lip and tooth activity and speech as people speak, AV-HuBERT can capture the subtle connections between audio and video.

This is very similar to the pattern of human beings themselves perceiving language.

Previous studies have shown that reading lip language is an important way for humans to understand language. Especially in noisy environments, the accuracy of language recognition can be increased by up to 6 times by reading lips.

In this model, masked audio and image sequences can be encoded into audiovisual features by a ResNet-transformer framework to predict discrete cluster task sequences.

Specifically, AV-HuBERT uses frame-level synchronized audio and video streams as inputs to better model and extract correlations between the two modes.

Image sequences and audio features can be used to generate intermediate features through lightweight modal-specific encoders, which are then fused and fed back into shared backbone transformer encoders to predict masked cluster assignments.

The goal is generated from clustered audio features or features extracted from the last iteration of the AV-HuBERT model.

When the lip reading is fine-tuned, the model uses only visual input, not audio input.

The results showed that after 30 hours of training with labeled TED talk videos, AV-HuBERT had a word error rate (WER) of 32.5%, compared with the lowest error rate of 33.6% that the previous method could achieve, and the training time of this method was as high as 31,000 hours.

WER is an indicator of the error rate in speech recognition tasks, calculated by dividing the number of incorrectly recognized words by the total number of words, 32.5% means that there is about one error for every 30 words.

After 433 hours of TED talk training, the error rate can be further reduced to 26.9%.

On the other hand, the biggest difference between AV-HuBERT and its predecessors is that it uses a self-supervised learning approach.

In the previous method proposed by DeepMind and the University of Oxford, the scope of vocabulary that can be learned is limited due to the need to label the data set.

AV-HuBERT iterates on the two steps of feature clustering and mask prediction in pre-training, so that it can learn to classify the labeled data on its own.

This way, AV-HuBERT can also learn well for some languages with few audio datasets.

Using less than one-tenth of the labeled data (433 hours/30 hours), the method reduces the recognition error rate to an average of 75% (25.8% vs 5.8%) of the previous method.

In fact, in noisy environments, speech recognition methods that can read lips are more effective.

Meta researchers say that when speech and background noise are the same volume, AV-HuBERT's WER is only 3.2 percent, compared to 25.5 percent for the previous best multimodal model.

There are still drawbacks

Obviously, in terms of all aspects of the data, the performance of Meta's new method is really eye-catching.

However, based on practical considerations, some scholars have raised some concerns.

Among them, Os Keye, an expert in artificial intelligence ethics at the University of Washington, mentioned that for people with facial paralysis due to diseases such as Down syndrome and stroke, does it still make sense to rely on lip reading speech recognition?

In response, Meta researchers responded that the AV-HuBERT method focuses more on lip movements than on the entire face.

And, like most AI models, the performance of AV-HuBERT is "proportional to the representative sample size of different populations in the training data."

Address of thesis:

https://arxiv.org/abs/2201.02184

https://arxiv.org/abs/2201.01763

GitHub Address:

https://github.com/facebookresearch/av_hubert

Reference Links:

https://venturebeat.com/2022/01/07/meta-claims-its-ai-improves-speech-recognition-quality-by-reading-lips/

How to improve the accuracy rate of speech recognition in noisy scenes? Facebook: Look at the lips

Read on