近日,国际声学、语音和信号处理会议ICASSP 2024(International Conference on Acoustics, Speech, and Signal Processing)的旗舰赛事——国际车载多通道语音识别挑战赛(In-Car Multi-Channel Automatic Speech Recognition Challenge,ICMC-ASR)落下帷幕。 在赛事设置的ASR(Automatic Speech Recognition)和ASDR(Automatic Speech Diarization and Recognition)两个赛道上,科大讯飞与中国科学技术大学语音及语言信息处理国家工程研究中心(USTC-NERCSLIP)联合团队取得全部第一的好成绩
The International Vehicle Multi-channel Speech Recognition Challenge was jointly initiated by Hill Shell, Li Auto, Audio Speech and Language Processing Research Group of Xi'an University of Technology, Nanyang Technological University of Singapore, Tianjin University, WeNet Open Source Community, Microsoft, China Academy of Information and Communications Technology, etc., attracting many enterprises and institutions to participate in the competition.
Close to the real and complex in-vehicle scenes
Dual-track speech recognition is challenging
The cockpit of a car is one of the most common use cases for voice recognition. Different from multi-person speech recognition in scenarios such as home and meetings, in-vehicle voice recognition faces more challenges:
Complex acoustic environment in the cockpit. There is a special room impulse response in closed and unconventional spaces, which leads to special reverberation conditions;
There are many different noises inside and outside the cabin, such as wind sounds, engine sounds, tire sounds, background music, and talking jammers;
Different driving situations can also affect the performance of the voice recognition system, such as parking, high-speed, low-speed driving, day and night driving, etc.
In addition, the lack of large-scale public real-world in-vehicle data is also one of the main obstacles to the development of this field.
The International Vehicle Multi-channel Speech Recognition Challenge built 1000+ hours of real multi-channel, multi-speaker Mandarin voice data in the car, which came from speakers in different seats in the car, and the distributed microphones in the car and the participants' headphones collected far-field and near-field data respectively.
The far-field microphone distribution chart given by the official event is given
On this basis, two tracks, ASR and ASDR, are set up for the event, and the track tasks are also closely related to the voice recognition needs in real on-board scenarios:
ASR: The information of the role separation of the cockpit speaker is manually annotated, and the contestants can directly use it to directly perform speech recognition on the basis of the artificial separation boundary;
ASDR: The cockpit speaker role separation task under far-field data needs to be completed first, that is, different speaker segments are segmented from continuous multi-person speaking voices, and each speaker belongs to determines which speaker each segment belongs to, and then speech recognition is performed.
In the end, the iFLYTEK joint team won the first place in the two tracks with a speech recognition error rate of 13.16% and 21.48% respectively, and compared with the baseline system provided by the official competition, the error rate of the joint team achieved a relative decrease of 49.84% and 70.52% respectively.
ASR track rankings
ASDR track rankings
ASR赛道核心考察指标为CER(Character Error Rate),即综合考察最小插入、删除和替换字符数;
The core investigation index of the ASDR track is concatenated minimum permutation CER (cpCER), which comprehensively examines the role separation effect of multiple speakers and the speech recognition effect of the system.
What technological innovations do we have in the face of challenges?
Under the influence of a variety of noises inside and outside the car, how can the automotive intelligent voice system "eliminate all difficulties" and accurately identify the voice of the main speaker?
iFLYTEK has been deeply engaged in speech recognition in complex scenarios, and after winning the CHiME championship for four consecutive years, it once again participated in the in-vehicle ICMC-ASR competition, mainly focusing on the fixed speaker position and accented speech recognition in multi-channel in-vehicle scenarios, and innovatively proposed a variety of technical methods. In these technical solutions, it is mainly solved from the perspective of front and back end:
In the front-end algorithm, due to the close distance between the target/non-target speakers in the car, the maximum signal-to-noise ratio criterion will lead to the wrong channel selection of the target speaker. Therefore, sound source localization is incorporated into channel selection to improve the separation of the target speaker:
多音区声源定位的通道挑选算法(Channel Selection Based on Multi-Source Sound Localization )
The algorithm replaces the selection criterion of the reference channel from the maximum signal-to-noise ratio criterion to the speaker position criterion, that is, the speaker position information obtained from the multi-pitch sound source localization based on the energy difference and phase difference selects the channel closest to the speaker, so as to avoid the wrong selection of the channel closest to the interference source. At the same time, the iterative averaging algorithm is introduced to obtain a more accurate estimate of the power spectral density of the signal source, so that the beamforming can achieve better results. The algorithm improves the ability to eliminate interference sources and noise without introducing speech distortion, and provides a single-channel audio with high signal-to-noise ratio and intelligibility for downstream speech recognition tasks.
In the back-end algorithm, this results in poor separation and recognition due to the speaker's severe accent problem. Therefore, the accent information is integrated into the speaker log and speech recognition respectively to improve the ability to distinguish accents:
基于自监督学习表示声纹提取的多说话人特征说话人角色分离算法 (Multi-Speaker Diarization Using Self-Supervised Learning Representation Speaker Embedding)
This method aims to solve the problem of speaker role separation in scenes with high noise, high reverberation and high speaker overlap. By introducing a self-supervised pre-trained model with accent adaptation to extract voiceprint information, the fusion of these different voiceprint information enables the speaker log model to Xi learn richer and more accurate accent Mandarin speaker characteristics. The model fully excavates the speaker information in the audio signal, effectively improves the performance of speaker role separation, and lays a solid foundation for the subsequent separation and recognition module.
基于多粒度单元增强的口音语音识别算法(Accent ASR based on Multi-grained Unit Enhancement)
In order to solve the accent problem, the multi-task learning Xi of pinyin sequences is introduced, and the aligned pinyin sequences and encoder acoustic features are fused with Twin Cross-Attention Xi and Contrastive Learning, so as to ensure that the fine-grained units can better learn Xi pronunciation information. At the same time, in the fusion stage of the encoder backbone network, the frame-segment level speaker information is also introduced to make it easier to distinguish the coarse-grained units generated by speakers with different accents, and the effect of speech recognition in complex scenes is improved.
From practical to easy to use
The future of in-vehicle voice recognition interaction is promising
Since 2003, iFLYTEK has been deeply engaged in the field of automotive intelligence for 21 years, and has ranked first in the domestic in-vehicle voice market coverage for many years*, and its diversified product cooperation covers more than 90% of China's mainstream independent brands and joint venture brand car manufacturers. By the end of 2023, iFLYTEK's automotive intelligent products and technologies have achieved a total of 53.49 million front-mounted installations, with an annual number of online interactions exceeding 10 billion and an average of more than 25 million monthly active users.
From "practical" to "easy to use", from "passive execution machine" to "anthropomorphic intimate assistant", from "in-car interaction" to "cross-scene interaction", from "main and co-driver interaction" to "multi-passenger interaction", iFLYTEK's intelligent voice technology continues to empower the in-vehicle intelligent cockpit.
In the face of complex background sounds in the car, iFLYTEK has effectively improved the accuracy of voice recognition through the sound source positioning system, the noise reduction solution equipped with up to six microphone arrays, and the voice recognition library accumulated over the years.
The application of multi-channel recognition technology has changed the situation that other passengers cannot interact with the voice assistant after the main driver wakes up the assistant first, and realizes that passengers and voice assistants can interact with each other in multiple locations on the car without interfering with each other.
Winning the first place in the ICMC-ASR dual track this time is undoubtedly a high affirmation of the iFLYTEK joint team in the on-board multi-channel voice recognition technology. At the same time, the rapid development of cognitive large models has also brought new development opportunities and experience upgrades for automotive intelligence.
Based on the iFLYTEK Xinghuo cognitive model, vehicle-machine interaction extends from simple control instructions to diversified intelligent interaction, which can support application scenarios such as small talk, knowledge quizzes, leisure and entertainment, and journey planning. The deep integration of voice interaction and smart cars will also bring a safer, more comfortable and more intimate driving experience.
If we take a long-term view, multi-channel voice recognition technology will shine in smart home, smart office and other fields in addition to vehicles. In the home, the multi-channel speech recognition technology of the smart home can recognize the different instructions of multiple members, distinguish between small talk and action instructions, and in the office scene, the multi-channel speech recognition technology can realize the automatic separation and recognition of speakers, give recognition results according to roles and generate meeting minutes. Adhering to its original intention, iFLYTEK will continue to deepen its efforts in the field of intelligent voice technology, and gradually turn the imagination of the future into a daily reality.
*The data in this article comes from the iFLYTEK intelligent vehicle data platform, and the market share comes from a third-party research report