laitimes

CNVSRC 2023 Chinese Continuous Visual Speech Recognition Challenge is officially released

Visual speech recognition, also known as lip recognition, is a technique that infers pronunciation content from lip movements. This technology has important applications in public safety, helping the elderly and disabled, video authenticity and other fields. At present, the research of lip recognition is in the ascendant, and although great progress has been made in the recognition of independent words and phrases, it still faces great challenges in the continuous recognition of large word lists. Especially for Chinese, research progress in this field has been limited by the lack of corresponding data resources. To this end, Tsinghua University released the CN-CVS dataset in 2023, becoming the first large-scale Chinese visual speech recognition database, which provides the possibility to further promote large word list continuous visual speech recognition (LVCVSR).

In order to promote the development of this research direction, the 2023 NCMMSC Special Topic: Chinese Continuous Visual Speech Recognition Challenge (CNVSRC, Chinese Continuous Visual Speech Recognition Challenge) jointly organized by Tsinghua University, Beijing University of Posts and Telecommunications, Haitian AAC and Voice Home was officially released. Based on CN-CVS Chinese visual speech recognition database, the competition evaluates the performance of LVCVSR systems in both studio reading and online speech. The results of the competition will be announced and awarded at the NCMMSC 2023 conference.

01 Dataset

• CN-CVS: CN-CVS contains more than 300 hours of audio and video data for 2,557 speakers, covering news broadcast and public speaking scenarios, making it the largest open-source audio and video dataset Chinese. The organizer provided a text annotation of the database for this competition. For more information about CN-CVS, please visit the database website http://www.cnceleb.org/. This dataset is used as the training set for the closed set task of this competition.

CNVSRC-Single: CNVSRC2023 single-person big data. Contains more than 100 hours of audio and video data from a speaker, data from online video, of which nine-tenths of the data constitutes the development set, and the remaining one-tenth of the data is used as the test set.

CNVSRC-Multi: Limited data for CNVSRC2023 multiplayer. It contains the audio and video data of 43 speakers, each of whom has a data volume of nearly 1 hour, of which two-thirds of each person's data constitutes the development set, and the remaining data constitutes the test set. Among them, the data of 23 speakers came from the recording of reading aloud fixed cameras in a controlled environment, and the duration of a single data was relatively short. The data of the other 20 speakers came from online speech videos, and the single data was longer and the environment and content were more complex.

For the training and development sets, the organizer provides audio, video, and corresponding transcripts; For test sets, only video data is provided. Participants are not allowed to use the test set in any way, including but not limited to using the test set to help model training or fine-tuning.

data set CNVSRC-Single CNVSRC-Multi
Development set Dev Test set Eval Development set Dev Test set Eval
Number of videos 25947 2881 20450 10269
Video duration (hours) 94.00 8.41 29.24 14.49

Note: The reading data in CNVSRC-Multi comes from the [Chinese Mandarin Audio and Video Recognition Library (Mobile Phone)] dataset donated by Haitian AAC to Tsinghua University. Haitian AAC (Website: www.dataoceanai.com) donated datasets to Tsinghua University to promote scientific development.

02 Task settings

CNVSRC 2023 designs two tasks, Visual Speech Recognition (T1) for Specific Speakers (T1), which focuses on the performance of a specific speaker after big data tuning and Multi-speaker Visual Speech Recognition (T2), which focuses on the basic performance of the system for non-specific speakers. Each task is divided into "fixed track" and "open track", in which the fixed track can only use the data and other resources agreed by the organizing committee, and the open track can use any resources except the test set.

Resources that cannot be used in fixed tracks include: non-public pre-trained models extracted as features, pre-trained language models that exceed 1B parameters, or non-public pre-trained language models. Tools and resources available include: face detection, extraction, lip area extraction, contour extraction, and other publicly available preprocessing tools; publicly available exogenous models and tools and datasets for data enhancement; Publicly available vocabulary, pronunciation dictionaries, n-gram language models, neuro-linguistic models with less than 1B parameters.

Fixed tracks Open track
T1: Speaker-specific lip recognition CN-CVS, CNVSRC-Single development set Any data, tools
T2: Say goodbye to the lips CN-CVS, CNVSRC-Multi development set Any data, tools

03 How to participate

Participants need to register a CNVSRC account on the CNCeleb official website, please visit the following website to register:

http://cnceleb.org/competition

After registration, users can download data resources (CN-CVS, CNVSRC-Single, CNVSRC-Multi) according to the prompts.

CNVSRC 2023 uses the Character Error Rate (CER) as the evaluation criterion. When submitting results, participants need to log in to their CNVSRC account, enter the CNVSRC 2023 result submission page, select the corresponding task and track, and submit the result file. Each line in the resulting file corresponds to a test video, starting with the ID of the video followed by the corresponding transcript. After the content is submitted, the system automatically calculates the CER and displays it to the participants. For each mission per track, participants have 5 submissions.

04 Baseline System

The organizer provided a baseline system for both multi-speaker and specific speaker tasks under fixed track conditions. The baseline system adopts a Conformer-based structure, and the model performance is as follows.

Task Single-speaker VSR Multi-speaker VSR
CER on Dev Set 48.57% 58.77%
CER on Eval Set 48.60% 58.37%

Participants can obtain the code for the baseline system at the following URL: https://github.com/MKT-Dataoceanai/CNVSRC2023Baseline

05 Schedule

Time agenda
2023/09/20 Open registration, training dataset, development dataset, and baseline system release
2023/10/10 Test dataset publishing
2023/11/01 The submission system is open
2023/12/01 12pm Deadline for submission results
2023/12/09 NCMMSC 2023 Workshop, announcing results and sharing of excellent competition plans

06 Organizing Committee

name unit
Wang Tsinghua University
Chen Chen Tsinghua University
Li Lantian Beijing University of Posts and Telecommunications
Li Ke Haitian Ruisheng
Bu Hui Voice House

Read on