Mozilla's open source speech dataset has 20,000 hours of content, and new support for Cantonese and Hokkien languages

2022-05-01 17:40:18

Earlier this week, Mozilla announced that its Common Voice dataset now contains more than 20,000 hours of content that anyone around the world can use to improve their speech recognition software, nearly twice as much as it did a year ago.

Mozilla's open source speech dataset has 20,000 hours of content, and new support for Cantonese and Hokkien languages

IT House understands that the latest English dataset is 71 GB and supports more languages than ever before, with the addition of Tigre, Hokkien, Meadow Mari, Bengali, Daoben and Cantonese.

According to Mozilla, the Common Voice project allows anyone to contribute their voice to the project, allowing virtual assistants to understand more accents. In addition, the Common Voice project is open source, ensuring that large tech companies cannot be monopolized, providing opportunities for small developers and companies to build competing products and services.

Mozilla notes the following highlights in its latest dataset release:

6 new languages: Tigre, Hokkien, Meadow Mari, Bengali, Daoben and Cantonese.

At least 100 hours of speech data are available in 27 languages, including Bengali, Thai, Basque, and Frisian.

At least 500 hours of speech data are available in 9 languages, including Kinyarwanda (2383 hours), Catalan (2045 hours), and Swahili (719 hours).

At least 45% of 9 languages have gender labels for women, including Marathi, Divihi and Luganda.

Mozilla's open source speech dataset has 20,000 hours of content, and new support for Cantonese and Hokkien languages

Read on