laitimes

【TinyML】End-side speech recognition technology

author:Shepherd

I. Preface

Automatic Speech Recognition is a technology that converts a person's speech into text. For example, "Hey, Siri", "Hi Alexa" and other voice assistants are implemented by the application of voice recognition technology. Through voice assistants, users can directly use their voice to control home equipment such as air conditioners, TVs, curtains and lights at home, making device control more convenient and natural.

【TinyML】End-side speech recognition technology

At present, most of the mainstream intelligent voice solutions on the market use online voice, combined with platform content to create a rich smart home ecology. However, the way of online networking also brings many uncertainties: privacy security, network latency, linkage speed, network stability, etc. Based on this, the offline localized voice control scheme on the end side has become another choice for users.

Second, end-side speech recognition

1. Overview

Modern speech recognition can be traced back to 1952, Davis et al. developed the world's first experimental system that can recognize the pronunciation of 10 English numbers, and officially opened the process of speech recognition. Speech recognition has been developed for more than 70 years, but it can be roughly divided into three stages from the technical direction:

【TinyML】End-side speech recognition technology

From the perspective of the content scope of recognition, speech recognition can be divided into "open domain recognition" and "closed domain recognition":

Open domain identification Closed domain identification
There is no need to prepend a collection of identifiers A collection of qualified terms that need to be provisioned
Models are generally relatively large and the engine is computationally intensive The identification engine has few hardware resources
Rely on "online" identification in the cloud "Offline" speech recognition deployed on embedded devices
It is mainly oriented to multi-round dialogue interaction It is mainly aimed at simple device control scenarios

The voice recognition on the end side belongs to the scope of closed domain recognition.

2. Principle

The workflow of speech recognition technology is shown in the following diagram:

【TinyML】End-side speech recognition technology
  • The device microphone receives the raw speech, and the ADC converts the analog signal to digital.
  • The acoustic front-end module performs echo cancellation, noise suppression, and voice active detection to exclude non-speech signals, and then performs MFCC feature extraction of normal speech signals.
  • The back-end processing part uses the speech features as the input of the acoustic model for model inference, and then calculates the score of command recognition combined with the language model, retrieves the command set according to the highest score, and finally outputs the recognition result.
  • The recognition process of the backend can be understood as the process of decoding the feature vector into text, which needs to be processed by two models:
【TinyML】End-side speech recognition technology
  • Acoustic model: It can be understood as the modeling of sound, which can convert speech input into the output of acoustic representation, more accurately giving the probability that speech belongs to an acoustic symbol. In English, this acoustic symbol can be a syllable or a smaller granular phoneme; In Chinese , the acoustic symbol can be a vowel or a phoneme as small as the English grain.
  • Language model: solve the problem of polyphonetic words, after the acoustic model gives the pronunciation sequence, find the string sequence with the highest probability from the candidate text sequence. The language model also constrains and rescores the decoding of acoustics so that the final recognition result conforms to grammatical rules. Currently the most common are N-Gram language models and RNN-based language models.

3. Trends

At present, open-domain speech recognition is developing towards large models of generative AI, unlike them, the research direction of end-side speech recognition mainly focuses on the following three aspects, which are smaller models, faster response and stronger robustness:

  • Model optimization and compression: Various optimization algorithms such as hyperdimensional computing, memory exchange mechanism, restricted neural architecture search and other optimization algorithms are proposed, and the size and computational requirements of the model are reduced by quantization, pruning, knowledge distillation and other model compression methods, while maintaining high accuracy.
  • Low-latency real-time processing: Meet real-time requirements while maintaining high accuracy by employing low-latency acoustic feature extraction algorithms, improved model inference methods, and streaming recognition techniques.
  • End-side acoustic environment adaptation: noise suppression algorithm, data enhancement of acoustic model training in a variety of noise environments, multi-channel audio processing, etc.

4. Open source tools and datasets

  • Open source tools
tool Brief introduction programming language link
Kaldi Kaldi is a powerful speech recognition toolkit that supports a variety of acoustic modeling and decoding techniques. It provides a range of tools and libraries for acoustic model training, feature extraction, decoding, and more.

C++

Shell

Python

https://github.com/kaldi-asr/kaldi
vosk-api Vosk is an offline open-source speech recognition tool. It recognizes 16 languages, including Chinese. Python https://github.com/alphacep/vosk-api
PocketSphinx Another small speech recognition engine developed by Carnegie Mellon University for embedded systems and mobile applications.

Python

C/C++

https://github.com/cmusphinx/pocketsphinx
DeepSpeech An open-source speech recognition engine developed by Mozilla, it uses RNNs and CNNs to process acoustic features and provides pre-trained models for use. Python https://github.com/mozilla/DeepSpeech
Julius An open-source large-vocabulary continuous speech recognition engine that supports multiple languages and models. C/C++ https://github.com/julius-speech/julius
HTK HTK is a toolkit for building hidden Markov models (HMMs). C https://htk.eng.cam.ac.uk/
ESPnet An end-to-end speech processing toolkit that includes tasks such as speech recognition, speech synthesis, and more.

Python

Shell

https://github.com/espnet/espnet
  • End-side inference open source framework
Platform framework Brief introduction programming language
Tensorflow-Lite A lightweight machine learning inference framework developed by Google specifically for mobile devices, embedded systems, and edge devices.

C++

Python

uTVM uTVM is a branch of TVM that focuses on low-latency, efficient deep learning model inference on embedded systems, edge devices, and IoT devices.

C++

Python

Edge Impluse Edge Impulse is an end-to-end platform for machine learning model development and deployment for IoT devices, supporting rich sensor and data integration, model development tools, model deployment, and inference. C/C++
NCNN NCNN is an optimized neural network computing library focused on high-performance, efficient deep learning model inference on resource-constrained devices.

C++

Python

  • Open source database
database Brief introduction link
IMIT TIMIT is a dataset widely used in speech recognition research that contains pronounced sentences with American English accents. It contains sentences recorded by speakers with multiple accents, genders, and ages for training and testing speech recognition systems. https://catalog.ldc.upenn.edu/LDC93s1
LibriSpeech LibriSpeech is a large dataset for speech recognition that contains audio and text from English readings in the public domain. http://www.openslr.org/94/
Haitian Ruisheng AAC Haitian provides multilingual, cross-domain, cross-modal artificial intelligence data and related data services to the whole industry. https://www.speechocean.com/dsvoice/catid-52.htm
Datatang Datatang is a domestic artificial intelligence data service enterprise, providing customized services for training datasets, data collection and annotation. https://www.datatang.com/

More open source database http://www.openslr.org/resources.php

3. Technical solutions

1. Based on voice recognition chip module

Speech recognition chips usually have built-in DSP instruction sets required for signal processing and speech recognition, FPU arithmetic units that support floating-point arithmetic, and FFT accelerators to train and learn audio signals through neural networks, thereby improving the recognition ability of speech signals. The following table shows some of the suppliers of speech recognition chip modules.

vendor Chip module Wake and command customization Iterative mode
Qiying Tairen

CI120 series

CI130 series

CI230 series

https://aiplatform.chipintelli.com/home/index.html

Serial port burning

STING

Helinco HLK-v20 http://voice.hlktech.com/yunSound/public/toWebLogin Serial port burning
Anxinko VC series module https://udp.hivoice.cn/solutionai Serial port burning
Movement intelligence

SU-03T

SU-1X series

SU-3X series

https://udp.hivoice.cn/solutionai Serial port burning
Only CreateSoulmate WTK6900 Offline customized services -
Kyushu Electronics NRK330x series Offline customized services -

Voice chip modules usually do not need to be connected to peripheral circuits, and can work by connecting to the power supply, MIC and speaker.

【TinyML】End-side speech recognition technology

Taking HLK-v20 as an example, its default firmware has built-in some default wake words and command entries, and after completing the recognition of voice commands, it will output the format commands of fixed protocols through the serial port, and users can dock the corresponding main control chip for wireless communication and other functions according to their needs. At the same time, HLK-v20 can customize the firmware of wake-up words and command words through the Hilink voice product customization management system, and generate a firmware SDK online after setting the command words according to the requirements, and then flash them to the chip module for updates.

2. Based on the development framework of chip manufacturers

Chip manufacturers provide their own AI application development framework, and the speech recognition model provided by them has been deeply optimized for their own chip architecture, which greatly improves the inference speed of the model.

Manufacturers frame Related links
ESP32 ESP-ADF speech development framework, relying on ESP-IDF infrastructure and ESP-SR speech recognition processing algorithm library. https://github.com/espressif/esp-adf/tree/master
STM32 Provides an end-to-end solution that enables developers to rapidly deploy a variety of AI models on STM32 microcontrollers. https://stm32ai.st.com/stm32-cube-ai/
Siliconlabs The MLTK toolset is provided https://siliconlabs.github.io/mltk/

Taking Espressif's ESP-ADF framework as an example, it is a series of audio application components developed on the basis of ESP-IDF, and its core speech recognition algorithm component is ESP-SR, as well as a variety of codec drivers and transmission protocols. Audio and video applications can be easily developed.

【TinyML】End-side speech recognition technology

The hardware framework used is shown in the figure above, due to the need for neural network inference, in order to the accuracy of speech recognition, LSTM or Seq2Seq and other models are generally introduced, resulting in a large final model file, and the memory resources required for runtime also have certain requirements, so in general, external Flash or SDCard are required.

3. Based on open source framework

At present, there are many open source neural network frameworks that can realize the training of speech recognition models and product-level industrial deployment solutions, such as Tensorflow-Lite, which is very popular in the industry, and TVM, a neural network compiler that supports automatic optimization of board-level operators.

【TinyML】End-side speech recognition technology

The biggest advantage of the open source framework based on neural network is the autonomous control of the whole process, including model training and deployment, etc., the disadvantage is that the process is long, and the design and tuning of network models are a big challenge.

4. Based on neural network chip

Neural network chips usually have high computing power, can meet complex functional requirements, and are more expensive in cost, mainly used in smart speakers and intelligent voice central control scenarios.

Manufacturers Model
Allwinner Technology R328、R58、R16、H6、F1C600
Morning A113X、A112、S905D
BEKEN Broadcom BK3260
Intel Atom x5-Z8350
MTK MT7668、MT7658、MT8167A、MT8765V、MT7688AN、MT8516、MT2601
Rockchip RK3308、RK3229、RK3326、OS1000RK
Spitz TH1520
iFLYTEK CSK4002

IV. Summary

This paper introduces the basic concepts and working principles of speech recognition technology, and expounds four technical application schemes for implementing speech recognition on the end side.

Read on