I. Preface

Automatic Speech Recognition is a technology that converts a person's speech into text. For example, "Hey, Siri", "Hi Alexa" and other voice assistants are implemented by the application of voice recognition technology. Through voice assistants, users can directly use their voice to control home equipment such as air conditioners, TVs, curtains and lights at home, making device control more convenient and natural.

【TinyML】End-side speech recognition technology

At present, most of the mainstream intelligent voice solutions on the market use online voice, combined with platform content to create a rich smart home ecology. However, the way of online networking also brings many uncertainties: privacy security, network latency, linkage speed, network stability, etc. Based on this, the offline localized voice control scheme on the end side has become another choice for users.

Second, end-side speech recognition

1. Overview

Modern speech recognition can be traced back to 1952, Davis et al. developed the world's first experimental system that can recognize the pronunciation of 10 English numbers, and officially opened the process of speech recognition. Speech recognition has been developed for more than 70 years, but it can be roughly divided into three stages from the technical direction:

From the perspective of the content scope of recognition, speech recognition can be divided into "open domain recognition" and "closed domain recognition":

Open domain identification	Closed domain identification
There is no need to prepend a collection of identifiers	A collection of qualified terms that need to be provisioned
Models are generally relatively large and the engine is computationally intensive	The identification engine has few hardware resources
Rely on "online" identification in the cloud	"Offline" speech recognition deployed on embedded devices
It is mainly oriented to multi-round dialogue interaction	It is mainly aimed at simple device control scenarios

The voice recognition on the end side belongs to the scope of closed domain recognition.

2. Principle

The workflow of speech recognition technology is shown in the following diagram:

The device microphone receives the raw speech, and the ADC converts the analog signal to digital.
The acoustic front-end module performs echo cancellation, noise suppression, and voice active detection to exclude non-speech signals, and then performs MFCC feature extraction of normal speech signals.
The back-end processing part uses the speech features as the input of the acoustic model for model inference, and then calculates the score of command recognition combined with the language model, retrieves the command set according to the highest score, and finally outputs the recognition result.
The recognition process of the backend can be understood as the process of decoding the feature vector into text, which needs to be processed by two models:

Acoustic model: It can be understood as the modeling of sound, which can convert speech input into the output of acoustic representation, more accurately giving the probability that speech belongs to an acoustic symbol. In English, this acoustic symbol can be a syllable or a smaller granular phoneme; In Chinese , the acoustic symbol can be a vowel or a phoneme as small as the English grain.
Language model: solve the problem of polyphonetic words, after the acoustic model gives the pronunciation sequence, find the string sequence with the highest probability from the candidate text sequence. The language model also constrains and rescores the decoding of acoustics so that the final recognition result conforms to grammatical rules. Currently the most common are N-Gram language models and RNN-based language models.

3. Trends

At present, open-domain speech recognition is developing towards large models of generative AI, unlike them, the research direction of end-side speech recognition mainly focuses on the following three aspects, which are smaller models, faster response and stronger robustness:

Model optimization and compression: Various optimization algorithms such as hyperdimensional computing, memory exchange mechanism, restricted neural architecture search and other optimization algorithms are proposed, and the size and computational requirements of the model are reduced by quantization, pruning, knowledge distillation and other model compression methods, while maintaining high accuracy.
Low-latency real-time processing: Meet real-time requirements while maintaining high accuracy by employing low-latency acoustic feature extraction algorithms, improved model inference methods, and streaming recognition techniques.
End-side acoustic environment adaptation: noise suppression algorithm, data enhancement of acoustic model training in a variety of noise environments, multi-channel audio processing, etc.

4. Open source tools and datasets

Open source tools

tool	Brief introduction	programming language	link
Kaldi	Kaldi is a powerful speech recognition toolkit that supports a variety of acoustic modeling and decoding techniques. It provides a range of tools and libraries for acoustic model training, feature extraction, decoding, and more.	C++ Shell Python	https://github.com/kaldi-asr/kaldi
vosk-api	Vosk is an offline open-source speech recognition tool. It recognizes 16 languages, including Chinese.	Python	https://github.com/alphacep/vosk-api
PocketSphinx	Another small speech recognition engine developed by Carnegie Mellon University for embedded systems and mobile applications.	Python C/C++	https://github.com/cmusphinx/pocketsphinx
DeepSpeech	An open-source speech recognition engine developed by Mozilla, it uses RNNs and CNNs to process acoustic features and provides pre-trained models for use.	Python	https://github.com/mozilla/DeepSpeech
Julius	An open-source large-vocabulary continuous speech recognition engine that supports multiple languages and models.	C/C++	https://github.com/julius-speech/julius
HTK	HTK is a toolkit for building hidden Markov models (HMMs).	C	https://htk.eng.cam.ac.uk/
ESPnet	An end-to-end speech processing toolkit that includes tasks such as speech recognition, speech synthesis, and more.	Python Shell	https://github.com/espnet/espnet

End-side inference open source framework

Platform framework	Brief introduction	programming language
Tensorflow-Lite	A lightweight machine learning inference framework developed by Google specifically for mobile devices, embedded systems, and edge devices.	C++ Python
uTVM	uTVM is a branch of TVM that focuses on low-latency, efficient deep learning model inference on embedded systems, edge devices, and IoT devices.	C++ Python
Edge Impluse	Edge Impulse is an end-to-end platform for machine learning model development and deployment for IoT devices, supporting rich sensor and data integration, model development tools, model deployment, and inference.	C/C++
NCNN	NCNN is an optimized neural network computing library focused on high-performance, efficient deep learning model inference on resource-constrained devices.	C++ Python

Open source database

database	Brief introduction	link
IMIT	TIMIT is a dataset widely used in speech recognition research that contains pronounced sentences with American English accents. It contains sentences recorded by speakers with multiple accents, genders, and ages for training and testing speech recognition systems.	https://catalog.ldc.upenn.edu/LDC93s1
LibriSpeech	LibriSpeech is a large dataset for speech recognition that contains audio and text from English readings in the public domain.	http://www.openslr.org/94/
Haitian Ruisheng	AAC Haitian provides multilingual, cross-domain, cross-modal artificial intelligence data and related data services to the whole industry.	https://www.speechocean.com/dsvoice/catid-52.htm
Datatang	Datatang is a domestic artificial intelligence data service enterprise, providing customized services for training datasets, data collection and annotation.	https://www.datatang.com/

More open source database http://www.openslr.org/resources.php

3. Technical solutions

1. Based on voice recognition chip module

Speech recognition chips usually have built-in DSP instruction sets required for signal processing and speech recognition, FPU arithmetic units that support floating-point arithmetic, and FFT accelerators to train and learn audio signals through neural networks, thereby improving the recognition ability of speech signals. The following table shows some of the suppliers of speech recognition chip modules.

vendor	Chip module	Wake and command customization	Iterative mode
Qiying Tairen	CI120 series CI130 series CI230 series	https://aiplatform.chipintelli.com/home/index.html	Serial port burning STING
Helinco	HLK-v20	http://voice.hlktech.com/yunSound/public/toWebLogin	Serial port burning
Anxinko	VC series module	https://udp.hivoice.cn/solutionai	Serial port burning
Movement intelligence	SU-03T SU-1X series SU-3X series	https://udp.hivoice.cn/solutionai	Serial port burning
Only CreateSoulmate	WTK6900	Offline customized services	-
Kyushu Electronics	NRK330x series	Offline customized services	-

Voice chip modules usually do not need to be connected to peripheral circuits, and can work by connecting to the power supply, MIC and speaker.

Taking HLK-v20 as an example, its default firmware has built-in some default wake words and command entries, and after completing the recognition of voice commands, it will output the format commands of fixed protocols through the serial port, and users can dock the corresponding main control chip for wireless communication and other functions according to their needs. At the same time, HLK-v20 can customize the firmware of wake-up words and command words through the Hilink voice product customization management system, and generate a firmware SDK online after setting the command words according to the requirements, and then flash them to the chip module for updates.

2. Based on the development framework of chip manufacturers

Chip manufacturers provide their own AI application development framework, and the speech recognition model provided by them has been deeply optimized for their own chip architecture, which greatly improves the inference speed of the model.

Manufacturers	frame	Related links
ESP32	ESP-ADF speech development framework, relying on ESP-IDF infrastructure and ESP-SR speech recognition processing algorithm library.	https://github.com/espressif/esp-adf/tree/master
STM32	Provides an end-to-end solution that enables developers to rapidly deploy a variety of AI models on STM32 microcontrollers.	https://stm32ai.st.com/stm32-cube-ai/
Siliconlabs	The MLTK toolset is provided	https://siliconlabs.github.io/mltk/

Taking Espressif's ESP-ADF framework as an example, it is a series of audio application components developed on the basis of ESP-IDF, and its core speech recognition algorithm component is ESP-SR, as well as a variety of codec drivers and transmission protocols. Audio and video applications can be easily developed.

The hardware framework used is shown in the figure above, due to the need for neural network inference, in order to the accuracy of speech recognition, LSTM or Seq2Seq and other models are generally introduced, resulting in a large final model file, and the memory resources required for runtime also have certain requirements, so in general, external Flash or SDCard are required.

3. Based on open source framework

At present, there are many open source neural network frameworks that can realize the training of speech recognition models and product-level industrial deployment solutions, such as Tensorflow-Lite, which is very popular in the industry, and TVM, a neural network compiler that supports automatic optimization of board-level operators.

The biggest advantage of the open source framework based on neural network is the autonomous control of the whole process, including model training and deployment, etc., the disadvantage is that the process is long, and the design and tuning of network models are a big challenge.

4. Based on neural network chip

Neural network chips usually have high computing power, can meet complex functional requirements, and are more expensive in cost, mainly used in smart speakers and intelligent voice central control scenarios.

Manufacturers	Model
Allwinner Technology	R328、R58、R16、H6、F1C600
Morning	A113X、A112、S905D
BEKEN Broadcom	BK3260
Intel	Atom x5-Z8350
MTK	MT7668、MT7658、MT8167A、MT8765V、MT7688AN、MT8516、MT2601
Rockchip	RK3308、RK3229、RK3326、OS1000RK
Spitz	TH1520
iFLYTEK	CSK4002

IV. Summary

This paper introduces the basic concepts and working principles of speech recognition technology, and expounds four technical application schemes for implementing speech recognition on the end side.

【TinyML】End-side speech recognition technology

I. Preface

Second, end-side speech recognition

3. Technical solutions

IV. Summary

Read on

The pinnacle of sound technology: the emergence of intelligent walkie-talkies, the perfect combination of voice recognition technology

iFLYTEK actively participates in public welfare undertakings. As one of the representatives of Chinese technology enterprises, iFLYTEK has not only achieved great success in the field of speech recognition, but also in artificial intelligence, big data, and machines

Come back twice as strong! The application of voice recognition technology expands people's communication with smart devices

#Family Scanning Reading Pen#iFLYTEK Speech Recognition Technology#School Season#Recommend Your Favorite Treasure Shop#Economical and Practical@Douyin Little Assistant

Why are there so few people in the workplace using the iPad as a productivity tool? That's because the iPad's entertainment nature is so conspicuous that it's hard for others to tell if you're really working or not

In the field of artificial intelligence, image technology and speech recognition technology are two important application fields

#Xpeng X9.35 million high-value family car# It is said that buying a car now: 50% of people - buy a mixed oil car, 30% of people - buy a pure oil car, 20% of people - buy a pure electric car 202

#The new national forces bring Yan Zhi to the powerful faction##Nano 01##Dongfeng Nano #I never understand why the current tram is so smart and so fashionable and beautiful

The Epoch ES is equipped with the latest generation of 8155 chip, which has become an indispensable configuration in many popular pure electric models this year. Its powerful computing power and high degree of adaptability

#The ninth-generation Camry is officially pre-sold#Do you know that the ninth-generation Camry's new car machine has made major changes, and you no longer have to envy the tram car machine, the new Camry car machine uses Qualcomm Xiao

Aion gave a "three-foot price for digging the ground," and the 49,900 new car drove home, and the price of the tram was finally knocked down! It's almost the New Year, are you sure you won't drive the new car home for the New Year? What?

In this issue, I would like to share with you the ideal L92023 Pro, with a guide price of 429,800 positioning large 6-seat SUV, and a body size of 521819981800mm.

In this issue, I will bring you the ideal L92022 Max, with a guide price of 459,800 yuan, a body size of 521819981800mm, and a wheelbase of 3105mm

This issue brings the Song PLUSEV2023 Champion Edition 520KM flagship model, with a guide price of 189,800, a body size of 478518901660mm, and a wheelbase

Today I would like to share with you the Lynk & Co 08EM-P2023 120 long-range Halo, with a guide price of 215,800 and a body size of 4820*1 after the discount of 206,800

#纯电卷王小鹏G6限时立减2万元#我最近考虑买小鹏G6, can someone share its test drive experience?1. The battery life is quite solid, and the WLTP is fully charged

CPU, GPU, TPU, NPU !️are several different types of processors, each with its own advantages and disadvantages

Recently, it was learned that the blue electric E5 glory version was launched, and the new car launched 3 models, with a guide price range of 99,800 ~ 119,800, compared with the old model of 32,100-4

Sisters, 2024 is really my lucky year! A big gift I received this year was actually a new energy vehicle! Sent by my husband-to-be, hahaha! At a glance, Feifan F7 outside

Recently, it was learned that the Wuling Starlight EV has opened pre-sale, positioned as a mid-size sedan, and currently offers two models, with a pre-sale price of 109,800 yuan and 119,800 yuan, CLTC

The Rafale fighter has a 98.6% speech recognition rate, and the failure is 1.4%, and the Indians can't be blamed entirely on it

The era of AI is coming, share an easy-to-use local speech recognition input tool