laitimes

AI sonar glasses are coming! Read lips and air control mobile phones with 95% accuracy

Wisdom Stuff (Public Number: Zhidxcom)

Compile | Wu Feining

Edit | LI Shuiqing

Zhidong news on April 17, recently, Cornell University's Future Interactive Intelligent Computer Interface (SciFi) Laboratory released a sonar glasses EchoSpeech, which can identify silent commands according to the movement of lips and facial muscles. Using acoustic sensing and AI technology, and equipped with two pairs of speakers and microphones, this seemingly ordinary glasses can currently continuously recognize up to 31 "silent voice commands" with up to 95% accuracy.

The main application scenarios of EchoSpeech include noisy environments, inconvenient occasions and private conversations, and can also help people with language impairments communicate with others, which is both commercial consumption and health care functions. The R&D team uses AI deep learning pipelines to decipher the sound wave transmission orbit of facial movement, and uses convolutional neural networks to decode silent language.

In addition, the R&D team is currently commercializing the device technology through the funding project Ignite, which will be promoted and used within a certain range in the future.

The paper, titled "EchoSpeech: Continuous Silent Speech Recognition on Minimally-obtrusive Eyewear Powered by Acoustic Sensing," will be held this month at the Conference on CHI (Acoustic Sensing) in Hamburg, Germany Human Factors in Computing Systems) conference of the Computer Association.

AI sonar glasses are coming! Read lips and air control mobile phones with 95% accuracy

The paper link is:

https://dl.acm.org/doi/10.1145/3534621

1. It can identify the wearer's lip movement, and the conversion accuracy is as high as 95%

Ruidong Zhang, a doctoral student at Cornell University's School of Information Science, who is also a key participant in the EchoSpeech sonar technology research and the main author of the paper, demonstrated the shape, working principle and use of EchoSpeech glasses in the video.

To outsiders, Zhang Ruidong seemed to be talking to himself strangely, he was obviously talking but no sound occurred. In fact, he's reading a password to EchoSpeech to unlock his phone and have it play the next song in his music list.

AI sonar glasses are coming! Read lips and air control mobile phones with 95% accuracy

This kind of scene that can only be realized in movies is not telepathy, but EchoSpeech, a new product released by Cornell University. The product recognizes silent commands based on the movement of the lips and facial muscles.

According to Zhang Cheng, a teaching assistant and director of the Science Laboratory at Cornell University's School of Computing and Information Science, the research team is using this technology to "transfer sonar to people." Equipped with a pair of microphones and a speaker smaller than the eraser on the pencil tip, the EchoSpeech glasses are equipped with two tools that make up the glasses' AI sonar system, which sends and receives sound waves to the face and senses the wearer's lip movements.

Meanwhile, when the wearer tries to communicate silently, the deep learning algorithm developed by the researchers analyzes these echo profiles in real time, with an accuracy of about 95 percent.

In Zhang Cheng's view, the biggest obstacle to the previous silent speech recognition technology was to pre-order commands, and users had to wear a small camera, which made this technology neither practical nor difficult to achieve. Moreover, the technology also involves the protection of user privacy of wearable cameras, and security management needs to be strengthened.

The acoustic sensing technology used by EchoSpeech reduces the demands placed on wearable cameras. Since audio data is much smaller than image or video data, it can be processed with less bandwidth and transmitted to a smartphone in real time via Bluetooth.

François Guimbretière, a professor at the School of Information Sciences and co-author of the paper, said: "Since the data is processed locally on the user's mobile phone and is not uploaded to the cloud for processing, it is ensured that all privacy-sensitive information does not escape the user's control. ”

The most common use cases for EchoSpeech are situations where it is inconvenient or impossible to speak, such as a noisy restaurant or a quiet library. In public, when people want to talk about more private topics or work content that involves high confidentiality, EchoSpeech can help users protect these privacy so that outsiders cannot hear the conversation between the two parties. EchoSpeech can also be paired with a stylus and used with design software such as CAD, allowing you to complete tasks with virtually no mouse and keyboard.

AI sonar glasses are coming! Read lips and air control mobile phones with 95% accuracy

When talking about the future use of this technology, Zhang Ruidong, a doctoral student in information science, the main participant in the study, said that for those with hearing impairment and speech impairment, this silent speech technology may be an excellent partner for speech synthesizers, which can allow them to express their own voice smoothly and naturally. It is reported that the current version of the glasses acoustic induction battery life can last about 10 hours, and the camera version is 30 minutes.

Whether used as a commercial consumer-grade smart wearable device or as a healthcare function, EchoSpeech maximizes the utility of smart wearable technology.

Second, 31 instructions are recognized continuously, and it only takes 6 minutes to match new users

EchoSpeech looks like a regular myopia glasses, but it's not. In a small test of 12 participants, EchoSpeech was able to continuously identify 31 separate silent commands and a string of consecutive numbers issued by the subject, and it had an error rate of less than 10% in the test.

AI sonar glasses are coming! Read lips and air control mobile phones with 95% accuracy

EchoSpeech explains in detail how the technology works in a published paper.

Two pairs of tiny speakers and microphones are placed under the frame to monitor movement on different sides of the face, and when the speakers emit sound waves of about 20,000 hertz, the sound waves travel along a specific path from one lens to the lips to the other. When sound waves from the loudspeaker sense lip movement and reflect and diffract, the microphone captures the unique patterns of these waves and creates an "echo profile" for each sentence or command, like a complete small sonar system working under the lens.

AI sonar glasses are coming! Read lips and air control mobile phones with 95% accuracy

▲The figure shows the system layout and echo configuration file.

In the figure above, Figure A shows the final position of the sensor, and Figure B represents the signal transmission path, that is, from P1 to P4, S1, S2 are the speakers, and M1 and M2 are the microphones. Each path consists of multiple path reflections and diffractions that originate from the source speaker and end with the microphone. Figure c is the sound wave profile formed by EchoSpeech for different instructions.

Through machine learning, people can infer the wearer's silent language and the words they want to say from these echo profiles. While the language model is uniformly pre-trained on select commands, it is fine-tuned for each wearer and takes about 6 to 7 minutes to match for new users.

The acoustic sensor is connected to a microcontroller via a custom speaker, which can also be connected to a computer via a USB cable.

In the real-time demonstration, the team demonstrated how the low-energy version of EchoSpeech can communicate wirelessly with the mobile phone through Bluetooth and a microcontroller, and after the device is connected to the Android phone, it can predict facial movements and transmit the conversion results to a "action key", issuing instructions to let the phone play music, activate the voice assistant or control the mobile phone, which is the technical principle that Zhang Ruidong can switch music playlists by "talking to himself" in the demonstration.

In addition, the R&D team designed a custom deep learning pipeline to decipher the sonic tracks of silent speech of facial movement. Using an echo curve computational model to interpret facial movement patterns, the researchers added a model based on convolutional neural networks (CNNs) to EchoSpeech to decode silent language from echo profiles.

The research team also added a recurrent neural network (RNN) to the end of the CNN, including a long short-term memory neural network (LSTM) and a gated recurrent unit layer (GRU) to improve performance, and conducted experiments on such a convolutional recurrent neural network structure (CRNN) model. The results showed that GRU performed significantly better than LSTM, and in most cases, CNNs worked similarly to CRNNs, but CNNs converged faster than CRNNs during periods of the same number of audios.

Third, single sentences and whole sentences can be recognized, and static and dynamic effects are the same

According to research, privacy issues and social embarrassment are important factors for people to use silent voice assistants, they hope to communicate without speaking loudly, and will not leak half a sound to the outside world, silent voice assistants in this point well protect the privacy of users. In order to meet the needs of users for the silent voice interface (SSI) function, the developers hope that EchoSpeech can be infinitely close to real-life scenarios.

In the experiment, the team first designed two sets of commands to test EchoSpeech's ability to recognize discrete and sequential speech, taking into account the two most common cases: static and dynamic.

Discrete studies focused on independent commands, while continuous studies focused on continuous silent speech recognition, and each participant needed to complete both tests. During the data collection process, the commands that the participants need to execute appear on the computer screen, they say the words that appear on the computer but cannot make sounds, and the computer camera records this complete process, clearly detecting the movement of each participant's facial muscles.

AI sonar glasses are coming! Read lips and air control mobile phones with 95% accuracy

In discrete studies, each silent instruction lasts up to 3 seconds, after which it automatically skips to the next instruction; In continuous studies, participants had 4 seconds to pass each sentence to sonar glasses, press the space bar or right arrow to jump to the next command, and participants "spoke" at as natural speed and tone as possible.

In order to test whether the recognition performance of sonar glasses remained stable in both static (such as sitting at a desk) and dynamic (such as walking on the road), some participants walked freely in the room in their own way and speed, and some walked with a computer, and the results showed that there was no significant difference in the performance of glasses in the two cases.

AI sonar glasses are coming! Read lips and air control mobile phones with 95% accuracy

▲Comparison of speech recognition performance of EchoSpeech in static and mobile states

According to the research team, users only need to provide 6-8 minutes of static training data to use sonar glasses in static and mobile environments without discrimination, and the performance is good.

With potential future large-scale deployments, this performance can be further improved. This will be a solid step towards SSI's daily life application scenarios.

Fourth, the technology may be commercialized and become a daily consumer product

In addition to EchoSpeech, SciFi Labs has previously developed a system called EarlO, which uses sonar-equipped headphones to capture the wearer's facial expressions, and the wearer's facial skin moves, stretches and wrinkles as it sounds, adjusts the echo profile accordingly, and then uses algorithms to identify these echo profiles and quickly reconstruct the user's facial expressions for display on the digital body.

A research team at New York's Furrow University has also developed a similar device, EarCommand, when we silently say a word, muscle movement and bone movement cause the ear canal to deform in a unique way, which means that a specific deformation pattern can match a specific word, and the computer uses these AI algorithms to determine the deformation of the ear canal to confirm the word spoken by the wearer.

SciFi Labs is also actively participating in Cornell's Ignite project to explore the commercialization of EchoSpeech technology. In the future, researchers will also develop smart glass applications to track the user's face, eyes and upper body movements. Zhang Cheng said that in the future, smart glass will become an important personal intelligence platform to explore people's activities in the daily environment.

Conclusion: Smart wearable devices have entered a mature period of research and development, and three major bottlenecks need to be broken

Since Google released Project Glass smart glasses in 2012, the smart wearable device market has attracted much attention. The emergence of EchoSpeech sonar glasses developed by Cornell University confirms that the functions and application scenarios of wearable devices have been continuously optimized and expanded, and it can be said that the wearable device industry has entered a mature period of research and development.

Whether it is EchoSpeech or other smart wearable devices, there are still many bottlenecks in key technologies that need to be broken through, including product form and AI computing power. The first problem is that the power consumption is large and the battery life is short, which makes it impossible for users to use it for a long time, which is especially obvious on the camera-equipped version of EchoSpeech. Secondly, the integration of product functions is not perfect enough, and the third is that the product design is not daily enough, which requires the development of more miniature hardware to equip the product form.

Driven by the actual needs of users and the iteration of technology updates, the future EchoSpeech will have greater improvements in wearability, mobility, interaction and sustainability.

Source: Cornell University website

Read on