laitimes

Follow-up Interview · Wu Mengyue|Do machines understand sound better than humans?

author:nextquestion
Follow-up Interview · Wu Mengyue|Do machines understand sound better than humans?

# Follow-up press

Sound is an important communication medium in human society, which can not only convey emotions, but also reflect people's physical conditions.

In this issue, Professor Mengyue Wu from the Department of Computer Science and Engineering of Shanghai Jiao Tong University will take us into the world of speech, from multimodal interaction to medical applications, and explore the mystery of sound. Welcome to the Podcast.

Follow-up Interview · Wu Mengyue|Do machines understand sound better than humans?

Tell us about your research background. Why are you interested in this area of research?

Wu Mengyue: My main research direction is rich audio analysis. When we listen to a certain sound, if we are listening to a language, we care not only what the person says, but also how the person says it, i.e. what the mood and emotion of the person is when he or she speaks. Thinking further, a person can reflect his/her own mental state or cognitive state while speaking, which is actually to see speech or language function as an externalized manifestation of the cognitive function of the brain. Therefore, from the perspective of speech, we can do a lot of pathological analysis.

On the other hand, the sounds we hear include not only speech, but also everything in nature or in our environment. For a long time, traditional speech researchers would think that these natural sounds are "noise", but in fact, when we process all auditory information, each small sound provides an extremely large amount of information. Now we call this field "rich audio analysis", the so-called "rich" comes from two aspects, on the one hand, it means that the human voice will have many layers and can extract a lot of information; On the other hand, it refers to the richness of the environment. The research I want to do now is how to combine the two well.

What are the application scenarios for rich audio analysis?

Wu Mengyue: In fact, from the research content we just talked about, we can obviously find some corresponding application scenarios. For example, speech analysis, especially when combined with pathology, is widely used in the medical field.

Pathological speech research is divided into several categories, one is related to organic disorders, such as when adenoid hypertrophy, may affect the overall airflow, in the process of articulation will be hindered, so these organic lesions will cause differences in speech signals. Therefore, our research and otolaryngology have many relevant parts that can judge changes in a person's voice through his/her voice, including the diagnosis of lesions such as adenoid hypertrophy, and even early prediction of laryngeal cancer.

In addition to speaking, people can also produce other sounds, some sounds are also related to organic changes, such as snoring, and there are now many studies that will detect snoring to monitor sleep or see if there are problems with their respiratory system.

In addition, there have been some studies during the global pandemic, such as judging the root cause of a person's cough by the sound of his/her cough. These studies can not only be used to diagnose coronavirus, but can also be put into a broader setting, especially in the field of pediatrics. Cough is a very common disease of the respiratory system in children, and there are many reasons why children experience cough symptoms. In collaboration with the Shanghai Children's Medical Center, we invented a long-term wearable device that is easy for children to carry, shaped like a microphone or a button, so that it is possible to monitor changes in the child's coughing process, and derive it backwards from the frequency of coughing and all the speech produced by coughing, such as whether the nature of cough is dry or wet, and then further analyze whether the cough is caused by an ordinary upper respiratory tract infection or by a certain type of pneumonia. These are some very clear application scenarios.

In addition to the application of organic diseases, neurodegenerative diseases or diseases directly related to emotional disorders can also be studied in speech, such as depression, anxiety, Parkinson's disease and Alzheimer's disease. When speech analysis and comparison were carried out on Alzheimer's patients, it was found that it had certain similarities with depression and Parkinson's disease. On the one hand, most Alzheimer's patients will have symptoms of depression for a long time, on the other hand, this disease is a neurodegenerative disease like Parkinson's disease, and the interconnection between these diseases allows our system to be used in these scenarios.

In other respects, there is a very direct application - the detection of crying babies. For example, you can put a detector at home, and when it collects the child's crying, it can analyze the crying and then determine what the child's needs are.

In addition, we have cooperated with the public security organs some time ago, and if we want to know who has returned from other places when monitoring the population flow, we can set up microphone arrays at the doorsteps of returnees, and several households can share a microphone array, and judge whether people come back or enter or exit through the microphone array's recognition of the sound of opening and closing doors.

This research can also be applied to determine the travel safety of Didi passengers, in order to check the safety of passengers when taking a taxi, the recording is turned on in real time, but even if the recording is on in real time, no one will see all the recordings in real time. Therefore, when processing recordings, it is necessary to detect and determine the abnormal events in it, and detect whether someone is screaming, arguing or calling for help, which is what we discuss in rich audio analysis.

Going a step further, you can explore how to describe an audio content in full natural language. For example, using ASR can directly get a voice translation, and for example, in the current scene, if described in natural language, it can be described as "several people are conducting online conference discussions, what are the specific contents", or it can also directly describe a voice as "someone walks by, and there are birds chirping at the same time..." These can help the hearing impaired very well, even if they can't hear the sound, they can understand what is happening in the auditory world at this moment through the language text. Some mobile phone manufacturers have begun to conduct research in this area, aiming to further meet the needs of the hearing impaired or impaired people.

These are the scenarios I can think of for rich audio analytics that correspond directly.

In the research process, data is the foundation of everything. What types of data do you primarily use? And how is this data collected and analyzed?

Wu Mengyue: This is a very critical issue, whether it is the medical field or the environmental sound field, compared to the speech we have studied for a long time, this part of the sound data is still relatively scarce. For sound type data in the medical field, we will work with hospitals, but cooperation with hospitals is more about inventing, creating or using existing technology to transform it into a form more suitable for analysis applications, and then collecting audio data and then analyzing it in the laboratory.

As for the sound of ambient audio, first of all, there are many ambient sounds, but its biggest problem is how to label it. Talking about annotation leads to new research questions, such as whether ambient audio can be described in a weakly supervised way. The largest dataset in ambient audio is Google's AudioSet launched in 2017, which contains 527 different sound events, each audio contains multiple labels, but in fact, there is no way to accurately locate the labels, such as an event in the first second to the third second of an audio, or an event from the fourth to the eighth second, this strong label labeling method is very time-consuming and resource-intensive. There is now a paragraph-level callout. How to annotate first with weak supervision, and then annotate each frame in a strongly supervised way, is a major challenge in our research field.

In addition, we ourselves first proposed the task of audio caption in 2018, that is, how to describe audio content in a natural language text. Compared with previous labeling studies, this method is closer to human auditory perception.

If you just hear a loud noise, you will not say "explosion, semicolon, cry for help, semicolon" when describing it, but will describe it in a very natural sentence, which is the result that we hope that future machines can directly output when doing auditory perception. Of course, when we create such a new task, we also need a new data set to support it.

In short, the data we study either comes from real scenarios, such as through cooperation with hospitals or natural collection, or we invent some new annotation methods on some basic datasets to solve our current problems.

In a recent study of yours, you mentioned a model called clap, what are the key datasets used to train such a model? And how are they built?

Wu Mengyue: In the past few years, there have been many large-scale pre-training models that combine vision and natural language, but there are very few in the audio field, largely due to the lack of datasets. But last year, including us, there were three models mentioned in the same period called clap, because the CLIP model was previously captioned on the image, and we replaced the image with audio, so it was called clap.

In fact, our training method is very similar to the original clip, the key is how to solve the problem of where the data set in the audio field, especially the data set corresponding to the text, comes from.

One way is to train a model based on the original audio caption dataset and then use this model to tailtag all other applicable audio.

There is another way to add discrete tags before tailing the tags, make it a bootstrap, and then use these tags to guide the audio caption model, so that the generated caption itself will be more consistent with the original audio content. When tailtag massive amounts of data in this way, a dataset corresponding to audio and text has been constructed to some extent.

On this basis, we use contrastive learning, such as using two encoders, while entering audio and text, plus a contractive loss, so that the trained pre-trained model can achieve a large performance improvement in many downstream tasks related to audio or text.

In short, if you want to do pre-training, the source of the data and the quality and quantity of the data are very important. On the one hand, a model can be trained to label labels, and on the other hand, ChatGPT can be used to generate natural language descriptions for more audio data.

Follow-up Interview · Wu Mengyue|Do machines understand sound better than humans?

Image source: Midjourney

Many experiments face the problem of "getting out of the lab". In the real world, voice signals may be interfered with by various factors, such as background noise, speaker's accent, speaking speed, intonation, etc., and using different recording devices and microphones may also cause differences in voice signals. So, how do lab-trained speech recognition systems handle speech signals in the real world?

Wu Mengyue: Compared to natural language processing, the most difficult part of audio analysis is really coordinating the signals of all the different audios. A lot of the data in our study comes from real-world scenarios, so when collecting sound in hospitals, we specify a uniform model or sampling rate to get a better optimized model. In the final model training, we will also use different methods to make the model more adaptable or robust, such as may simulate different noises, or add some additional noise, but this also makes the original dataset used for training more complex.

As a result, anything that might be encountered in a real-world test is contained in the distribution of the original training dataset, but it's still difficult to actually get this work to be applied in the field—no matter who is around and how noisy the environment is—to achieve real-world performance as well as in the lab. Therefore, the key question is what is the acceptable range of degradation in the performance of the model in a real-world environment.

For this problem, traditional speech recognition research also faces real-world challenges - how to get better research results in this non-cooperative environment, we have made a lot of efforts and attempts, but so far this problem has not been solved.

You just mentioned that an important part of the research is the labeling and description of environmental sounds. With the advent of GPT, AI models have become powerful tools in scientific research, including we know that GPT-4 has been able to analyze, understand, integrate and output multimodal data. So can it help with the annotation and description of environmental sounds?

Wu Mengyue: This question is very interesting. If a person is asked to describe in words the difference in the sound of a violin and a cello, or how much the sound difference between a café scene and a restaurant scene, it is difficult to describe clearly. But if such a request is made to ChatGPT, whether it is GPT-3.5 or GPT-4, the answer it gives is very reasonable, from which it can be found that ChatGPT actually makes up for the lack of acoustic encoders through powerful text capabilities. So we thought ChatGPT might do a better job of describing ambient sounds than people.

The key question now is what kind of prompt to give ChatGPT so that it meets our requirements and description habits, and at the same time accurately describes the specific characteristics of the sound. Some time ago, there was such a study at the University of Surrey in the United Kingdom, although this study only used ChatGPT to assist the research in the first step, but overall, I think it is a very promising direction.

However, in the speech model, even if ChatGPT is used, it cannot directly use the image or voice as material to provide it for multimodal joint training, and we may need to fine-tune or do joint training in our own laboratory. However, there are indeed application scenarios in this regard, and ChatGPT's current ability to understand different modal information can assist us in partial analysis and processing of information media.

Based on ChatGPT, what else has your research team done?

Wu Mengyue: The application of ChatGPT is still based on text as a medium, and if there is a small sample in the process of model training, ChatGPT can be used to label the data, especially when dealing with very subtle differences in emotional relationships. In addition to the analysis of the sound itself, ChatGPT can also be used to do additional research, such as letting the robot simulate the entire dialogue-based consultation scene of doctors and patients - using ChatGPT to make two simulators, one imitating the patient and one imitating the doctor, and then comparing the simulated consultation scene with the real psychiatric consultation process, and then you can explore what limitations ChatGPT has in understanding and processing natural language compared to the real scene.

Of all the AI models we trained, ChatGPT's natural language understanding ability has reached its limit, and how to use the model to achieve human-machine consultation with the same utility as the real scene is also a study we want to conduct in combination with ChatGPT. If the ability of natural language understanding can no longer be further improved for ChatGPT, then what factors are different between natural dialogue and model-simulated dialogue are what we are very concerned about now.

You mentioned that ChatGPT is used as a simulated doctor-patient consultation scenario, so can the simulated data it creates be used as real research data? Are the findings based on this relevant?

Wu Mengyue: At present, it is not very good. It can simulate some relatively basic cases, but there is still a certain gap with the real application.

Specifically, for example, simulated doctors, ChatGPT and doctors' questioning forms or styles have certain differences, ChatGPT may be more written, and in the usual consultation, in order to relax patients, doctors are likely to use some more relaxed, colloquial consultation methods. When using ChatGPT to simulate a patient, when the patient sees the doctor in reality, he/she will not be so frank with telling the doctor some answers, or many patients are not clear about what their symptoms are, but ChatGPT as such a patient, such as at the beginning we let it add resistance, it may resist once or twice, if you ask again in the opposite direction, it will immediately say it, it feels like "I have the answer, but because you told me not to say this answer, I hide it twice." "The psychological gap between it and real patients is still very large.

Follow-up Interview · Wu Mengyue|Do machines understand sound better than humans?

Image source: Midjourney

So, I think it can be used to do some level of data augmentation. However, if you want to use this simulation data to make complete training data, the gap with the actual application scenario may be too large.

Now for the application of ChatGPT, it is possible to compare the difference between the data simulated by ChatGPT as a patient and the data of real patients, and this part of the work has preliminary results and will be published soon. At present, a more intuitive conclusion can be made that if ChatGPT is set a better prompt, in the case of the patient is cooperating, the simulated scene can be very close to the real interview scene, and when there is a patient who is not in a cooperative state, the dialogue will produce greater difficulties, so the difference itself depends on the complexity of the real scene to be simulated by the robot.

ChatGPT can simulate simple and basic consultation scenarios, but there is still a certain gap with real consultation applications. In order to relax patients during real consultations, doctors will use a more colloquial consultation method, while the doctor's consultation style simulated by ChatGPT is biased towards written expression; There are also differences when simulating patients, such as patients may not frankly say the answer during the interview, may not understand their specific symptoms, and may also have some situations where the preface does not match, but ChatGPT is difficult to fully simulate this situation, for example, when simulating a patient resisting an answer, it may only resist once or twice, and it will no longer resist after changing the way of asking, so there is still a very large psychological gap between this and the real patient. I think ChatGPT can be used to do some degree of data augmentation, but the gap between the generated data and the real application scenario is too large to be used as complete training data.

At the "held some time ago", you mentioned that you have been doing research for a long time to judge depression, Parkinson's and other diseases based on language function. What is the connection between speech and brain disorders? How can I detect diseases with voice?

Wu Mengyue: For example, Parkinson's disease is a neurodegenerative disease, which affects motor control in the brain, motor function control not only affects the control of hands and feet, but also affects the speech preparation stage before speaking, there is also a buffer process between the two steps of the brain to produce the idea of "talking" to control the vocalization of the vocal organs, when the part of the motor function control is affected, Although the words to be said have been thought of in my mind, because the vocal organs have not been controlled at this moment, they cannot make sounds in time. Therefore, many Parkinson's patients may not be clear or repeat a certain sound when pronouncing, or may have a long pause in vocalization as preparation for the next speech.

Therefore, Parkinson's patients have some characteristics in acoustic performance, such as the speed of speech will be slowed down, the overall vocabulary will be less, the pause time between words will become longer, and the number of repetitions of a word will be more than normal. These are actually features that can be quantitatively calculated, and adding these quantified content to the final detection model can feedback many disease-related features through voice.

What is the current accuracy of voice-based disease diagnosis? Has some research been applied to the medical field? Are there potential ethical issues here?

Wu Mengyue: There are actually reports on the accuracy of the application of such research in the news at home and abroad, such as the dataset of the University of Southern California used in depression detection, using this dataset to make a baseline (benchmark), after experimental parameter adjustment, you can get 80%-90% accuracy, but when you put it into a real scene or an approximate scene in the face of data collected in different ways, its migration ability is still very poor. If different data sets are detected without any parameter optimization, the accuracy may become 60%-70%. In the face of this situation, on the one hand, it can be combined with different modes for detection, on the other hand, it may be necessary to further find features that are not affected by environmental factors or dataset factors, and finally a more robust or portable detection method can be realized.

Certain ethical issues arise in the process. The first is the question of whether such model testing can replace doctors. First of all, the technology itself can help doctors work, for example, a person receiving treatment can check his recent psychological condition through the small program of psychological condition screening, and does not need to go to the hospital for each review, which can greatly increase the convenience of diagnosis. But even if it has achieved good accuracy experimentally, it cannot replace the test results of doctors' face-to-face consultations.

In addition, the reason why the use of voice for detection is emphasized is because many other aspects of information, such as face information, gait, etc., may involve more private content than voice, but voice detection will still involve human privacy. For example, in the diagnosis of depression or other mental illnesses more face-to-face treatment, only based on the patient's description of their own state to diagnose the objectivity will decrease, so we are considering whether wearable devices can be used to monitor the patient's sleep, activity and other aspects for a long time, according to which the actual condition of the patient is inferred, but this will also involve another type of ethical question: Does the doctor have the right to obtain the patient's daily life trajectory for condition monitoring? Therefore, I think that from a macro perspective, there may be certain conflicts and contradictions between the management of medical treatment, individuals, and public health.

Technology itself is moving forward, but there are many factors that restrict technology, and there are many factors that need to be considered whether technology can be applied in real life.

With the rapid development of AI technology, what breakthroughs do you think will be in the field of speech in the future?

Wu Mengyue: A Ph.D. from our lab is now working on a multilingual speech recognition project at Google, which is to achieve multilingual speech recognition, build a speech recognition system that can recognize multiple languages or even 100 different languages, which also uses the correspondence between sound and text, and in the process of speaking, there is a strong correspondence between phoneme and language (character or letter), using phoneme + The duration can be used to convert between text and speech.

There is also a strong correspondence in the analysis of rich audio, such as "bird call" and a type of audio containing bird song has a strong directivity, in order to reverse use this directivity for audio encoding, therefore, the relationship between text and speech can also help us understand or analyze sound in a multimodal manner.

So I think a promising future direction is to use language as a clue with better knowledge to aid research, which may be very helpful in any field of research related to speech.

After the advent of ChatGPT, where do you think AGI-related general artificial intelligence will go next? Will AI eventually evolve to resemble real humans?

Wu Mengyue: There was a sci-fi movie "Her" a long time ago, in the movie, everyone has a visual system, people can talk through headsets, there is no gap between machines and people in information understanding, which is my preliminary vision of future general artificial intelligence functions; Another example is the companionship robot dog that Boston Dynamics wants to do, which is also a research direction. The information processing that can achieve these functions must be multimodal, and if there is too much gap between the information obtained by the machine and the information obtained by the human, there is no way to help people make decisions. Therefore, technically speaking, there are still parts of the model that need to be revised, and only by exploring the gap between humans and robots and then filling this gap can the machine become more similar to people.

Follow-up Interview · Wu Mengyue|Do machines understand sound better than humans?

▷Image source: "Her" movie. The protagonist, Theodore Tombray, and the artificial intelligence assistant Samantha

Now in the process of interaction between humans and machines, the machine itself exists more in the form of tools, and when it can not be limited to the form of being stimulated to answer, but can actively carry out dialogue, human machine interaction can become closer to human interaction.

Also, when we know that the other person is a robot, do you say "thank you" or "sorry" to the robot?

In the process of our simulation, we found that if the doctor knew in advance that the other party was a patient played by ChatGPT, the doctor would not have empathy for the "patient", and would be more inclined to confirm whether ChatGPT deduced a qualified patient by going through the process during the diagnosis process; It's the same when ChatGPT plays the role of a doctor to deal with patients. Therefore, it is also necessary to understand what gaps exist between people and humans and machines, and exploring this gap is also the key to achieving truly universal artificial intelligence.

Do you think trying to make machines more like people a good thing or a bad thing?

Wu Mengyue: I think making machines and people more similar, on the one hand, can help machines have better performance, on the other hand, when machines have various abilities similar to people, people can communicate with machines more naturally, otherwise there is still a gap between people and machines. As for whether our study wants robots to be more human-like, that's a broader ethical debate. For example, Moss in the wandering earth may have begun to appear its own consciousness, whether the emergence of consciousness is a good thing or a bad thing for robots, and where is the value and significance of the existence of robots, I think these will be discussed by philosophy teachers.

If technically speaking, we definitely hope that general artificial intelligence is more human-like, when robots have similar abilities to people, it will be of great help to people, and people themselves will be able to free themselves from a lot of complicated labor. As for whether the behavioral ability after liberation will rise or decrease, this is the result that no one can predict now.

Question from @MengyueWu:

What is the difference between the process by which machines understand language and texts and the process by which humans understand language and texts? The information input to the machine is spectral or directly waveformal, what is the difference between the machine encoding this audio information and the human brain processing the audio information?

In response to Mr. Wu's question, we will invite more guests to explore the answer, so stay tuned.

Guest: Wu Mengyue

Interview: Lixia

Finishing & typography: Yunshan

Editor-in-charge: Yunke

Read on