laitimes

Tsinghua Electronics Department and Volcano Speech jointly launched the cognitive-oriented auditory big language model SALMONN

author:Voice House

Recently, the Department of Electronic Engineering of Tsinghua University and the Volcano Speech team joined hands to launch the cognitive-oriented open source auditory big language model SALMONN (Speech Audio Language Music Open Neural Network).

Tsinghua Electronics Department and Volcano Speech jointly launched the cognitive-oriented auditory big language model SALMONN

LARGE LANGUAGE MODEL SALMONN LOGO

Compared with other large models that only support voice input or non-speech audio input, SALMONN has the ability to perceive and understand various audio inputs such as speech, audio events, and music, which is equivalent to "adding an ear" to the large language model, resulting in the emergence of advanced capabilities such as multilingual and cross-modal reasoning. In this article, emergence capabilities refer to cross-modal capabilities that have not been learned in model training.

Specifically, by adding a universal audio encoder based on Whisper Encoder to the Vicuna 13B large language model and using a fusion device to align audio and text modalities, the SALMONN model has a direct perception of audio modalities and is no longer a "brain in a cylinder"; Compared with the API route of first using the API call "ToolFormer", converting speech or non-speech audio input into text, and then entering text into the large language model, SALMONN can directly obtain knowledge from the physical world and emerge a complete understanding of some complex audio scenarios.

In addition, compared with traditional speech recognition, audio subtitle generation and other speech and audio processing tasks, SALMONN uses the common sense and cognitive ability learned by the large language model from massive text to realize a cognitive-oriented audio perception, which greatly improves the versatility of the model and the richness of the task. In addition, SALMONN can listen to the user's text instructions and even voice commands more accurately. Since the model only uses training data based on text instructions, listening to voice commands is also a cross-modal emergence capability.

Tsinghua Electronics Department and Volcano Speech jointly launched the cognitive-oriented auditory big language model SALMONN

Schematic diagram of the SALMONN structure

In general, at present, SALMONN is capable of English speech recognition, English to Chinese speech translation, emotion recognition, audio subtitle generation, music description and other important speech and audio tasks, and at the same time, a variety of multilingual and cross-modal capabilities have emerged that have not been specifically learned in model training, covering non-English speech recognition, English to (other than Chinese) speech translation, summarization and keyword extraction of speech content, audio-based story generation, audio question and answer, Tasks such as speech and audio joint inference.

In this regard, the research team divided the above tasks into three categories according to the degree of difficulty, and proposed demos one by one to display, they are:

  • Tasks learned in model training
  • Tasks that have not been learned in model training, but that large language models are able to accomplish based on text input
  • A task that has not been learned in model training and requires a multimodal large model that directly perceives audio and video

The first category: the tasks learned in model training

  • Automatic Speech Recognition

ASR "Link"

But I was also stumbling because I couldn't look away from these things. (Chinese translates to: I have stumbled too because I couldn't take my eyes off these things. )

Tsinghua Electronics Department and Volcano Speech jointly launched the cognitive-oriented auditory big language model SALMONN

Screenshot of SALMONN operation (voice input)

  • Audio Captioning

audiocaption "Link"

Tsinghua Electronics Department and Volcano Speech jointly launched the cognitive-oriented auditory big language model SALMONN

SCREENSHOT OF SALMONN RUNNING (AUDIO INPUT)

SALMONN's output Chinese translates to: Sound effects include gunshots, explosions, and shouts. High sound quality and good mixing create a realistic, immersive audio experience. The dialogue is clear and easy to understand, and the characters shout and communicate with each other during the intense battle scenes. The music is also available, but not as prominent as the sound effects, adding to the overall atmosphere of the scene.

  • Music Description (Music Descrption)

Music "Link"

Tsinghua Electronics Department and Volcano Speech jointly launched the cognitive-oriented auditory big language model SALMONN

SALMONN RUNNING SCREENSHOT (MUSIC INPUT)

SALMONN'Chinese OUTPUT TRANSLATES AS: MUSIC IS A LIGHT AND CHEERFUL TUNE WITH A STABLE RHYTHM. It is played on the piano and has a simple and catchy melody.

  • Emotion Recognition

emotion "link"

Tsinghua Electronics Department and Volcano Speech jointly launched the cognitive-oriented auditory big language model SALMONN

Screenshot of SALMONN running (speech input in emotion recognition)

SALMONN'Chinese OUTPUT TRANSLATES TO: THE SPEAKER'S MOOD IS DISAPPOINTED.

The second category: tasks that have not been learned in model training but that large language models can complete based on text input

  • Speech translation from English to other languages is a multilingual emerging capability

asr_en2de "Link"

English voice input

But I was also stumbling because I couldn't look away from these things.

Tsinghua Electronics Department and Volcano Speech jointly launched the cognitive-oriented auditory big language model SALMONN

SCREENSHOT OF SALMONN RUNNING

German output of SALMONN: Ich stolperte auch, weil ich nicht aus den Dingen heraussehen konnte. (Chinese translates to: I also stumble because I can't see clearly.) )

  • Voice keyword extraction is an emerging capability

keywords "Link"

Voice input

He tells us that at this festive season of the year with Christmas and roast beef looming before us similes drawn from eating and its results occur most readily to the mind. (Chinese translated: He tells us that in this festive season of Christmas and roast beef, the metaphor derived from the diet and its results is most likely to come to mind.)

Tsinghua Electronics Department and Volcano Speech jointly launched the cognitive-oriented auditory big language model SALMONN

SCREENSHOT OF SALMONN RUNNING

Keywords output by SALMONN: festive season, Christmas, roast beef

  • Voice question answering is an emergent capability

spoken_query "Link"

How a rocket engine works? (Chinese translated: How do rocket engines work? )

Tsinghua Electronics Department and Volcano Speech jointly launched the cognitive-oriented auditory big language model SALMONN

SCREENSHOT OF SALMONN RUNNING

SALMONN's output Chinese translates to: Rocket engines work by burning fuels such as hydrogen or kerosene in a combustion chamber, and the hot gas produced is ejected through a nozzle. The shape of the nozzle directs the gas in a specific direction, creating thrust to propel the rocket forward.

Fuels are usually mixed with oxidants such as liquid oxygen, where a chemical reaction occurs, releasing large amounts of energy in the form of heat and pressure. The heat generated by the combustion process vaporizes the fuel, and the pressure generated pushes the gas out of the engine.

The hot gas generated during combustion is then directed into a nozzle shaped like a cone. The function of the nozzle is to expand the gas, thereby increasing the speed of the gas, forming a high-speed gas jet, generating thrust.

The thrust generated by the rocket engine propels the rocket forward, freeing it from Earth's gravity and flying through space. The specific design of the rocket engine, including the fuel used, the shape of the nozzle and the size of the engine, will determine the amount of thrust generated and the specific characteristics of the rocket flight.

The third category: tasks that have not been learned in model training but need to directly perceive audio and video multimodal large models

  • Audio-based story generation

audio_story_telling "Link"

Audio scene input with voice

Audio description: Rustling occurs, ducks quack and water splashes, followed by an adult female and adult male speaking and duck calls being blown. (Chinese translates to: rustle, duck quack, splash, followed by adult male and adult female talking, duck call)

Tsinghua Electronics Department and Volcano Speech jointly launched the cognitive-oriented auditory big language model SALMONN

SCREENSHOT OF SALMONN RUNNING

SALMONN'Chinese TRANSLATES TO: SUNNY, BIRDS SINGING, A GROUP OF CHILDREN WALKING IN THE WOODS. They are searching for treasure, determined to find as many items on the list as possible.

As they walked, they heard a strange noise coming from a nearby pond. It sounds like a duck quacking, but it's louder and longer-lasting. The children hurried to the pond to investigate.

As they approached the pond, they saw something they had never seen before. A huge duck is sitting by the pond, barking loudly. The duck was the size of a car and its call was loud enough to be heard from miles away.

The children were stunned to see this giant duck. They had never seen anything like this before. They slowly approached the duck, trying not to alarm it.

As they approached, the duck turned to speak to them. "Hello, children," said in a low, loud voice. "What wind blew you to my pond?" .......

  • Audio inference for voice commands

Audio input with voice commands:

spoken_audio_query "Link"

Audio input with voice commands

Can you guess where I am right now Chinese? )

Tsinghua Electronics Department and Volcano Speech jointly launched the cognitive-oriented auditory big language model SALMONN

SCREENSHOT OF SALMONN RUNNING

THE Chinese TRANSLATION OF SALMONN'S OUTPUT: SURE, I CAN GUESS WHERE YOU ARE NOW. Depending on the gunfire in the background, it sounds like you may be in a war zone or a dangerous area. The sound of gunfire corresponds to the sound of automatic weapons, such as AK-47 or M-16. In addition, the sound of an explosion in the background indicates that you may be in a location where a bomb or other explosive device was used.

It's hard to be more specific without more information, but depending on the sound of gunfire and explosions, you could be in a war zone or a war-torn country. It may also be in places where criminal activity or violence is rampant.

Hope the above helps you, let me know if you have any more questions or need additional help from me.

On the whole, SALMONN has performed well in three types of tasks of different difficulty, and has emerged amazing audio understanding and reasoning capabilities, successfully opened up audio and text modes, and initially demonstrated a certain general auditory artificial intelligence. The research team will open source the SALMONN v1.0 model and related code in the near future, and will continue to update SALMONN in the future, so that the large model can better perceive the multimodal physical world and contribute to the construction of open source general artificial intelligence. Stay tuned!

Also attached:

Github repository: https://github.com/bytedance/SALMONN/

Demo link:https://bytedance.github.io/SALMONN/

The Multimedia Signal and Intelligent Information Processing Laboratory of the Department of Electronic Engineering of Tsinghua University has rich research accumulation in the fields of medical-engineering intersection and speech processing. The lab's Zhang Chao research group was established in 2022 and focuses on multimodal speech language processing and computational cognitive neuroscience research.

The Volcano Voice team provides high-quality voice AI technology capabilities and full-stack voice product solutions for ByteDance's internal business lines, and provides external services through Volcano Engine. Since its establishment in 2017, the team has focused on the research and development of industry-leading AI intelligent voice technology, and continuously explored the efficient combination of AI and business scenarios to achieve greater user value.

Read on