laitimes

SALMONN: A large language model that supports speech, audio events, and music input

author:O&M development Mu Zi Li

#暑期创作大赛#

SALMONN is a large language model (LLM) that supports speech, audio events, and music input, created by the Department of Electronic Engineering of Tsinghua University and ByteDance. INSTEAD OF SPEECH-ONLY INPUT OR AUDIO-ONLY EVENT INPUT, SALMONN CAN PERCEIVE AND UNDERSTAND A VARIETY OF AUDIO INPUTS, ENABLING EMERGING CAPABILITIES SUCH AS MULTILINGUAL SPEECH RECOGNITION AND TRANSLATION AND AUDIO SPEECH REASONING. THIS CAN BE SEEN AS GIVING LLM'S "EARS" COGNITIVE HEARING CAPABILITIES, WHICH MAKES SALMONN A STEP TOWARDS GENERAL ARTIFICIAL INTELLIGENCE WITH HEARING CAPABILITIES.

SALMONN: A large language model that supports speech, audio events, and music input

SALMONN encodes a common audio representation using speech and audio encoders, and then uses an audio text aligner to map audio features to text space. Finally, the large language model responds based on text prompts and auditory markers.

SALMONN: A large language model that supports speech, audio events, and music input

demo

Compared with traditional speech and audio processing tasks such as speech recognition and audio subtitling, SALMONN realizes cognitive-oriented audio perception by using LLM's common sense and cognitive capabilities, which greatly improves the versatility of the model and the richness of tasks. IN ADDITION, SALMONN IS ABLE TO FOLLOW TEXT COMMANDS, EVEN VERBAL COMMANDS, WITH RELATIVELY HIGH ACCURACY. Since SALMONN only uses text-based training data, listening to voice commands is also a cross-modal emergence capability.

HERE ARE SOME DEMOS FROM SALMONN.

sonic Reply
asr.wav
SALMONN: A large language model that supports speech, audio events, and music input
Audio subtitle .wav
SALMONN: A large language model that supports speech, audio events, and music input
Music .wav
SALMONN: A large language model that supports speech, audio events, and music input
Emotional .wav
SALMONN: A large language model that supports speech, audio events, and music input
asr_en2de.wav
SALMONN: A large language model that supports speech, audio events, and music input
Keywords.flac
SALMONN: A large language model that supports speech, audio events, and music input
Spoken inquiry .wav
SALMONN: A large language model that supports speech, audio events, and music input
Audio storytelling .wav
SALMONN: A large language model that supports speech, audio events, and music input
Spoken audio query .wav
SALMONN: A large language model that supports speech, audio events, and music input

Project Address:

Read on