#暑期创作大赛#
SALMONN is a large language model (LLM) that supports speech, audio events, and music input, created by the Department of Electronic Engineering of Tsinghua University and ByteDance. INSTEAD OF SPEECH-ONLY INPUT OR AUDIO-ONLY EVENT INPUT, SALMONN CAN PERCEIVE AND UNDERSTAND A VARIETY OF AUDIO INPUTS, ENABLING EMERGING CAPABILITIES SUCH AS MULTILINGUAL SPEECH RECOGNITION AND TRANSLATION AND AUDIO SPEECH REASONING. THIS CAN BE SEEN AS GIVING LLM'S "EARS" COGNITIVE HEARING CAPABILITIES, WHICH MAKES SALMONN A STEP TOWARDS GENERAL ARTIFICIAL INTELLIGENCE WITH HEARING CAPABILITIES.
SALMONN encodes a common audio representation using speech and audio encoders, and then uses an audio text aligner to map audio features to text space. Finally, the large language model responds based on text prompts and auditory markers.
demo
Compared with traditional speech and audio processing tasks such as speech recognition and audio subtitling, SALMONN realizes cognitive-oriented audio perception by using LLM's common sense and cognitive capabilities, which greatly improves the versatility of the model and the richness of tasks. IN ADDITION, SALMONN IS ABLE TO FOLLOW TEXT COMMANDS, EVEN VERBAL COMMANDS, WITH RELATIVELY HIGH ACCURACY. Since SALMONN only uses text-based training data, listening to voice commands is also a cross-modal emergence capability.
HERE ARE SOME DEMOS FROM SALMONN.
sonic | Reply |
asr.wav | |
Audio subtitle .wav | |
Music .wav | |
Emotional .wav | |
asr_en2de.wav | |
Keywords.flac | |
Spoken inquiry .wav | |
Audio storytelling .wav | |
Spoken audio query .wav |
Project Address: