To add comprehensive audiovisual capabilities to the language model, DAMO Academy open source Video-LLaMA

Machine Heart column

Heart of the Machine Editorial Office

Video plays an increasingly important role in today's social media and internet culture, and Douyin, Kuaishou, B-station, etc. have become popular platforms for hundreds of millions of users. Users share their life moments, creative works, interesting moments and other content around videos to interact and communicate with others.

Recently, large language models have demonstrated remarkable capabilities. Can we equip a large model with "eyes" and "ears" so that it can understand the video and interact with the user?

Starting from this question, researchers at DAMO Academy proposed Video-LLaMA, a large model with comprehensive audiovisual capabilities. Video-LLaMA is able to perceive and understand video and audio signals in video, and can understand the instructions entered by the user to complete a series of complex tasks based on audio and video, such as audio/video description, writing, Q&A, etc. At present, papers, code, and interactive demos are all open. In addition, on the Video-LLaMA project homepage, the research team also provides Chinese versions of the model to make the experience of Chinese users smoother.

To add comprehensive audiovisual capabilities to the language model, DAMO Academy open source Video-LLaMA

Link to the paper: https://arxiv.org/abs/2306.02858

Code address: https://github.com/DAMO-NLP-SG/Video-LLaMA

Demo Address:

Modelscope: https://modelscope.cn/studios/damo/video-llama/summary

Huggingface: https://huggingface.co/spaces/DAMO-NLP-SG/Video-LLaMA

Sample input file address:

https://github.com/DAMO-NLP-SG/Video-LLaMA/tree/main/examples

Model design

Video-LLaMA adopts modular design principles to map visual and audio modal information in video into the input space of large language models to achieve cross-modal instruction following. Unlike previous large-model studies (MiNIGPT4, LLaVA) that focused on still image understanding, Video-LLaMA faced two challenges in video understanding: capturing dynamic scene changes in vision and integrating audiovisual signals.

To capture dynamic scene changes in video, Video-LLaMA introduces a pluggable branch of visual language. This branch first uses the pre-trained image encoder in BLIP-2 to obtain the individual features of each frame image, and then combines with the corresponding frame position embedding, all image features are fed into Video Q-Former, and Video Q-Former aggregates the frame-level image representation and generates a fixed-length comprehensive video representation. Finally, a linear layer is used to align the video representation to the embedding space of the large language model.

As for the sound signal in the video, Video-LLaMA processes it using the audio-language branch. Multiple two-second audio clips are first sampled evenly from the original video and each clip is converted to a 128-dimensional Mel spectrogram. Then, using the powerful ImageBind as the audio encoder, the characteristics of each sound clip are extracted individually. After adding a learnable positional embedding, Audio Q-Former aggregates the fragment features as a whole and generates fixed-length audio features. Similar to the visual language branch, a linear layer is used to align the audio representation to the embedding space of the large language model.

To reduce training costs, Video-LLaMA freezes the pre-trained image/audio encoder and updates only the following parameters in the visual and audio branches: Video/Audio Q-Former, positional coding layer, and linear layer (as shown in Figure 1).

To learn visual and text alignment, the authors first pre-trained visual branches using large-scale video-text datasets (WebVid-2M) and image-to-text datasets (CC-595K). After that, the authors used the image instruction dataset from MiniGPT-4, LLaVA, and the video instruction dataset from Video-Chat to fine-tune it to achieve better cross-modal instruction following.

As for learning audio-text alignment, due to the lack of large-scale high-quality audio-text data, the authors employ a workaround strategy to achieve this goal. First, the goal of learningable parameters in the audio-language branch can be understood as aligning the output of the audio encoder with the embedding space of the LLM. The audio encoder ImageBind has a very strong multimodal alignment capability, which can align embeddings of different modalities into a common space. Therefore, the authors used visual-text data to train audio-language branching to align the public embedding space of ImageBind to the text embedding space of LLM, thereby aligning the audio modality to LLM text embedding space. In this clever way, Video-LLaMA can demonstrate the ability to understand audio during inference, even if it has never been trained on audio data.

Example demonstration

The author shows some examples of Video-LLaMA's video/audio/image-based dialogue.

(1) The following two examples demonstrate the audiovisual comprehensive perception ability of Video-LLaMA, and the conversation in the example revolves around the video with sound. In example two, only the player is shown on the screen, but the sound is the cheers and applause of the audience, if the model can only accept visual signals, it will not be able to infer the positive response of the audience, there is no sound of instruments in the audio, but there is a saxophone in the picture, if the model can only accept auditory signals, it will not be able to know that the player played the saxophone.

(2) Video-LLaMA also has strong perceptual understanding ability for static images, and can complete tasks such as picture description and question and answer.

(3) Amazingly, Video-LLaMA was able to successfully identify famous landmarks and people, and was able to conduct common-sense questions and answers. For example, the following VIdeo-LLaMA successfully identified the White House and introduced the situation of the White House. For example, by entering a still of Dragon Mother and Yu Xue (characters in the classic film and television series Game of Thrones), VIdeo-LLaMA can not only successfully identify, but also tell that their relationship is constantly sorted out.

(4) For the dynamic events of the video, Video-llama can also capture well, such as the action of booing, the direction of the boat.

summary

At present, audio video understanding is still a very complex research problem with no mature solution, and while Video-LLaMA has shown impressive capabilities, the authors also mention some limitations.

(1) Limited perception ability: Video-LLaMA's visual hearing ability is still relatively rudimentary, and it is still difficult to recognize complex visual sound information. Part of the reason for this is that the quality and scale of the dataset is not good enough. The research team is actively building high-quality audio-video-text aligned datasets to enhance the model's perception.

(2) Difficult to process long videos: Long videos (such as movies and TV shows) contain a large amount of information, and the reasoning ability and computing resources of the model are high.

(3) The hallucinatory problem inherent in language models still exists in Video-LLaMA.

Overall, Video-LLaMA, as a large model with comprehensive audiovisual capabilities, has achieved impressive results in the field of audio-video understanding. With the continuous tackling of difficulties, the above challenges will be overcome one by one, making the audio and video understanding model have a wide range of practical value.