laitimes

"It's too difficult to find a film"?Large model + video search is easy to solve!

author:Cloudinsight
When the video is retrieved, the large model buff is stacked.

Technical author

Video retrieval, commonly known as "finding a film", is to enter a piece of text to find the video that best matches the description of the text.

With the trend of video socialization and the rapid rise and development of various video platforms, "video retrieval" has increasingly become a new demand for users and video platforms to efficiently find videos and locate target content.

For individual users, in the face of a large number of online video resources, it is very important to quickly and accurately find the videos they are interested in through keywords or descriptions. At the same time, in personal storage devices such as mobile phones or network cloud disks, users also have the need to retrieve the video data they have shot and recorded.

For video editors and production teams, searching through a vast library of media assets to find the video clips or footage they need is a day-to-day foundation. Accurate and efficient video retrieval technology can meet its needs to lock the matching materials in a short time, effectively improving the creative efficiency.

"It's too difficult to find a film"?Large model + video search is easy to solve!

Screenshot of media asset search on a new media editing website

In addition, it is also a huge challenge for video platforms and regulators to locate videos containing bad artists in hundreds of millions of video libraries and take them offline.

The development and application of "video retrieval" technology is not only an effective way to solve information overload and improve data processing efficiency, but also plays a vital role in meeting the urgent needs of individual users, professional creators and even the development of the entire video industry.

In this article, we will review the development of video retrieval technology and uncover the big model behind the new generation of natural language video retrieval technology.

01 Current status of video retrieval

Let's take Youku Sopian as an example, Youku's video retrieval technology is based on:

• The main search content is the title and description;

• Recognize people, ASR, and OCR through multimodal content, and convert them into text search;

• Have a certain query comprehension ability to match entity knowledge (converted into search keywords);

• Query intent analysis with some semantic understanding (identifying problems such as How To).

"It's too difficult to find a film"?Large model + video search is easy to solve!

Image source: Ali Entertainment technical team

The above technical solution can meet the basic video retrieval needs of users, but there are also defects:

• A large amount of visual information cannot participate in the retrieval and recall: The search based on the existing multimodal algorithm can only identify people, objects, ASR, OCR and other contents in the label system, and a large number of visual information (such as birds soaring in the blue sky) cannot participate in the retrieval in text.

• Strong reliance on knowledge graph and semantic analysis: The maintenance and update of knowledge graph and the ability to understand intent require continuous investment and updating, and the burden of use is large.

• Loss of semantic connection in keyword-based searches: Taking "Mr. Ma riding a bicycle" as an example, keyword-based searches can only merge the search for two keywords, "Mr. Ma" and "bicycle", which loses the concept of "riding" and leads to recall bias.

At the same time, with the development of the times, everyone's methods of finding films are becoming more and more tricky. Users want to no longer be satisfied with a certain keyword, but want to use natural language similar to the following to match the content of the video itself, not just the content that can be texted by people, objects, ASR, and OCR, such as: football players are injured, planes cross Tianmen Mountain, spring breeze and rain breed peaches and plums......

What should we do if we want to achieve such intelligent search results? Let's first review the development of video retrieval technology.

02 The development of video retrieval technology

First generation: Traditional text-based video retrieval

In the era when the network was not yet developed, the processing power of computers for audio and video was very limited, and media data was only regarded as an extension of text data. In order to be able to search for media data, website editors usually do a manual cataloging of media data: pick a title, write a description, and even manually add a few keywords.

Therefore, traditional video retrieval essentially degenerates into text retrieval, which uses the capabilities of relational databases (such as mysql) or text inverted databases (ElasticSearch) to search and sort text tokens.

"It's too difficult to find a film"?Large model + video search is easy to solve!

Second generation: cross-modal video retrieval based on AI tags

With the increasing amount of audio and video data on the Internet, the manual cataloging of media assets has reached an unsustainable level, which inevitably requires the introduction of higher productivity technologies.

By the 10s of the 21st century, with the maturity of neural networks based on CNN architecture, AI has been able to easily understand and recognize the objective entities in videos, and can classify videos through classification models, at which point smart label technology came into being. For example, Alibaba Cloud Video Cloud's smart tagging technology can automatically tag videos with the following tags:

• Objective entities: celebrities/politicians/sensitive figures, landmarks, logos

• Scene and action events

• Keywords such as time, region, person, etc

• Video category information

The second-generation video retrieval technology is based on the first-generation technology, which automatically analyzes the visual and auditory modalities and converts them into text data.

"It's too difficult to find a film"?Large model + video search is easy to solve!

Third generation: natural language video retrieval based on large models

These searches rely on keywords or tags for content indexing and retrieval, but there are significant limitations to this type of approach, especially for non-text content such as images and videos, where it is very difficult to describe their full information with limited tags. These labels may neither cover all relevant concepts nor convey the nuanced differences and deeper meanings of the content.

With the rise of AIGC and the so-called "Artificial General Intelligence (AGI)", especially the comprehensive application of large models represented by large language models (LLMs), the third generation of video retrieval technology has begun to mature. LLM contains the representation of massive human knowledge, and by extending LLM to audio and video mode, we can realize the representation of media data.

Multimodal representation models can convert text, images, audio, video, and other content into vector representations in high-dimensional space, also known as embeddings. These embeddings capture the semantic information of the content and map it into a contiguous vector space, so that semantically similar content is close to each other in this vector space.

Large model search technology supports natural language search, allowing users to describe what they're looking for in their own words, rather than relying on predefined keywords or tags. Through the understanding of natural language descriptions, the large model can convert these descriptions into corresponding vector representations and look for the best matching content in the high-dimensional space.

"It's too difficult to find a film"?Large model + video search is easy to solve!

The advantage of third-generation search technology lies in its flexibility and expressiveness. Users no longer have to be limited to a limited number of keywords, but can make more precise and nuanced descriptions in their own words. At the same time, because large models are able to understand the deeper meaning of content, search results are often more relevant and accurate, enhancing the user experience and providing a more powerful tool for acquiring and discovering information.

For example, a user wants to find an image and video depicting "a warrior in ancient armor standing still on top of a mountain at sunset." In a traditional tag-based search system, users may need to experiment with various combinations of keywords, such as "warrior", "armor", "sunset", "mountaintop", etc. In the cross-modal retrieval system of large models, users can directly enter a complete description, and the retrieval system will understand its semantics and return matching pictures and videos.

03 Natural language video retrieval was launched

Based on the DAMO Academy's multimodal representation model, Alibaba Cloud Video Cloud has launched natural language video retrieval in VOD and intelligent media services. Combined with the existing AI tag retrieval, face retrieval, and image similarity retrieval, a complete multi-mode retrieval solution is formed.

Natural Language Video Retrieval Demo: https://v.youku.com/v_show/id_XNjM2MzE5NTg5Ng==.html

Our current implementation of natural language video retrieval technology supports the following performance parameters:

• Recall relevant footage in videos up to 100,000 hours

• RT<1 sec at 10QPS search speed

• The accuracy rate of the recalled fragments reached more than 80%.

Of course, in the process of realizing natural language video retrieval, we also encountered a series of difficulties and challenges.

"It's too difficult to find a film"?Large model + video search is easy to solve!

The following article will describe how we overcome these difficulties and challenges, and introduce the technical principles and solutions of implementation, as well as the direction of future video retrieval evolution.

04 Multimodal Representation Large Model Algorithm

Arithmetic principles

CLIP is a visual classification model proposed by OpenAI in 2021, and the pre-trained model can achieve excellent transfer results in downstream tasks without fine-tuning. In order to get rid of the strong dependence of supervised learning on labeled datasets, CLIP adopts a self-supervised contrastive learning scheme to learn the correspondence between images and text from 400 million pairs of graphic data collected from the Internet, and then obtains visual-language alignment ability.

The CLIP pre-trained model consists of two main modules: Text Encoder and Image Encoder, where Text Encoder is used to extract text features using the 63M parameter text transformer model, and Image Encoder is used to extract image features using the ResNet model based on CNN architecture or the ViT model based on tansformer architecture.

"It's too difficult to find a film"?Large model + video search is easy to solve!

Searching for images based on text is one of the most direct applications of CLIP, first sending the image to be retrieved into the Image Encoder to generate image features and storing them, and then sending the retrieved text to the Text Encoder to generate text features, and using the text features to compare with the stored image features one by one, among which the retrieved image with the highest cosine similarity is the highest.

Although CLIP is trained based on text-image pairs, it can also be naturally generalized to text-video retrieval tasks: the keyframe image is obtained by extracting the video frame, and then the keyframe image is sent to the Image Encoder to extract the image features.

Algorithm selection

Although CLIP has excellent zero-shot migration capabilities, it is trained on an English dataset and requires a very cumbersome translation effort to apply it to Chinese searches. In order to avoid the additional computation required by adding translation modules, we found two open-source Chinese search models published by the Damo Academy: TEAM and ChineseCLIP.

TEAM is a work released by the Damo Academy in 2022, and the authors have added a module called Token Embeddings AlignMent (TEAM) to the twin tower structure of CLIP, which is used to explicitly align the image features at the token level with the text features, and generate matching scores for the input image-text pairs.

In the framework, the image encoder adopts a vit-large-patch14 structure, and the text encoder adopts a bert-base structure. The authors also construct a 1 billion-scale Chinese visual-language and training dataset (collected by quarks), on which the proposed framework is pre-trained to achieve advanced performance in Chinese cross-modal retrieval benchmarks (Flickr8K-CN, Flickr30K-CN and COCO-CN).

"It's too difficult to find a film"?Large model + video search is easy to solve!

ChineseCLIP is another work released by the Damo Academy in 2022, mainly based on a 200 million scale Chinese dataset (native Chinese data + Chinese Chinese data) to complete the localization of CLIP, and the model structure has not been greatly changed.

In order to achieve efficient migration of Chinese data across modal base models, the authors developed a two-stage pre-training method, the core idea of which is to use LiT (Locked-image Tuning) to enable the text encoder to read high-quality representations from the basic vision model of CLIP, and then transfer the entire model to a new pre-trained data domain.

First, the existing pre-trained model is used to initialize the parameters of the image and text towers, in which the Image Encoder uses the parameters of CLIP and the Text Encoder uses the parameters of Chinese RoBERTa. In the first stage, the Image Encoder parameters are frozen, and only the pre-training parameters of the Text Encoder are updated, and in the second stage, the Image Encoder and Text Encoder are fine-tuned at the same time through comparative learning. Through two-stage training, state-of-the-art performance was achieved in Chinese cross-modal retrieval tests (MUGE, Flickr30K-CN, and COCO-CN).

"It's too difficult to find a film"?Large model + video search is easy to solve!

Algorithm evaluation

Based on the long-term accumulated data of Alibaba Cloud's video cloud AI editorial department, some short videos were finally used as test video sets. The video collection is mainly composed of short videos ranging from a few minutes to 10 minutes, including various types of videos such as news, promotional videos, interviews, and animations, which is also very consistent with the positioning of video cloud customers.

"It's too difficult to find a film"?Large model + video search is easy to solve!

After the test video collection is stored in the database, we design some natural language sentences as search queries, which will ensure that the query must have corresponding videos. Considering the small size of the video set, we ultimately only evaluated the accuracy of the TOP1 recall.

After practical tests, both TEAM and ChineseCLIP can achieve the accuracy of TOP1 return 80%, and both can be embedded in the system framework as large model feature extractors.

05 Search for engineering solutions

In terms of system architecture design, our search service architecture adopts the Core-Module design system, which designs the core search process that is not easy to change into Core modules, and separates various search services into different Modules. There is a module manager inside the search core module that manages all modules (modules are designed to allow self-registration).

Each module contains three interfaces, which are divided into feature extraction, query rewriting, and aggregation scoring.

"It's too difficult to find a film"?Large model + video search is easy to solve!

The above-mentioned traditional search, cross-modal search, and large model search correspond to three types of modules, and also support new face search and DNA search modules, and other search modules can be extended in the future.

"It's too difficult to find a film"?Large model + video search is easy to solve!

In terms of the warehousing process, multi-dimensional content understanding is supported when media assets are stored:

• Basic information base-module: a traditional search engine

• Smart label aiLabel-module: Relying on the DAMO Academy's self-developed smart label algorithm, it supports object, scene, landmark, event, LOGO, subtitle OCR, voice ASR, word, category, theme, persona, and custom label recognition

• Face Feature Face-Module: Face recognition

• DNA feature dna-module: homologous detection feature extraction

• Large model feature mm-module: multi-modal large model feature extraction for content understanding

The content of media assets is understood according to different dimensions, the traditional scalar data is stored in ES to build an inverted index, and the vector data is stored in a self-developed distributed vector database.

"It's too difficult to find a film"?Large model + video search is easy to solve!

In the search process, the cross-modal large model search is based on the user's query text, and then the text features are extracted through the large model, and the target content is obtained from the search vector base database. Users can also perform traditional ES text search to obtain target content, and users can use a combination of two search methods, and the multi-channel recall capability is still in internal testing.

Relying on our self-developed distributed vector database, it supports massive data (1 billion level) feature data storage, and the search delay is within 1s.

At present, the accuracy rate of TOP1 search for natural language description is 80%, and there are still difficulties in understanding and searching for complex semantics.

Face search supports image search, multi-mode search supports text search, and large model search supports text search and image search.

"It's too difficult to find a film"?Large model + video search is easy to solve!

In terms of scalability, the large model search supports multi-tenancy, and the data between tenants is isolated from each other, which supports both low-cost shared instances and high-performance exclusive instances; when users manage media asset data, multiple search libraries can be established, and each search library can be set up separately, and the underlying operator of the search index supports configuration selection; users can add, delete, modify and modify media data according to the search database, so as to meet the needs of customers, and the search architecture has high scalability, reliability, and stability.

"It's too difficult to find a film"?Large model + video search is easy to solve!

06 Summary and outlook

In this paper, we introduce the implementation and application of cross-modal large model retrieval technology of intelligent media services, and we analyze media assets in multiple dimensions, and use the traditional ES-based scalar retrieval and vector-based feature retrieval to meet the needs of users for content understanding and cross-modal accurate retrieval of long videos.

However, video retrieval technology is far from the end of its evolution, and there are still needs to be optimization and breakthroughs in the following aspects.

The first is the improvement of algorithms.

Optimization of quasi-recall: At present, the recall accuracy of the large model TEAM and ChineseClip is 80%, and the recall accuracy of the new multimodal information representation integrated model MBA under development by the DAMO Academy can reach 93%, which will be accessed in the future.

New Modal Fusion: The current large representation model only supports the alignment of text and image, and the audio modality is missing. Imagine how cool it would be if I searched for "after the rain in the empty mountains" and found a video of a landscape with the sound of rain.

Multi-representation fusion: The current algorithm only extracts features based on sentence-level text and frame-level images, which actually loses the details of objective entities such as people and objects in vision. Ideally, a large representation model should be multi-representation fusion. For example, if I search for "Messi holding the Golden Ball", it should be Messi holding the Golden Ball, not Ronaldo holding the Golden Boot. This means that the representation of large models needs to have the ability to recognize people and words, rather than relying solely on the text-image pair in training.

The second is the balance between cost and performance.

Representation feature compression: At present, it is a 768-dimensional float32-dimensional vector, and the search effect of float32 compression to uint8 is basically the same, and the compression to 01 binary vector is being explored to achieve low-cost storage and search.

Fragment-based representation: The current video extracts one frame per second for feature calculation and storage, and the merging of video clips has been studied, and feature aggregation is carried out in advance to reduce the number of frames, reduce storage and improve search efficiency.

The third is in engineering and experience.

Multi-channel recall: For AI tag search, face search, and large model search, users can search at the same time, and the search results will be re-scored and sorted after merging.

Enhanced LLM: supports the understanding of complex search statements, rewrites queries for user queries to achieve QP capabilities during search, identifies fields such as filter and groupBy for SQL-style conversion of search statements, and analyzes and filters the search results in combination with the original query through large models.

At present, natural language video retrieval has been launched on Alibaba Cloud Intelligent Media Service (IMS).

Media Asset Search Product Documentation: https://help.aliyun.com/document_detail/2582336.html

Welcome to join the official Q&A "DingTalk Group" consultation and exchange: 30415005038

References & Models:

[1] "Video Search is Too Difficult! Ali Entertainment Multimodal Search Algorithm Practice": https://mp.weixin.qq.com/s/n_Rw8oa0Py7j_hPIL1kG1Q

[2] "Depth | Hundreds of millions of users watch 100 minutes a day!Understanding of short video content based on multimodal embedding and retrieval":https://mp.weixin.qq.com/s/M_E89uEPkWrMRBan1kF8AQ

[3] "Youku launched "AI Search" | Fuzzy search accurate matching, solving the difficulty of finding a film": https://mp.weixin.qq.com/s/Wr09Sfn3XxJ-CqvJmeC-Uw

[4] ChineseClip模型:https://modelscope.cn/models/iic/multi-modal_clip-vit-base-patch16_zh/summary

[5] TEAM Image and text retrieval model: https://modelscope.cn/models/iic/multi-modal_team-vit-large-patch14_multi-modal-similarity/summary

Read on