laitimes

High and low resolution all! Eight Chinese jointly released the largest and highest video dataset in history

High and low resolution all! Eight Chinese jointly released the largest and highest video dataset in history

Reporting by XinZhiyuan

EDIT: LRS

【New Zhiyuan Introduction】The scale of the video language dataset has refreshed the record again! Eight Chinese from MSRA jointly released the largest video language dataset in history, HD-VILA-100M, and the first high-resolution large-scale dataset! The article also proposes a training model, and the performance of the model trained based on this data is directly improved by 53.6%!

Recall that a few years ago, most of the information on the Internet was still static, such as pictures and novels.

But with the rise of major video websites and short videos, the number of users browsing videos on the Internet has increased significantly in recent years, and the quality, resolution and content diversity of video creation have also become higher and higher!

Video sharing of everyday life such as travel, sports, music, etc. has become the new normal, often accompanied by a text.

Therefore, AI research is also close behind, entering the multimodal era of text + video, such as video search, video recommendation, video editing all need this multimodal modeling capability!

High and low resolution all! Eight Chinese jointly released the largest and highest video dataset in history

However, the development of existing video-language understanding models is actually largely limited by the size and coverage of the dataset.

Early datasets such as MSR-VTT, DiDeMo, epic-KITCHENS are composed of manually annotated videos and text descriptions by humans, and due to the introduction of manual annotations, the construction cost of datasets has also risen sharply, resulting in the scale of these datasets can not be very large.

In addition, these data sets contain only some descriptive statements, so the complexity and diversity of the data set are also greatly limited, which indirectly affects the generalization performance of subsequent development models.

High and low resolution all! Eight Chinese jointly released the largest and highest video dataset in history

There are also some researchers who directly use speech recognition (ASR) video to train together, and the size of the dataset has been greatly improved by eliminating the process of manually labeling video text. One of the most representative examples is the HowTo100M dataset, which contains millions of video text corpora.

Although the size of the dataset has gone up, the quality has come down.

High and low resolution all! Eight Chinese jointly released the largest and highest video dataset in history

There is a big gap between the automatically labeled video data and the video in the real scene in terms of quality and semantic diversity.

To better understand video and solve the data problems mentioned above, eight Chinese from MsRA at Microsoft Research Asia recently co-published a paper that focused on pretraining of joint video and language and proposed a new dataset, HD-VILA-100M (High-resolution and Diversified VIdeo and LAnguage).

The video category in the dataset covers a wide range and is useful for subsequent applications such as text-to-video retrieval and video QA scenarios.

High and low resolution all! Eight Chinese jointly released the largest and highest video dataset in history

This dataset has three main features:

1. The scale is particularly large

The dataset contains 100 million video text pairs from 3 million videos, with a total video duration of 370,000 hours, 2.8 times longer than the previously mentioned HowTo100M video time, and the average sentence length is also 8 times longer than HowTo100M.

As mentioned earlier, as the video subtitles generated by ASR are generally of low quality and have no punctuation. To overcome this problem, the researchers used a tool from GitHub, puntuator2, to slice captions into multiple complete sentences, and then aligned video clips and sentences using Youtube's own caption timestamps via Dynamic Time Warping.

After processing, the average length of the video clips in the HD-VILA-100M dataset was 13.4 seconds, and each sentence contained an average of 32.5 words.

2. The resolution is particularly high

All video resolutions in the dataset are 720p, while the resolutions of the current mainstream video text datasets are only 240p and 360p.

3. The diversity is particularly high

The dataset covers the 15 most popular video categories on YouTube, such as sports, music, cars, and more. And the researchers also balanced the number of videos under each category.

High and low resolution models

Once you have the data, you're ready to start training!

However, due to the limitations of memory, computing power and other practical factors, the previous work either uses a simple end-to-end encoder based on video frames for visual coding and multimodal fusion, or uses some trained spatio-temporal encoders to achieve the fusion of visual coding and multimodal information step by step.

Little research has been done to joint spatio-temporal video representation in end-to-end video language pre-training models.

Isn't this innovation coming to the door?

The researchers proposed a new model in which the input is a hybrid image sequence that contains a small number of high-resolution (HR) video frames and a large number of low-resolution (LR) video frames for the task of multi-video learning.

High and low resolution all! Eight Chinese jointly released the largest and highest video dataset in history

Such a model design enables end-to-end training of high-resolution spatiotemporal video representations and solves two main problems in model design:

1. Which HR and LR video frames should be extracted?

The researchers first randomly sampled some HR video frames from a video clip to ensure that the final learned video features were robust enough.

The LR video frame is extracted from the average sample of the nearby frame of the HR video frame, which also ensures that the intermediate HR video frame contains spatial information similar to LR, which is also very critical for the learning of timing characteristics.

2. How do I learn space-time features from a mixed image sequence?

The researchers encode HR and LR video frames separately, and using a hybrid Transformer, the encoded HR feature and the LR feature are mapped to the same embedding space. This design also ensures that the spatiotemporal information in the video can cover both HR and LR video frames in a learnable way.

The researchers conducted experiments on the video-text retrieval task and can see that the HD-VILA model proposed in the article has great advantages over the model trained on the HowTo100M dataset in the past.

At the zero-shot setup, HD-VILA is even 38.5% better than VideoCLIP's R@1 (10.4->14.4), which also shows that the video representations learned by the model have sufficient generalization capabilities, and the fine-tuned model has surpassed all baseline models.

High and low resolution all! Eight Chinese jointly released the largest and highest video dataset in history

In the film dataset LSMDC, the model achieved even greater performance gains (53.6%) than other baseline models. Since the style of the film is obviously different from the video style in the HowTo100M, the pre-trained model on the HowTo100M is difficult to adapt to the film field. And because the video data resolution in LSMDC is generally higher, and HD-VILA processes high-resolution video better than other models, the performance improvement is also greater.

In experiments on the DiDeMo and ConnectivityNet datasets, HD-VILA also achieved better performance. The main features of these two datasets are larger, richer video categories, and longer time per video, in which case the model requires better timing understanding to recall the correct results and meet the training goals of HD-VILA.

In the text-to-visual generation experiment, the researchers compared the models of StyleCLIP and TediGAN, both of which used cross-modal pre-training to complete language-guided image generation tasks, and the image generation quality was widely praised in the industry. The quality of the visually generated results can also to some extent reflect the quality of transmodal embedding.

In the first example of the text-guided manipulation task, while all three models succeeded in making hair more wavy, HD-VILA was the only one that followed the text's requirements to paint the characters lipstick.

High and low resolution all! Eight Chinese jointly released the largest and highest video dataset in history

In the image super-resolution task, the HD-VILA and SR3, pSp models simultaneously generate images of 1024 × 1024 from ultra-low resolutions of 16×16, which is also quite challenging due to the particularly low resolution of the input images.

It can be seen from the experimental results that SR3 and pSp cannot reconstruct high-quality faces using only visual information, while HD-VILA can accurately reconstruct facial features such as lipstick and straight hair with the support of pre-trained models and text descriptions.

High and low resolution all! Eight Chinese jointly released the largest and highest video dataset in history

The author of the article, Dr. Baining Guo, is currently the Executive Vice President of Microsoft Research Asia, responsible for research in the field of graphic images. In 1999, he joined Microsoft Research China (the predecessor of Microsoft Research Asia). Previously, he was a senior researcher at Intel Corporation's Silicon Valley Headquarters Research Institute, where he holds a master's and doctoral degree from Cornell University and a bachelor's degree from Peking University.

Dr. Baining Guo's research interests include computer graphics, computer visualization, natural user interfaces, and statistical learning. His research in areas such as texture mapping modeling, real-time rendering, and geometry is particularly prominent.

Resources:

https://arxiv.org/abs/2111.10337

Read on