Huawei Noah open source's first billion-level Chinese multimodal dataset - Wukong, filling a major gap in the NLP community Chinese

Selected from arXiv

Written by Jiaxi Gu et al

Machine Heart Compilation

Edit: Juniper

Researchers at Huawei's Noah's Ark Laboratory proposed a large-scale Chinese cross-modal database, "Wukong", and benchmarked different multimodal pre-training models on this basis, which is conducive to the development and development of Chinese visual language pre-training algorithms.

Pre-training large-scale models on big data and fine-tuning downstream tasks has become an emerging paradigm for AI systems. Models such as BERT and GPT are becoming increasingly popular in the NLP community because of their high portability for a wide range of downstream tasks and even zero-sample learning tasks, resulting in SOTA performance. Recent work, such as CLIP, ALIGN, and FILIP, has further extended this paradigm to the field of visual language joint pretraining (VLP) and has shown superior results to SOTA methods on a variety of downstream tasks. This promising direction has attracted a great deal of attention from industry and researchers as a pathway to the next generation of AI models.

There are two reasons for the success of the VLP model. On the one hand, more advanced model architectures (such as ViT/BERT) and training objectives (such as comparative learning) often improve model generalization capabilities and robustness of learned representations. On the other hand, due to advances in hardware and distributed training frameworks, more and more data can be fed into large-scale models to improve the generalizability, portability, and zero-sample capabilities of models. In visual or linguistic tasks, pre-training on large-scale data (e.g., JFT-300M in image classification, C4 dataset in T5), and then transfer learning or prompt learning has proven to be very useful for improving downstream task performance. In addition, recent work has shown the potential of VLP models to be trained on more than 100 million noisy image-text pairs from the network.

As a result, the success of pre-trained VLP models on large-scale data has led to a constant crawling and collection of larger data sets. Table 1 below shows an overview of many popular datasets in the VLP realm. Publicly available visual language (English) datasets such as Flickr30k, SBU Captions, and CC12M have relatively small sample sizes (about 10 million), while datasets like LAION-400M are larger. However, training models directly using English datasets can result in a significant degradation in the performance of Chinese translation tasks. For example, a large number of specific Chinese idioms and slang that English translation cannot cover, and machine translation often introduces errors in these areas, which in turn affects task execution.

Huawei Noah open source's first billion-level Chinese multimodal dataset - Wukong, filling a major gap in the NLP community Chinese

At present, the lack of large-scale publicly available Chinese datasets in the community not only hinders the development of the community, but also uses a large, private data set for each job, achieving an amazing performance that other jobs cannot be fairly compared.

To bridge this gap, researchers at Huawei's Noah's Ark Lab released a large Chinese transmodal dataset called Wukong, which contains 100 million pairs of graphics and text from the network. To ensure diversity and generalization, the Wukong dataset is collected based on a list of 200,000 high-frequency Chinese words. This paper also employs image-based and text-based filtering strategies to further refine the Wukong dataset, making it the largest cross-modal dataset of Chinese visual languages to date. The researchers analyzed the dataset and showed that it covers a wide range of visual and textual concepts.

Address of the paper: https://arxiv.org/pdf/2202.06767.pdf

Dataset address: https://wukong-dataset.github.io/wukong-dataset/benchmark.html

The researchers further published a set of large pre-trained models using different architectures (ResNet/ViT/SwinT) and different methods (CLIP, FILIP, and LiT). The main contributions of this article are as follows:

Published a large-scale visual and Chinese language pre-training dataset with 100 million graphic pairs covering more comprehensive visual concepts;

Published a set of pre-trained large-scale visual-language models using a variety of popular architectures and methods, and provided comprehensive benchmarking against published models;

The published pre-trained model performed optimally in several Chinese benchmark tasks, such as the Zero Sample Image Classification Task consisting of 17 datasets and the Image Text Retrieval Task, which consisted of 5 datasets.

"Wukong" dataset

The researchers built a new dataset called Wukong, which contains 100 million pairs of graphics and text collected from the network. To cover enough diverse visual concepts, the Goku dataset was collected from a query list of 200,000 terms. This basic query list is taken from The Directional Skip-Gram: Explicitly Distinguishing Left and Right Context for Word Embeddings, yan Song et al., and filtered based on the frequency of Chinese words and phrases that appear in Huawei's massive news text corpus.

After the query list is established, the researcher searches for each query in Baidu image to obtain a list of image URLs and corresponding title information. To maintain a balance between the results of different queries, they searched up to 1000 samples per query. The images were then downloaded using the previously obtained image URL, resulting in a total of 166 million graphic pairs. Then, as is customary, the researchers constructed the final Goku dataset using a series of filtering strategies below. Figure 2 below shows some samples from the Goku dataset.

Image-based filtering

The researchers first filtered the data based on the size and aspect ratio of the images. Only images larger than 200 pixels in length or width and no more than 3 in aspect ratio are retained. This filters out images that are too small, too tall, or too wide because they can become low resolution after image enhancements such as upsampling and square cropping during pre-training.

Text-based filtering

Second, in order to make the selected samples have a high-quality Chinese description of the corresponding image, the researchers further filtered the data according to the language, length, and frequency of the text attached to the image. Specifically, they first examined the language and length, keeping a sentence containing at least one but fewer than 32 Chinese characters. Meaningless image descriptions such as "000.jpg" are also discarded. After that, text paired with too many images is usually not related to the image content, such as "View source page", "Expand text", "Photography community". In practice, the researchers set this threshold to 10, which discards pairs of graphics that appear more than 10 times in the entire corpus collected.

In order to protect the privacy of individuals appearing in the text, the researchers replaced the name with a special marker "", in addition, they also built a list of sensitive words Chinese, and the graphic pairs containing sensitive words were also discarded.

After applying the above filtering strategy, the researchers ended up with a dataset of about 100 million pairs. Table 2 below shows the statistic for the dataset: there are 20,442 unique tokens in the dataset text, with an average number of tokens in each description being 22.

In Figure 3 below, the researcher visualized the distribution of words (consisting of one or more tokens) in the dataset. They then used Jieba, a Chinese text word breaker, to intercept words and build a word cloud for the dataset.

Method schema

Text-image union alignment

Similar to recent well-validated methods, the researchers employed a contrasting pre-training architecture, as shown in Figure 1 below. They use a dual-stream model with a Transformer-based text and image encoder. Both encoders convert text and visual input tokens into embeddings of the same dimension. In this learned joint embedding space, researchers use contrast losses to encourage paired images and texts to have similar embeddings, while unpaired ones have different embeddings.

Model architecture

Since the encoders for visual and text modes are decoupled, different encoder architectures can be explored for both modes. The researchers experimented with three visual encoder variants(i.e., ResNet, Vision Transformer, and Swin Transformer) and a single BERT-like text encoder to train Chinese VLP models.

Pre-training targets

Cross-modal contrast learning is a particularly effective way to train a model from paired image-text data, which can learn representations of both modes simultaneously by distinguishing between paired and unpaired samples. The researchers followed the formula markers in FILIP (Yao et al., 2022), using de-defining sets of image samples while representing textual data. Given an image sample and a text sample, the goal of the model is to make paired images and text representations in a union multimodal space close together, and unpaired ones far away.

In this work, the researchers explored two ways to measure the similarity between images and text. Learned representations of images and text are labeled as sums, respectively. Here, n_1 and n_2 are the number of (unfilled) word tokens in each image and text.

LiT-tuning

The researchers were inspired by a recently proposed fine-tuning paradigm, LiT-tuning (Locked-image Text tuning), which shows that fixed-weighted image encoders and learnable text encoders work best in VLP models. They did the same in the contrast learning setup, where they updated only the weights of the text encoder, not the image encoders.

Specifically, the LiT-tuning method employed by the researchers aims to teach a text encoder Chinese to read a suitable representation from an existing image encoder that is pre-trained on an English dataset. They also added an optional learnable linear transformation layer to each encoder that maps the representations of both modes to the same dimension. LiT-tuning works well because it decouples data sources and techniques for learning image features and visual language alignment (Zhai et al., 2021b). Also, the image descriptor is well pre-trained beforehand using relatively clean or (semi-) manually labeled images.

The researchers extended this idea to multilingual data sources and tried to align a pre-trained fixed image encoder on an English data source with a trainable Chinese text encoder. In addition, the LiT-tuning method significantly speeds up the training process and reduces memory requirements because it does not require the gradient to be calculated for the visual encoder.

Experimental results

Table 3 below describes the model parameters and the details of the video encoder.

Zero sample image classification. Researchers evaluate pre-trained models on 17 zero-sample image classification tasks. The results of the zero-sample image classification are shown in Table 5 below. They compared multiple LiT-tuning models that used different visual encoders, i.e. loaded existing visual encoders from CLIP or Swin Transformer and fixed their weights during the training phase. It was found that using token-level similarity resulted in more significant improvements than using global similarity.

Graphic retrieval task. The researchers evaluated two sub-tasks, namely Search by Image and Search by Image. Table 6 and Table 7 show the results of zero sample settings and graphic retrievals that can be fine-tuned, respectively. For the zero sample setup, Wukong_ViT achieved the best results on 3 of the 4 datasets compared to other models, while Wukong_ViT-500M achieved the best results on the larger MUGE datasets. For fine-tuning settings, the Wukong_ViT-500M achieves the best results on all datasets except AIC-ICC, with Wukong_ViT the best.

Vocabulary - Visualization of tile alignment. Researchers visualize using pre-trained model Wukong_ViT and Wukong_Swin. As shown in Figure 4, images of the six tabs from the Chinese's ImageNet (i.e., Bean lady, lifeboat, hummingbird, phablet, church, and electric fan) are visualized. The same visualization method as FILIP (Yao et al., 2022) is then applied to align the text and tile tokens.

From Figure 4 below, the researchers found that both models are capable of predicting image blocks of a target object. For Wukong_ViT with more image blocks, this vocabulary-tile alignment is more fine-grained than Wukong_Swin.

Huawei Noah open source's first billion-level Chinese multimodal dataset - Wukong, filling a major gap in the NLP community Chinese

Read on