Recently, Huawei Noah Lab open-sourced a hundred million-level Chinese cross-modal dataset - "Wukong".

Multimodal large models (such as OpenAI's CLIP, Google's ALIGN, etc.) have set off a new wave of large-scale multimodal learning in recent years, and these models embody excellent open domain zero-sample capabilities in various downstream tasks, which is a possible path for the next generation of general artificial intelligence.

The success of these large models relies heavily on the support of pre-trained datasets, but Chinese the large scale of open source data, which hinders the development and application of Chinese multimodal large models.

In response to this gap, Huawei Noah decided to open source "Wukong"! The new dataset is said to contain 100 million Chinese graphic pairs from the network, filtered and filtered.

Huawei Noah open source's first large-scale Chinese multi-modal data "Wukong", 100 million graphic pairs, including the basic big model

Illustration: Goku dataset text word cloud

In addition, because training large models is very expensive and time-consuming, in order to facilitate the future use and development of the Chinese community, Noah's team has also open sourced a series of multi-modal base large models.

These models use different image encoders (ResNet/ViT/SwinT) and different pre-training methods (CLIP/FILIP/LIT) for flexibility.

They conducted a series of benchmarks for different tasks, and experiments showed that Goku could perform well on a variety of downstream tasks as an excellent cross-modal Chinese pre-training dataset and benchmark learning method.

For more information, please refer to:

https://arxiv.org/abs/2202.06767
https://wukong-dataset.github.io/wukong-dataset/

Here are the project details:

1 Dataset build

Illustration Note: The current industry-renowned graphic and text pre-training dataset, Goku is the first large-scale Chinese dataset

Visual-linguistic multimodal large model pre-training relies strongly on large-scale data sets of graphics.

Although the current open source English large-scale graphic data sets are relatively rich, such as CC12M, YFCC100M, LAION-400M, but Chinese has always lacked such large-scale data sets for free download research. Wukong, as the first large-scale graphic dataset open sourced in the field of Chinese, fills such a gap and can accelerate the research work of Chinese cross-modal pre-trained large models.

The Wukong100m dataset contains about 100 million pairs of graphics and text from the Internet. In order to encompass as rich a visual concept as possible, the collection of raw data revolves around 200,000 basic keywords, and for each basic keyword, the dataset is constructed by entering the search engine and using the returned pictures and corresponding text. In addition, in order to balance the number of samples corresponding to keywords, the samples collected for each base keyword are retained up to 1000.

After that, the dataset went through a series of filtering strategies to get the final version, and the specific filtering strategies were divided into image-based filtering and text-based filtering. Among them, the principle of picture filtering is to filter out pictures with a length or width of no more than 200 pixels and a length to width ratio greater than 3, so as to ensure that the picture can present the visual concept more clearly.

As for the strategy based on text or image description, first of all, only text containing at least one Chinese character will be retained, such as "000.jpg" which will be filtered out along with its corresponding picture, because the meaning it represents is very weak.

Second, if there are too many pictures corresponding to the same description, such as "view source page", "expand the full text", "photography tribe", then such descriptions often lack too much meaning and will be excluded when building the dataset.

In addition, in order to protect privacy, the names of the people involved in the dataset are replaced with special <人名>symbols "", and data containing sensitive words are also removed.

2 Model pre-training method

2.1 Model Structure

We used the mainstream pre-trained model of a transformer-based twin-tower structure to encode images and text into the same dimensional space.

We use the contrast-learned loss function to align text and visual modalities so that the matching pairs are more similar in the encoded representations, while the unmatched graphics are less similar. So by training such image encoders and text encoders, we can efficiently align text and visual modes on the same representation space.

Note: The Goku Foundation model mainly adopts a twin-tower structure.

Image Encoder: Our pre-training uses common image encoder structures: ResNet, ViT and Swin Transformer, where the ResNet structure we experimented with Res50 and Res101; the ViT model used ViT-B/16, ViT-B/32 and ViT-L/14, respectively; the SwinT model used the Swin-L model. The first token of the ViT model, the feature of the [CLS] token, is used to describe the entire image, and the ResNet and SwinT models use the feature mean of all patch tokens to describe the entire image.

Text Encoder: We use a Transformer structure with 12 layers, 12 attention mechanism heads, and 768 hidden variable dimensions. For tokenizers Chinese, we used the WordPiece method Chinese BERT model, Chinese thesaurus size is 21128, and the first token of the text also uses the common [CLS], which is encoded by the text encoder to describe the characteristics of the entire text.

Linear mapping layer: After images and text are encoded by the encoder, we use a learnable linear mapping layer to project the features of images and text into a multimodal common space through linear variations.

LiT-Tuning: To improve training efficiency and save computing resources, we loaded a pre-trained image encoder into the twin-tower model structure and locked its parameters in the contrast learning. We just need to train the text encoder and two linear mapping layers. Wherein, the image encoder may be pre-trained on the picture data of the English label.

2.2 Comparative learning training

We use the in-batch contrast learning method: in each training iteration, the positive example of each image is the text corresponding to it, and the other text in the same batch is its negative example, and vice versa. Then the comparative learning loss of the kth picture and the text can be expressed as:

thereinto

Huawei Noah open source's first large-scale Chinese multi-modal data "Wukong", 100 million graphic pairs, including the basic big model

Represents the similarity of the kth image to the jth text,

Huawei Noah open source's first large-scale Chinese multi-modal data "Wukong", 100 million graphic pairs, including the basic big model

Represents the similarity of the jth text to the kth picture, with the mean of the above two learning losses as the final calculated loss function. When calculating the similarity value, we use two methods: the global similarity calculation of CLIP and the token-wise similarity calculation of FILIP.

As the name suggests, the global similarity calculation is to use the global representation of the image and the dot product result of the global representation of the text; and the Token-wise similarity calculation uses a more fine-grained contrast learning goal to capture the fine-grained matching and positioning of the token level of the picture and the text, we first calculate the text token that is most similar to it for each picture patch and record the similarity, and then use the average similarity of all patches as the similarity of the final picture to the text. A similar calculation applies to how similar the text is to the image.

3 Experimental evaluation

3.1 Zero-shot zero sample image classification, graphic retrieval tasks

We verified it on zero-sample image classification and graphic retrieval downstream tasks of Zero-shot, respectively, except in this experiment

-500M, the other Goku model variants are trained on this database of 100 million, and the results of the zero-sample picture classification experiment on seventeen different data sets are as follows:

At the same time, the task of graphic retrieval is also divided into two experimental settings: fine-tuning and Zero-shot zero sample, which are verified on five different datasets on the two tasks of image retrieval text and text retrieval image, and the experimental results are compared with the Chinese multimodal large model in the industry. The results of the fine-tuned graphic retrieval experiment are as follows:

In addition, the results of the zero-sample graphic search experiment are as follows:

The above experimental results show the effectiveness of the Chinese multimodal dataset we built, as well as the open source baseline multimodal pre-trained model. It shows that the image encoder that will be pre-trained on the English dataset will be adapted to the Chinese multimodal pre-training can also train a model with excellent results. At the same time, it can be seen that under the same model configuration, the experimental effect of using token-wise similarity calculation is better than that of using global similarity calculation. On the task of graphic retrieval, our baseline pre-trained model can obtain results that approximate SOTA models or even higher.

3.2 Visualization of fine-grained alignment

We believe that Chinese multimode models that use Token-wise similarity calculations can also have fine-grained alignment and positioning capabilities like FILIP. So we experimented with token-wise alignment using ImageNet's images and Chinese tags.

You can see it

and

Both exhibit a degree of fine-grained alignment.

Taking the figure of "Bean Niang" as an example, in the text sequence "[CLS] Bean Niang [SEP]", the position index of the tag "Bean Niang" of the category is 1 and 2, and on the image we can find the corresponding picture patch token of The corresponding Bean Niang no. 1 and No. 2, and this part of the patch token can outline the objects of this category.

because

The granularity of the partition patch (16*16) is compared

(7*7) is more finely fragmented, so the picture patch token is more delicate and complete in outlining the shape of the object.

Experimental results show that token-wise similarity calculation is suitable for a variety of patch-based image encoders, and also for LiT-tuning training methods. This fine-grained graphic-text alignment capability can provide more possibilities for the work of image object recognition.

4 Summary

Huawei's Noah's Ark Lab has open sourced a hundred million-level Chinese cross-modal dataset, "Wukong", and will open source a series of multimodal basic models to further promote the future use and development of the Chinese community.

The team also benchmarked datasets and models on a range of tasks, including zero-sample image classification and image and text retrieval. Experiments have shown that Goku can be used as an excellent Chinese multi-mode pre-training dataset and benchmark learning method, and has excellent performance on various downstream tasks.

At the same time, the ability of the Goku model in the fine-grained alignment at the token level has certain potential for use, serving more visual tasks such as target positioning.

At present, the Wukong dataset can be downloaded on the official website, and interested partners can use it immediately!