laitimes

Google SynCLR Technology: Efficient Image Modeling, Self-Training with AI-Generated Data!

author:Xi Xiaoyao Technology said

Google has introduced an innovative synthetic image framework that is unique in that it doesn't rely on real data at all. This framework starts with the composite image titles, and then generates the corresponding images based on those titles. Next, deep Xi is performed using contrast-Xi techniques to train a model that can accurately recognize and understand these images, which surprisingly excels in a variety of downstream tasks. Let's take a look at what magic is used!

论文标题: Learning Vision from Models Rivals Learning Vision from Data

Paper Links:

https://arxiv.org/pdf/2312.17742.pdf

introduction

Large-scale real-world data often comes with a number of challenges: large, unfiltered datasets are less expensive but have limited benefits, while finely filtered smaller-scale datasets, while more accurate, limit the widespread use of models. To overcome these obstacles, a new study proposes a unique solution – using synthetic data to learn Xi visual representations. This method achieves an effective contrastive Xi by generating a large number of image titles and corresponding images, and is able to treat images that share the same title as positive examples of matching each other. In particular, the research team highlighted that this synthetic data-based Xi approach not only demonstrated excellent scalability, but also demonstrated superior performance comparable to traditional methods in a variety of downstream tasks.

Google SynCLR Technology: Efficient Image Modeling, Self-Training with AI-Generated Data!

Traditional learning Xi methods ("learning from data") focus on drawing knowledge purely from real data. A typical example is the CLIP model, which extracts information directly from text and image datasets, achieving an impressive linear transfer accuracy of 80.2% on ImageNet.

The hybrid Xi approach ("Hybrid") uses a two-pronged strategy that combines real text and generated images for learning Xi. For example, the StableRep model operates under such a framework, which uses text datasets and image generators for Xi, and also achieves a fairly good linear transfer accuracy of 76.7% on ImageNet.

The "Learning from models" approach proposed in this paper, SynCLR, marks a leap forward in innovation. By learning Xi from synthetic text and composite images, it is able to compete with CLIP on ImageNet even without direct contact with any real data, achieving a very good linear transfer accuracy of 80.7%.

way

The core innovation of SynCLR is that it redefines the fine-grained nature of visual categories through generative models. In contrast to traditional self-supervised and supervised Xi methods, SynCLR uniquely uses titles as category definitions, where each title describes a visual category in detail. The ingenuity of this approach is that it allows images to be grouped according to the semantics shared by the title, rather than being limited to broader category tags, such as "Golden Retriever". In experiments, this title-based fine classification method has demonstrated its superiority over traditional self-supervised and supervised training methods. The system consists of the following three steps:

Generate an image title

First, the authors succeeded in generating a large corpus of image titles. To achieve this, they cleverly leveraged the contextual Xi capabilities of large language models (LLMs) to craft a series of prompt templates to guide LLMs to produce context-specific textual content.

By curating a list of concepts from existing datasets such as ImageNet-21k and Places-365, the authors constructed special prompts for each concept that led the LLM to generate descriptive and creative image captions. At the heart of this process is ensuring that the resulting headlines accurately describe the content of the image and show enough variety to cover a wide range of visual concepts. This diversity is critical because it ensures that the resulting set of images can represent the widest possible variety of scenes, objects, and activities, which in turn improves the generalization of the Xi learned visual representations.

Google SynCLR Technology: Efficient Image Modeling, Self-Training with AI-Generated Data!

In this way, the authors synthesized a large and diverse number of image titles, which were then used to guide the image generation model to generate the corresponding composite images. These composite images are combined with synthetic captions to form a rich dataset for training Xi models for visual representations. This method makes it possible to train vision models without real image data at all, and provides an innovative and effective complement to traditional vision Xi methods that rely on real datasets.

Google SynCLR Technology: Efficient Image Modeling, Self-Training with AI-Generated Data!

In the diagram below, the research team provides a context (left) that guides the model to generate a specific descriptive title based on a given pair of categories (e.g., "tiger, forest" or "groom, wedding ceremony"). For example, in the actual resulting results (right), for the category pair "red fox, yard", the model generates the following title: "wild red fox sitting on a partially snow covered front yard of a house in the suburbs of a small city". In this process, three instances of this context are randomly selected for each inference.

Image generation

The research team employed an innovative approach to image generation, which initiates the reverse diffusion process through different random noises. The Classifier-Free Guidance (CFG) ratio plays a crucial role in this process, effectively balancing the relationship between sample quality, consistency between text and images, and sample diversity. To generate a series of different images for each text description, the team adjusted the random noise inputs, enriching the diversity of the generated images.

Representation Xi methodology

This representation Xi method is constructed on the basis of the StableRep method, which introduces a multi-positive contrast Xi loss. The core idea is to align images generated by the same title in the embedding space, while incorporating multiple techniques from other self-supervised Xi methods, including patch-level masking image modeling targets.

StableRep

The StableRep method minimizes cross-entropy loss by comparing the similarities and differences between different samples. This strategy trains the model to recognize and distinguish between images generated by the same or different titles.

iBOT

The iBOT method employs masked image modeling objectives, where local patches are masked, and the task of the model is to predict the labeled representation of these masked patches. This strategy scales the DINO model from the image level to the patch level.

Exponential Moving Average (EMA)

EMA was originally introduced by MoCo in self-supervised learning Xi to code crops and generate targets for iBOT losses. During training, the EMA model is updated according to the cosine schedule to smooth the update process of the model parameters, so that the model remains stable during the training process.

Multi-crop strategy

As a method to improve computational efficiency, the multi-crop strategy allows the model to learn Xi from multiple perspectives and contexts, increasing the diversity of training samples and improving the generalization ability of representations. Specifically, StableRep improves efficiency by minimizing the cross-entropy loss between true and contrastive allocations. In this framework, there is a coded anchor sample and a set of coded candidate samples. The contrastive distribution distribution describes the probability that the model predicts whether the anchor sample and each candidate sample will be generated by the same title. They use an indicator function to identify whether the two samples are from the same heading.

experiment

The research team pre-trained the model with up to 500k steps in a large batch size of 8,192 titles, and all pre-training tasks were performed at 224x224 resolution. They compared SynCLR to OpenAI's CLIP, OpenCLIP, and DINO v2, which represent different approaches to learning Xi from data. In particular, the ViT-B/14 and ViT-L/14 in DINO v2 are distilled from the ViT-g model, which gives DINO v2 an advantage when comparing.

ImageNet linear evaluation

For a fair comparison, all models use the cls token of the last block as a representation (compared to the results of DINO v2 using multi-layer concatenation). As shown in Table 6, SynCLR achieves a score of 80.7% on the ViT-B structure and 83.0% on the ViT-L structure. These results are comparable to those models that learn Xi directly from real data, such as CLIP and DINO v2, although SynCLR only uses synthetic data.

Google SynCLR Technology: Efficient Image Modeling, Self-Training with AI-Generated Data!

UperNet semantic segmentation

The research team used a single-scale 512x512 resolution for UperNet semantic segmentation, and some models used a patch size of 14x14 to accommodate the 518x518 resolution. They used 600M of synthetic data compared to other models including MoCo v3, SimMIM, MAE, PeCo, data2vec, iBOT, BEiT v2, CLIP, and OpenCLIP, which were largely pre-trained on real ImageNet data. SynCLR achieves 54.3% and 57.7% on the mIoU metric at standard and high resolution, respectively, which is excellent compared to the model trained with real data.

Google SynCLR Technology: Efficient Image Modeling, Self-Training with AI-Generated Data!

ImageNet image classification

The performance of SynCLR in ImageNet image classification is also worth noting. Using 600M synthetic data, SynCLR was compared to various models pre-trained with different datasets such as IN21K, WIT-400M, and LAION-2B. In the ViT-B structure, the Top-1 accuracy of SynCLR is 85.8%, while the accuracy of the ViT-L structure is 87.9%, which is better than most models trained with real data.

Google SynCLR Technology: Efficient Image Modeling, Self-Training with AI-Generated Data!

These results clearly show that the SynCLR method is comparable to models that rely on real data in the field of visual representation Xi, despite relying entirely on synthetic data, demonstrating the remarkable effectiveness and great potential of this approach.

brief summary

The authors make the following key points and conclusions:

Reasons to Xi from Generative Models: A significant advantage of generative models is their ability to function as hundreds or thousands of datasets simultaneously. In traditional research methods, researchers often need to collect separate data sets for different image categories (e.g., cars, flowers, cats, dogs, etc.). Systems like DINO v2 are able to construct robust representations by synthesizing and integrating large numbers of these datasets.

Significant advantages of generative models: Compared to traditional data collection and annotation methods, generative models provide a more efficient and broader way to cover visual concepts. This approach eliminates the need for significant time and resources spent on real-world image data collection and annotation.

The paper highlights the critical role of synthetic data in visual representation Xi. Although synthetic data may not be as accurate as real data in terms of classification accuracy, it shows a very high level of effectiveness in training visual representation models. These trained representations can then be easily adapted to downstream tasks with a small amount of real data, demonstrating the usefulness and adaptability of synthetic data.

Google SynCLR Technology: Efficient Image Modeling, Self-Training with AI-Generated Data!