laitimes

The CV ring exploded again? Xiaoza high-profile official announcement DINOv2, split retrieval omnipotent, netizens: Meta is "Open" AI

The CV ring exploded again? Xiaoza high-profile official announcement DINOv2, split retrieval omnipotent, netizens: Meta is "Open" AI

  Shin Ji Won reports  

Editor: Peach Layan  

【New Zhiyuan Guide】Meta has put another big one in the CV field! Self-supervision + no need for fine-tuning, computer vision will not exist again?

Following "Divide Everything", Meta reissued DINOv2.

This is still Xiaoza's personal official announcement, Meta in the CV field of another heavyweight open source project.

The CV ring exploded again? Xiaoza high-profile official announcement DINOv2, split retrieval omnipotent, netizens: Meta is "Open" AI

Xiaoza also said that Meta has been committed to open source various AI tools, and DINOv2 released today is a SOTA-level model. It can realize self-supervised training in depth estimation, semantic segmentation, and image similarity comparison.

Xiaoza said that the model can be used to generate forest heights on different continents with the help of satellite images. In the future, it can also help with medical imaging, food production and so on.

Of course, in the end, Xiaoza did not forget his main theme - the meta-universe. He believes that DINOv2 can greatly bless the construction of the metaverse and make the immersive experience of users in the metaverse better.

The CV ring exploded again? Xiaoza high-profile official announcement DINOv2, split retrieval omnipotent, netizens: Meta is "Open" AI

Netizens shouted, "Computer vision doesn't exist again!"

Effect demonstration

Meta released cases of depth estimation, semantic segmentation and instance retrieval on its official website.

Depth estimation:

For friends who are not familiar with computer vision, depth estimation may be a relatively unfamiliar word. But in fact, as long as you understand its application scenario, you can understand what it means.

Simply put, for 2D photos, because the image is a flat surface, the distance of each point in the photo from the source of the shot is crucial when 3D reconstruction.

That's what depth estimation is all about.

In the picture on the right, the same color means the same distance from the shooting point, and the lighter the color, the closer the distance. In this way, the depth of the whole picture comes out.

Let's look at a few more sets of examples:

The CV ring exploded again? Xiaoza high-profile official announcement DINOv2, split retrieval omnipotent, netizens: Meta is "Open" AI

Semantic segmentation:

The meaning of semantic segmentation is relatively simple. Literally, the word semantics also means different things in different contexts. For example, in the field of speech recognition, semantics refer to speech content. In the field of images, it refers to the content of pictures.

Segmentation is to colorize different parts of an image, so that the division between parts is clear.

It's a bit like a doodle book I used to play with as a child, coloring different parts on blank contour drawings.

The CV ring exploded again? Xiaoza high-profile official announcement DINOv2, split retrieval omnipotent, netizens: Meta is "Open" AI

Of course, there is still a difference, and we can also decorate the same part of the picture book with different colors.

As shown in the picture above, the bridge is one color, the river is one color, the grass is one color, and the distant trees are another color.

More examples:

The CV ring exploded again? Xiaoza high-profile official announcement DINOv2, split retrieval omnipotent, netizens: Meta is "Open" AI

Instance retrieval:

The CV ring exploded again? Xiaoza high-profile official announcement DINOv2, split retrieval omnipotent, netizens: Meta is "Open" AI

This is better understood. Upload an image to the model and you can find similar images from a library with a vast array of images.

The Eiffel Tower in the picture above is the input picture, and the model then retrieves a large number of pictures of the same theme, with different styles.

The CV ring exploded again? Xiaoza high-profile official announcement DINOv2, split retrieval omnipotent, netizens: Meta is "Open" AI

DINOv2

The CV ring exploded again? Xiaoza high-profile official announcement DINOv2, split retrieval omnipotent, netizens: Meta is "Open" AI

Address: https://arxiv.org/pdf/2304.07193.pdf

After watching the SOTA-level demo, let's take a look at the technological breakthroughs hidden behind it.

You know, the breakthrough in natural language processing to pre-train models on large amounts of data has opened the way for similar basic models in computer vision.

These models can greatly simplify the use of images in any system by producing visual features for multiple uses, features that can function in different image distributions and tasks without fine-tuning.

This work shows that existing pre-training methods, especially self-supervised methods, can produce such an effect if trained on sufficient data from different sources.

Meta researchers revisited existing methods and combined different techniques to scale our pre-training in terms of data and model size.

Most of the techniques contribute to accelerated and stable scaling of training. On the data side, Meta proposes an automated pipeline that aims to build a specialized, diverse, and curated dataset of images, rather than uncurated data, as is often done in self-monitoring literature.

In terms of models, the researchers trained a ViT model with 1B parameters and refined it into a series of smaller models that exceeded existing OpenCLIP benchmarks at the image and pixel level at most image and pixel levels.

Pre-trained representations that are not related to learning tasks have become the standard for natural language processing (NLP). These features can be copied without fine-tuning and achieve significantly better performance in downstream tasks than the task-specific model.

This success is driven by pre-training on large amounts of raw text, such as language modeling or word vectors, without supervision.

After this paradigm shift in NLP, the researchers expect similar fundamental models to emerge in computer vision. These models produce visual features that can be useful in any task. At the image level, there is image classification, while at the pixel level, there is segmentation (as in the example above).

Most of the efforts on these base models have focused on text-guided pre-training, i.e., using a text-supervised form to guide feature training. This form of text-guided pre-training limits the information that can be retained about the image, because the caption contains only the surface information in the image, and complex pixel-level information may not be represented.

In addition, these image encoders require a one-to-one correspondence text & image corpus. An alternative to text-guided pre-training is self-supervised learning, which is characterized by learning from images alone. These methods are conceptually closer to tasks such as language modeling and can capture information at the image and pixel level.

However, most of the advances in self-supervised learning are pre-trained on a small curated dataset ImageNet1k. Some efforts to extend these methods beyond ImageNet-1k have been tried, but they are characterized by focusing on uncurated datasets, resulting in a significant decrease in the quality of features.

This is because of the lack of control over data quality and diversity.

Meta researchers focused on whether pre-trained self-supervised learning on large amounts of curated data had the potential to learn all visual features. They revisited existing discriminative self-supervised methods for learning features at the image and plaque level, such as iBOT, and Meta's researchers reconsidered some of the iBOT options under larger datasets.

Most of Meta's technical contributions have focused on stabilizing and accelerating discriminative self-supervised learning at model and data scale-up. These improvements make the new method about 2 times faster and require 3 times less memory than similar discriminative self-supervised methods, allowing longer training with larger batch sizes.

Regarding the pre-training data, the researchers built a model to filter and rebalance datasets containing large amounts of unprocessed images. Inspired by the approach used in NLP, data similarity is used instead of external metadata and does not require manual annotation.

In this work, a simple clustering method solves this problem excellently.

Meta researchers collected a diverse corpus of 142 million images to test this approach. The result is a variety of pre-trained visual models called DINOv2, which is the protagonist we introduce today.

Meta also released all the models and code so that DINOv2 can be retrained on any data.

The researchers verified DINOv2's capabilities on various computer vision benchmarks and extended it at the image and pixel level, as shown below.

The CV ring exploded again? Xiaoza high-profile official announcement DINOv2, split retrieval omnipotent, netizens: Meta is "Open" AI

Netizen: This is "Open" AI

After the release of DINOv2, netizens also received unanimous praise.

"Computer vision foundational models are progressing incredibly fast. Similar to LLMs driven by self-supervised learning on large-scale data and models. Thanks to Meta open source DINOv2 and SAM - for ~~90% of ordinary domain tasks, these models are becoming more and more capable, and basically do not require fine-tuning."

The CV ring exploded again? Xiaoza high-profile official announcement DINOv2, split retrieval omnipotent, netizens: Meta is "Open" AI

"SAM+DINO, it is too strong in agriculture."

The CV ring exploded again? Xiaoza high-profile official announcement DINOv2, split retrieval omnipotent, netizens: Meta is "Open" AI

"Meta is the real "Open" AI company!"

Resources:

https://www.maginative.com/article/meta-ai-unveils-dinov2-a-game-changer-in-self-supervised-vision-transformer-models

  https://github.com/facebookresearch/dinov2

Read on