laitimes

Xiaoza personally officially announced the Meta vision big model! Self-supervised learning requires no fine-tuning

Xiao Xiao originated from the Cave Fei Temple

Qubits | Official account QbitAI

No text labels required, the fully self-supervised Meta Vision Big Model is here!

Xiaoza personally announced it, and it received a lot of attention when it was released——

In tasks such as semantic segmentation, instance segmentation, depth estimation and image retrieval, this large visual model called DINOv2 has achieved very good results.

Xiaoza personally officially announced the Meta vision big model! Self-supervised learning requires no fine-tuning

It even surpasses OpenCLIP, the best open source visual model currently available.

Although Meta has previously released a self-supervised learning visual large model DINO, this time the AI's ability to recognize image features is obviously further and accurately segmented the subject in the video:

Xiaoza personally officially announced the Meta vision big model! Self-supervised learning requires no fine-tuning

Don't think that DINOv2 learns only picture segmentation through self-supervision. In fact, it has been able to accurately identify where the head, body and limbs of the same object (dog) grow according to photos of different categories and different scenes:

Xiaoza personally officially announced the Meta vision big model! Self-supervised learning requires no fine-tuning

In other words, DINOv2 learned to find image features on its own.

At present, Meta has not only released open source code, but also given the web version of the demo to play. There are netizen connotations:

What is open source, LLaMA, SAM, DINOv2 This is called open source!

Xiaoza personally officially announced the Meta vision big model! Self-supervised learning requires no fine-tuning

Let's take a look at how effective DINOv2 really is.

Accurately identify the same objects with different art styles

In fact, DINOv2 is a large visual model based on the previous generation DINOv1.

This model has 1 billion parameters, which is still a visual transformer architecture (ViT), but unlike DINO, this time DINOv2 has been carefully selected on the dataset.

Specifically, DINOv2 builds a data filtering pipeline that carefully filters out images with similar content while excluding identical images:

Xiaoza personally officially announced the Meta vision big model! Self-supervised learning requires no fine-tuning

The final training data images presented to DINOv2 do not have text labels, but the features of these images are indeed similar.

How effective is the visual model trained with this type of data?

This is the performance of DINOv2 on 8 visual tasks, including semantic segmentation, classification, depth estimation, etc., where orange is the effect of the self-supervised method and dark pink is the effect of the weakly supervised method.

It can be seen that the visual model after self-supervised learning has performed as well as the model after weak supervised learning.

Xiaoza personally officially announced the Meta vision big model! Self-supervised learning requires no fine-tuning

The results are also good, even if the same objects are not similar in style in a series of photos, DINOv2 can accurately identify their features and classify them into similar lists.

For example, birds and airplanes with wings in group (a), elephants and elephant sculptures in group (b), cars and car toy models in group (c), horses and graffiti horses in group (d):

Xiaoza personally officially announced the Meta vision big model! Self-supervised learning requires no fine-tuning

And from the PCA (principal component analysis) image effect, DINOv2 can not only accurately classify, but also mark their "identical" parts with different colors, such as elephant trunks are green, wheels are red, horse tails are yellow, etc.

In other words, DINOv2 understands the similarities in these images, just as one would describe an airplane as "looking like a bird."

At present, DINOv2 has released a demo, and we have tried its actual effect.

The demo is directly playable

The official website has opened the trial of the three functions of semantic segmentation, image retrieval and depth estimation.

According to Meta, DINOv2 surpasses OpenCLIP, the best performing current open source visual model, on most benchmarks.

Let's first look at the effect of depth estimation.

Xiaoza personally officially announced the Meta vision big model! Self-supervised learning requires no fine-tuning

It is worth mentioning that in the better case, DINOv2 also runs faster than iBOT, and only one-third of the memory under the same hardware can run more than 2 times faster than DINOv2.

Xiaoza personally officially announced the Meta vision big model! Self-supervised learning requires no fine-tuning

Here's how the Meta paper compares to OpenCLIP on a practical example:

Xiaoza personally officially announced the Meta vision big model! Self-supervised learning requires no fine-tuning

Let's try it with this macho version of the new treasure island, it looks pretty good, even the high-paste picture can estimate the depth relatively well:

Xiaoza personally officially announced the Meta vision big model! Self-supervised learning requires no fine-tuning

Next is the effect of semantic segmentation, and here is also the comparison of data in the Meta paper:

Xiaoza personally officially announced the Meta vision big model! Self-supervised learning requires no fine-tuning

Here is also a comparison between OpenCLIP and DINOv2, the middle picture is the effect of OpenCLIP, and the right is the effect of DINOv2 splitting:

Xiaoza personally officially announced the Meta vision big model! Self-supervised learning requires no fine-tuning

We also tried it with a picture of an office, and it seems that DINOv2 can still segment the human body and objects more accurately, but there will be some noise in the details:

Xiaoza personally officially announced the Meta vision big model! Self-supervised learning requires no fine-tuning

Finally, there is image retrieval.

The picture effect given on the official website is quite good, enter the tower photo, you can generate a lot of similar art pictures containing the iron tower:

Xiaoza personally officially announced the Meta vision big model! Self-supervised learning requires no fine-tuning

Here we also tried it, entered a Huaqiang buy melon, and most of the artistic pictures given are related to watermelon:

Xiaoza personally officially announced the Meta vision big model! Self-supervised learning requires no fine-tuning

So, where can such a large self-supervised vision model be used?

Judging from the video given by Meta, there are currently some more environmentally friendly uses, such as for estimating tree heights around the world:

Xiaoza personally officially announced the Meta vision big model! Self-supervised learning requires no fine-tuning

In addition, as Zuckerberg said, DINOv2 can also be used to improve medical imaging, food crop growth and so on. Of course, here Xiaoza further emphasized:

It can be used to make a more immersive metaverse.

Well, it seems that Meta's metaverse route will continue...

Read on