Xiaoza personally officially announced the Meta vision big model! Self-supervised learning requires no fine-tuning

Xiao Xiao originated from the Cave Fei Temple

Qubits | Official account QbitAI

No text labels required, the fully self-supervised Meta Vision Big Model is here!

Xiaoza personally announced it, and it received a lot of attention when it was released——

In tasks such as semantic segmentation, instance segmentation, depth estimation and image retrieval, this large visual model called DINOv2 has achieved very good results.

Xiaoza personally officially announced the Meta vision big model! Self-supervised learning requires no fine-tuning

It even surpasses OpenCLIP, the best open source visual model currently available.

Although Meta has previously released a self-supervised learning visual large model DINO, this time the AI's ability to recognize image features is obviously further and accurately segmented the subject in the video:

Don't think that DINOv2 learns only picture segmentation through self-supervision. In fact, it has been able to accurately identify where the head, body and limbs of the same object (dog) grow according to photos of different categories and different scenes:

In other words, DINOv2 learned to find image features on its own.

At present, Meta has not only released open source code, but also given the web version of the demo to play. There are netizen connotations:

What is open source, LLaMA, SAM, DINOv2 This is called open source!

Let's take a look at how effective DINOv2 really is.

Accurately identify the same objects with different art styles

In fact, DINOv2 is a large visual model based on the previous generation DINOv1.

This model has 1 billion parameters, which is still a visual transformer architecture (ViT), but unlike DINO, this time DINOv2 has been carefully selected on the dataset.

Specifically, DINOv2 builds a data filtering pipeline that carefully filters out images with similar content while excluding identical images:

The final training data images presented to DINOv2 do not have text labels, but the features of these images are indeed similar.

How effective is the visual model trained with this type of data?

This is the performance of DINOv2 on 8 visual tasks, including semantic segmentation, classification, depth estimation, etc., where orange is the effect of the self-supervised method and dark pink is the effect of the weakly supervised method.

It can be seen that the visual model after self-supervised learning has performed as well as the model after weak supervised learning.

The results are also good, even if the same objects are not similar in style in a series of photos, DINOv2 can accurately identify their features and classify them into similar lists.

For example, birds and airplanes with wings in group (a), elephants and elephant sculptures in group (b), cars and car toy models in group (c), horses and graffiti horses in group (d):

And from the PCA (principal component analysis) image effect, DINOv2 can not only accurately classify, but also mark their "identical" parts with different colors, such as elephant trunks are green, wheels are red, horse tails are yellow, etc.

In other words, DINOv2 understands the similarities in these images, just as one would describe an airplane as "looking like a bird."

At present, DINOv2 has released a demo, and we have tried its actual effect.

The demo is directly playable

The official website has opened the trial of the three functions of semantic segmentation, image retrieval and depth estimation.

According to Meta, DINOv2 surpasses OpenCLIP, the best performing current open source visual model, on most benchmarks.

Let's first look at the effect of depth estimation.

It is worth mentioning that in the better case, DINOv2 also runs faster than iBOT, and only one-third of the memory under the same hardware can run more than 2 times faster than DINOv2.

Here's how the Meta paper compares to OpenCLIP on a practical example:

Let's try it with this macho version of the new treasure island, it looks pretty good, even the high-paste picture can estimate the depth relatively well:

Next is the effect of semantic segmentation, and here is also the comparison of data in the Meta paper:

Here is also a comparison between OpenCLIP and DINOv2, the middle picture is the effect of OpenCLIP, and the right is the effect of DINOv2 splitting:

We also tried it with a picture of an office, and it seems that DINOv2 can still segment the human body and objects more accurately, but there will be some noise in the details:

Finally, there is image retrieval.

The picture effect given on the official website is quite good, enter the tower photo, you can generate a lot of similar art pictures containing the iron tower:

Here we also tried it, entered a Huaqiang buy melon, and most of the artistic pictures given are related to watermelon:

So, where can such a large self-supervised vision model be used?

Judging from the video given by Meta, there are currently some more environmentally friendly uses, such as for estimating tree heights around the world:

In addition, as Zuckerberg said, DINOv2 can also be used to improve medical imaging, food crop growth and so on. Of course, here Xiaoza further emphasized:

It can be used to make a more immersive metaverse.

Well, it seems that Meta's metaverse route will continue...

Xiaoza personally officially announced the Meta vision big model! Self-supervised learning requires no fine-tuning

Read on

A panacea for equity? Musk's open source plan for Twitter algorithms is far more complicated than imagined

Huawei experts explain the OpenHarmony open source Hongmeng hardware resource pooling model

Mozilla's open source speech dataset has 20,000 hours of content, and new support for Cantonese and Hokkien languages

Microsoft's participation in the Open 3D Foundation will drive the development of open source 3D engines

In an instant 5k+star, Musk bought Twitter for $44 billion and announced that it would open source

To see how strong the AI is, someone took it to play a "script kill"

Hardware 丨 AMD expects to launch a CPU with an integrated AI engine as early as 2023

Only one-tenth of the data is needed to complete the four visual tasks, and it is actually open source!

For the first time in the history of a tech company: a Meta open source AI model of the size of the GPT3 parameter

Why sound is suitable for building a brand strengthens the mind

Huawei Hongmeng HarmonyOS 94 JS/eTS open source components are newly launched

The 7th generation of Qualcomm AI engine: through AI, see the future

Taking Log4j as an example, how to assess and classify security risks

Capture once in 5 minutes, at least 89 times a day at home! Suntech employee: I don't even dare to go to the toilet

Played a script kill, the same car teammate "not human"

2022 Le Orange New Product Launch: 14 new products qifa software and hardware fully upgraded

Is there any software to dub videos? Share software that can dub videos

Don't let ChatGPT run

The meme search artifact is here! You can also search for videos, netizens: I found a six-year meme to solve in two minutes

A programmer's success story: from open source tools to a $7.5 billion software empire

Cheating with ChatGPT, beware of being caught, anti-plagiarism watermark technology makes students' nightmares come early

Google's "crazy" generative AI track, the latest model can "create" music with text and pictures

What to do if ChatGPT goes crazy? Xiaoice Li Di: Two keys that I can break

Experience ChatGPT again: it will still be wrong, but the logic is stronger

The Fudan MOSS model is scheduled to be open source in mid-April, and Qiu Xipeng explains in detail how to build it

Microsoft Open Source Deep Speed Chat: The era of ChatGPT for everyone is here

Meta has thrown out another AI open source masterpiece! Animated graffiti and exposed new datasets

The CV ring exploded again? Xiaoza high-profile official announcement DINOv2, split retrieval omnipotent, netizens: Meta is "Open" AI

Fudan MOSS big model is open source! Github and Hugging Face went live at the same time

Open source big model, the next "stuck neck" technology? Deep web