Xiao Xiao originated from the Cave Fei Temple
Qubits | Official account QbitAI
No text labels required, the fully self-supervised Meta Vision Big Model is here!
Xiaoza personally announced it, and it received a lot of attention when it was released——
In tasks such as semantic segmentation, instance segmentation, depth estimation and image retrieval, this large visual model called DINOv2 has achieved very good results.
It even surpasses OpenCLIP, the best open source visual model currently available.
Although Meta has previously released a self-supervised learning visual large model DINO, this time the AI's ability to recognize image features is obviously further and accurately segmented the subject in the video:
Don't think that DINOv2 learns only picture segmentation through self-supervision. In fact, it has been able to accurately identify where the head, body and limbs of the same object (dog) grow according to photos of different categories and different scenes:
In other words, DINOv2 learned to find image features on its own.
At present, Meta has not only released open source code, but also given the web version of the demo to play. There are netizen connotations:
What is open source, LLaMA, SAM, DINOv2 This is called open source!
Let's take a look at how effective DINOv2 really is.
Accurately identify the same objects with different art styles
In fact, DINOv2 is a large visual model based on the previous generation DINOv1.
This model has 1 billion parameters, which is still a visual transformer architecture (ViT), but unlike DINO, this time DINOv2 has been carefully selected on the dataset.
Specifically, DINOv2 builds a data filtering pipeline that carefully filters out images with similar content while excluding identical images:
The final training data images presented to DINOv2 do not have text labels, but the features of these images are indeed similar.
How effective is the visual model trained with this type of data?
This is the performance of DINOv2 on 8 visual tasks, including semantic segmentation, classification, depth estimation, etc., where orange is the effect of the self-supervised method and dark pink is the effect of the weakly supervised method.
It can be seen that the visual model after self-supervised learning has performed as well as the model after weak supervised learning.
The results are also good, even if the same objects are not similar in style in a series of photos, DINOv2 can accurately identify their features and classify them into similar lists.
For example, birds and airplanes with wings in group (a), elephants and elephant sculptures in group (b), cars and car toy models in group (c), horses and graffiti horses in group (d):
And from the PCA (principal component analysis) image effect, DINOv2 can not only accurately classify, but also mark their "identical" parts with different colors, such as elephant trunks are green, wheels are red, horse tails are yellow, etc.
In other words, DINOv2 understands the similarities in these images, just as one would describe an airplane as "looking like a bird."
At present, DINOv2 has released a demo, and we have tried its actual effect.
The demo is directly playable
The official website has opened the trial of the three functions of semantic segmentation, image retrieval and depth estimation.
According to Meta, DINOv2 surpasses OpenCLIP, the best performing current open source visual model, on most benchmarks.
Let's first look at the effect of depth estimation.
It is worth mentioning that in the better case, DINOv2 also runs faster than iBOT, and only one-third of the memory under the same hardware can run more than 2 times faster than DINOv2.
Here's how the Meta paper compares to OpenCLIP on a practical example:
Let's try it with this macho version of the new treasure island, it looks pretty good, even the high-paste picture can estimate the depth relatively well:
Next is the effect of semantic segmentation, and here is also the comparison of data in the Meta paper:
Here is also a comparison between OpenCLIP and DINOv2, the middle picture is the effect of OpenCLIP, and the right is the effect of DINOv2 splitting:
We also tried it with a picture of an office, and it seems that DINOv2 can still segment the human body and objects more accurately, but there will be some noise in the details:
Finally, there is image retrieval.
The picture effect given on the official website is quite good, enter the tower photo, you can generate a lot of similar art pictures containing the iron tower:
Here we also tried it, entered a Huaqiang buy melon, and most of the artistic pictures given are related to watermelon:
So, where can such a large self-supervised vision model be used?
Judging from the video given by Meta, there are currently some more environmentally friendly uses, such as for estimating tree heights around the world:
In addition, as Zuckerberg said, DINOv2 can also be used to improve medical imaging, food crop growth and so on. Of course, here Xiaoza further emphasized:
It can be used to make a more immersive metaverse.
Well, it seems that Meta's metaverse route will continue...