Meta "Split Everything" Evolution 2.0! Tracking moving objects, the code weight dataset is fully open source

2024-07-30 14:41:00

Ming Min from the Au Fei Temple

Quantum Position | 公众号 QbitAI

It's release again, i.e., open source!

Meta's "Split Everything AI" second-generation SAM2 has just been unveiled on SIGGRAPH.

Compared to the previous generation, its capabilities have been expanded from image segmentation to video segmentation.

Any long video can be processed in real time, and objects that have not been seen in the video can be easily divided and tracked.

Meta "Split Everything" Evolution 2.0! Tracking moving objects, the code weight dataset is fully open source

What's more, the model code, weights, and datasets are all open source!

Like the Llama family, it is licensed under the Apache 2.0 license and shares evaluation code under the BSD-3 license.

Netizen yygq: I'll just ask OpenAI if it's embarrassing.

Meta said that the open-source dataset contains 51,000 real-world videos and 600,000 spatio-temporal masks, far exceeding the previous largest dataset of its kind.

The demo that can be tried online is also online, so everyone can experience it.

Add a memory module on top of the SAM

Compared with the SAM generation, SAM2 has the following capabilities:

Real-time segmentation of arbitrary long videos is supported
Implement zero-shot generalization
Segmentation and tracking accuracy improved
Fix occlusion issues

Its interactive segmentation process is divided into two main steps: selection and refinement.

In the first frame, the user selects the target object by clicking, and SAM2 automatically propagates the segmentation to the subsequent frames based on the click, forming a spatiotemporal mask.

If SAM2 loses the target object in some frames, the user can correct it by providing an additional hint in a new frame.

If you need to restore an object in the third frame, just click on it in that frame.

The core idea of SAM2 is to treat an image as a single frame video, so it can be extended directly from SAM to the video domain, supporting both image and video inputs.

The only difference in processing video is that the model relies on memory to recall the processed information in order to accurately segment the object at the current time step.

In video segmentation, the motion, distortion, occlusion, and light of objects change strongly compared to image segmentation. Segmenting objects in a video at the same time requires an understanding of where entities are located across space and time.

So Meta has done three main jobs:

Design a visual segmentation task that can be prompted
Design of a new model on the basis of SAM
Construct the SA-V dataset

First, the team designed a visual segmentation task to generalize the image segmentation task to the video domain.

SAM is trained to define the target and predict the segmentation mask in terms of input points, boxes, or masks in the image.

The SAM is then trained to accept a prompt in any frame of the video to define the spatiotemporal mask to be predicted.

SAM2 makes instant predictions of the mask on the current frame based on the input prompt, and performs temporary propagation, generating a mask of the target object on all frames.

Once the initial mask is predicted, iterative improvements can be made by providing additional hints to SAM2 in any frame, which can be repeated as many times as needed until all masks are obtained.

By introducing streaming memory, the model can process video in real time and segment and track the target object more accurately.

It consists of a memory encoder, a memory bank, and a memory attention module. Let the model process only one frame of images at a time, and use the information of the previous frame to assist in the segmentation task of the current frame.

When splitting an image, the memory component is empty, and the model is similar to SAM. When segmenting a video, the memory component is able to store object information as well as previous interaction information, allowing SAM2 to perform mask prediction throughout the video.

If there are additional hints on other frames, SAM2 can correct errors based on the storage memory of the target object.

The memory encoder creates a memory based on the current prediction, and the memory retains information about the past predictions of the target object of the video. The memory attention mechanism works by conditionalizing the features of the current frame and adjusting to the features of past frames to produce embeddings, which are then passed to a mask decoder to generate a mask prediction for that frame, and this operation is repeated over and over again in subsequent frames.

This design also allows the model to process videos of any length, which is important not only for annotation collection of SA-V datasets, but also for robotics and other fields.

If the segmented object is blurry, SAM2 will also output multiple valid masks. For example, if the user clicks on the tires of a bicycle, the model can interpret this as multiple masks, which may refer to tires, possibly to all of the bicycles, and output multiple predictions.

In a video, if only the tires are visible in a frame, it may be the tires that may need to be segmented; If there are a lot of bicycles in the subsequent frames of the video, then it may be the bicycles that may need to be split.

If it is still not possible to determine which part the user wants to divide, the model will choose based on the confidence level.

In addition, it is easy for the segmented object to be occluded in the video. To address this new situation, SAM2 also adds an additional model output, the occlusion head, which predicts whether an object will appear on the current frame.

Also, in terms of datasets.

SA-V contains 4.5 times more video and 53 times more annotations than the largest dataset of its kind available.

To collect so much data, the research team built a data engine. SAM2 is used to annotate the spatiotemporal mask in the video, and then the new annotation is used to update SAM2. By repeating this cycle multiple times, you can iterate on the dataset and model over and over again.

Similar to SAM, the research team does not apply semantic constraints to the spatiotemporal mask of the annotation, but focuses more on the complete object.

This method also greatly improves the speed of collecting video object segmentation masks, which is 8.4 times faster than SAM.

Solve the problem of excessive segmentation and go beyond SOTA

In contrast, the use of SAM2 can solve the problem of over-segmentation.

Experimental data show that SAM2 performs well in all aspects compared with the semi-supervised SOTA method.

However, the research team also said that SAM2 is still insufficient.

For example, you may lose your partner. This is likely to happen if the camera's angle of view changes greatly and in a crowded scene. So they designed a mode of real-time interaction that supports manual corrections.

and the target object is moving too fast, and there may be a loss of detail.

Finally, models are not only open source and free to use, but are already hosted on platforms such as Amazon SageMaker.

It is worth mentioning that it was found that the paper mentioned that SAM2 training took 108 hours to complete on 256 A100s, compared to 68 hours for SAM1.

Expanding from image segmentation to video at such a low cost?

Reference Links:

[1]https://ai.meta.com/blog/segment-anything-2/

[2]https://x.com/swyx/status/1818074658299855262

— END —

Quantum QbitAI · 头条号

Meta "Split Everything" Evolution 2.0! Tracking moving objects, the code weight dataset is fully open source

Add a memory module on top of the SAM

Solve the problem of excessive segmentation and go beyond SOTA

Read on

Nuclear radiation only affects humans? Chernobyl has become a paradise for animals, they are not afraid of nuclear radiation?

Hongbao Town, Qingshui County, actively carried out paperless certification of animal quarantine

Giant pandas "Baoli" and "Qingbao" arrived at the National Zoo in Washington

Hongshan Zoo will appear at the Express Readers' Festival, and will also bring a group of "hairy children" cultural and creative products

·After 40 years of hard work, she has become a leader in China's laboratory animal discipline from "raising rats".

"Abroad" open! Giant pandas "Baoli" and "Qingbao" officially moved into the Washington Zoo |

The heron, a national second-class protected animal, inhabits Lingqiu County

There are three animals in the house, don't fight, do you know which three animals?

Plants have learned to pollinate by blasting and remove "opponent" pollen from animals

Shanmen Town, Qingshui County, has fully completed the animal epidemic prevention work in the fall of 2024

Pay close attention to the implementation of the year丨Our district will carry out joint actions for forest fire prevention and extinguishing and wildlife protection in the autumn and winter of 2024

A bear in the zoo became popular, foreign netizens: Is it a bear or a human?

The "eye of the needle" on the crab is someone injecting water to gain weight? Expert: Caused by social animals stamping on each other

The North Macedonian Capital Zoo is closed due to avian flu

The "dung champions" of the animal kingdom: the mystery of the ecological impact of hippo manure.

The 4 brands of sneakers that are not recommended to buy, they sound high-grade, but they are actually IQ taxes, do you have them

Chinese cycling star Xu Chao passed away unexpectedly! Mother: It's like the sky is falling!

voted for the best female athlete of the year, 2 national table tennis players entered the top 10, and Sun Yingsha had 25 times more votes than the second place

Magnesium, a key substance in the human body, can prevent brain atrophy? Do 3 kinds of exercise moderately, and you are not afraid of magnesium deficiency

Ma Long won the 2024 Xinhua News Agency Top Ten Athletes, and Fan Zhendong regrettably lost the election and caused dissatisfaction among fans

Promise me, these 4 sports brands are better "don't buy", they sound high-grade, but they are actually IQ taxes

Official announcement! Xinhua News Agency's top 10 athletes of the year were released, and there was only 1 national table tennis, and Chen Mengfan Zhendong was not available

Why is pork a dietary taboo for athletes?

Yang Zhenning's "secret of longevity": exercise and drinking water are all on the sidelines, the key lies in these 4 points

Congratulations to Quan Hongchan for escorting and positioning as an international-level athlete, Jinan University responded, and the comment area was burned

The General Administration of Sports angrily criticized the Asian Cup! CCTV spoke, the fan circle kidnapped the athletes, and the fans at the scene told the truth

Exercise against cancer is a real hammer? Doctors remind: the best anti-cancer exercise is not walking or running

A hundred years ago, there were rare images of the "human zoo" in the United States, and the indigenous people had no dignity at all!

Another sports brand giant fell off the altar with a loss of 2.1 billion! A middle-aged man's "dream clothes"

Outburst! 32-year-old marathon runner Dang Jiangtao passed away, not long after getting married, and he posted news 10 days ago

A large number of celebrity gambling photos were exposed, Sun Honglei was a frequent visitor, and there were national athletes, almost all of whom lost money

After many anti-Chinese campaigns, the choice of Chinese Indonesians