laitimes

Meta "Split Everything" Evolution 2.0! Tracking moving objects, the code weight dataset is fully open source

Ming Min from the Au Fei Temple

Quantum Position | 公众号 QbitAI

It's release again, i.e., open source!

Meta's "Split Everything AI" second-generation SAM2 has just been unveiled on SIGGRAPH.

Compared to the previous generation, its capabilities have been expanded from image segmentation to video segmentation.

Any long video can be processed in real time, and objects that have not been seen in the video can be easily divided and tracked.

Meta "Split Everything" Evolution 2.0! Tracking moving objects, the code weight dataset is fully open source

What's more, the model code, weights, and datasets are all open source!

Like the Llama family, it is licensed under the Apache 2.0 license and shares evaluation code under the BSD-3 license.

Netizen yygq: I'll just ask OpenAI if it's embarrassing.

Meta "Split Everything" Evolution 2.0! Tracking moving objects, the code weight dataset is fully open source

Meta said that the open-source dataset contains 51,000 real-world videos and 600,000 spatio-temporal masks, far exceeding the previous largest dataset of its kind.

The demo that can be tried online is also online, so everyone can experience it.

Meta "Split Everything" Evolution 2.0! Tracking moving objects, the code weight dataset is fully open source

Add a memory module on top of the SAM

Compared with the SAM generation, SAM2 has the following capabilities:

  • Real-time segmentation of arbitrary long videos is supported
  • Implement zero-shot generalization
  • Segmentation and tracking accuracy improved
  • Fix occlusion issues
Meta "Split Everything" Evolution 2.0! Tracking moving objects, the code weight dataset is fully open source

Its interactive segmentation process is divided into two main steps: selection and refinement.

In the first frame, the user selects the target object by clicking, and SAM2 automatically propagates the segmentation to the subsequent frames based on the click, forming a spatiotemporal mask.

If SAM2 loses the target object in some frames, the user can correct it by providing an additional hint in a new frame.

If you need to restore an object in the third frame, just click on it in that frame.

Meta "Split Everything" Evolution 2.0! Tracking moving objects, the code weight dataset is fully open source

The core idea of SAM2 is to treat an image as a single frame video, so it can be extended directly from SAM to the video domain, supporting both image and video inputs.

The only difference in processing video is that the model relies on memory to recall the processed information in order to accurately segment the object at the current time step.

In video segmentation, the motion, distortion, occlusion, and light of objects change strongly compared to image segmentation. Segmenting objects in a video at the same time requires an understanding of where entities are located across space and time.

Meta "Split Everything" Evolution 2.0! Tracking moving objects, the code weight dataset is fully open source

So Meta has done three main jobs:

  • Design a visual segmentation task that can be prompted
  • Design of a new model on the basis of SAM
  • Construct the SA-V dataset
Meta "Split Everything" Evolution 2.0! Tracking moving objects, the code weight dataset is fully open source

First, the team designed a visual segmentation task to generalize the image segmentation task to the video domain.

SAM is trained to define the target and predict the segmentation mask in terms of input points, boxes, or masks in the image.

The SAM is then trained to accept a prompt in any frame of the video to define the spatiotemporal mask to be predicted.

SAM2 makes instant predictions of the mask on the current frame based on the input prompt, and performs temporary propagation, generating a mask of the target object on all frames.

Once the initial mask is predicted, iterative improvements can be made by providing additional hints to SAM2 in any frame, which can be repeated as many times as needed until all masks are obtained.

Meta "Split Everything" Evolution 2.0! Tracking moving objects, the code weight dataset is fully open source

By introducing streaming memory, the model can process video in real time and segment and track the target object more accurately.

It consists of a memory encoder, a memory bank, and a memory attention module. Let the model process only one frame of images at a time, and use the information of the previous frame to assist in the segmentation task of the current frame.

When splitting an image, the memory component is empty, and the model is similar to SAM. When segmenting a video, the memory component is able to store object information as well as previous interaction information, allowing SAM2 to perform mask prediction throughout the video.

If there are additional hints on other frames, SAM2 can correct errors based on the storage memory of the target object.

The memory encoder creates a memory based on the current prediction, and the memory retains information about the past predictions of the target object of the video. The memory attention mechanism works by conditionalizing the features of the current frame and adjusting to the features of past frames to produce embeddings, which are then passed to a mask decoder to generate a mask prediction for that frame, and this operation is repeated over and over again in subsequent frames.

Meta "Split Everything" Evolution 2.0! Tracking moving objects, the code weight dataset is fully open source

This design also allows the model to process videos of any length, which is important not only for annotation collection of SA-V datasets, but also for robotics and other fields.

If the segmented object is blurry, SAM2 will also output multiple valid masks. For example, if the user clicks on the tires of a bicycle, the model can interpret this as multiple masks, which may refer to tires, possibly to all of the bicycles, and output multiple predictions.

In a video, if only the tires are visible in a frame, it may be the tires that may need to be segmented; If there are a lot of bicycles in the subsequent frames of the video, then it may be the bicycles that may need to be split.

If it is still not possible to determine which part the user wants to divide, the model will choose based on the confidence level.

In addition, it is easy for the segmented object to be occluded in the video. To address this new situation, SAM2 also adds an additional model output, the occlusion head, which predicts whether an object will appear on the current frame.

Meta "Split Everything" Evolution 2.0! Tracking moving objects, the code weight dataset is fully open source

Also, in terms of datasets.

SA-V contains 4.5 times more video and 53 times more annotations than the largest dataset of its kind available.

Meta "Split Everything" Evolution 2.0! Tracking moving objects, the code weight dataset is fully open source

To collect so much data, the research team built a data engine. SAM2 is used to annotate the spatiotemporal mask in the video, and then the new annotation is used to update SAM2. By repeating this cycle multiple times, you can iterate on the dataset and model over and over again.

Similar to SAM, the research team does not apply semantic constraints to the spatiotemporal mask of the annotation, but focuses more on the complete object.

This method also greatly improves the speed of collecting video object segmentation masks, which is 8.4 times faster than SAM.

Solve the problem of excessive segmentation and go beyond SOTA

In contrast, the use of SAM2 can solve the problem of over-segmentation.

Meta "Split Everything" Evolution 2.0! Tracking moving objects, the code weight dataset is fully open source

Experimental data show that SAM2 performs well in all aspects compared with the semi-supervised SOTA method.

Meta "Split Everything" Evolution 2.0! Tracking moving objects, the code weight dataset is fully open source

However, the research team also said that SAM2 is still insufficient.

For example, you may lose your partner. This is likely to happen if the camera's angle of view changes greatly and in a crowded scene. So they designed a mode of real-time interaction that supports manual corrections.

Meta "Split Everything" Evolution 2.0! Tracking moving objects, the code weight dataset is fully open source

and the target object is moving too fast, and there may be a loss of detail.

Meta "Split Everything" Evolution 2.0! Tracking moving objects, the code weight dataset is fully open source

Finally, models are not only open source and free to use, but are already hosted on platforms such as Amazon SageMaker.

It is worth mentioning that it was found that the paper mentioned that SAM2 training took 108 hours to complete on 256 A100s, compared to 68 hours for SAM1.

Expanding from image segmentation to video at such a low cost?
Meta "Split Everything" Evolution 2.0! Tracking moving objects, the code weight dataset is fully open source

Reference Links:

[1]https://ai.meta.com/blog/segment-anything-2/

[2]https://x.com/swyx/status/1818074658299855262

— END —

Quantum QbitAI · 头条号

Follow us and be the first to know about cutting-edge technology trends

Read on