EDIT: So sleepy

【New Zhiyuan Guide】Only need a sentence, a click, a brush, you can split and track any object in any scene!

Recently, the ReRER lab of Zhejiang University deeply combined SAM with video segmentation and released Segment-and-Track Anything (SAM-Track).

SAM-Track gives SAM the ability to track video targets and support multiple ways (point, brush, text) to interact.

On this basis, SAM-Track unifies multiple traditional video segmentation tasks, achieves the goal of one-click segmentation tracking any video in any video, and extrapolates traditional video segmentation to general video segmentation.

SAM-Track has excellent performance, and can track hundreds of targets with high quality and stability in complex scenarios with only a single card.

Video Split Finale! Zhejiang University released SAM-Track: Universal intelligent video segmentation with one click

Project address: https://github.com/z-x-yang/Segment-and-Track-Anything

Address: https://arxiv.org/abs/2305.06558

Effect display

SAM-Track supports language input as a prompt. For example, given the category text "panda", you can track all targets belonging to the category "panda" with one click.

You can also give a more detailed description, such as entering the text "leftmost panda", SAM-Track can locate a specific target for segmentation tracking.

Another powerful feature of SAM-Track compared to traditional video tracking algorithms is that it can track and segment a large number of objects simultaneously and automatically detect emerging objects.

SAM-Track also supports a combination of multiple interaction methods, and users can match according to actual needs. For example, use a brush to frame a skateboard that is tightly connected to the human body to prevent dividing superfluous objects, and then use the tap to select the human body.

Fully automatic video target segmentation and tracking is also a problem, and various application scenarios, including street view, aerial photography, AR, animation, medical images, etc., can be segmented and tracked with one click and automatically detect emerging objects.

If they are not satisfied with the automatic segmentation results, the user can make editorial corrections based on this, such as using a click to correct the over-segmented tram.

At the same time, the latest version of SAM-Track supports online browsing of tracking results, and you can select the segmentation results of any frame in the middle to modify and add targets, and track them again.

In order to facilitate the user's online experience, the project provides WebUI, which can be deployed with one click through Colab:

Model composition

The SAM-Track model is based on DeAOT, the four-track champion scheme of ECCV'22 VOT Workshop.

DeAOT is an efficient multi-objective VOS model that can track and segment objects in the rest of the video given the object annotation of the first frame.

DeAOT uses a recognition mechanism to embed multiple objects in a video into the same high-dimensional space, enabling the simultaneous tracking of multiple objects.

DeAOT's speed performance in multi-object tracking is comparable to other VOS methods for single object tracking.

In addition, through the propagation mechanism based on the hierarchical Transformer, DeAOT better aggregates long time series and short time series information, showing excellent tracking performance.

Since DeAOT requires annotation of reference frames to initialize, SAM-Track uses the Segment Anything Model (SAM) model, which has recently become popular in the field of image segmentation, to obtain annotation information.

Using SAM's excellent zero-sample migration capability and multiple interaction methods, SAM-Track can efficiently obtain high-quality reference frame annotation information for DeAOT.

Although the SAM model excels in the field of image segmentation, it cannot output semantic labels, and text prompts do not support Referring Object Segmentation and other tasks that rely on deep semantic understanding.

Therefore, the SAM-Track model further integrates Grounding-DINO to achieve high-precision language-guided video segmentation. Grounding DINO is an open ensemble object detection model with good language understanding.

Based on the entered category or detailed description of the target object, Grounding-DINO can detect the target and return the location box.

SAM-Track model architecture

As shown in the figure below, the SAM-Track model supports three object tracking modes, namely interactive tracking mode, automatic tracking mode, and fusion mode.

For the interaction tracking mode, the SAM-Track model first applies SAM, and selects the target by clicking or framing in the reference frame until the user is satisfied with the interaction segmentation result.

If you want to achieve language-guided segmentation of video objects, SAM-Track will call Grounding-DINO to obtain the position box of the target object according to the input text, and on this basis, the segmentation result of the object of interest is obtained through SAM.

Finally, DeAOT uses the interactive segmentation result as a reference frame to track the selected target. In the process of tracing, DeAOT propagates the visual embedding and high-dimensional ID embedding layered in the past frame to the current frame, so as to realize the segmentation of multiple target objects frame by frame. Therefore, SAM-Track can track objects of interest in segmented videos by supporting multimodal interactions.

However, the interactive tracking mode cannot handle emerging objects that appear in the video. It limits the application of SAM-Track in specific areas, such as autonomous driving, smart cities, etc.

In order to further expand the application range and performance of SAM-Track, SAM-Track implements an automatic tracking mode to track new objects that appear in the video.

Auto-tracking mode uses Segment Everything and Object of Interest Segmentation to annotate new objects in every n frames. For emerging object ID assignment problems, SAM-Track employs a Comparison Mask Module (CMR) to determine the ID of the new object.

The fusion mode combines the interactive tracking mode and the automatic tracking mode. Interactive tracking mode allows users to easily get annotations for the first frame of a video, while automatic tracking mode handles new objects that appear in subsequent frames of the video that are not selected. The combination of tracking methods expands the application of SAM-Track and increases the usefulness of SAM-Track.

Resources:

https://github.com/z-x-yang/Segment-and-Track-Anything

Video Split Finale! Zhejiang University released SAM-Track: Universal intelligent video segmentation with one click

【New Zhiyuan Guide】Only need a sentence, a click, a brush, you can split and track any object in any scene!