The battle for universal vision GPT has begun! KLCII launches SegGPT, and the batch cutout artifact is coming

Smart stuff

Author | ZeR0

Edit | Shadow of indifference

Zhidong reported on April 8 that following the language model frenzy set off by ChatGPT, this week, computer vision ushered in the GPT moment. First, on Thursday, Meta released the "Divide Everything" model SAM, followed by KLCII's visual team

SegGPT(Segment Everything In Context)

。

SegGPT can be used to segment everything in a context, and is the first general-purpose visual model that uses visual cues (prompts) to accomplish arbitrary segmentation tasks. The differences between SegGPT and SAM are:

（1）SegGPT

“

All in one":

Given one or a few example images and intent masks, the model can understand the user's intent and "learn" similar segmentation tasks. Users can identify and segment similar objects in batches by marking and identifying a type of object on the screen, whether in the current screen or other screens or video environments.

(2)SAM

“

One Touch and Click":

Through a point or bounding box, interactive prompts are given on the picture to be predicted to identify the specified object on the segmentation screen.

Whether it is "one-touch" or "one-to-one", it means that the visual model has "understood" the image structure.

The combination of SAM's fine annotation capability and SegGPT's universal segmentation annotation capability can parse any image from a pixel array into visual structural units, and understand any scene like biological vision.

Address: https://arxiv.org/abs/2304.03284

Code address: https://github.com/baaivision/Painter

Demo：https://huggingface.co/spaces/BAAI/SegGPT

First, the goal is to divide all objects, three advantages blessings

SegGPT is a derivative of KLCII's universal vision model Painter, optimized for the goal of segmenting everything.

After the training is completed, there is no need to fine-tune, just provide examples, and SegGPT can automatically reason and complete the corresponding segmentation task.

Specifically, the SegGPT model has the following advantages:

1. General ability

: SegGPT has contextual inference capabilities, and the model can adaptively adjust the prediction according to the provided segmentation example (prompt) to achieve the segmentation of "everything", including instances, categories, parts, outlines, text, faces, medical images, etc.

2. Flexible reasoning ability

: Support any number of prompts; Support tuned prompt for specific scenarios; Different goals can be represented with different colored masks to achieve parallel segmentation inference.

3. Automatic video segmentation and tracking capabilities

: Taking the first frame image and the corresponding object mask as a context example, SegGPT can automatically segment subsequent video frames, and can use the color of the mask as the ID of the object to achieve automatic tracking.

Second, application example: batch "picking" out of rainbows and planetary rings

1. Mark the rainbow in one picture, and SegGPT can batch divide the rainbow in other pictures.

2. The researchers evaluated SegGPT on a wide range of tasks, including few-sample semantic segmentation, video object segmentation, semantic segmentation, and panoramic segmentation.

The figure below shows the segmentation results of SegGPT on instances, categories, components, outlines, text, and arbitrarily shaped objects.

3. Use a brush to roughly circle the planetary ring belt.

SegGPT accurately outputs the planetary ring belt in the target image in the prediction map.

4. According to the context of the astronaut helmet mask provided by the user.

SegGPT was able to predict the corresponding astronaut helmet area in the new image.

Third, the training idea: defined as the context coloring problem, multiple techniques unlock the segmentation ability

SegGPT unifies various segmentation tasks into a common contextual learning framework, unifying diverse data forms by converting various segmentation data into images of the same format.

Its training is defined as a context coloring problem, where each data sample has a random color map to accomplish different tasks depending on the context, rather than relying on specific colors.

SegGPT is trained to perform arbitrary segmentation tasks in images or videos through contextual reasoning, such as object instances, categories, components, outlines, text, and arbitrarily shaped objects.

How to pass test-time techniques Unlocking various abilities is a highlight of the universal model.

The SegGPT paper proposes multiple techniques to unlock and enhance various types of segmentation capabilities, such as the different context ensemble methods shown below. The Feature Ensemble method proposed in the paper can support any number of prompt examples.

In addition, SegGPT also supports optimizing dedicated prompt prompts for specific scenarios. For targeted usage scenarios, SegGPT can obtain the corresponding prompt through prompt tuning, without the need to update the model parameters to apply to specific scenarios.

For example, automatically build a corresponding prompt for a certain data set, or build a dedicated prompt for a room. As shown in the following figure:

Conclusion: Powerful zero-sample scenario migration capability achieves optimal performance on classic CV datasets

THE MODEL REQUIRES ONLY A FEW PROMPT EXAMPLES TO ACHIEVE OPTIMAL PERFORMANCE ON COCO AND PASCAL DATASETS.

SegGPT shows strong zero-sample scene transfer capabilities, such as state-of-the-art performance on the few-sample semantic segmentation test set FSS-1000, without training.

Without the need for video training data, SegGPT can directly perform video object segmentation and achieve performance comparable to models specially optimized for video object segmentation.

The following is the effect of tuned prompt on semantic segmentation and instance segmentation tasks:

SegGPT was evaluated in a wide range of tasks, including semantic segmentation of a small number of photos, video object segmentation, semantic segmentation, panoramic segmentation. The results show that there is a strong ability to divide intra-domain and extraterritorial targets, both in terms of quality and quantity.

With the release of SAM and SegGPT, two basic models of image segmentation, the dawn of general visual GPT has emerged.