CVPR 2023 | Zhejiang University & Nanyang Polytechnic proposed PADing: zero-sample universal segmentation framework

Paper Link:

https://arxiv.org/abs/2306.11087

Project Homepage:

https://henghuiding.github.io/PADing/

Code link:

https://github.com/heshuting555/PADing

Figure 1. Example of generic segmentation of a zero-sample image

First, research motivation image segmentation aims to classify and group pixels with different semantics, such as categories or instances, which has made rapid progress in recent years. However, since deep learning methods are data-driven, the strong demand for large-scale labeled training samples leads to great challenges, which consume huge time and labor costs. To address the above challenges, Zero-Shot Learning (ZSL) is proposed to classify new objects without training samples and extends to segmentation tasks such as Zero-Shot Semantic Segmentation (ZSS) and Zero-Shot Instance Segmentation (ZSI). On this basis, this paper further introduces Zero-Shot Panoptic Segmentation (ZSP) and aims to use semantic knowledge to construct a general zero-shot panorama/semantic/instance segmentation framework, as shown in Figure 1. Starting from generating better pseudo-features for unknown categories, this paper designs a general model PADing to solve the three major segmentation tasks. Aiming at the common problems of general segmentation: visual and linguistic differences and category bias, this paper aims to achieve panorama, instance and semantic segmentation of new categories. In this paper, quantitative experiments and qualitative visualization are carried out based on the zero-sample universal segmentation method PADing, and the results show that this method performs well in quantitative experimental results and qualitative visualization results compared with mainstream methods. The contributions in this paper mainly include the following four points:

The general zero-sample segmentation problem is studied, and a unified framework called Primitive generation with collaborative relationship alignment and feature disentanglement learning (PADing) is proposed to deal with zero-sample semantic segmentation. Instance segmentation and panorama segmentation problems.
A primitive generator is proposed that uses many learning primitives with fine-grained properties to synthesize visual features of unseen categories, which helps to solve the problem of bias and gap between domains.
A collaborative relational alignment and feature decoupling learning method is proposed to promote generator to produce better synthetic features.
The proposed method PADing achieves new and state-of-the-art performance in zero-sample panoramic segmentation (ZSP), zero-sample instance segmentation (ZSI) and zero-sample semantic segmentation (ZSS).

II. Methodology

2.1 Method Overview: The method proposed in this paper generates PADing based on the primitives of cooperative relational alignment and feature decoupling learning, and its overall architecture is shown in Figure 2. First, Backbone predicts a set of class-independent masks and their corresponding class vectors. Next, primitive generators are trained to synthesize class vectors from semantic vectors. Then, the real and synthetic class vectors are decomposed into semantically relevant and semantically independent features, and relational alignment learning is performed on the semantically related features. Finally, by synthesizing the vector of unknown class, the training classifier is refined, with the real vector of the actual known class and the synthetic vector of the unknown class.

Figure 2. PADing Framework Structure Figure 2.2 Primitive Cross-modal Generation Due to the lack of samples of unknown classes, the classifier cannot be optimized using features of unknown classes. Therefore, classifiers trained with only features of known classes tend to mark all objects as known classes, which is called a bias problem. Previous methods proposed using generative models to synthesize fake visual features for unknown categories. Although good performance is achieved, the visual-semantic difference in feature granularity is not taken into account. As we all know, images often contain richer information than language. Visual information provides very granular properties of an object, while textual information typically provides abstract and high-level properties. This difference leads to inconsistencies between visual and semantic features. To solve this challenge, this paper proposes a primitive-based cross-modal generator that utilizes a large number of learned attribute primitives to construct visual representations. First initialize a bunch of learnable primitives, hoping that it can learn fine-grained information, the specific method is to use the Transformer to input both semantic vectors and primitive groups into the network, first the semantic vector first calculates the similarity with the primitive, selects the primitives that are most correlated with the semantic vector and adds Gaussian noise. In this way, features composed of primitives are obtained, and when a semantic vector is input, the corresponding visual vector can be output to be generated. Finally, MMD loss is used to bring these two generated visual vector features closer to the real one. Primitives act as a bridge between language and vision, eliminating intra-domain differences between the two.

Figure 3. Schematic diagram of the structure generated by primitives across modalities2.3 Semantic-visual relationship alignment is well known, and the relationship between categories is naturally different. For example, there are three objects: apples, oranges, and cows. Obviously, the relationship between apples and oranges is stronger than the relationship between apples and cows. Categorical relationships in semantic space are powerful prior knowledge, and class-specific feature generation does not explicitly utilize such relationships. That is, objects with similar relationships in semantic space should also be similar in visual space and have similar distributions. However, the usual method generally directly transfers the relationships of semantic space violently into visual space. This does not effectively use semantic relationships, because semantics and vision are not aligned spaces, visual features contain more information, and semantic features can be regarded as a condensation of information. That is, there is more superfluous information in the visual features. Therefore, this article considers decoupling visual features before aligning relationships. The method of decoupling is to divide semantically related features and semantically independent features, and then align the visual semantically related features with semantic features. Semantic-independent features are expected to conform to a normal distribution and characterize features that do not have concrete semantic information. Semantically related features need to be able to classify them into specified semantic information through features.

Figure 4. Semantic-visual alignment diagram

3. Experiments

3.1 Quantitative result experimentsIn order to verify the effectiveness of the proposed method, comparative experiments were carried out on COCO data for panoramic segmentation, instance segmentation and semantic segmentation, as shown in Tables 1, 2 and 3. Experimental results show that the proposed method of PADing achieves advanced performance.

Table 1. Zero-sample panorama segmentation results

Table 2. Zero-sample semantic segmentation results

Table 3. Zero-sample instance segmentation results3.2 Qualitative results experimentIn order to explore whether primitives can represent subtle detailed elements, Figure 5 visualizes the attention response of different primitives on the image. The results show that primitives can represent properties of different fine-grained sizes, such as the cat in the figure, which focuses on the ears, tail, and contours.

Figure 5. In order to study the properties of the unseen features synthesized in this paper and demonstrate the effectiveness of the method presented in this chapter, Figure 6 uses t-SNE to show the distribution of the synthesized unknown features. (a) The synthetic features generated by the GMMN generator are disorganized due to semantic-visual differences. (b) The primitive generator of this article is introduced, and the features of the same category become more compact, and the features of different categories are highly separable. In addition, after applying the relational alignment constraint on the semantically related features, (c), the features of different categories are farther apart and the distribution structure is better, which indicates that the structural relationship has been embedded in the synthesized features, and the synthetic unseen features greatly enhance the better discrimination.

Figure 6. Different generators generate unknown category feature distribution plots Figure 7 qualitatively visualizes the example of zero-sample universal segmentation results, and the results show that our method can achieve good results.

Figure 7. Zero-sample universal segmentation (panorama, instance, semantic segmentation) visualization results

IV. Summary

Aiming at the visual and linguistic differences and class bias problems in zero-sample universal segmentation, this paper proposes a unified framework (PADing) for primitive generation, cooperative relationship alignment and feature decoupling learning, so as to realize efficient and practical zero-sample universal segmentation. First, primitive generators are proposed for synthesizing pseudo-training features of unknown classes. Then, a collaborative feature decoupling and relational alignment learning strategy is proposed to help the generator produce better pseudo-unknown features, the former decoupling visual features into semantically relevant parts and semantically irrelevant parts, and the semantically irrelevant parts, and the latter transferring cross-class knowledge from semantic space to visual space. PADing's extensive experiments on three zero-sample segmentation tasks, including semantic, instance, and panoramic segmentation, have yielded state-of-the-art results.

Written by Henghui Ding Source: Public Account [PaperWeekly]

Illustration by IconScout Store from IconScout

-The End-

Scan the code to watch!

New this week!

"AI Technology Stream" original submission plan

TechBeat is an AI Learning Community (www.techbeat.net) established by Jiangmen Ventures. The community has launched 480+ talk videos and 2400+ technical dry goods articles, covering CV/NLP/ML/ROBOTIS, etc.; Hold top meetings and other online communication activities on a regular basis every month, and hold offline gatherings and exchange activities for technicians from time to time. We are striving to become a high-quality, knowledge-based communication platform that AI talents love, hoping to create more professional services and experiences for AI talents, and accelerate and accompany their growth.

Contents

Latest Technology Interpretation/Systematic Knowledge Sharing //

Cutting-edge information commentary/experience narration //

Instructions for submission

Manuscripts need to be original articles and indicate author information.

We will select some directions in in-depth technical analysis and scientific research experience, inspire users with more inspirational articles, and do original content rewards

Submission method

Send mail to

[email protected]

Or add staff WeChat (chemn493) to submit articles to communicate the details of submissions; You can also pay attention to the "Jiangmen Venture Capital" public account, and reply to the word "submission" in the background to get submission instructions.

>>> Add WeChat!

About MeMen▼Jiangmen is a new venture capital firm focused on discovering, accelerating and investing in technology-driven startups, including Jiangmen Innovation Services and TechBeat AI Community. The company is committed to discovering and cultivating scientific and technological innovation enterprises with global influence by connecting technology and business, and promoting enterprise innovation and development and industrial upgrading.

Founded at the end of 2015, the founding team was built by the founding team of Microsoft Venture Capital in China, and has selected and deeply incubated 126 innovative technology-based startups for Microsoft.

If you are a start-up in the technology field and want not only investment, but also a series of continuous and valuable post-investment services, please send or recommend a project to my "door":

⤵ One click to send you to TechBeat Happy Planet

CVPR 2023 | Zhejiang University & Nanyang Polytechnic proposed PADing: zero-sample universal segmentation framework

II. Methodology

3. Experiments

IV. Summary

Read on

Review: Nanyang's daughter's love: After understanding that Kuang Haisheng was killed, I understood that Ouyang Tianqing misunderstood Chang Yudie

After the Nanyang Midi Music Festival, they gradually decided to forgive

NUS Nanyang Dormitory Subletting?The school emphasized: Students are prohibited from subletting or allowing others to rent rooms!

Moisturizing and moisturizing + long-lasting repair, Nanyang Golden Snake Oil Ointment is a must-have for dry skin!

Nanyang Wanjin oil control anti-dandruff shampoo, return your plump and fluffy hair~

Nanyang Wanjin shoulder and neck cream can easily relieve waist and neck pain in five minutes~

The Dutch East India Company, which once dominated the South Seas, how much power did the commercial city of Batavia contribute?

A must-have Nanyang Wanjin shoulder and neck cream for office workers, which quickly relieves shoulder and neck pain~

Nanyang Wanjin banana paste, ancient Thai recipe to repair chapped hands and feet~

Hong Kong Strange Case: Nanyang crossed the river and turned stones into gold, and was acquitted of four murders?

「三只羊」下南洋,TikTok的题比抖音难解

South Sea Gold Bead Ring 🌻14.2mm Very Slightly Flawed Intense Gold Color Deep Sea Pearl 18K Gold Diamond Fanta Stone Inlay Gradient Side Stone No. 13 Shank

Nanyang Technological University began to restrict study tours to the campus

18k white gold, gold weight 9g, natural diamond 💎1.72ct, natural emerald 2.46ct, South Sea gold pearl & Australian white, keshi, seedless, wild, whole

Senior Visiting Scholar at Nanyang Technological University, Singapore

CICGF | "Hairun Blue" will make its debut at the 4th CICGF, bringing a clear island South Sea breeze

CVPR 2023 | Zhejiang University &amp; Nanyang Polytechnic proposed PADing: zero-sample universal segmentation framework

II. Methodology

3. Experiments

IV. Summary

Read on

CVPR 2023 | Zhejiang University & Nanyang Polytechnic proposed PADing: zero-sample universal segmentation framework