Comparative learning leads weak label learning new SOTA, and Zhejiang University's new research was selected as ICLR Oral

Heart of the Machine column

Author: Wang Haobo

This article introduces the latest work of Zhejiang University, university of Wisconsin-Madison and other institutions PiCO, and the relevant papers have been accepted by ICLR 2022 (Oral, Top 1.59%)!

Partial Label Learning (PLL) is a classic weakly supervised learning problem that allows each training sample to associate a set of candidate labels for many real-world data labeling scenarios with label uncertainty. However, there is still a big gap between the existing PLL algorithm and the fully supervised method.

To this end, this paper proposes a collaborative framework to address two key research challenges in PLLs – characterization learning and label disambiguation. Specifically, the researchers proposed PiCO consists of a comparative learning module and a novel tag disambiguation algorithm based on class prototypes. PiCO generates tightly aligned representations for samples from the same class while facilitating label disambiguation. Theoretically, the researchers showed that these two components can promote each other and can be rigorously proven from the perspective of the expectation maximization (EM) algorithm. Numerous experiments have shown that PiCO is significantly superior to current state-of-the-art PLL methods in PLLs, and can even achieve results comparable to fully supervised learning.

Comparative learning leads weak label learning new SOTA, and Zhejiang University's new research was selected as ICLR Oral

Address of the paper: https://arxiv.org/pdf/2201.08984v2.pdf

Project Home: https://github.com/hbzju/pico

background

The rise of deep learning relies on a large amount of accurate labeling data, but in many scenarios, there is a large uncertainty in the data labeling itself. For example, most non-professional annotators can't determine whether a dog is Alaskan or a husky. Such a problem is called Label Ambiguity, which stems from the ambiguity of the sample itself and the lack of knowledge of the labeler, and is very common in labeling scenarios that require more professionalism. At this point, accurate labeling often requires hiring experts with extensive domain knowledge to label. To reduce the labeling cost of such problems, this paper examines Partial Label Learning [1] (PLL), in which the researcher allows the sample to associate a set of candidate labels that contains real labels.

In the PLL problem, the most important problem is disambiguation, which identifies the true label from the set of candidate labels. To solve the PLL problem, existing work typically assumes that the sample has good characterization and then performs label disambiguation based on the smoothing assumption that samples with close features are likely to share the same true labels. However, the dependence on representations has led to the PLL method falling into the characterization-disambiguation dilemma: the uncertainty of labeling can seriously affect representation learning, and the quality of representations in turn affects label disambiguation. Therefore, the performance of existing PLL methods is still far from the scene of fully supervised learning.

To this end, the researchers proposed a collaborative framework PiCO, introducing contrastive learning (CL) technology to solve both the two highly related problems of representation learning and label disambiguation. The main contributions of this article are as follows:

Methodology: This paper pioneers the comparative learning of some label learning and proposes a new framework called PiCO. As part of the algorithm, the researchers also introduced a new prototype-based tag disambiguation mechanism that effectively utilizes the embeddings of contrast learning.

Experiment: The researchers' proposed PiCO framework yielded SOTA results on multiple datasets. In addition, the researchers' first attempt to experiment on a fine-grained classification dataset improved classification performance by 9.61% compared to the optimal baseline for the CUB-200 dataset.

Theory: Theoretically, researchers have demonstrated that PiCO is equivalent to maximizing likelihood with the Expectation-Maximization process. The researchers' derivations can also be generalized to other comparative learning methods, demonstrating that the Alignment property in CL [2] is mathematically equal to the M step in classical clustering algorithms.

frame

In short, PiCO consists of two key components, one for representation learning and label disambiguation. These two components operate systematically as a whole and feed back to each other. Subsequently, researchers will further explain piCO from the perspective of EM.

Classification Loss

Given a dataset, each tuple contains and a collection of candidate tags. To effectively solve the PLL problem, the researcher maintains a pseudo-label vector for each sample. During training, the researcher will constantly update this pseudo-label vector, and the model will optimize the following losses to update the classifier.

Contrastive Representation Learning For PLL

Inspired by supervised comparative learning (SCL)[3], the researchers aimed to introduce comparative learning mechanisms that learn similar representations for samples from the same class. The basic structure of PiCO is similar to MoCo [4], both consisting of two networks, the Query network and the Key network. Given a sample, the researcher first used a random data augmentation technique to obtain two augmented samples, called Query View and Key View. They are then fed into two networks separately, resulting in a pair of normalized embeddings, i.e., sums.

When implemented, the researchers let the Query network share the same convolutional block as the classifier and add an additional projection network. Like MoCo, the researchers updated the Key network using momentum averaging technology for query networks. In addition, the researchers introduce a queue of key embeddings that store over time. As a result, the researchers obtained the following comparative learning embedding pool: . The researchers then calculated the contrast loss for each sample according to the following formula:

Where is the positive sample set in contrast learning, and. is the temperature parameter.

Positive Set select. It can be found that the most important problem in the comparative learning module is the construction of a positive sample set. However, in the PLL problem, the true label is unknown, so it is not possible to directly select a sample of the same kind. Therefore, the researchers employ a simple but effective strategy of using the labels predicted directly by the classifier: and construct the following positive sample set:

To save computational efficiency, the researchers also maintained a tag queue to store predictions from previous Batches. Although the strategy is simple, it can get very good experimental results and can be theoretically proven to be effective.

Prototype-based Label Disambiguation

It is important to note that comparative learning still relies on accurate classifier predictions, so an effective label disambiguation strategy is still needed to obtain accurate label estimates. To this end, the researchers have proposed a novel prototype-based label disambiguation strategy. Specifically, the researchers maintained a prototype embedding vector for each tag, which can be thought of as a representative set of embedding vectors.

Pseudo-label updates. During the learning process, the researcher first initializes S to a Uniform vector. Then, based on the class prototype, the researchers used a sliding average strategy to update the pseudo-label vector.

That is, the researcher selects the label corresponding to the most recent prototype and gradually updates the pseudo-label S. Here, the reason for using a moving average is that embeddings for the output of the comparative learning network are not reliable in the initial stage, where fitting uniform pseudo-targets initialize the classifier well. The Moving Average Strategy pseudo-label is then smoothly updated to the correct target to ensure a stable Traning Dynamic.

Prototype updates. To update the pseudo-label, a simple approach is to calculate the center of each class once per iteration or Epoch, but this will incur a significant computational cost. So once again, the researchers used the sliding average technique to update the prototype,

That is, when predicted as a category, it is made to step in the direction of the corresponding vector.

Insights. It's worth noting that these two seemingly separate modules are actually able to work together. First, contrast learning has a clustering effect in the embeddings space, so it can be leveraged by the label disambiguation module to obtain more accurate class centers. Secondly, after label disambiguation, the label predicted by the classifier is more accurate, which can feed back the positive set constructed by the comparison learning module. When the two modules agree, the entire training process converges. The researchers then discuss the similarities between PiCO and the classical EM clustering algorithm more rigorously in theory.

Experimental results

The main result

Before proceeding to a theoretical analysis, researchers first look at piCO's superior experimental effects. The first is the result on CIFAR-10, CIFAR-100, where the probability of each Negative Label becoming a candidate label is represented.

As shown above, PiCO achieved excellent experimental results, with SOTA results in two datasets, with varying degrees of ambiguity (corresponding sizes). It is worth noting that previous work [5][6] has only explored cases where the amount of label is small (), and the researchers' results on cifar-100 show that PiCO still has excellent performance even in large label space. Finally, it's worth noting that when it's relatively small, PiCO even achieves results close to full supervision!

Representation learning

In addition, the researchers also visualized the representations learned by different methods, and can see that Uniform labels lead to vague representations, and the clusters learned by the PRODEN method overlap and cannot be completely separated. In contrast, PiCO learning is characterized more compactly and recognizably.

Ablation experiments

Finally, the researchers showed the effect of different modules on the experimental results, and it can be seen that both the label disambiguation module and the contrast learning module will bring a very significant performance improvement, and ablation will bring about a performance degradation. For more experimental results, please refer to the original paper.

Theoretical analysis

Finally to the most exciting part! I believe everyone has a question: why can PiCO achieve such excellent results? In this paper, the researchers theoretically analyze the prototypes obtained by comparative learning to help disambiguate labels. The researchers will show that alignment in contrast learning essentially minimizes the class covariance in the embedding space, which is consistent with the goal of classical clustering algorithms. This prompted researchers to interpret PiCO from the perspective of the Expectation-Maximization (EM) algorithm.

First, the researcher considers an ideal Setup: in each training step, all data samples are accessible, and the augmented samples are also included in the training set, i.e. The contrast loss can then be calculated as follows:

The researchers focused on the first (a), the Alignment term [2], while the other Uniformity was shown to favor Information-Preserving. In this paper, the researchers relate it to the classic clustering algorithm. The researchers first divide the dataset into subsets, where the samples in each subset contain the same predictive labels. In fact, PiCO's Positive Set selection strategy is also constructed from the same policy. Therefore, the researchers have,

where is a constant, is the center of the mean. Researchers here approximate because usually large. For simplicity, the researchers omitted the symbol. As you can see, Alignment minimizes the variance within the class!

At this point, researchers can interpret the PiCO algorithm as an EM algorithm that optimizes a generated model. In step E, the classifier assigns each sample to a specific cluster. In the M step, the contrast losses concentrate the embedding in the center direction of their cluster. Finally, the training data will be mapped to a mixed von Mises-Fisher distribution on the unit hypersphere.

EM-Perspective。 To estimate likelihood, the researchers additionally introduce a hypothesis to establish a correlation between the set of candidate labels and the real labels.

From this, the researchers proved that PiCO implicitly maximizes the likelihood as follows,

E-Step。 First, the researchers are introducing a set of distributions, and if so, if. The argument to make. The researcher's goal is to maximize the likelihood of the following,

The final step of derivation uses Jensen's inequality. Since functions are concave, the equation holds when it is some constant. Therefore, the researchers have,

That is, the posterior probability. In PiCO, researchers estimate it using classifier output.

For estimation, the classical unsupervised clustering method assigns samples directly to the nearest clustering center, such as the k-Means method; in the case of fully supervised learning, researchers can directly use the ground-truth tag. In PLL problems, however, the supervision signal is somewhere between full supervision and unsupervised. According to the researchers' experimental observations, candidate labels are more reliable for posterior estimates at the outset; as the model is trained, the prototypes of the contrast learning become more credible. This prompted the researchers to update the pseudo-labels in a sliding average fashion. As a result, researchers have a good initialization information when estimating class posteriors, and they are smoothly improved during training. Finally, since each sample corresponds to a unique label, the researchers used One-hot predictions, and the researchers had.

M-Step。 At this step, the researchers hypothesized that the posterior class probabilities were known and maximized likelihood. The following theorem shows that minimizing contrast losses can also maximize a lower bound of likelihood,

Proof is shown in the original text. When approaching 1, the nether is tighter, which means that the intraclass concentration of the hypersphere is high. Intuitively, when the hypothesis space is sufficiently abundant, it is possible for researchers to obtain a lower class covariance in Euclidean space, resulting in a large norm of the mean vector. Then, normalized embedding in the hypersphere also has a strong intraclass concentration, because large ones also result in large K-values [7]. Based on the visualizations in the experimental results, the researchers found that PiCO was indeed able to learn compact clusters. Therefore, researchers believe that minimizing contrast losses can also maximize likelihood.

conclusion

In this paper, the researchers propose a novel partial label learning framework, PiCO. The key idea is to identify real labels from candidate sets by using embdding prototypes for contrast learning. Comprehensive experimental results show that PiCO achieved the results of SOTA and, in some cases, near-full supervision. Theoretical analysis shows that PiCO can be interpreted as an EM algorithm. The researchers hope that the researchers' work will attract more attention from the community to make more widely use of contrast learning techniques for partial label learning.

Lab Introduction

Welcome to the data intelligence laboratory of Zhejiang University where researcher Zhao Junbo is located and the M3 Group led by researcher Zhao Junbo (which has nothing to do with BMW's sports car)!! Under the leadership of Professor Chen Gang, Dean of the School of Computer Science, the laboratory has won the VLDB 2014/2019 best paper, and in recent years, it has achieved great results in top conferences and journals such as VLDB, ICLR, ICML, ACL, KDD, www, etc., and has won many national and provincial awards. Teacher Zhao Junbo is a researcher and doctoral supervisor of the 100 Talents Program of Zhejiang University, with Yann LeCun, Google citation 1w+, Zhihu Wanfan Small V, AI track serial entrepreneur.

Zhao Junbo Homepage:http://jakezhao.net/

reference

1. In fact, PLLs have more direct aliases: Ambiguous Label Learning, or Superset Label Learning. This article follows the most commonly used name, called partial label learning.

2. Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In ICML, volume 119 of Proceedings of Machine Learning Research, pp. 9929–9939. PMLR, 2020.

3. Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. In NeurIPS, 2020.

4. Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross B. Girshick. Momentum contrast for unsupervised visual representation learning. In CVPR, pp. 9726–9735. IEEE, 2020.

5. Jiaqi Lv, Miao Xu, Lei Feng, Gang Niu, Xin Geng, and Masashi Sugiyama. Progressive identification of true labels for partial-label learning. In ICML, volume 119 of Proceedings of Machine Learning Research, pp. 6500–6510. PMLR, 2020.

6. Lei Feng, Jiaqi Lv, Bo Han, Miao Xu, Gang Niu, Xin Geng, Bo An, and Masashi Sugiyama. Provably consistent partial-label learning. In NeurIPS, 2020b.

7. Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh, and Suvrit Sra. Clustering on the unit hypersphere using von mises-fisher distributions. J. Mach. Learn. Res., 6:1345–1382, 2005.

Zhihu original text: https://zhuanlan.zhihu.com/p/463255610

Comparative learning leads weak label learning new SOTA, and Zhejiang University's new research was selected as ICLR Oral

Read on