laitimes

A long-text summary of the past and present lives of "category incremental learning", open source toolkit

Heart of the Machine reprinted

Author: Si Wei

With the gradual maturity of statistical machine learning, it is time to break the traditional model of isolated learning and instead study lifelong learning, pushing machine learning to new heights.

What is Life-Long Machine Learning?

Lifelong machine learning, or lifelong learning, is an advanced machine learning paradigm that accumulates knowledge from past tasks through continuous learning and uses that knowledge to aid future learning. In such a process, the learner's knowledge is becoming richer and richer, and the learning efficiency is getting higher and higher. This trait of learning ability is an important marker of human intelligence.

However, the current mainstream machine learning paradigm is learned in isolation: given a training dataset, the algorithm generates the model directly from that training set (searching for optimal or approximately optimal hypotheses from the hypothesis space). It does not attempt to preserve what it has learned to improve the efficiency of future learning. While this isolated learning paradigm has had great success, it requires a large number of training examples and is only suitable for extremely limited tasks that are well defined and narrow in scope. In contrast, we humans can learn effectively with a few examples because we have accumulated so much knowledge in the past. This accumulated prior knowledge allows us to efficiently learn new things with little data or with a small effort. Lifelong learning aims to equip machine learning models with this capability.

With the gradual maturity of statistical machine learning, it is time to break the traditional model of isolated learning and instead study lifelong learning, pushing machine learning to new heights. Applications such as intelligent assistants, chatbots, and physical robots are also desperately needing this lifelong learning capability. Without the ability to accumulate what you've learned and use it to learn more gradually, a system may never be truly intelligent. [1]

A long-text summary of the past and present lives of "category incremental learning", open source toolkit

Life Long Learning

In recent years, lifelong learning (LLL) has attracted a lot of attention in the deep learning community, and it is often referred to as Continual Learning. While degree neural networks (DNNs) achieve the best performance in many machine learning tasks, deep learning algorithms based on connectionism have the problem of catastrophic forgetting, which makes it very difficult to achieve the goal of continuous learning. When using neural networks to learn sequence tasks, the model may perform poorly on old tasks due to catastrophic forgetting problems after learning new tasks. However, our human brain has this extraordinary ability to learn a lot of different tasks without any negative mutual interference. The Continual Learning algorithm attempts to achieve the same capabilities for neural networks and solve the catastrophic problem of forgetting. So, essentially, continuous learning performs incremental learning for new tasks.

Unlike many other Life Long Learning technologies, however, the focus of current continuous learning algorithms is not on how to leverage what has been learned in previous tasks to help better learn new tasks. Rather, the focus is on addressing the problem of catastrophic forgetting.

What is catastrophic forgetting[2]?

Catastrophic forgetting refers to the fact that after the model learns new knowledge, it almost completely forgets the content of the previous training.

In short, the concern is that during The Systematic Learning process, the model is exposed to new and different data or tasks at each stage of learning, and loses or has limited access to the old class of data. In such a scenario, the performance of the connectivist model represented by neural networks on the old task will be greatly reduced.

For example, in the training process of the traditional image classification model, we train on all the data at the same time (although the current popular optimization method is to optimize in batches in the form of Batche, but each epoch we will still train on all the data, equivalent to each epoch will review all the data). But in the case of continual learning, the task of learning is trained sequentially in terms of tasks, and when we train on a new task, the training data in the old task is not available (or has limited access).

To give a simple example: Bob is a college student who now needs to take a final exam in digital signal processing, machine learning, and modern history.

As we all know, because Bob is a college student, his memory ability is very strange. He can quickly take a subject in two hours, but if he reviews a new subject, he forgets the old one.

On the first day, Xiaoming reviewed digital signal processing, and he walked out of the examination room happily.

The next day, Xiaoming reviewed the machine learning, and he happily walked out of the exam.

On the third day, Xiaoming reviewed modern history, but the school found that someone had leaked the questions, so he decided to retake the three exams at the same time. Therefore, Xiaoming happily wrote the content of modern history "accurately" on the volume of digital signal processing and machine learning, leaving the grading teacher with a puzzled face.

Although the above story is very absurd, similar problems often arise in the current research scenarios, such as in the classification task, we first use some predetermined category of samples to train a model, and then use some new category of samples to refine such a network, which will make the network's performance of identifying the initial category greatly degraded; for example, in the enhanced learning task, training subsequent tasks alone will make the performance of agents in the previous task drop seriously.

As shown in the figure below, when we used a neural network model to train new task fish and tigers, the model incorrectly classified the dogs in the old task as fish.

A long-text summary of the past and present lives of "category incremental learning", open source toolkit

Catastrophic Forgetting

Iii. What are the scenarios of Continual Learning?

Scenario 1: Task-IL

Task incremental learning is the simplest scenario for Traditional Learning. In this scenario, both the training phase and the testing phase, the model is told the current task ID.

This feature has led to the emergence of methods for task specific components, such as packNet[3] determining the mask map of convoluted filters for each task in advance. Another example is that HAT will dynamically train the mask map for convolution according to the task. When given the task ID, the corresponding mask is selected to make the prediction.

A long-text summary of the past and present lives of "category incremental learning", open source toolkit

PackNet

Scenario 2: Domain-IL

Domain-IL adds a new restriction during the testing phase compared to Task-IL, i.e. the task ID is not told during the prediction phase. The model needs to classify the data correctly without knowing the task ID.

Domain-IL scenarios are often used to deal with the problem of the same label space but different input distribution. For example, the tiger in the anime and the tiger in reality (easter eggs in the year of the tiger).

A long-text summary of the past and present lives of "category incremental learning", open source toolkit

domain (one): A tiger in the real world

A long-text summary of the past and present lives of "category incremental learning", open source toolkit

domain (2): Animated tiger

Scenario Three: Class-IL

New categories are constantly coming in Class-IL, and the model needs to correctly classify the input into its corresponding categories. This is a more restrictive scenario, where the model needs to correctly identify the task-ID corresponding to the input after accepting the input, and then powder the data into the correct category.

Example[4]

The following illustration shows a visual example where the model is trained on task1, 2, 3, 4, and 5 in turn.

During the forecasting phase

Task-IL tells the task-ID that the model divides the data into the first or second category based on the task-ID. For example, when told that the task-ID is 1, the model only needs to distinguish between 0 and 1.

Domain-IL cannot obtain a task-ID, but it needs to determine whether the input tag belongs to the set (0,2,4,6,8) or (1,3,5,7,9).

Class-IL, on the other hand, needs to give a specific digital label, that is, choose one from 0-9 for output.

A long-text summary of the past and present lives of "category incremental learning", open source toolkit

Three different sets of Creative Learning on MNIST

In addition, there is currently a more stringent data-IL, and we do not show the stage of telling the task at the time of training, hoping that the model can adapt to this category of unstable and uneven data flow. We won't discuss it here.

4. What is class incremental learning?

A simple example

A long-text summary of the past and present lives of "category incremental learning", open source toolkit

Example of Class-Incremental Learning

The model is first trained on Task 1 to learn to classify birds and jellyfish. After that, based on the current model, you need to learn the goose and the arctic fox in task 2 and the dog and crab in task 3, respectively. After completing training sequentially, the model needs to be evaluated on all the categories that have already been seen, and a good class increment model should be able to learn both the new class knowledge and not forget the old class knowledge.

Formal definitions

Category incremental learning is designed to continuously learn new classes from one data stream. Suppose there are B training sets that do not have category coincidences, which represent the bth incremental learning training data set, also known as the training task. is a training sample from a category, where it is the label space of the bth task. There is no category overlap between different tasks, i.e. for Yes: . During the process of learning the bth task, you can only update the model with the training dataset of the current stage. At each training stage, the goal of the model is not only to learn the knowledge of the new class in the current data set, but also to keep all the previously learned classes of knowledge. Therefore, we evaluate the model's incremental learning ability based on its discriminative ability on all sets of known classes. If the output of the incremental learning model to the sample is recorded as follows, the expected risk of the model to be optimized is described as:

A long-text summary of the past and present lives of "category incremental learning", open source toolkit

where represents the sample distribution for task b. To assess the differences between inputs, cross-entropy loss loss functions are generally used in classification tasks. Because the model needs to minimize the desired risk on all distributions seen at the same time, a model that satisfies Equation 1 is able to learn the new class without forgetting the knowledge of the old class. Further, the deep neural network can be decoupled according to the feature extraction and linear classifier layers, then the model consists of a feature extraction module and a linear classifier, i.e. . For the sake of expressing it conveniently, we further express the linear classifiers as a combination for each classifier: .

Fifth, the interpretation of the thesis method

Model decoupling

For the convenience of the subsequent explanation, we first decouple the neural network model.

The model consists of a feature extraction module and a linear classifier, i.e. . For the sake of convenience, we further express the linear classifiers as a combination for each classifier: .

5.1 LwF: Learning without Forgetting[5]

Core Summary

LwF (Learning without Forgetting) is an early publication in the field of Invental Learning, and the core points of the paper include:

In addition to LwF itself, three basic comparison methods of Fine-tunine, Feature Extraction and Joint Training are proposed, and different methods are analyzed and experimentally compared.

The use of Knowledge Distillation is proposed to provide "soft supervision" information of the old class to alleviate the problem of catastrophic forgetting. And it was found that this kind of supervision information can greatly improve the accuracy of the old class even without the data of the old class.

Basic experimental comparisons were made on factors such as regular penalty coefficients, regular forms, and model expansion methods of parameter offsets. (However, the effect of these factors in the results of the paper is not obvious).)

Method comparison

A long-text summary of the past and present lives of "category incremental learning", open source toolkit

Learning Without Forgetting

As shown in the figure:

(a) In the traditional multi-classification model, it accepts a picture, and then outputs the probability of the picture in each category through linear transformation, nonlinear activation function, convolution, pooling and other operation operators.

(b) In the Fine-tuning method, when training a new class, we keep the old classifier unchanged and directly train the preceding feature extractor and the new classifier weights.

(c) Called Feature Extraction, keep the feature extractor unchanged, keep the old classifier weights unchanged, and train only the parameters corresponding to the new task.

(d) is the method of Joint Training, which is trained on all training data at the same time at each training task moment.

(e) The LwF method, which provides a "soft" supervision information for the old class through knowledge distillation, based on Fine-tuning.

Knowledge Distillation[6]

Knowledge distilling (Knowledge Distilling) was originally a method of model compression, which refers to the use of a more complex Teacher model that has been trained to guide the training of a lighter Student model, so as to reduce the size of the model and computational resources while trying to maintain the accuracy of the original Teacher model.

Its basic form is:

A long-text summary of the past and present lives of "category incremental learning", open source toolkit

where is the logits output of class i, which is the temperature coefficient. The loss function of knowledge distillation can be seen as the KL divergence that minimizes the data likelihood of the Teacher model and the Student model on the existing data set. This kind of supervision information is smoother than the general label distribution on the one hand, and on the other hand, it can reflect the similar relationship between different categories to a certain extent.

In LwF's model, we use additional memory overhead to save the old model, and when training a new model, use the old model as the Teacher model of the old class.

Training process

For the training set of the new task, LwF's loss function includes:

Label supervision information of the new class: that is, the cross-entropy loss (KL divergence) between the logits and the label corresponding to the new class

Knowledge distillation of old classes: cross-entropy loss of logits on old classes of old and new models (including temperature coefficient: set the temperature coefficient greater than one to enhance the model's coding ability for class similarity)

Parameter offset regular term, for new model parameters about the old model parameter offset regular term.

The specific pseudocode is as follows:

A long-text summary of the past and present lives of "category incremental learning", open source toolkit

5.2 iCaRL: Incremental Classifier and Representation Learning[7]

iCaRL can be seen as a cornerstone of much of the work in the Class-Incremental Learning direction. Key contributions to the article include:

Define a specification for The Class-Incremental Learning:

The model needs to be trained in streaming data that is constantly emerging in a new category.

At any stage, the model should be able to accurately classify all the categories currently seen.

The model's compute and storage consumption must have an upper bound or only grow slowly with the number of tasks.

The first elucidation of the fact that we can retain a portion of typical old-class data in future training phases greatly improves the upper limit of accuracy that the model can achieve, and proposes an effective typical sample picking strategy herding: greedy selection can make the eigengen mean of the exit set closest to the total mean of the sample.

A long-text summary of the past and present lives of "category incremental learning", open source toolkit

Herding

This paper proposes a classification of nearest-mean-of-exemplars using retained old class data, rather than directly using a linear classifier from the training phase. This is because using the cross-entropy loss function to train directly on an unbalanced dataset is prone to larger classifier paranoia. The features extracted by the model can alleviate this problem to a large extent.

A long-text summary of the past and present lives of "category incremental learning", open source toolkit

When a new mission arrives:

Combine the newly arrived category dataset with the exemplar set of retained old class data to get the dataset of the current round.

Use sigmoid to convert the logits of the model output to between 0-1. Converts the target label to a one-hot vector representation.

For the classification of new classes, we use binary cross entropy to calculate the loss. The binary cross entropy calculation here only takes into account all the calculations of the new classes, which allows us to learn new samples without updating the weight vectors in the old linear classifier, thereby reducing the impact of uneven data streams on sample classification.

For the classification of old classes, the model is modeled on LwF, calculating the loss of binary cross entropy of the probability output of the new and old models on the old classes to train the model.

A long-text summary of the past and present lives of "category incremental learning", open source toolkit

iCaRL

iCaRL had a profound impact on later methods. After this, a considerable number of category incremental learning methods follow this paradigm. Create an external set to store typical old class samples. Knowledge distillation is used to provide supervisory information for older classes.

5.3 BiC[8]

The BiC basically follows the training paradigm of iCaRL, but still uses a linear classifier as the classifier for the prediction phase. BiC points out that a significant factor in the catastrophic forgetting that occurs in class incremental learning is classifier paranoia due to unbalanced samples in the training set. Abstractly explains the causes of the classifier's paranoia caused by this training sample imbalance. As shown in the following figure, the blue dashed line in the figure is the unbiased distribution of all the old class features, the solid green line is the unbiased distribution of the new class sample, and the blue implementation in the plot corresponds to the unbiased classifier. And since when learning the new category, we only keep a part of the old class sample. This leads to the characteristic distribution we encounter during the actual training process, which may be a narrow and sharp distribution like a solid blue line, which causes the classifier we learn to shift to the right relatively unbiased classifier, resulting in a large part of the old class sample being divided into new classes.

A long-text summary of the past and present lives of "category incremental learning", open source toolkit

BiC

Following this line of thinking, the BiC sets up a bias Correction phase in which we use a linear offset to flatten and translate the distribution of the new class so that the implementation overlaps with the dotted line, resulting in an unbiased classifier, specifically:

After training using a training mode similar to iCaRL, we train two parameters using a pre-reserved training set of the balance of the old and new classes, which control the panning of the classifier to zoom, respectively, i.e.:

A long-text summary of the past and present lives of "category incremental learning", open source toolkit

Multiply the output of the old class and add it, which is obtained by the training phase of Bias Correction.

5.4 WA[9]

Wa Pointed out that there are two main reasons why refining the model directly on the new data leads to model performance degradation:

There are not enough samples of the old class to train, resulting in the model not being able to maintain the ability to distinguish between the old classes.

The old class sample is significantly less than the new class sample, resulting in a great deal of classification paranoia in the model, which causes the model to give a large value on the probability output of the new class, regardless of whether it encounters the old class sample or the new class sample.

A long-text summary of the past and present lives of "category incremental learning", open source toolkit

Problem profiling

Therefore, wa divides the process into two goals:

Maintain the relative sizes between the old classes: i.e. Maintaining Discrimination

Dealing with the fairness problem of old and new classes, i.e. achieving alignment of classification preferences for old and new classes: Maintaining Fairness

A long-text summary of the past and present lives of "category incremental learning", open source toolkit

target

For the first goal, knowledge distillation based on the iCaRL pattern can well achieve the problem of Discrimination between old classes. For the second goal, wa found that after training on an unbalanced training set, the linear classifier corresponding to the new class tends to have a larger weight than the old class linear classifier. Comparing the L2 norms of the weights of the old and new classifiers, it can be found that the L2 norms of the new class are significantly larger than those of the old class.

A long-text summary of the past and present lives of "category incremental learning", open source toolkit

Classifier weight comparison

What WA does is flatten the L2 norm.

A long-text summary of the past and present lives of "category incremental learning", open source toolkit

thereinto:

A long-text summary of the past and present lives of "category incremental learning", open source toolkit

The final experimental results show that the performance improvement of this simple strategy is very significant.

A long-text summary of the past and present lives of "category incremental learning", open source toolkit

Experimental results

One thing we need to point out here is that this solution is actually not so complete. Because the size of the classifier weights is not always positively correlated with the size of the final output of the logits. This is because if a classifier corresponds to a large weight, then if a feature is in the same direction as it, then the size of the logits is obviously positively correlated with the classifier weight, but if the characteristics of a sample are inverse with the direction of the classifier, then the inner product of the feature and the classifier will be a small negative value, and this is the weight is negatively correlated. Therefore, the large classifier weight does not always mean that the logits output by the model are large. So why is this solution a good solution to the problem of calibration of new and old classes? The author gives an explanation for intuitive in the article: since non-negative activation functions are often used in the model structure, typically such as relu, resulting in the feature output of the model, the weights of the classifier are often positive, which means that the angle between the classifier weight vector and the eigenvector in most cases is sharp angle, and its inner product is positive, so it is positively correlated.

5.5 OF THE[10]

The method based on dynamic feature structure has been widely used in Task-IL, der is the first attempt to apply the dynamic feature structure method to the Class-IL scene and achieve excellent performance.

Der explains that the traditional approach can fall into a stability-plasticity dilemma: for a monoskeletal model, if no restrictions are applied, if it is sufficiently malleable, then its performance on the old class sample will be greatly reduced; but if too many restrictions are applied, it will cause the model to not have enough plasticity to learn new tasks. DER, on the other hand, achieves trade-off for better stability and plasticity than traditional methods. DER keeps and freezes the old feature extractor to preserve the old knowledge while creating a new trainable feature extractor to make the modal malleable enough to learn the new task.

A long-text summary of the past and present lives of "category incremental learning", open source toolkit

THE

Specifically, when a new task arrives:

DER fixes the original feature extractor and creates a new feature extractor, stitching the two features together to obtain the total feature extractor.

The extracted features are fed into the newly created classifier and the cross-entropy loss with the target is calculated.

In order to better extract features, DER additionally uses a secondary classifier that uses only new features and requires new feature spaces to be able to achieve good discrimination between new classes. For all the old class samples, the auxiliary classifier will classify them on the same label.

DER has also designed a pruning method that enables significant parameter reductions while maintaining model performance as much as possible. This pruning strategy is borrowed from the classic method of Task-IL, HAT[11], which transforms the mask of HAT with filter weights into the mask of the entire channel.

The loss function for the final model training is:

A long-text summary of the past and present lives of "category incremental learning", open source toolkit

Each loss is as follows: total cross-entropy loss, auxiliary classifier cross-entropy loss, and sparse solution loss corresponding to pruning strategy.

5.6 COIL[12]

Classic learning systems are often deployed in closed environments, where learning models can model fixed categories of data using pre-collected datasets. However, in an open dynamic environment this assumption is difficult to satisfy — new categories grow over time, and models need to continuously learn new classes in the data stream. For example, in e-commerce platforms, a variety of products are added every day; on social media, new hot topics are constantly emerging. Therefore, the class incremental learning model needs to learn the new class without forgetting the characteristics of the old class. COIL observes correlations between new and old classes during incremental learning, so it can be used to further assist the model in learning at different stages. Therefore, COIL proposes to use the incremental learning process of collaborative transportation auxiliary categories and link different incremental learning stages based on semantic correlation between categories. Cooperative transport is divided into two aspects: prospective transport, which aims to augment the classifier with the knowledge gained from optimal transportation, as an initialization of the new class classifier; and retrospective transport, which aims to convert the new class classifier into the old class classifier and prevent catastrophic forgetting. Therefore, the knowledge of the model can flow in both directions during the incremental learning process, so that the ability to distinguish between the old classes is maintained while learning the new class.

A long-text summary of the past and present lives of "category incremental learning", open source toolkit

Description of the characteristic level of coil

As shown in the figure above, COIL attempts to perform a classifier migration based on semantic relationships between classes. For example, tigers and cats are very similar, so the features used to discriminate between the two are also highly coincident, and even a large number of tiger classifier weights can be reused as classifier initializations for class cats; tigers and zebras are not similar, so the features used to discriminate between the two cannot be reused. COIL considers measuring the similarity of the center of the category in a uniform embedding space, and constructs a distance matrix between the categories. After that, with the help of the optimal transport algorithm, the distance between classes is used as the transport cost, minimizing the classifier reuse cost between all new classes and the old class collection, thus guiding classifier reuse based on the semantic relationship between the classes. Finally, as shown in the following figure, the old class classifier is reused as a new class classifier and the new class classifier is reused as an old class classifier, respectively, to construct knowledge transfer in two different directions, and the loss function is designed to constrain the model and prevent catastrophic forgetting.

A long-text summary of the past and present lives of "category incremental learning", open source toolkit

COIL method implementation

Classification boundary visualization:

A long-text summary of the past and present lives of "category incremental learning", open source toolkit

COIL classification boundaries

六、PyCIL: A Python Toolbox for Class-Incremental Learning

We open-sourced a Class-IL framework based on pytorch: PyCIL.

It not only contains a number of early fundamental methods such as EWC and iCaRL, but also contains some of the Class-IL algorithms that are now state-of-the-art, hoping to help some scholars who want to understand and study related fields.

Project Address: PyCIL[13]

Methods Reproduced

FineTune: Baseline method which simply updates parameters on new task, suffering from Catastrophic Forgetting. By default, weights corresponding to the outputs of previous classes are not updated.

EWC: Overcoming catastrophic forgetting in neural networks. PNAS2017 [paper]

LwF: Learning without Forgetting. ECCV2016 [paper]

Replay: Baseline method with exemplars.

GEM: Gradient Episodic Memory for Continual Learning. NIPS2017 [paper]

iCaRL: Incremental Classifier and Representation Learning. CVPR2017 [paper]

BiC: Large Scale Incremental Learning. CVPR2019 [paper]

WA: Maintaining Discrimination and Fairness in Class Incremental Learning. CVPR2020 [paper]

PODNet: PODNet: Pooled Outputs Distillation for Small-Tasks Incremental Learning. ECCV2020 [paper]

DER: DER: Dynamically Expandable Representation for Class Incremental Learning. CVPR2021 [paper]

Coil: Co-Transport for Class-Incremental Learning. ACM MM2021 [paper]

Partial experimental results

A long-text summary of the past and present lives of "category incremental learning", open source toolkit

Experimental results (1)

A long-text summary of the past and present lives of "category incremental learning", open source toolkit

Experimental results (2)

reference

Zhiyuan Chen; Bing Liu; Ronald Brachman; Peter Stone; Francesca Rossi, Lifelong Machine Learning: Second Edition , Morgan & Claypool, 2018. https://ieeexplore.ieee.org/document/8438617

Catastrophic forgetting in connectionist networks https://www.sciencedirect.com/science/article/pii/S1364661399012942

PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning https://arxiv.org/abs/1711.05769

Three scenarios for continual learning https://arxiv.org/abs/1904.07734

Learning without Forgetting https://arxiv.org/abs/1606.09282

Distilling the Knowledge in a Neural Network https://arxiv.org/abs/1503.02531

iCaRL: Incremental Classifier and Representation Learning https://arxiv.org/abs/1611.07725

Large Scale Incremental Learning https://arxiv.org/abs/1905.13260

Maintaining Discrimination and Fairness in Class Incremental Learning https://arxiv.org/abs/1911.07053

DER: Dynamically Expandable Representation for Class Incremental Learning https://arxiv.org/abs/2103.16788

https://arxiv.org/abs/1801.01423 https://arxiv.org/abs/1801.01423

Co-Transport for Class-Incremental Learning https://arxiv.org/abs/2107.12654

PyCIL: A Python Toolbox for Class-Incremental Learning https://arxiv.org/abs/2112.12533

Read on