The first visual framework for rapid knowledge distillation: ResNet50 80.1% accuracy, 30% faster training

Follow and star

Never get lost

Institute of Computer Vision

The first visual framework for rapid knowledge distillation: ResNet50 80.1% accuracy, 30% faster training

Official Account ID|ComputerVisionGzq

Scan the QR code on the homepage to get how to join

The first visual framework for rapid knowledge distillation: ResNet50 80.1% accuracy, 30% faster training

Papers and project websites: http://zhiqiangshen.com/projects/FKD/index.html
Code: https://github.com/szq0214/FKD

Special column of the Institute of Computer Vision

Today, we introduce an article on fast knowledge distillation from ECCV at Carnegie Mellon University and other institutions, which can train ResNet-50 from scratch in ImageNet-1K from scratch with basic training parameter configuration to 80.1% (without data augmentation such as mixup, cutmix, etc.), and the training speed (especially the data reading overhead) is more than 16% less than that of traditional classification frameworks, and more than 30% faster than the previous SOTA algorithmIt is one of the best knowledge distillation strategies for both accuracy and speed, and the code and model have all been open sourced!

Redirected from "The Heart of the Machine"

Since it was proposed by Geoffrey Hinton et al. in 2015, knowledge distillation (KD) has had a huge impact in the fields of model compression, visual classification detection, etc., and has since produced countless related variants and extended versions, but can be broadly divided into the following categories: vanilla KD, online KD, teacher-free KD, etc. Recently, many studies have shown that the simplest and most naïve knowledge distillation strategy can achieve a huge performance improvement, and the accuracy is even higher than that of many complex KD algorithms. However, vanilla KD has an unavoidable drawback: each iteration needs to input the training sample into the teacher forward propagation to produce a soft label, which leads to a large part of the computational overhead spent on traversing the teacher model, but the size of the teacher is usually much larger than that of the student, and the weight of the teacher is fixed during the training process. As a result, the learning efficiency of the entire knowledge distillation framework is very low. In order to solve this problem, this paper first analyzes why it is not possible to directly generate a single soft label vector for each input image and then reuse this label in the training process of different iterations, the fundamental reason is that the use of data augmentation in the training process of the visual domain model, especially the image enhancement strategy of random-resize-cropping, leads to the input samples generated by different iterations even if they come from the same image, they may also come from different regions. As a result, the sample does not match well with a single soft tag vector in different iterations. Based on this, this paper proposes a design for rapid knowledge distillation, which processes the required parameters through a specific coding method, and then further stores and reuses the soft label, and at the same time, uses the strategy of assigning regional coordinates to train the target network. With this strategy, the entire training process can be explicitly teacher-free, which is characterized by both fast (16%/30% training speedup, especially friendly to the drawback of slow data reads on the cluster) and good (80.1% accuracy can be achieved on ImageNet-1K without additional data augmentation using ResNet-50). First, let's review how the common knowledge distillation structure works, as shown in the following diagram:

The knowledge distillation framework consists of a pre-trained teacher model (with fixed weights in the distillation process) and a student model to be learned, and the teacher is used to generate soft labels to supervise the student's learning. It can be seen that this framework has an obvious drawback: when the teacher structure is larger than the student, the computational overhead generated by the training image feedforward has exceeded that of the student, but the teacher weight is not the goal of our learning, resulting in this computational overhead being essentially "useless". The motivation of this paper is to study how to avoid or reuse this additional calculation results in the process of knowledge distillation training, and the solution strategy of this paper is to save the regional soft label of different regions of each image on the hard disk in advance, and the training process reads the training image and label file at the same time, so as to achieve the effect of reusing the label. So the question becomes: how do soft labels organize and store most effectively? Let's take a closer look at the strategies proposed in the article.

1. Introduction to the FKD algorithm framework

The core part of the FKD framework consists of two stages, as shown in the following figure: (1) the generation and storage of soft labels; (2) Use soft label for model training.

As shown in the figure, the first half shows the process of generating soft tags, the author enters multiple crops into the pre-trained teacher to generate the required soft label vectors, and at the same time, the author also save: (1) the coordinates corresponding to each crop and (2) the Boolean value whether to flip or not. The second half shows the student training process, the author will also read their corresponding soft label files when randomly sampling images, and select N crops from them for training, and additional data enhancements such as mixup, cutmix will be placed at this stage, thus saving the additional storage overhead caused by introducing more data augmentation parameters.

2. Sampling strategy

This paper also proposes a multi-crop sampling strategy, that is, multiple sample crops are sampled for each image in a mini-batch. When the total training epochs remain constant, this sampling method can greatly reduce the number of data reads, and the acceleration effect of this strategy is very obvious for some cluster devices where data reads are not very efficient or cause serious bottlenecks (as shown in the table below). The authors found that if the number of crops is not too large, the accuracy of the model can be significantly improved, but too many crops in a picture will cause insufficient information difference (too similar) in each mini-batch, so oversampling will affect the performance, so a reasonable value needs to be set.

3. Acceleration ratio

In the experimental part, the authors compared the speed with the standard training method and the ReLabel training, and the results are shown in the table below: you can see that FKD is about 16% faster than the normal classification framework, and 30% faster than ReLabel, because ReLabel needs to read double the number of files compared to normal training. It should be noted that in this speed comparison experiment, the number of FKD crops is 4, and if you choose a larger number of crops, you can get a higher speedup ratio.

Acceleration Reason Analysis: In addition to the above introduction of multiple crops for acceleration, the author also analyzes some other acceleration factors, as shown in the figure below, ReLabel needs to generate the coordinates of the sampled data in the training stage of the model, and needs to use RoI-Align and Softmax to generate the required soft labels, in contrast, FKD directly saves the coordinate information and the final soft label format, so it can be trained directly without any additional post-processing after reading the label file, which will also be faster than ReLabel.

4. Label quality analysis

Soft label quality is one of the most important indicators to ensure the accuracy of model training, and the authors demonstrate that the proposed method has better soft label quality by visualizing the label distribution and calculating the cross-entropy between different model predictions.

The graph above shows a comparison of the distribution of FKD and ReLabel soft labels, and the following conclusions are obtained:

The authors analyze that the reason for FKD is that ReLabel feeds a global image into the model rather than a local area, which makes the generated global label mapping encode more global category information while ignoring contextual information, making the generated soft label too close to a single semantic label.
(second row) While there are some samples where the maximum predicted probabilities are similar between ReLabel and FKD, FKD contains more dependent class probabilities in the label distribution, and information about these subordinate categories is not captured in ReLabel's distribution.
FKD is more robust than ReLabel for some anomalies, such as a target box with loose boundaries, or only a partial target.
In some cases, ReLabel's label distribution unexpectedly crashes (evenly distributed) and doesn't produce a major prediction, whereas FKD can still predict well.

5. Label compression and quantification strategies

1) Hardening. In this policy, the sample label Y_H is indexed using the maximum logits predicted by teacher. The label hardening strategy still produces one-hot labels, as shown in the following formula:

2) Smoothing. The smoothing quantization strategy is to replace the hardened label Y_H described above with a combination of soft labels and evenly distributed piecewise functions as follows:

3）边际平滑 (Marginal Smoothing with Top-K)。边际平滑量化策略相比单一预测值保留了更多的边际信息（Top-K）来平滑标签 Y_S：

4) Marginal Re-Norm with Top-K. The marginal smoothing normalization strategy renormalizes the Top-K prediction to a sum of 1 and keeps the other element values at zero (FKD uses normalization to calibrate the sum of the Top-K predictions to 1 because the soft tag stored by FKD is the value after the softmax process):

The following diagram shows the corresponding quantitative strategies:

6. Comparison of storage sizes for different label quantization/compression strategies

The storage space required by different label compression methods is shown in the following table, the dataset used is ImageNet-1K, where M is the number of samples per image in the soft label generation stage, here the author chooses 200 as an example. Nim is the number of images, the ImageNet-1K dataset is 1.2M, SLM is the size of the ReLabel label matrix, Cclass is the number of classes, and DDA is the parameter dimension that needs to be stored for data augmentation.

As you can see from the table, the storage space required for FKD soft tags without any compression is 0.9T, which is obviously unrealistic in actual use, and the size of the label data is far larger than the training data itself. The storage size can be greatly reduced by label compression, and the subsequent experiments also prove that the appropriate compression method does not damage the accuracy of the model.

7. Application on self-supervised learning tasks

The way FKD is trained can also be applied to self-supervised learning tasks. The authors use self-supervised algorithms such as MoCo, SwAV, etc. to pre-train the teacher model, and then generate unsupervised soft labels for self-supervision in the manner described above, which is similar to the supervised learning of teachers. The label-generating process preserves the original self-supervised model projection head and uses the final output vector that follows, and then saves that vector as a soft label. After obtaining the soft label, you can use the same supervised training method to learn the corresponding student model.

8. Experimental results

1) The first is the results on ResNet-50 and ResNet-101, as shown in the table below, FKD achieved an accuracy of 80.1%/ResNet-50 and 81.9%/ResNet-101. At the same time, the training time is much faster than normal training and ReLabel.

2) The authors also tested the results of FKD on MEAL V2 and also got 80.91% results.

3) Results on Vision Transformer: Next, the authors show the results on the vision transformer, which can improve FKD by nearly one point compared to the previous knowledge distillation method without additional data augmentation, while training more than 5 times faster.

4) Results on Tiny CNNs:

5) Ablation experiment: First of all, different compression strategies, considering the storage requirements and training accuracy, the marginal smoothing strategy is the best.

The next step is the comparison of different crop numbers in the training phase, because MEAL V2 uses the pre-trained parameter as the initialization weight, so the performance is relatively stable and close under different crop numbers. Vanilla and FKD perform best at crop=4. In particular, vanilla's accuracy is improved by one point compared to crop=1, and the accuracy decreases significantly after crop is greater than 8.

6) Results on self-supervised tasks: As shown in the following table, the FKD method can still learn the target model well on self-supervised learning tasks, and can be accelerated by three to four times compared with Gemini self-supervised network training and distillation training.

9. Downstream tasks

The following table shows the results of the FKD model on the ImageNet ReaL and ImageNetV2 datasets, and you can see that the FKD has achieved steady improvement on these datasets.

The following table shows the results of the FKD pre-trained model on the COCO object detection task, and the improvement is also obvious.

10. Visual Analytics

As shown in the following two visualizations, the authors explore the impact of FKD as a region-based training method on the model by visualizing the attention map, and compare the models obtained by three different training methods: normal one-hot label, ReLabel, and the FKD proposed in this paper. (i) The probability value of FKD's prediction is smaller (soft) than that of ReLabel because the FKD training process introduces more context and context. In the training strategy of FKD stochastic crop, many samples are sampled in the background (context) region, and the soft prediction labels from the teacher model can more truly reflect the actual input content, and these soft labels may be completely different from the one-hot labels, and the training mechanism of FKD can make better use of the additional information in the context. (ii) The feature visualization of FKD has a larger region of high response values over the object region, suggesting that the FKD-trained model leverages cues from more regions to make predictions, thereby capturing more variability and fine-grained information. (iii) ReLabel's attention visualization is closer to the PyTorch pre-trained model, while the results of FKD are very different from theirs. This shows that the attention mechanism learned by the FKD method is significantly different from the previous model, and the reasons and working mechanisms of its effectiveness can be further studied from this point of view.

Please contact this official account for authorization for reprinting

The Computer Vision Research Institute Learning Group is waiting for you to join!

ABOUT

Institute of Computer Vision

The Institute of Computer Vision is mainly involved in the field of deep learning, mainly focusing on face detection, face recognition, multi-object detection, object tracking, image segmentation and other research directions. The institute will continue to share the latest paper algorithm framework, and the difference between us this reform is that we want to focus on "research". After that, we will share the practice process for the corresponding fields, so that everyone can truly experience the real scenario of getting rid of the theory, and cultivate the habit of loving hands-on programming and thinking with their brains!

🔗

Yolov7: The Newest and Fastest Real-Time Detection Framework, with the Most Detailed Analysis and Explanation (with source code)