The difficulty of reproducing the model is not necessarily the author's fault, and the study found that the model architecture should be backed by CVPR 2022

2022-03-19 13:36:48

Feng se comes from Aofei Temple

Qubits | Official account QbitAI

Can the same neural network get the same result after two trainings under different initialization conditions?

A study by CVPR 2022 gives the answer by visualizing decision boundaries –

Some are easy, some are hard.

For example, from the graph below, the researchers found that ViT is more difficult to reproduce than ResNet (after two training sessions, it is clear that the ViT decision boundaries differ even more):

The difficulty of reproducing the model is not necessarily the author's fault, and the study found that the model architecture should be backed by CVPR 2022

The researchers also found that there was also a strong correlation between the reproducibility of the model and the breadth of the model itself.

Similarly, they used this method to visualize the double descent phenomenon, one of the most important theories of machine learning in 2019, and finally found some interesting phenomena.

Let's see exactly how they do it.

Wider CNN model with higher reproducibility

Decision boundaries in deep learning can be used to minimize errors.

In simple terms, the classifier classifies the points outside the line inside the line into different classes by deciding on the boundary.

In this study, the authors selected three random images from the CIFAR-10 training set and then trained on 7 different architectures using three different random initialization configurations to map out their respective decision areas.

From this we can find:

The three on the left and the four on the right are very different, which means that the similarities between the different architectures are very low.

Looking further, the decision boundary diagram between the fully connected network on the left, ViT and MLP Mixer is not the same, while the CNN model on the right is very similar.

In the CNN model, we can also observe a clear repetitive trend between different random number seeds, which shows that models with different initialization configurations can produce the same results.

The authors devised a more intuitive measure to measure the reproducibility score of each architecture, and the results really validated our intuitive feelings:

It was also found that wider CNN models appear to be more reproducible in their decision-making areas, such as WideRN30.

And CNN models with residual connection structures (ResNet and SenseNet) have slightly higher reproducibility than models without this connection (VGG).

In addition, the choice of optimizer also has an impact.

In the table below, we can see that SAM produces more repeatable decision boundaries than standard optimizers such as SGD and Adam.

However, for MLP Mixer and ViT, the use of SAM does not always guarantee the highest test accuracy of the model.

Some netizens expressed curiosity, if by improving the design of the model itself, can this phenomenon be changed?

The authors responded that they had tried to adjust the learning rate of ViT, but the results were still worse than ResNet.

Visualize the double decline of ResNet-18

Double Descent is an interesting concept that describes the relationship between test/training error and model size.

Previously, it was generally believed that models with too few parameters had poor generalization ability — because they were underfitted ; models with too many parameters also had poor generalization ability — because they were overfitted.

And it proves that the relationship between the two is not so simple. Specifically:

The error decreases first as the model grows, then after the model is overfitted, the error increases again, but as the model size or training time increases further, the error decreases again.

The authors continue to use the decision boundary method to visualize the double decline of ResNet-18.

They increased the model capacity by changing the width parameter (k:1-64).

Two sets of models were trained, one of which used a training set with label noise and the other with a 20% noise label.

Eventually, a significant double decline was observed in the second set of models.

In this regard, the author states:

Model instability predicted by linear models also applies to neural networks, but this instability manifests itself as a large number of fragments of the decision-making area.

That is to say, the double descent phenomenon is caused by excessive fragmentation of the decision area in the case of noise labels.

Specifically, when k approaches/reaches 10 (i.e., the interpolation threshold), because the model fits most of the training data at this time, the decision area is divided into many small pieces, becoming "chaotic and broken", and not reproducible; at this time, there is a significant instability in the classification function of the model.

At very narrow (k=4) and wide (k=64) models, the decision area is less fragmented and has a high level of repeatability.

To further prove this result, the authors devised a fragmentation score calculation method that finally verified the observations in the figure above.

The model's reproducibility score is as follows:

It can also be seen that in the case of under-parameterization and over-parameterization, the reproducibility of the entire training process is high, but there will be a "failure" at the interpolation threshold.

Interestingly, even without noise labels, the researchers found that the quantization method they designed was sensitive enough to detect subtle drops in reproducibility (the blue-line portion of the chart above).

Now that the code is open source, why not try your model to be easy to reproduce?

Address of thesis:

https://arxiv.org/abs/2203.08124

GitHub Links:

https://github.com/somepago/dbVi

The difficulty of reproducing the model is not necessarily the author's fault, and the study found that the model architecture should be backed by CVPR 2022

Read on