laitimes

Google "Model Soup" slaughtered ImageNet's list by fine-tuning! The method turned out to be only half a page

Google "Model Soup" slaughtered ImageNet's list by fine-tuning! The method turned out to be only half a page

Reporting by XinZhiyuan

Edit: LRS is so sleepy

【New Zhiyuan Introduction】ImageNet ranking list has been refreshed! However, this time, the new overlord Google did not propose a new model, only by fine-tuning the "several" models to do the first, the paper is all experimental analysis, which also caused controversy among netizens: all rely on wealth and wealth!

Recently, Google has made a big fuss over its powerful computing resources, and it has also picked up a meta AI friend.

I have to say that the cooperation between these two "wrongdoers" is rare.

Google "Model Soup" slaughtered ImageNet's list by fine-tuning! The method turned out to be only half a page

Thesis link: https://arxiv.org/abs/2203.05482

The research team came up with a concept called "model soup" that fine-tuned under a large pre-trained model using different hyperparameter configurations and then averaged the weights.

Experimental results prove that this simple method can usually improve the accuracy and robustness of the model.

Google "Model Soup" slaughtered ImageNet's list by fine-tuning! The method turned out to be only half a page

In general, getting the best performing model requires two steps:

1. Train multiple models with different hyperparameters

2. Choose the model that works best on the validation set

But the individual models produced by this method have a fatal flaw: the luck component is very large, and it is very easy to fall into local optimal points, resulting in performance that is not globally optimal.

Google "Model Soup" slaughtered ImageNet's list by fine-tuning! The method turned out to be only half a page

So another common strategy is model integration (ensemble), but the integrated model is still essentially multiple models, so the same input needs to be reasoned many times, and the inference cost is higher.

Model Soup, on the other hand, averages the model weights and ends up with a model that can improve performance without incurring any additional inference or memory costs.

Of course, you may be thinking, the model method is so simple, how dare Google send out the paper?

Google "Model Soup" slaughtered ImageNet's list by fine-tuning! The method turned out to be only half a page

The Method section is only half a page, and the entire article is basically full of experiments, which means that Google has done something that no one else has done: with a lot of computing resources, do a lot of experiments to prove that this simple method is effective.

And the model also set a new record for ImageNet 1K: 90.94%.

Google "Model Soup" slaughtered ImageNet's list by fine-tuning! The method turned out to be only half a page

So for researchers in colleges and universities, this article may not have much academic value, it is completely experimental science. But for large companies with money and resources, strong performance is enough!

The name Model Soup may have been inspired by the "Fibonacci Soup", which is done by heating up yesterday's soup and the day before yesterday and mixing it to get today's fresh "Fibonacci Soup".

Google "Model Soup" slaughtered ImageNet's list by fine-tuning! The method turned out to be only half a page

Model soup heats up yesterday's multiple models and becomes today's fresh SOTA model.

New bottles of old wine

A common development model for CV models is that large companies with computing resources pre-train the model, and other researchers fine-tune it for their specific downstream tasks on this basis.

In the case of a single model, performance may not be optimal, so another common way to improve performance is ensemble: using different hyperparameters, training multiple models, and then combining the outputs of these models, such as voting, to select multiple models to predict the consistent result as the final output.

Although the integrated model can improve the performance of the model, the disadvantages are also obvious: the same input needs to be predicted many times, the inference performance is significantly degraded, and the video memory must be increased, the graphics card must be increased, or the reasoning time must be waited for longer.

Google "Model Soup" slaughtered ImageNet's list by fine-tuning! The method turned out to be only half a page

Google proposes to average the weights of multiple fine-tuned models instead of selecting a single model that achieves the highest accuracy on the verification set, and the resulting new model is called model soup.

Because multiple models also need to be trained during normal training, the model soup does not increase the training cost. And the model soup is also a single model, so there is no increase in the cost of reasoning.

Google "Model Soup" slaughtered ImageNet's list by fine-tuning! The method turned out to be only half a page

In fact, previous studies have shown that weight averaging along a single training trajectory can improve the performance of random initialization of training models.

Model soup extends the effectiveness of weight averaging to the background of fine-tuning.

Google "Model Soup" slaughtered ImageNet's list by fine-tuning! The method turned out to be only half a page

There are also many strategies for weight averaging, and the paper gives 3 commonly used methods: uniform soup, greedy soup, and learning soup.

Uniform soup is the simplest, and the weights of different models can be directly averaged.

Greedy soup is built by sequentially adding models as potential ingredients in the soup, leaving the model soup only when the model's performance improves on the reserved validation set.

Before running the algorithm, the models are sorted in descending order of validation set accuracy, so the greedy soup model is no worse than the best single model on the validation set.

Google "Model Soup" slaughtered ImageNet's list by fine-tuning! The method turned out to be only half a page

Learning soup is done by using the weights of each model in the model soup as a learnable parameter.

Google "Model Soup" slaughtered ImageNet's list by fine-tuning! The method turned out to be only half a page

Strong performance is king

Although the idea of model soup is simple, the focus of this paper is not on methods, but on experiments.

Google "Model Soup" slaughtered ImageNet's list by fine-tuning! The method turned out to be only half a page

In the experimental section, the researchers explored the application of model soup when fine-tuning various models. The main models for fine-tuning are the CLIP and ALIGN models, pre-trained with contrast supervision of image-text pairs, the ViT-G/14 model pre-trained on JFT-3B, and the Transformer model for text classification. The experiment mainly used the CLIP ViT-B/32 model.

Fine-tuning is end-to-end, i.e. all parameters are modifiable, which tends to be more accurate than training only the final linear layer.

Before fine-tuning, the experiment takes two different approaches to initializing the final linear layer. The first method is to initialize the model from a linear probe (LP). The second method is initialized using zero-shot, for example, using a classifier generated by a text tower of CLIP or ALIGN as initialization.

The dataset used for fine-tuning is ImageNet. Five naturally distributed shifts were also evaluated in the experiment: ImageNetV2, ImageNet-R, ImageNet-Sketch, ObjectNet, and ImageNet-A.

Since the official ImageNet validation set is used as a test set, the experiment uses about 2% of the ImageNet training set as a reserved validation set for building greedy soups.

Experimental results compared with the strategies of the soup show that greedy soup requires fewer models to achieve the same precision as choosing the best individual model on the reserved validation set. The X axis is the number of models considered in the hyperparameter random search, and the Y axis is the accuracy of the various model selection methods. All methods require the same amount of training and computational cost during inference.

Google "Model Soup" slaughtered ImageNet's list by fine-tuning! The method turned out to be only half a page

For any number of models, greedy soup is superior to the best single model on ImageNet and out-of-distribution test sets; greedy soup is superior to uniform soup on ImageNet and comparable to it outside the distribution. Logit integration is better on ImageNet than Greedy Soup, but worse off-distribution.

Google "Model Soup" slaughtered ImageNet's list by fine-tuning! The method turned out to be only half a page

Greedy Tomby ViT-G/14 pre-trained on JFT-3B and fine-tuned on ImageNet to obtain the best single model performs both within and outside the distribution.

Google "Model Soup" slaughtered ImageNet's list by fine-tuning! The method turned out to be only half a page

To test whether the model performance gains obtained through model soup could be extended beyond image classification, the researchers also experimented with the NLP task. The researchers fine-tuned the BERT and T5 models on four text classification tasks derived from THE REFERENCE benchmark: MRPC, RTE, CoLA, and SST-2. While the improvement is not as pronounced as in image classification, greedy soup can in many cases perform better than the best single-model.

Google "Model Soup" slaughtered ImageNet's list by fine-tuning! The method turned out to be only half a page

Does it make sense?

Most researchers working on AI models should read the paper in their hearts: Just this?

Google "Model Soup" slaughtered ImageNet's list by fine-tuning! The method turned out to be only half a page

As soon as the paper came out, there was also a discussion about the paper on the knowledge.

Google "Model Soup" slaughtered ImageNet's list by fine-tuning! The method turned out to be only half a page

Some netizens said that this kind of paper is meaningless, all rely on resources piled up, verify a small idea. Previous models also had similar ideas, and the paper also lacked theoretical analysis of neural networks.

However, everything has two sides, netizens @ Zhao Zhao is not bad, then said that sota is only the performance of the paper, the conclusion of a large number of experiments in the article is still more enlightening, simple and effective is a good idea!

Netizen @ Battle Department Pastor said that this is a very Google-style job, the idea is not difficult to think, but Google wins in the same reasoning speed, and the explanation of the problem is also in place, the experiment is sufficient (for poor researchers may not be able to reproduce). There is indeed a lot to learn. And the model soup is also more environmentally friendly, not to throw away the trained model directly, but to use it, not to waste electricity.

Google "Model Soup" slaughtered ImageNet's list by fine-tuning! The method turned out to be only half a page

Netizen @ Tomato Brisket analysis said: "Now imageNet brush list model, 1 billion parameters is not too little, 10 billion parameters is not too much. And Google, Facebook, these rich owners, can not move is 1000 graphics cards to start, not only with Conv + Transformer, but also with JFT-3B cheating. However, if the 1,000-layer ResNet reaches 91% of the Top 1, it will be the progress of the times."

Finally, he joked: "If I brush to the 92% Top 1, I will wake up laughing in the middle of the night, and the KPIs for a year have been reached."

Google "Model Soup" slaughtered ImageNet's list by fine-tuning! The method turned out to be only half a page

Resources:

https://arxiv.org/abs/2203.05482

https://www.zhihu.com/question/521497951

Read on