A single GPU gets the GPT-3 hyperparameters! The | train small models first, then "one-click migration" has been open sourced

2022-03-14 00:49:59

Feng se comes from Aofei Temple

Qubits | Official account QbitAI

"A GPU can't train GPT-3, let alone adjust the hyperparameters on it."

No, no, no, now the situation has changed -

It is entirely possible to adjust the hyperparameters of a large-scale model on a single GPU.

What to say?

It turns out that someone has discovered a new parameter adjustment method, no matter how the model size changes, the resulting optimal hyperparameter can maintain stable performance.

From this, we can first train a small version of the model, indirectly adjust the hyperparameter on it, and then copy them directly to the full-scale model in the form of zero samples, and we can get quite good performance.

This is simply not too good for people who do not have enough GPU resources.

At present, the relevant posts have also caused a hot discussion on Reddit, with 300+ likes and support.

A single GPU gets the GPT-3 hyperparameters! The | train small models first, then "one-click migration" has been open sourced

The large model of the GPT-3 is tuned up on a GPU

The method is called muP (Maximal Update Parametrization) and the authors are from Microsoft and OpenAI, respectively.

The idea is simple, using a special parametric idea called P that they found in their previous work:

Narrow neural networks and wide neural networks share the same set of optimal hyperparameters, even when the width is infinitely large (width->∞).

The specific principle can be found in the paper "Feature Learning in Infinite-Width Neural Networks".

Shareable hyperparameters include learning rate, learning rate schedule, initialization, parameter multipliers... It is even possible to individually target each parameter tensor.

The authors validated this conclusion on Transformers and ResNet, which are as wide as 4096.

Thus, resource-poor alchemists can hyperparameterize a small version of the GPT-3 model on a single GPU:

If the parameters obtained on this small model are close to optimal, then the same result can be obtained on the large model.

ps. This method of tuning is also named "Transfer".

How effective is it?

The authors trained a small GPT-3 with only 40 million parameters, small enough to run directly on a single GPU.

Then "migrate" its hyperparameters to a massive GPT-3 with 6.7 billion parameters, and it turns out that its performance is exactly the same as the original GPT-3 – although the original GPT-3 is twice the size of its parameters!

This adjustment cost only accounts for 7% of the overall pre-training cost.

Due to the increase in model size, the cost of directly adjusting small models is still about the same, and if you use this method to tune in to 17.5 billion-scale GPT-3, the cost may be up to 0.3% of the total pre-training cost.

Well, you might ask: Can you just reduce the width of the model?

The authors say there is no theoretical guarantee for "non-width stuff."

The good news is that they tested the migration effects of deepth, batch size, sequence length, and timestep within a reasonable range of preLN Transformer.

Among them, they reduced BERT-base and BERT-large to the same size in width and depth, and then made simultaneous hyperparameter adjustments to find:

The performance of both has improved compared to the already tuned megatron BERT baseline, especially the BERT-large improvement.

This also sums up a truth:

The larger the migrated model, the higher the benefits.

So the authors also joke that although we did not test the 17.5 billion-scale GPT-3, the guaranteed results will make you "drool".

Having said all that, how exactly does it come about?

The following table outlines how to adjust the intrinsicization and leveling rate of your model through fan-in or fan-out.

where pink text is P and gray text in parentheses is the default value for pytorch.

Of course, if you don't want to do it yourself, the author has also open sourced the Pytorch implementation, which can be applied to your model through pip install mup.

About the author

One is named Greg Yang, a senior researcher at Microsoft.

The corresponding author is Jianfeng Gao, partner research manager of the Microsoft Research Deep Learning Technology Center and IEEE Fellow.

There are also two Chinese authors liu xiaodong (alumni of Beijing University of Posts and Telecommunications) and Chen Weizhu (who has been with Microsoft for 16 years).

Their results have been accepted by NeurIPS 2021.

GitHub Links:

https://github.com/microsoft/mup

Address of thesis:

https://arxiv.org/abs/2203.03466

Official Blog Links:

https://www.microsoft.com/en-us/research/blog/%C2%B5transfer-a-technique-for-hyperparameter-tuning-of-enormous-neural-networks/

Reddit Discusses:

https://www.reddit.com/r/MachineLearning/comments/tb0jm6/r_you_cant_train_gpt3_on_a_single_gpu_but_you_can/

A single GPU gets the GPT-3 hyperparameters! The | train small models first, then "one-click migration" has been open sourced

Read on