Adam optimizer improved again, with memory limits too high learning rate, Peking University Sun Xu research group proposed

Xiao Cha was sent from The Temple of Oufei Qubits reports | Official number qbitai

adam is widely used as an optimizer for fast convergence, but its poor convergence limits the scope of use, and in order to ensure better results, we still use sgd in many cases.

But the slower convergence rate of SGD is also a headache, so people have been working on ways to further optimize adam. Adabound and Radam are both attempts in this regard.

Recently, Sun Xu's research group at Peking University proposed a new optimizer adamod. This is an improved adam-based optimizer with automatic warm-up heuristics and long-term learning rate buffering.

Adam optimizer improved again, with memory limits too high learning rate, Peking University Sun Xu research group proposed

The name adamod comes from adaptive and momental bound.

During training, Adamod can easily beat the adam, while being less sensitive to learning rate hyperparameters, training curves, and does not require warm-up.

Adamod's principle is to calculate an exponential long-term average of the adaptive learning rate while training, and use that average to trim the learning rate that is too high during training.

This improves the convergence of the optimizer, eliminates the need for warm-up, and reduces sensitivity to learning rates.

In the above figure, we can see that the training results of both sgdm and adam depend on the choice of initial learning rate. Adadamod, on the other hand, converges to the same result even if the learning rate differs by two orders of magnitude.

Compared to the adam optimizer, adamod only adds a hyperparameter, β3, which describes the length of memory in training.

This long-term memory solves the abnormally large value of the adaptive learning rate, so as not to put the optimizer in a bad state.

Similar to the previous radam optimizer, Adamod is able to control changes in adaptive learning rate from the start of training, ensuring stability at the start of training without the need for warm-up.

Related Stories:

The radam optimizer has evolved: combined with lookahead, the performance is better and faster

On three transformer-based neural machine translation models, the adamo without warm-up showed faster convergence rates and better convergence results than preheated adams.

If the Adam optimizer is not warmed up, the effect can be very poor, reaching the point where it is completely unusable.

In fact, Adamod's idea is also very simple, just a small modification on the basis of adam.

As adadoline describes, unstable and abnormal learning rates often occur near the end of training, which jeopardizes the generalization performance of adaptive methods.

Chinese Xueba undergraduates proposed a new algorithm for AI: speed is comparable to adam, performance is comparable to sgd, and the chairman of the iclr field is full of praise

Therefore, the idea of adadoline is to first define the lower limits of the learning rate ηl and ηu, the lower limit is 0 at the beginning, the upper limit is ∞, and as the training process progresses, the upper and lower limits converge to the learning rate of sgd α.

Adam calculates adaptive learning rates based on gradient estimates for first- and second-order moments. Inspired by the exponential moving average (ema), Adamod calculates the low-order moment of the gradient and takes the memory to the next step via the parameter β3.

It can be seen that the first 8 steps of Adam and adammod are exactly the same, and the latter is only 9 or 10 steps more than the former.

Specifically, do the following in adam:

The range of exponential moving averages is 1/β3. β3 is a measure of the length of memory, and the closer it is to 1, the longer the length of memory.

For example, when β3 = 0.9, the average range of memory is 10 cycles; when β3 = 0.999, the average range is 1000 cycles.

According to β3, the relationship between the smoothing value of the current step and the previous smoothing value can be calculated.

With this equation, we define the relationship between the current smoothed value and the past "long-term-memory" (long-term-memory). Obviously, when β3 = 0, adamod is exactly equivalent to adam.

After calculating the current smoothing value, a minimum value is selected from the learning rate ηt calculated by it and the current adam, thus avoiding the situation of excessive learning rate.

This operation can be seen as cutting the learning rate element by element, so that the output is limited by the current smoothing value.

Now you can install directly via pip.

Although adamant beats adam, sgdm can still outperform adamant in longer training conditions.

Therefore, someone proposed a diffmod algorithm that combines diffgrad and adadamod, using another parameter "len_memory" instead of β3, which can pass the total number of batches to it, making it easier to remember and track.

The first author of this article is Ding Jianbang, the corresponding author is Associate Professor Sun Xu, who graduated from Huazhong University of Science and Technology with a bachelor's degree, graduated from the University of Tokyo with a Ph.D. in 2010, and interned at Microsoft's Raymond Research Institute in the United States.

His research interests are natural language processing, machine learning, and deep learning, and he has served as the domain chair of international academic conferences such as Emnlp and iJCNLP.

The previous adabound optimizer was proposed by Luo Liangchen of Sun Xu's group. The first author of this article also thanked Luo Liangchen and others for participating in the discussion.

Blog Discussion:

https://medium.com/@lessw/meet-adamod-a-new-deep-learning-optimizer-with-memory-f01e831b80bd

Address of thesis:

https://arxiv.org/abs/1910.12249v1

Adamod source code:

https://github.com/lancopku/adamod

diffmod source code:

https://github.com/lessw2020/best-deep-learning-optimizers/blob/master/adamod/diffmod.py

— Ends —

Qubit qbitai · Headline signing