本文主要展示各類深度學習優化器Optimizer的效果。所有結果基于pytorch實作,參考github項目pytorch-optimizer(倉庫位址)的結果。pytorch-optimizer基于pytorch實作了常用的optimizer,非常推薦使用并加星該倉庫。
文章目錄
- 1 簡介
- 2 結果
- A2GradExp(2018)
- A2GradInc(2018)
- A2GradUni(2018)
- AccSGD(2019)
- AdaBelief(2020)
- AdaBound(2019)
- AdaMod(2019)
- Adafactor(2018)
- AdamP(2020)
- AggMo(2019)
- Apollo(2020)
- DiffGrad*(2019)
- Lamb(2019)
- Lookahead*(2019)
- NovoGrad(2019)
- PID(2018)
- QHAdam(2019)
- QHM(2019)
- RAdam*(2019)
- Ranger(2019)
- RangerQH(2019)
- RangerVA(2019)
- SGDP(2020)
- SGDW(2017)
- SWATS(2017)
- Shampoo(2018)
- Yogi*(2018)
- Adam
- SGD
- 3 評價
- 4 參考
1 簡介
pytorch-optimizer中所實作的optimizer及其文章主要如下所示。關于optimizer的優化研究非常多,但是不同任務,不同資料集所使用的optimizer效果都不一樣,看看研究結果就行了。
optimizer | paper |
A2GradExp | https://arxiv.org/abs/1810.00553 |
A2GradInc | https://arxiv.org/abs/1810.00553 |
A2GradUni | https://arxiv.org/abs/1810.00553 |
AccSGD | https://arxiv.org/abs/1803.05591 |
AdaBelief | https://arxiv.org/abs/2010.07468 |
AdaBound | https://arxiv.org/abs/1902.09843 |
AdaMod | https://arxiv.org/abs/1910.12249 |
Adafactor | https://arxiv.org/abs/1804.04235 |
AdamP | https://arxiv.org/abs/2006.08217 |
AggMo | https://arxiv.org/abs/1804.00325 |
Apollo | https://arxiv.org/abs/2009.13586 |
DiffGrad | https://arxiv.org/abs/1909.11015 |
Lamb | https://arxiv.org/abs/1904.00962 |
Lookahead | https://arxiv.org/abs/1907.08610 |
NovoGrad | https://arxiv.org/abs/1905.11286 |
PID | https://www4.comp.polyu.edu.hk/~cslzhang/paper/CVPR18_PID.pdf |
QHAdam | https://arxiv.org/abs/1810.06801 |
QHM | https://arxiv.org/abs/1810.06801 |
RAdam | https://arxiv.org/abs/1908.03265 |
Ranger | https://arxiv.org/abs/1908.00700v2 |
RangerQH | https://arxiv.org/abs/1908.00700v2 |
RangerVA | https://arxiv.org/abs/1908.00700v2 |
SGDP | https://arxiv.org/abs/2006.08217 |
SGDW | https://arxiv.org/abs/1608.03983 |
SWATS | https://arxiv.org/abs/1712.07628 |
Shampoo | https://arxiv.org/abs/1802.09568 |
Yogi | https://papers.nips.cc/paper/8186-adaptive-methods-for-nonconvex-optimization |
為了評估不同optimizer的效果,pytorch-optimizer使用可視化方法來評估optimizer。可視化幫助我們了解不同的算法如何處理簡單的情況,例如:鞍點,局部極小值,最低值等,并可能為算法的内部工作提供有趣的見解。pytorch-optimizer選擇了Rosenbrock和Rastrigin 函數來進行可視化。具體如下:
- Rosenbrock(也稱為香蕉函數)是具有一個全局最小值(1.0,1.0)的非凸函數。整體最小值位于一個細長的,抛物線形的平坦山谷内。尋找山谷是微不足道的。但是,要收斂到全局最小值(1.0,1.0)是很困難的。優化算法可能會陷入局部最小值。
- Rastrigin函數是非凸函數,并且在(0.0,0.0)中具有一個全局最小值。由于此函數的搜尋空間很大且局部最小值很大,是以找到該函數的最小值是一個相當困難的工作。
2 結果
下面分别顯示不同年份算法在Rastrigin和Rosenbrock函數下的結果,結果顯示為Rastrigin和Rosenbroc從上往下的投影圖,其中綠色點表示最優點,結果坐标越接近綠色點表示optimizer效果越好。個人覺得效果較好的方法會在方法标題後加*。
A2GradExp(2018)
Paper: Optimal Adaptive and Accelerated Stochastic Gradient Descent (2018)
rastrigin | rosenbrock |
A2GradInc(2018)
Paper: Optimal Adaptive and Accelerated Stochastic Gradient Descent (2018)
rastrigin | rosenbrock |
A2GradUni(2018)
Paper: Optimal Adaptive and Accelerated Stochastic Gradient Descent (2018)
rastrigin | rosenbrock |
AccSGD(2019)
Paper: On the insufficiency of existing momentum schemes for Stochastic Optimization (2019)
rastrigin | rosenbrock |
AdaBelief(2020)
Paper: AdaBelief Optimizer, adapting stepsizes by the belief in observed gradients (2020)
rastrigin | rosenbrock |
AdaBound(2019)
Paper: An Adaptive and Momental Bound Method for Stochastic Learning. (2019)
rastrigin | rosenbrock |
AdaMod(2019)
Paper: An Adaptive and Momental Bound Method for Stochastic Learning. (2019)
rastrigin | rosenbrock |
Adafactor(2018)
Paper: Adafactor: Adaptive Learning Rates with Sublinear Memory Cost. (2018)
rastrigin | rosenbrock |
AdamP(2020)
Paper: Slowing Down the Weight Norm Increase in Momentum-based Optimizers. (2020)
rastrigin | rosenbrock |
AggMo(2019)
Paper: Aggregated Momentum: Stability Through Passive Damping. (2019)
rastrigin | rosenbrock |
Apollo(2020)
Paper: Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization. (2020)
rastrigin | rosenbrock |
DiffGrad*(2019)
Paper: diffGrad: An Optimization Method for Convolutional Neural Networks. (2019)
Reference Code: https://github.com/shivram1987/diffGrad
rastrigin | rosenbrock |
Lamb(2019)
Paper: Large Batch Optimization for Deep Learning: Training BERT in 76 minutes (2019)
rastrigin | rosenbrock |
Lookahead*(2019)
Paper: Lookahead Optimizer: k steps forward, 1 step back (2019)
Reference Code: https://github.com/alphadl/lookahead.pytorch
非常需要注意的是Lookahead嚴格來說不算一種優化器,Lookahead需要一種其他優化器搭配工作,這裡Lookahead搭配Yogi進行優化
rastrigin | rosenbrock |
NovoGrad(2019)
Paper: Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks (2019)
rastrigin | rosenbrock |
PID(2018)
Paper: A PID Controller Approach for Stochastic Optimization of Deep Networks (2018)
rastrigin | rosenbrock |
QHAdam(2019)
Paper: Quasi-hyperbolic momentum and Adam for deep learning (2019)
rastrigin | rosenbrock |
QHM(2019)
Paper: Quasi-hyperbolic momentum and Adam for deep learning (2019)
rastrigin | rosenbrock |
RAdam*(2019)
Paper: On the Variance of the Adaptive Learning Rate and Beyond (2019)
Reference Code: https://github.com/LiyuanLucasLiu/RAdam
rastrigin | rosenbrock |
Ranger(2019)
Paper: Calibrating the Adaptive Learning Rate to Improve Convergence of ADAM (2019)
rastrigin | rosenbrock |
RangerQH(2019)
Paper: Calibrating the Adaptive Learning Rate to Improve Convergence of ADAM (2019)
rastrigin | rosenbrock |
RangerVA(2019)
Paper: Calibrating the Adaptive Learning Rate to Improve Convergence of ADAM (2019)
rastrigin | rosenbrock |
SGDP(2020)
Paper: Slowing Down the Weight Norm Increase in Momentum-based Optimizers. (2020)
rastrigin | rosenbrock |
SGDW(2017)
Paper: SGDR: Stochastic Gradient Descent with Warm Restarts (2017)
rastrigin | rosenbrock |
SWATS(2017)
Paper: Improving Generalization Performance by Switching from Adam to SGD (2017)
rastrigin | rosenbrock |
Shampoo(2018)
Paper: Shampoo: Preconditioned Stochastic Tensor Optimization (2018)
rastrigin | rosenbrock |
Yogi*(2018)
Paper: Adaptive Methods for Nonconvex Optimization (2018)
Reference Code: https://github.com/4rtemi5/Yogi-Optimizer_Keras
rastrigin | rosenbrock |
Adam
pytorch自帶
rastrigin | rosenbrock |
SGD
pytorch自帶
rastrigin | rosenbrock |
3 評價
看了第2節的結果,DiffGrad,Lookahead,RAdam,Yogi的結果應該還算不錯。但是這種可視化結果并不完全正确,一方面訓練的epoch太少,另外一方面資料不同以及學習率不同,結果也會大大不同。是以選擇合适的優化器在實際調參中還是要具體應用。比如在這個可視化結果中,SGD和Adam效果一般,但是實際上SGD和Adam是廣泛驗證的優化器,各個任務都能獲得不錯的結果。SGD是著名的大後期選手,Adam無腦調參最優算法。RAdam很不錯,但是并沒有那麼強,具體RAdam的評價見如何看待最新提出的Rectified Adam (RAdam)?。DiffGrad和Yogi某些任務不錯,在某些任務可能效果更差,實際選擇需要多次評估。Lookahead是Adam作者和Hinton聯合推出的訓練優化器,Lookahead可以配合多種優化器,好用是好用,可能沒效果,但是一般都會有點提升,實際用Lookahead還是挺不錯的。
結合可視化結果,實際下調參,先試試不同的學習率,然後再選擇不同的優化器,如果不會調參,優化器個人推薦選擇順序如下:
- Adam
- Lookahead + (Adam or Yogi or RAdam)
- 帶有動量的SGD
- RAdam,Yogi,DiffGrad
4 參考
- pytorch-optimizer
- pytorch
- 如何看待最新提出的Rectified Adam (RAdam)?