cs231n-notes-Lecture-7：各種優化方法介紹與比較

2023-04-07 19:03:32

Lecture-7 Training Neural Networks

Optimization

SGD

Cons
1. Very slow progress along shallow dimension, jitter along steep direction.
  
  cs231n-notes-Lecture-7：各種優化方法介紹與比較
  2. local minima or saddle point. Saddle points are much more common in high dimension.
  
  cs231n-notes-Lecture-7：各種優化方法介紹與比較
2. Gradients come from minibatches, so they can be noisy!

SGD + Momentum

cs231n-notes-Lecture-7：各種優化方法介紹與比較

Nesterov Momentum

cs231n-notes-Lecture-7：各種優化方法介紹與比較

AdaGrad

cs231n-notes-Lecture-7：各種優化方法介紹與比較

step size becomes smaller and smaller because grad_squared is always increasing.
the gradient becomes smaller in the waggling dimension
not common: Slow, get stuck easily

RMSProp

cs231n-notes-Lecture-7：各種優化方法介紹與比較

decay_rate: commonly 0.9 or 0.99
Slove the problem that Adagrad always slow down the movement in all dimensions : only work in some dimesions whose gradients are large.

Adam

cs231n-notes-Lecture-7：各種優化方法介紹與比較

Sort of like RMSProp with momentum
Bias correction is used to avoid it moves large distance at the very first step.

Learning rate decay

common in SGD but not in Adam.
draw the loss curve and think if it’s needed.

Second-order Optimization

Quasi-Newton methods (BGFS most popular):

instead of inverting the Hessian (O(n^3)), approximate

inverse Hessian with rank 1 updates over time (O(n^2) each).

L-BFGS (Limited memory BFGS):

Does not form/store the full inverse Hessian.

Usually works very well in full batch, deterministic mode i.e. if you have a single, deterministic f(x) then L-BFGS will probably work very nicely
Does not transfer very well to mini-batch setting. Give bad results. Adapting L-BFGS to large-scale, stochastic setting is an active area of research.

Model Ensembles

Train multiple independent models
At test time average their results

Enjoy 2% extra performance

Tips and Tricks

Instead of training independent models, use multiple snapshots of a single model during training!

Loshchilov and Hutter, “SGDR: Stochastic gradient descent with restarts”, arXiv 2016 Huang et al, “Snapshot ensembles: train 1, get M for free”, ICLR 2017

Improve single-model performance

Regularization

Add term to loss

cs231n-notes-Lecture-7：各種優化方法介紹與比較
Dropout (two explanations)
- Forces the network to have a redundant representation; Prevents co-adaptation of features
- Dropout is training a large ensemble of models (that share parameters).
Data augmentation
- Horizontal Flips
- Random crops and scales
- Color Jitter
- translation
- rotation
- stretching
- shearing
- lens distortions
DropConnect

Wan et al, “Regularization of Neural Networks using DropConnect”, ICML 2013

Fractional Max Pooling

Graham, “Fractional Max Pooling”, arXiv 2014

Stochastic Depth

Huang et al, “Deep Networks with Stochastic Depth”, ECCV 2016

cs231n-notes-Lecture-7：各種優化方法介紹與比較

Lecture-7 Training Neural Networks

Optimization

SGD

SGD + Momentum

Nesterov Momentum

AdaGrad

RMSProp

Adam

Learning rate decay

Second-order Optimization

Model Ensembles

Tips and Tricks

Improve single-model performance

Regularization

繼續閱讀

Tensorflow Day17 Sparse Autoencoder

基于keras的多GPU深度學習網絡模型及參數儲存-筆記

A Guide For Time Series Prediction Using Recurrent Neural Networks (LSTMs)

Conditional Compilation In Java @ JDJ

ICLR 2017 | GAN Missing Modes 和 GAN

Text Recognition with ML KitText Recognition with ML Kit

【吳恩達機器學習筆記】7支援向量機12支援向量機（Support Vector Machines）

【深度學習-基礎知識】batchNormal原理及caffe中是如何使用的

scikit-learn中的SVM

優化你程式的大小 - 微觀優化

優化你程式的大小 - 宏觀優化

STL序列式容器中删除元素的方法和陷阱(一)

ML - 貸款使用者逾期情況分析6 - Final思路

[zz]The Most Important Algorithms (in CS and Math)

Intel® 64 and IA-32 Architectures Software Developer's Manuals

SVM支援向量機二（Lagrange Duality）SVM支援向量機二（Lagrange Duality）