天天看點

cs231n-notes-Lecture-7:各種優化方法介紹與比較

Lecture-7 Training Neural Networks

Optimization

SGD

  • Cons
    1. Very slow progress along shallow dimension, jitter along steep direction.
      cs231n-notes-Lecture-7:各種優化方法介紹與比較
      2. local minima or saddle point. Saddle points are much more common in high dimension.
      cs231n-notes-Lecture-7:各種優化方法介紹與比較
    2. Gradients come from minibatches, so they can be noisy!

SGD + Momentum

cs231n-notes-Lecture-7:各種優化方法介紹與比較

Nesterov Momentum

cs231n-notes-Lecture-7:各種優化方法介紹與比較

AdaGrad

cs231n-notes-Lecture-7:各種優化方法介紹與比較
  • step size becomes smaller and smaller because grad_squared is always increasing.
  • the gradient becomes smaller in the waggling dimension
  • not common: Slow, get stuck easily

RMSProp

cs231n-notes-Lecture-7:各種優化方法介紹與比較
  • decay_rate: commonly 0.9 or 0.99
  • Slove the problem that Adagrad always slow down the movement in all dimensions : only work in some dimesions whose gradients are large.

Adam

cs231n-notes-Lecture-7:各種優化方法介紹與比較
  • Sort of like RMSProp with momentum
  • Bias correction is used to avoid it moves large distance at the very first step.

Learning rate decay

  • common in SGD but not in Adam.
  • draw the loss curve and think if it’s needed.

Second-order Optimization

  • Quasi-Newton methods (BGFS most popular):
  • instead of inverting the Hessian (O(n^3)), approximate
  • inverse Hessian with rank 1 updates over time (O(n^2) each).
  • L-BFGS (Limited memory BFGS):
Does not form/store the full inverse Hessian.
  • Usually works very well in full batch, deterministic mode i.e. if you have a single, deterministic f(x) then L-BFGS will probably work very nicely
  • Does not transfer very well to mini-batch setting. Give bad results. Adapting L-BFGS to large-scale, stochastic setting is an active area of research.

Model Ensembles

  1. Train multiple independent models
  2. At test time average their results
Enjoy 2% extra performance

Tips and Tricks

  • Instead of training independent models, use multiple snapshots of a single model during training!
Loshchilov and Hutter, “SGDR: Stochastic gradient descent with restarts”, arXiv 2016 Huang et al, “Snapshot ensembles: train 1, get M for free”, ICLR 2017

Improve single-model performance

Regularization

  • Add term to loss
    cs231n-notes-Lecture-7:各種優化方法介紹與比較
  • Dropout (two explanations)
    • Forces the network to have a redundant representation; Prevents co-adaptation of features
    • Dropout is training a large ensemble of models (that share parameters).
  • Data augmentation
    • Horizontal Flips
    • Random crops and scales
    • Color Jitter
    • translation
    • rotation
    • stretching
    • shearing
    • lens distortions
  • DropConnect
Wan et al, “Regularization of Neural Networks using DropConnect”, ICML 2013
  • Fractional Max Pooling
Graham, “Fractional Max Pooling”, arXiv 2014
  • Stochastic Depth
Huang et al, “Deep Networks with Stochastic Depth”, ECCV 2016

繼續閱讀