天天看點

Papers Notes_7_ Batch Normalization--Batch Normalization: Accelerating Deep Network Training by...IntroductionBatch NormalizationReferences

Papers Notes_7_ Batch Normalization--Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

  • Introduction
  • Batch Normalization
    • algorithm
    • works
    • benefit
  • References

Introduction

  1. mini-batch

    estimate the gradient over the training set, quality improves as the batch size increases

    computation a batch is more effecient than m computations

  2. fix distribution of inputs

    a layer with a sigmoid activation function z = g ( W u + b ) z=g(Wu+b) z=g(Wu+b)

    g ( x ) = 1 1 + e x p ( x ) g(x)=\frac{1}{1+exp(x)} g(x)=1+exp(x)1​, ∣ x ∣ |x| ∣x∣ increases, g ′ ( x ) g'(x) g′(x) tends to zero

    for x = W u + b x=Wu+b x=Wu+b except small absolute values, the gradient will vanish, the model train slowly

    in practice, the saturation problem and the resulting vanish gradient are usually addressed by ReLU, careful initialization and small learning rate

    ensure the distribution of nonlinearity inputs remains stable→optimizer less likely to be saturated

Batch Normalization

algorithm

Papers Notes_7_ Batch Normalization--Batch Normalization: Accelerating Deep Network Training by...IntroductionBatch NormalizationReferences

notation y = B N γ , β ( x ) y=BN_{\gamma, \beta}(x) y=BNγ,β​(x) indicates parameters γ \gamma γ, β \beta β are to be learned

works

an affine transformation followed by an element-wise nonlinearity: z = g ( W u + b ) z=g(Wu+b) z=g(Wu+b)

add the BN transform immediately before the nonlinearity, by normalizing x = W u + b x=Wu+b x=Wu+b

in experiments, apply BN before nonlinearity to result in stable distribution

on contrary, apply BN to outputs of nonlinearity will result in sparser activations(others’ work)

benefit

  1. enable higher learning rates

    too-high learning rate may result in gradient explode or vanish, as well as getting stuck in poor local minima

    BN prevent small changes to the parameters from amplifying into larger and suboptimal changes in activations in gradient

    BN makes training more resilient to parameters scale (learning rate may increase the scale of layer parameters)→ B N ( W u ) = B N ( ( a W ) u ) BN(Wu)=BN((aW)u) BN(Wu)=BN((aW)u)

  2. regularize the model

    a training example is seen in conjunction with other examples in the mini-batch

    the training network no longer producing deterministic values for a given training example

    →batch normalization can replace dropout (dropout is typically used to reduce overfitting)

References

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

繼續閱讀