Papers Notes_7_ Batch Normalization--Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
- Introduction
- Batch Normalization
-
- algorithm
- works
- benefit
- References
Introduction
-
mini-batch
estimate the gradient over the training set, quality improves as the batch size increases
computation a batch is more effecient than m computations
-
fix distribution of inputs
a layer with a sigmoid activation function z = g ( W u + b ) z=g(Wu+b) z=g(Wu+b)
g ( x ) = 1 1 + e x p ( x ) g(x)=\frac{1}{1+exp(x)} g(x)=1+exp(x)1, ∣ x ∣ |x| ∣x∣ increases, g ′ ( x ) g'(x) g′(x) tends to zero
for x = W u + b x=Wu+b x=Wu+b except small absolute values, the gradient will vanish, the model train slowly
in practice, the saturation problem and the resulting vanish gradient are usually addressed by ReLU, careful initialization and small learning rate
ensure the distribution of nonlinearity inputs remains stable→optimizer less likely to be saturated
Batch Normalization
algorithm

notation y = B N γ , β ( x ) y=BN_{\gamma, \beta}(x) y=BNγ,β(x) indicates parameters γ \gamma γ, β \beta β are to be learned
works
an affine transformation followed by an element-wise nonlinearity: z = g ( W u + b ) z=g(Wu+b) z=g(Wu+b)
add the BN transform immediately before the nonlinearity, by normalizing x = W u + b x=Wu+b x=Wu+b
in experiments, apply BN before nonlinearity to result in stable distribution
on contrary, apply BN to outputs of nonlinearity will result in sparser activations(others’ work)
benefit
-
enable higher learning rates
too-high learning rate may result in gradient explode or vanish, as well as getting stuck in poor local minima
BN prevent small changes to the parameters from amplifying into larger and suboptimal changes in activations in gradient
BN makes training more resilient to parameters scale (learning rate may increase the scale of layer parameters)→ B N ( W u ) = B N ( ( a W ) u ) BN(Wu)=BN((aW)u) BN(Wu)=BN((aW)u)
-
regularize the model
a training example is seen in conjunction with other examples in the mini-batch
the training network no longer producing deterministic values for a given training example
→batch normalization can replace dropout (dropout is typically used to reduce overfitting)
References
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift