Papers Notes_7_ Batch Normalization--Batch Normalization: Accelerating Deep Network Training by...IntroductionBatch NormalizationReferences

2023-06-11 06:59:00

Introduction

mini-batch

estimate the gradient over the training set, quality improves as the batch size increases

computation a batch is more effecient than m computations
fix distribution of inputs

a layer with a sigmoid activation function z = g ( W u + b ) z=g(Wu+b) z=g(Wu+b)

g ( x ) = 1 1 + e x p ( x ) g(x)=\frac{1}{1+exp(x)} g(x)=1+exp(x)1, ∣ x ∣ |x| ∣x∣ increases, g ′ ( x ) g'(x) g′(x) tends to zero

for x = W u + b x=Wu+b x=Wu+b except small absolute values, the gradient will vanish, the model train slowly

in practice, the saturation problem and the resulting vanish gradient are usually addressed by ReLU, careful initialization and small learning rate

ensure the distribution of nonlinearity inputs remains stable→optimizer less likely to be saturated

Batch Normalization

Papers Notes_7_ Batch Normalization--Batch Normalization: Accelerating Deep Network Training by...IntroductionBatch NormalizationReferences

notation y = B N γ , β ( x ) y=BN_{\gamma, \beta}(x) y=BNγ,β(x) indicates parameters γ \gamma γ, β \beta β are to be learned

an affine transformation followed by an element-wise nonlinearity: z = g ( W u + b ) z=g(Wu+b) z=g(Wu+b)

add the BN transform immediately before the nonlinearity, by normalizing x = W u + b x=Wu+b x=Wu+b

in experiments, apply BN before nonlinearity to result in stable distribution

on contrary, apply BN to outputs of nonlinearity will result in sparser activations(others’ work)

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift