Batch Normalization&Dropout淺析

一. Batch Normalization

對于深度神經網絡，訓練起來有時很難拟合，可以使用更先進的優化算法，例如：SGD+momentum、RMSProp、Adam等算法。另一種政策則是高改變網絡的結構，使其更加容易訓練。Batch Normalization就是這個思想。

為什麼要做Normalization？
神經網絡學習過程本質就是為了學習資料分布，一旦訓練資料與測試資料的分布不同，那麼網絡的泛化能力也大大降低；另外一方面，一旦每批訓練資料的分布各不相同(batch梯度下降)，那麼網絡就要在每次疊代都去學習适應不同的分布，這樣将會大大降低網絡的訓練速度。

機器學習方法在輸入資料為0均值和機關方差的不相關特征時效果更好，是以在我們訓練網絡的時候，可以人為與處理資料，使其滿足這樣的分布。然而即使我們在輸入端處理好資料，經過更深層次的非線性激活後，資料可能不再是不相關的，也不是0均值機關方差了，這樣對于後面網絡層的拟合就造成了困難。更糟糕的是，在訓練過程中，每個層的特征分布随着每一層的權重更新而改變。

深度神經網絡中的特征分布變化會使神網絡的訓練變得更加困難，為了克服這種問題，在網絡中加入Batch Normalization層。在訓練時，BN層計算批資料每個特征的均值和标準差。這些均值和标準差的平均值在訓練期間被記錄下來，在測試階段，使用這些資訊進行标準化測試集特征。

實作方法：

代碼實作：

def batchnorm_forward(x, gamma, beta, bn_param):
    """
    Forward pass for batch normalization.

    During training the sample mean and (uncorrected) sample variance are
    computed from minibatch statistics and used to normalize the incoming data.
    During training we also keep an exponentially decaying running mean of the
    mean and variance of each feature, and these averages are used to normalize
    data at test-time.

    At each timestep we update the running averages for mean and variance using
    an exponential decay based on the momentum parameter:

    running_mean = momentum * running_mean + (1 - momentum) * sample_mean
    running_var = momentum * running_var + (1 - momentum) * sample_var

    Note that the batch normalization paper suggests a different test-time
    behavior: they compute sample mean and variance for each feature using a
    large number of training images rather than using a running average. For
    this implementation we have chosen to use running averages instead since
    they do not require an additional estimation step; the torch7
    implementation of batch normalization also uses running averages.

    Input:
    - x: Data of shape (N, D)
    - gamma: Scale parameter of shape (D,)
    - beta: Shift paremeter of shape (D,)
    - bn_param: Dictionary with the following keys:
      - mode: 'train' or 'test'; required
      - eps: Constant for numeric stability
      - momentum: Constant for running mean / variance.
      - running_mean: Array of shape (D,) giving running mean of features
      - running_var Array of shape (D,) giving running variance of features

    Returns a tuple of:
    - out: of shape (N, D)
    - cache: A tuple of values needed in the backward pass
    """
    mode = bn_param['mode']
    eps = bn_param.get('eps', 1e-5)
    momentum = bn_param.get('momentum', 0.9)

    N, D = x.shape
    running_mean = bn_param.get('running_mean', np.zeros(D, dtype=x.dtype))
    running_var = bn_param.get('running_var', np.zeros(D, dtype=x.dtype))

    out, cache = None, None
    if mode == 'train':
        #######################################################################
        # TODO: Implement the training-time forward pass for batch norm.      #
        # Use minibatch statistics to compute the mean and variance, use      #
        # these statistics to normalize the incoming data, and scale and      #
        # shift the normalized data using gamma and beta.                     #
        #                                                                     #
        # You should store the output in the variable out. Any intermediates  #
        # that you need for the backward pass should be stored in the cache   #
        # variable.                                                           #
        #                                                                     #
        # You should also use your computed sample mean and variance together #
        # with the momentum variable to update the running mean and running   #
        # variance, storing your result in the running_mean and running_var   #
        # variables.                                                          #
        #######################################################################
        sample_mean = x.mean(axis = 0)
        sample_var = x.var(axis = 0)
        x_hat = (x-sample_mean)/(np.sqrt(sample_var+eps))
        out = gamma*x_hat+beta
        running_mean = momentum * running_mean + (1 - momentum) * sample_mean
        running_var = momentum * running_var + (1 - momentum) * sample_var
        
        #cache = (x,gamma,beta)
        cache = (gamma, x, sample_mean, sample_var, eps, x_hat)
        #######################################################################
        #                           END OF YOUR CODE                          #
        #######################################################################
    elif mode == 'test':
        #######################################################################
        # TODO: Implement the test-time forward pass for batch normalization. #
        # Use the running mean and variance to normalize the incoming data,   #
        # then scale and shift the normalized data using gamma and beta.      #
        # Store the result in the out variable.                               #
        #######################################################################
        x_h = (x-bn_param["running_mean"])/(np.sqrt(bn_param["running_var"]+eps))
        out = gamma*x_h+beta
        #######################################################################
        #                          END OF YOUR CODE                           #
        #######################################################################
    else:
        raise ValueError('Invalid forward batchnorm mode "%s"' % mode)

    # Store the updated running means back into bn_param
    bn_param['running_mean'] = running_mean
    bn_param['running_var'] = running_var

    return out, cache


def batchnorm_backward(dout, cache):
    """
    Backward pass for batch normalization.

    For this implementation, you should write out a computation graph for
    batch normalization on paper and propagate gradients backward through
    intermediate nodes.

    Inputs:
    - dout: Upstream derivatives, of shape (N, D)
    - cache: Variable of intermediates from batchnorm_forward.

    Returns a tuple of:
    - dx: Gradient with respect to inputs x, of shape (N, D)
    - dgamma: Gradient with respect to scale parameter gamma, of shape (D,)
    - dbeta: Gradient with respect to shift parameter beta, of shape (D,)
    """
    dx, dgamma, dbeta = None, None, None
    ###########################################################################
    # TODO: Implement the backward pass for batch normalization. Store the    #
    # results in the dx, dgamma, and dbeta variables.                         #
    ###########################################################################
    gamma, x, sample_mean, sample_var, eps, x_hat = cache 
    N = x.shape[0]
    D = x.shape[1]
    dgamma = np.sum(dout * x_hat,axis = 0)#(D,)
    
    dbeta = dout.sum(axis = 0)#(D,)
    dx_hat = dout * gamma#(N,D)
    std = np.sqrt(sample_var.reshape(1,D) + eps)#(1,D)
    dx = dx_hat / std#(N,D)
    dstd = np.sum(-dx_hat*(x_hat/std),axis = 0).reshape(1,D)#(1,D)
    dm = np.sum(-dx_hat / std,axis = 0).reshape(1,D)#(1,D)
    dvar = dstd/(2*std)#(1,D)
    dm += dvar*(-2/N)*((x-sample_mean).sum(axis = 0).reshape(1,D))#(1,D)
    dx += dvar * (2/N)*(x-sample_mean)#(N,D)
    dx += dm / N#(N,D)
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################

    return dx, dgamma, dbeta

二. Dropout

過拟合一直是深度神經網絡（DNN）所要面臨的一個問題：模型隻是在訓練資料上學習分類，使其适應訓練樣本，而不是去學習一個能夠對通用資料進行分類的完全決策邊界。這些年，提出了很多的方案去解決過拟合問題。其中一種方法就是Dropout，由于這種方法非常簡單，但是在實際使用中又具有很好的效果，是以被廣泛使用。

Dropout 背後的思想其實就是把DNN當做一個內建模型來訓練，之後取所有值的平均值，而不隻是訓練單個DNN。

DNN網絡将Dropout率設定為 p，也就是說，一個神經元被保留的機率是 1-p。當一個神經元被丢棄時，無論輸入或者相關的參數是什麼，它的輸出值就會被設定為0。

丢棄的神經元在訓練階段，對BP算法的前向和後向階段都沒有貢獻。因為這個原因，是以每一次訓練，它都像是在訓練一個新的網絡。

簡而言之：Dropout 可以在實際工作中發揮很好的效果，因為它能防止神經網絡在訓練過程中産生共适應。

代碼實作1：

代碼實作2：

- Inverted Dropout（Dropout 改進版）

優點：使得我們隻需要在訓練階段縮放激活函數的輸出值，而不用在測試階段改變什麼。

在各種深度學習架構的實作中，我們都是用 Inverted Dropout 來代替 Dropout，因為這種方式有助于模型的完整性，我們隻需要修改一個參數（保留/丢棄機率），而整個模型都不用修改。

def dropout_forward(x, dropout_param):
    """
    Performs the forward pass for (inverted) dropout.

    Inputs:
    - x: Input data, of any shape
    - dropout_param: A dictionary with the following keys:
      - p: Dropout parameter. We drop each neuron output with probability p.
      - mode: 'test' or 'train'. If the mode is train, then perform dropout;
        if the mode is test, then just return the input.
      - seed: Seed for the random number generator. Passing seed makes this
        function deterministic, which is needed for gradient checking but not
        in real networks.

    Outputs:
    - out: Array of the same shape as x.
    - cache: tuple (dropout_param, mask). In training mode, mask is the dropout
      mask that was used to multiply the input; in test mode, mask is None.
    """
    p, mode = dropout_param['p'], dropout_param['mode']
    if 'seed' in dropout_param:
        np.random.seed(dropout_param['seed'])

    mask = None
    out = None

    if mode == 'train':
        #######################################################################
        # TODO: Implement training phase forward pass for inverted dropout.   #
        # Store the dropout mask in the mask variable.                        #
        #######################################################################
        #musk = np.random.rand(*x.shape) >= p
        mask = (np.random.rand(*x.shape) >= p) / (1 - p)
        out =x * mask
        #######################################################################
        #                           END OF YOUR CODE                          #
        #######################################################################
    elif mode == 'test':
        #######################################################################
        # TODO: Implement the test phase forward pass for inverted dropout.   #
        #######################################################################
        out = x
        #######################################################################
        #                            END OF YOUR CODE                         #
        #######################################################################

    cache = (dropout_param, mask)
    out = out.astype(x.dtype, copy=False)

    return out, cache


def dropout_backward(dout, cache):
    """
    Perform the backward pass for (inverted) dropout.

    Inputs:
    - dout: Upstream derivatives, of any shape
    - cache: (dropout_param, mask) from dropout_forward.
    """
    dropout_param, mask = cache
    mode = dropout_param['mode']

    dx = None
    if mode == 'train':
        #######################################################################
        # TODO: Implement training phase backward pass for inverted dropout   #
        #######################################################################
        dx = dout * mask
        #######################################################################
        #                          END OF YOUR CODE                           #
        #######################################################################
    elif mode == 'test':
        dx = dout
    return dx

作者：郭耀華

出處：http://www.guoyaohua.com

微信：guoyaohua167

郵箱：[email protected]

本文版權歸作者和部落格園所有，歡迎轉載，轉載請标明出處。

【如果你覺得本文還不錯，對你的學習帶來了些許幫助，請幫忙點選右下角的推薦】

Batch Normalization&amp;Dropout淺析