批歸一化和層歸一化

引言

本文探讨了BN和LN，BN适合于CNN；而LN适合于RNN。雖然我們現在還不知道BN為什麼有效，但是重點是它有效，我們就使用它。

Batch Normalization

Batch Normalization(批歸一化,以下簡稱BN)是由Sergey Ioffe1等人提出來的，是一個廣泛使用的深度神經網絡訓練的技巧，它不僅可以加快了模型的收斂速度，還可以簡化初始化要求，即可以使用較大的學習率。

概念

原文提出了一些概念，用于解釋為什麼BN有用，但是

covariate shift與 internal covariate shift

在原文 1中，以分布穩定性角度，covariate shift描述的是模型輸入分布變化的現象。

internal covariate shift 說的是在深度神經網絡隐藏層之間輸入分布變化的現象。

從模型的角度看，訓練資料和測試資料分布相差較大，也是一種covariate shift。

算法描述

原文中算法描述如下：

其中是訓練批次中的所有樣本，該算法可學習的參數是和；輸出是每個樣本經過BN後的結果。

共四部，前三步分别是計算批次内樣本的均值、方差、進行标準化。

最後一步是反标準化操作，将标準化後的資料再擴充和平移。為了讓模型自己去學習是否需要标準化，以及多大程度。其中是一個很小的常數，防止分母為零。

上面說的都是針對訓練資料的，對于測試資料，或者說線上資料應該怎麼做呢？

因為線上資料可能一次隻輸入一條，是以無法計算均值和方差。原文的做法是儲存訓練資料每個批次的均值和方差，主要思想是求所有批次得到的均值和方差的期望，使用的是指數移動平均值(EMA)，着重考慮最近疊代的均值和方差。

算法實作

class BatchNorm(nn.Module):
    def __init__(self, num_features, epsilon=1e-05, momentum=0.1, device=None):
        '''
        num_features: 全連接配接網絡的輸出大小
        momentum: EMA中使用的參數
        '''
        super(BatchNorm, self).__init__()
        
        self.device = device
        
        # 需要學習的參數，用Parameter生成
        self.beta = nn.Parameter(torch.zeros(1, num_features))
        self.gamma = nn.Parameter(torch.ones(1, num_features))
        
        self.epsilon = epsilon
        self.momentum = momentum
        
        self.moving_mean = torch.zeros(1, num_features)
        self.moving_var = torch.ones(1, num_features)

    
    def forward(self, X):
        '''
        X: [batch_size, num_features]
        '''
        if self.device:
            self.moving_mean = self.moving_mean.to(device)
            self.moving_var = self.moving_var.to(device)
        
        # 如果是訓練模式
        if self.training:
            # 目前批次的均值和方差
            mean = X.mean(dim=0) 
            var = ((X - mean)**2).mean(dim=0)
            # 标準化
            X_normalized = (X - mean) / torch.sqrt(var + self.epsilon)
            # 更新移動平均值  和nn.BatchNorm1d的做法一樣
            self.moving_mean = (1 - self.momentum) * self.moving_mean + self.momentum * mean
            self.moving_var =  (1 - self.momentum) * self.moving_var + self.momentum * var
        else:
            # 如果是推理模式
            X_normalized = (X - self.moving_mean) / torch.sqrt(self.moving_var + self.epsilon)
        
        # 公式中的y
        Y = self.gamma * X_normalized + self.beta
        
        return Y # [batch_size, num_features]

    def __repr__(self):
        return f'BatchNorm(num_features={self.moving_mean.size(1)}, momentum={self.momentum})'

下面我們用一個回歸任務來看一下批歸一化的效果。

以下示例參考了莫凡Python2

# 超參數
N_SAMPLES = 2000
BATCH_SIZE = 64
EPOCH = 12
LR = 0.03
N_HIDDEN = 8
ACTIVATION = torch.relu
B_INIT = -0.2   # 使用一個負值的參數初始化

# training data
x = np.linspace(-7, 10, N_SAMPLES)[:, np.newaxis]
noise = np.random.normal(0, 2, x.shape)
y = np.square(x) - 5 + noise

# test data
test_x = np.linspace(-7, 10, 200)[:, np.newaxis]
noise = np.random.normal(0, 2, test_x.shape)
test_y = np.square(test_x) - 5 + noise

train_x, train_y = torch.from_numpy(x).float(), torch.from_numpy(y).float()
test_x = Variable(torch.from_numpy(test_x).float(), volatile=True)  # not for computing gradients
test_y = Variable(torch.from_numpy(test_y).float(), volatile=True)

train_dataset = Data.TensorDataset(train_x,train_y)
train_loader = Data.DataLoader(dataset=train_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=2)

# show data
plt.scatter(train_x.numpy(), train_y.numpy(), c='#FF9359', s=50, alpha=0.2, label='train')
plt.legend(loc='upper left')
plt.show()

以函數畫圖，增加了一些噪音。

然後我們構造一個深層的網絡來拟合這些資料，用到了上面自定義的批歸一化實作。

class Net(nn.Module):
    def __init__(self, batch_normalization=False):
        super(Net, self).__init__()
        self.do_bn = batch_normalization
        self.fcs = []
        self.bns = []
        # 使用自定義的Batch Norm層
        self.bn_input = BatchNorm(1) 

        # 輸入層的大小是1，隐藏層的大小是10
        for i in range(N_HIDDEN):             
            input_size = 1 if i == 0 else 10
            fc = nn.Linear(input_size, 10)
            # 通過setattr 動态建構神經網絡
            setattr(self, 'fc%i' % i, fc)       
            self._set_init(fc)                 
            self.fcs.append(fc)
            if self.do_bn:
                bn = BatchNorm(10)
                setattr(self, 'bn%i' % i, bn)
                self.bns.append(bn)
        # 輸出層的大小也是1，我們做的是回歸
        self.predict = nn.Linear(10, 1)         # output layer
        self._set_init(self.predict)            # parameters initialization

    def _set_init(self, layer):
        init.normal_(layer.weight, mean=0., std=.1)
        init.constant_(layer.bias, B_INIT)

    def forward(self, x):
        # 儲存激活之前的輸入
        pre_activation = [x]
        if self.do_bn: 
            x = self.bn_input(x)     # 輸入的BN
        # 每個隐藏層的輸入
        layer_input = [x]
        for i in range(N_HIDDEN):
            x = self.fcs[i](x)
            pre_activation.append(x)
            if self.do_bn: 
                x = self.bns[i](x)   # 隐藏層的BN
            x = ACTIVATION(x)
            layer_input.append(x)
            
        out = self.predict(x)
        return out, layer_input,

列印網絡結構：

nets = [Net(batch_normalization=False), Net(batch_normalization=True)]
print(*nets)    # print net architecture

我們建構了兩個網絡執行個體，一個使用了批歸一化，另一個沒有使用。作為對比。

訓練，并畫圖。

opts = [torch.optim.Adam(net.parameters(), lr=LR) for net in nets]

loss_func = torch.nn.MSELoss()

f, axs = plt.subplots(4, N_HIDDEN+1, figsize=(10, 5))
plt.ion()   # something about plotting

def plot_histogram(l_in, l_in_bn, pre_ac, pre_ac_bn):
    for i, (ax_pa, ax_pa_bn, ax,  ax_bn) in enumerate(zip(axs[0, :], axs[1, :], axs[2, :], axs[3, :])):
        [a.clear() for a in [ax_pa, ax_pa_bn, ax, ax_bn]]
        if i == 0: p_range = (-7, 10);the_range = (-7, 10)
        else:p_range = (-4, 4);the_range = (-1, 1)
        ax_pa.set_title('L' + str(i))
        ax_pa.hist(pre_ac[i].data.numpy().ravel(), bins=10, range=p_range, color='#FF9359', alpha=0.5);ax_pa_bn.hist(pre_ac_bn[i].data.numpy().ravel(), bins=10, range=p_range, color='#74BCFF', alpha=0.5)
        ax.hist(l_in[i].data.numpy().ravel(), bins=10, range=the_range, color='#FF9359');ax_bn.hist(l_in_bn[i].data.numpy().ravel(), bins=10, range=the_range, color='#74BCFF')
        for a in [ax_pa, ax, ax_pa_bn, ax_bn]: a.set_yticks(());a.set_xticks(())
        ax_pa_bn.set_xticks(p_range);ax_bn.set_xticks(the_range)
        axs[0, 0].set_ylabel('PreAct');axs[1, 0].set_ylabel('BN PreAct');axs[2, 0].set_ylabel('Act');axs[3, 0].set_ylabel('BN Act')
    plt.pause(0.01)
    
# training
losses = [[], []]  # recode loss for two networks
for epoch in range(EPOCH):
    print('Epoch: ', epoch)
    layer_inputs, pre_acts = [], []
    for net, l in zip(nets, losses):
        net.eval（)              # set eval mode to fix moving_mean and moving_var
        pred, layer_input, pre_act = net(test_x)
        l.append(loss_func(pred, test_y).data)
        layer_inputs.append(layer_input)
        pre_acts.append(pre_act)
        net.train()             # free moving_mean and moving_var
    plot_histogram(*layer_inputs, *pre_acts)     # plot histogram

    for step, (b_x, b_y) in enumerate(train_loader):
        b_x, b_y = Variable(b_x), Variable(b_y)
        for net, opt in zip(nets, opts):     # train for each network
            pred, _, _ = net(b_x)
            loss = loss_func(pred, b_y)
            opt.zero_grad()
            loss.backward()
            opt.step()    # it will also learns the parameters in Batch Normalization
            
plt.ioff()

L0是輸入層，L1到L8是隐藏層，繪畫的是每次疊代各個層輸入值的分布情況。

紅色是無BN的網絡，藍色的有BN的網絡。

PreAct是激活之前的值，Act是激活之後的值。可以看到，無BN的網絡，激活函數為ReLU的情況話，所有網絡的輸出基本不變，看上去像是死掉了。

而使用了BN的網絡，每層的分布較為分散，沒有集中在某處，經過BN激活之後的值也存在很多大于零的部分。

下面畫出兩個網絡拟合曲線和損失曲線。

# plot training loss
plt.figure(2)
plt.plot(losses[0], c='#FF9359', lw=3, label='Original')
plt.plot(losses[1], c='#74BCFF', lw=3, label='Batch Normalization')
plt.xlabel('step');plt.ylabel('test loss');plt.ylim((0, 2000));plt.legend(loc='best')

# evaluation
# set net to eval mode to freeze the parameters in batch normalization layers
[net.eval（) for net in nets]    # set eval mode to fix moving_mean and moving_var
preds = [net(test_x)[0] for net in nets]
plt.figure(3)
plt.plot(test_x.data.numpy(), preds[0].data.numpy(), c='#FF9359', lw=4, label='Original')
plt.plot(test_x.data.numpy(), preds[1].data.numpy(), c='#74BCFF', lw=4, label='Batch Normalization')
plt.scatter(test_x.data.numpy(), test_y.data.numpy(), c='r', s=50, alpha=0.2, label='train')
plt.legend(loc='best')
plt.show()

為什麼BN有用

BN使得深層神經網絡更易于訓練，但是具體為什麼，現在還沒有定論，不過存在一些假設。

假設1：原文1的作者猜測是因為BN減少了 internal covariate shift(ICS)，使得神經網絡更易于訓練。

❌ Shibani Santurkar3等人通過實驗證明，這種表現和ICS無關。
假設2：BN通過2個可學習的參數調整隐藏層的輸入分布來使優化器更好地工作。

❓ 這個假設強調是因為參數之間的互相依賴性，讓優化任務更加困難，但是沒有确鑿的證據。
假設3：BN重新定制了底層的優化問題，使之更加平滑且穩定。

❓ 這是最新的研究，并且還未有人提出異議。他們提供了一部分理論支援，但是一些基本問題仍然未得到解答，比如BN是如何幫助泛化的。

Layer Normalization

Jimmy4基于Batch Normalization提出了Layer Normalization(層歸一化，以下簡稱LN)。

原文指出，如果将BN應用到RNN中會出現一些問題，由于NLP任務中句子的長度是不固定的，如果使用BN，會導緻每個時間步的統計量不同。可能某個時間步，某個句子沒有輸入了；更糟糕的是，BN無法适應于線上學習(每個批次隻有一個樣本)和批次數量過小的情況。

如果說BN是針對整個批次計算的，那麼LN就是針對一個樣本所有特征計算的。

或者說BN是對一個隐藏層的所有神經元進行歸一化。

算法描述

類似BN，但是是對每個樣本自身進行計算，是以訓練時和測試時是一樣的，不需要計算EMA。

令是某個時間步LN層的H大小的輸入向量表示，LN通過下面的公式将進行歸一化：

其中就是LN層的輸出，是點乘操作，和是輸入各個次元的均值和方差，和是兩個可學習的參數，和的次元相同。

算法實作

class LayerNorm(nn.Module):
    def __init__(self, normalized_shape, epsilon=1e-05):
        '''
        normalized_shape: 輸入tensor的shape或輸入tensor最後一個次元大小
        '''
        super(LayerNorm, self).__init__()
        
        if isinstance(normalized_shape, int):
            normalized_shape = (normalized_shape, )
        else:
            normalized_shape = (normalized_shape[-1], )
        
        self.normalized_shape = torch.Size(normalized_shape)
                
        # 需要學習的參數，用Parameter生成
        self.beta = nn.Parameter(torch.zeros(*normalized_shape))
        self.gamma = nn.Parameter(torch.ones(*normalized_shape))
        
        self.epsilon = epsilon


    
    def forward(self, X):
        '''
        X: [batch_size, *]
        '''

        # 計算每個樣本的均值和方差
        mean = X.mean(dim=-1, keepdim = True) 
        var = ((X - mean)**2).mean(dim=-1, keepdim = True)
        # 标準化
        X_normalized = (X - mean) / torch.sqrt(var + self.epsilon)
     
        
        # 公式中的h
        Y = self.gamma * X_normalized + self.beta
        
        return Y # [batch_size, num_features]

    def __repr__(self):
        return f'LayerNorm(normalized_shape={self.normalized_shape})'

參考

Batch Normalization: Accelerating Deep Network Training b y Reducing Internal Covariate Shift ↩︎ ↩︎ ↩︎
莫凡Python ↩︎
How Does Batch Normalization Help Optimization? ↩︎
Layer Normalization ↩︎

批歸一化和層歸一化

引言

Batch Normalization

概念

算法描述

算法實作

為什麼BN有用

Layer Normalization

算法描述

算法實作

參考

繼續閱讀

深度學習論文: Confluence: A Robust Non-IoU Alternative to NMS in Object Detection及其PyTorch實作

系統性綜述：特征點檢測與比對

Batch Normalization的一些個人了解

一文弄懂LogSumExp技巧

拓端tecdat|r語言程式設計指導空間可視化繪制道路交通安全事故地圖

PET-AI解讀 | rs-fMRI的GNN和TCN模組化（圖建構，時間序列歸一化）

特征工程中的歸一化問題

基于邏輯回歸和神經網絡識别手寫數字（從0到9）（Matlab代碼實作）

解決Transformer固有缺陷：複旦大學等提出線性複雜度SOFT

matlab 神經網絡 ANN 分類

使用Scikit-Learn輕松實作資料縮放

為什麼要做特征的歸一化/标準化？

【基礎算法】常見的ML、DL程式設計題

6-1、HFSS激勵類型

資料的歸一化（Normalization）、标準化（Standardization）

資料挖掘-歸一化