深度學習中，Batch_Normalization加速收斂并提高正确率的内部機制

文章轉自https://blog.csdn.net/whitesilence/article/details/75667002，初非常感謝作者的原創，轉載一下友善自己以後的學習

在看 ladder network(https://arxiv.org/pdf/1507.02672v2.pdf) 時初次遇到batch normalization（BN）. 文中說BN能加速收斂等好處，但是并不了解，然後就在網上搜了些關于BN的資料。

看了知乎上關于深度學習中 Batch Normalization為什麼效果好？和CSDN上一個關于Batch Normalization 的學習筆記，總算對BN有一定的了解了。這裡隻是總結一下BN的具體操作流程，對于BN更深層次的了解，為什麼要BN，BN是否真的有效也還在持續學習和實驗中。

BN就是在神經網絡的訓練過程中對每層的輸入資料加一個标準化處理。

深度學習中，Batch_Normalization加速收斂并提高正确率的内部機制

傳統的神經網絡，隻是在将樣本x隻在輸入層之前對x進行标準化處理（減均值，除标準差），以降低樣本間的差異性。BN是在此基礎上，不僅僅隻對輸入層的輸入資料x進行标準化，還對每個隐藏層的輸入進行标準化。

深度學習中，Batch_Normalization加速收斂并提高正确率的内部機制

标準化後的x

乘以權值矩陣Wh1加上偏置bh1得到第一層的輸入wh1x+bh1,經過激活函數得到h1=ReLU(wh1x+bh1)，然而加入BN後, h1的計算流程如虛線框所示：

1. 矩陣x先經過Wh1的線性變換後得到s1 (注：因為減去batch的平均值μB後，b的作用會被抵消掉，所提沒必要加入b了）,将s1 再減去batch的平均值μB，并除以batch的标準差√σ2B+ϵ，得到s2. ϵ是為了避免除數為0時所使用的微小正數ϵ。

其中μB=1m∑mi=0Wh1xi

σ2B=1m∑mi=0(Wh1xi−μB)2

(注：由于這樣做後s2基本會被限制在正态分布下，使得網絡的表達能力下降。為解決該問題，引入兩個新的參數：γ,β. γ和β是在訓練時網絡自己學習得到的。）将s2乘以γ調整數值大小，再加上β增加偏移後得到s3，s3

經過激活函數後得到h1

需要注意的是，上述的計算方法用于在訓練過程中。在測試時，所使用的μ和σ2是整個訓練集的均值μp和方差σ2p. 整個訓練集的均值μp和方差σ2p的值通常是在訓練的同時用移動平均法來計算的.

在看具體代碼之前，先來看兩個求平均值函數的用法：

mean, variance = tf.nn.moments(x, axes, name=None, keep_dims=False)

這個函數的輸入參數x表示樣本，形如[batchsize, height, width, kernels]

axes表示在哪個次元上求解，是個list

函數輸出均值和方差

'''
batch = np.array(np.random.randint(1, 100, [10, 5]))開始這裡沒有定義資料類型，batch的dtype=int64,導緻後面sess.run([mm,vv])時老報InvalidArgumentError錯誤，原因是tf.nn.moments中的計算要求參數是float的
'''
batch = np.array(np.random.randint(1, 100, [10, 5]),dtype=np.float64)
mm, vv=tf.nn.moments(batch,axes=[0])#按次元0求均值和方差
#mm, vv=tf.nn.moments(batch,axes=[0,1])求所有資料的平均值和方差
sess = tf.Session()
print batch
print sess.run([mm, vv])#一定要注意參數類型
sess.close()

輸出結果：

[[ 53.   9.  67.  30.  69.]
 [ 79.  25.   7.  80.  16.]
 [ 77.  67.  60.  30.  85.]
 [ 45.  14.  92.  12.  67.]
 [ 32.  98.  70.  98.  48.]
 [ 45.  89.  73.  73.  80.]
 [ 35.  67.  21.  77.  63.]
 [ 24.  33.  56.  85.  17.]
 [ 88.  43.  58.  82.  59.]
 [ 53.  23.  34.   4.  33.]]
[array([ 53.1,  46.8,  53.8,  57.1,  53.7]), array([  421.09,   896.96,   598.36,  1056.69,   542.61])]

ema = tf.train.ExponentialMovingAverage(decay) 求滑動平均值需要提供一個衰減率。該衰減率用于控制模型更新的速度，ExponentialMovingAverage 對每一個（待更新訓練學習的）變量（variable）都會維護一個影子變量（shadow variable）。影子變量的初始值就是這個變量的初始值，

shadow_variable=decay×shadow_variable+(1−decay)×variable

由上述公式可知， decay 控制着模型更新的速度，越大越趨于穩定。實際運用中，decay 一般會設定為十分接近 1 的常數（0.99或0.999）。為了使得模型在訓練的初始階段更新得更快，ExponentialMovingAverage 還提供了 num_updates 參數來動态設定 decay 的大小：

decay=min{decay,1+num_updates10+num_updates}

對于滑動平均值我是這樣了解的（也不知道對不對，如果有覺得錯了的地方希望能幫忙指正）

假設有一串時間序列 {a1,a2,a3,⋯,at,at+1,⋯,}

t時刻的平均值為mvt=a1+a2+⋯+att

t+1時刻的平均值為mvt+1=a1+a2+⋯+at+at+1t+1=tmvt+at+1t+1=tt+1mvt+1t+1at+1

令decay=tt+1, 則mvt+1=decay∗mvt+(1−decay)∗at+1

import tensorflow as tf
graph=tf.Graph()
with graph.as_default():
    w = tf.Variable(dtype=tf.float32,initial_value=1.0)
    ema = tf.train.ExponentialMovingAverage(0.9)
    update = tf.assign_add(w, 1.0)

    with tf.control_dependencies([update]):
        ema_op = ema.apply([w])#傳回一個op,這個op用來更新moving_average #這句和下面那句不能調換順序

    ema_val = ema.average(w)#此op用來傳回目前的moving_average,這個參數不能是list

with tf.Session(graph=graph) as sess:
    sess.run(tf.initialize_all_variables())
    for i in range(3):
        print i
        print 'w_old=',sess.run(w)
        print sess.run(ema_op)
        print 'w_new=', sess.run(w)
        print sess.run(ema_val)
        print '**************'

輸出：

0
w_old= 1.0
None
w_new= 2.0#在執行ema_op時先執行了對w的更新
1.1  #0.9*1.0+0.1*2.0=1.1
**************
1
w_old= 2.0
None
w_new= 3.0
1.29  #0.9*1.1+0.1*3.0=1.29
**************
2
w_old= 3.0
None
w_new= 4.0
1.561  #0.9*1.29+0.1*4.0=1.561

關于加入了batch Normal的對mnist手寫數字分類的nn網絡完整代碼：

import tensorflow as tf
#import input_data
from tqdm import tqdm
import numpy as np
import math
from six.moves import cPickle as pickle
#資料預處理
pickle_file = '/home/sxl/tensor學習/My Udacity/notM/notMNISTs.pickle'
#為了加速計算，這個是經過處理的小樣本mnist手寫數字，這個資料可在[這裡](http://download.csdn.net/detail/whitesilence/9908115)下載下傳
with open(pickle_file, 'rb') as f:
  save = pickle.load(f)
  train_dataset = save['train_dataset']
  train_labels = save['train_labels']
  valid_dataset = save['valid_dataset']
  valid_labels = save['valid_labels']
  test_dataset = save['test_dataset']
  test_labels = save['test_labels']
  del save  # hint to help gc free up memory
  print('Training set', train_dataset.shape, train_labels.shape)
  print('Validation set', valid_dataset.shape, valid_labels.shape)
  print('Test set', test_dataset.shape, test_labels.shape)

image_size = 28
num_labels = 10

def reformat(dataset, labels):
    dataset = dataset.reshape((-1, image_size * image_size)).astype(np.float32)
    # Map 0 to [1.0, 0.0, 0.0 ...], 1 to [0.0, 1.0, 0.0 ...]
    labels = (np.arange(num_labels) == labels[:, None]).astype(np.float32)
    return dataset, labels

train_dataset, train_labels = reformat(train_dataset, train_labels)
valid_dataset, valid_labels = reformat(valid_dataset, valid_labels)
test_dataset, test_labels = reformat(test_dataset, test_labels)
print('Training set', train_dataset.shape, train_labels.shape)
print('Validation set', valid_dataset.shape, valid_labels.shape)
print('Test set', test_dataset.shape, test_labels.shape)

#建立一個7層網絡
layer_sizes = [784, 1000, 500, 250, 250,250,10]
L = len(layer_sizes) - 1  # number of layers
num_examples = train_dataset.shape[0]
num_epochs = 100
starter_learning_rate = 0.02
decay_after = 15  # epoch after which to begin learning rate decay
batch_size = 120
num_iter = (num_examples/batch_size) * num_epochs  # number of loop iterations

x = tf.placeholder(tf.float32, shape=(None, layer_sizes[0]))
outputs = tf.placeholder(tf.float32)
testing=tf.placeholder(tf.bool)
learning_rate = tf.Variable(starter_learning_rate, trainable=False)

def bi(inits, size, name):
    return tf.Variable(inits * tf.ones([size]), name=name)

def wi(shape, name):
    return tf.Variable(tf.random_normal(shape, name=name)) / math.sqrt(shape[0])

shapes = zip(layer_sizes[:-1], layer_sizes[1:])  # shapes of linear layers

weights = {'W': [wi(s, "W") for s in shapes],  # feedforward weights
           # batch normalization parameter to shift the normalized value
           'beta': [bi(0.0, layer_sizes[l+1], "beta") for l in range(L)],
           # batch normalization parameter to scale the normalized value
           'gamma': [bi(1.0, layer_sizes[l+1], "beta") for l in range(L)]}

ewma = tf.train.ExponentialMovingAverage(decay=0.99)  # to calculate the moving averages of mean and variance
bn_assigns = []  # this list stores the updates to be made to average mean and variance

def batch_normalization(batch, mean=None, var=None):
    if mean is None or var is None:
        mean, var = tf.nn.moments(batch, axes=[0])
    return (batch - mean) / tf.sqrt(var + tf.constant(1e-10))

# average mean and variance of all layers
running_mean = [tf.Variable(tf.constant(0.0, shape=[l]), trainable=False) for l in layer_sizes[1:]]
running_var = [tf.Variable(tf.constant(1.0, shape=[l]), trainable=False) for l in layer_sizes[1:]]

def update_batch_normalization(batch, l):
    "batch normalize + update average mean and variance of layer l"
    mean, var = tf.nn.moments(batch, axes=[0])
    assign_mean = running_mean[l-1].assign(mean)
    assign_var = running_var[l-1].assign(var)
    bn_assigns.append(ewma.apply([running_mean[l-1], running_var[l-1]]))
    with tf.control_dependencies([assign_mean, assign_var]):
        return (batch - mean) / tf.sqrt(var + 1e-10)


def eval_batch_norm(batch,l):
    mean = ewma.average(running_mean[l - 1])
    var = ewma.average(running_var[l - 1])
    s = batch_normalization(batch, mean, var)
    return s

def net(x,weights,testing=False):
    d={'m': {}, 'v': {}, 'h': {}}
    h=x
    for l in range(1, L+1):
        print "Layer ", l, ": ", layer_sizes[l-1], " -> ", layer_sizes[l]
        d['h'][l-1]=h
        s= tf.matmul(d['h'][l-1], weights['W'][l-1])
        m, v = tf.nn.moments(s, axes=[0])
        if testing:
            s=eval_batch_norm(s,l)
        else:
            s=update_batch_normalization(s, l)
        s=weights['gamma'][l-1] * s + weights["beta"][l-1]
        if l == L:
            # use softmax activation in output layer
            h = tf.nn.softmax(s)
        else:
            h= tf.nn.relu(s)
        d['m'][l]=m
        d['v'][l]=v
    d['h'][l]=h
    return h,d

y,_=net(x,weights)

cost = -tf.reduce_mean(tf.reduce_sum(outputs*tf.log(y), 1))

correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(outputs, 1))  # no of correct predictions

accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float")) * tf.constant(100.0)


train_step = tf.train.AdamOptimizer(learning_rate).minimize(cost)

# add the updates of batch normalization statistics to train_step
bn_updates = tf.group(*bn_assigns)
with tf.control_dependencies([train_step]):
    train_step = tf.group(bn_updates)

print "===  Starting Session ==="

sess = tf.Session()
init = tf.initialize_all_variables()
sess.run(init)

i_iter = 0
print "=== Training ==="
#print "Initial Accuracy: ", sess.run(accuracy, feed_dict={x: test_dataset, outputs: test_labels, testing: True}), "%"

for i in tqdm(range(i_iter, num_iter)):
    #images, labels = mnist.train.next_batch(batch_size)
    start = (i * batch_size) % num_examples
    images=train_dataset[start:start+batch_size,:]
    labels=train_labels[start:start+batch_size,:]
    sess.run(train_step, feed_dict={x: images, outputs: labels})
    if (i > 1) and ((i+1) % (num_iter/num_epochs) == 0):#i>1且完成了一個epochs,即所有資料訓練完一遍
        epoch_n = i/(num_examples/batch_size)#第幾個epochs
        perm = np.arange(num_examples)
        np.random.shuffle(perm)
        train_dataset = train_dataset[perm]#所有訓練資料疊代完一次後，對訓練資料進行重排，避免下一次疊代時取的是同樣的資料
        train_labels = train_labels[perm]
        if (epoch_n+1) >= decay_after:
            # decay learning rate
            # learning_rate = starter_learning_rate * ((num_epochs - epoch_n) / (num_epochs - decay_after))
            ratio = 1.0 * (num_epochs - (epoch_n+1))  # epoch_n + 1 because learning rate is set for next epoch
            ratio = max(0, ratio / (num_epochs - decay_after))
            sess.run(learning_rate.assign(starter_learning_rate * ratio))
        print "Train Accuracy: ",sess.run(accuracy,feed_dict={x: images, outputs: labels})

print "Final Accuracy: ", sess.run(accuracy, feed_dict={x: test_dataset, outputs: test_labels, testing: True}), "%"

sess.close()

關于batch normal 的另一參考資料http://blog.csdn.net/intelligence1994/article/details/53888270

tensorflow常用函數介紹http://blog.csdn.net/wuqingshan2010/article/details/71056292

歡迎關注朋友的CSDN：
https://me.csdn.net/qq_41007606
https://me.csdn.net/qq_40962368

深度學習中，Batch_Normalization加速收斂并提高正确率的内部機制

繼續閱讀

簡單文檔分類——樸素貝葉斯算法樸素貝葉斯算法簡單文檔分類執行個體步驟總結樸素貝葉斯分類調用(sklearn)

考證大全 | 證券從業資格考試

敲黑闆！2021年證券從業考試考點預測

2021年銀行從業考試考情介紹,果斷收藏!

證券從業合格證書什麼時候列印？有哪些注意事項？

【幹貨滿滿】初級銀行從業考試《個人理财》重點梳理

2020年經濟師考試，難嗎？

初級銀行從業資格證有什麼用？

MBA提前面試純幹貨分享

MBA值得學麼

吳恩達logistic回歸實作

【人工智能行業大師訪談1】吳恩達采訪 Geoffery Hinton

深度學習模型分析人類複雜疾病的準确性

【趨高機器視覺】機器視覺技術原了解析及解決方案

解碼器用于語義分割：資料依賴的解碼可以實作靈活的特征聚合

cs231n斯坦福基于卷積神經網絡的CV學習筆記（一）KNN和線性分類器/分類器損失/反向傳播一，KNN圖像分類算法二，線性分類器三，線性分類器損失四，反向傳播五，神經網絡