文章轉自https://blog.csdn.net/whitesilence/article/details/75667002, 初非常感謝作者的原創,轉載一下友善自己以後的學習
在看 ladder network(https://arxiv.org/pdf/1507.02672v2.pdf) 時初次遇到batch normalization(BN). 文中說BN能加速收斂等好處,但是并不了解,然後就在網上搜了些關于BN的資料。
看了知乎上關于深度學習中 Batch Normalization為什麼效果好? 和CSDN上一個關于Batch Normalization 的學習筆記,總算對BN有一定的了解了。這裡隻是總結一下BN的具體操作流程,對于BN更深層次的了解,為什麼要BN,BN是否真的有效也還在持續學習和實驗中。
BN就是在神經網絡的訓練過程中對每層的輸入資料加一個标準化處理。

傳統的神經網絡,隻是在将樣本x隻在輸入層之前對x進行标準化處理(減均值,除标準差),以降低樣本間的差異性。BN是在此基礎上,不僅僅隻對輸入層的輸入資料x進行标準化,還對每個隐藏層的輸入進行标準化。
标準化後的x
乘以權值矩陣Wh1加上偏置bh1得到第一層的輸入wh1x+bh1,經過激活函數得到h1=ReLU(wh1x+bh1),然而加入BN後, h1的計算流程如虛線框所示:
1. 矩陣x先經過Wh1的線性變換後得到s1 (注:因為減去batch的平均值μB後,b的作用會被抵消掉,所提沒必要加入b了),将s1 再減去batch的平均值μB,并除以batch的标準差√σ2B+ϵ,得到s2. ϵ是為了避免除數為0時所使用的微小正數ϵ。
其中μB=1m∑mi=0Wh1xi
σ2B=1m∑mi=0(Wh1xi−μB)2
(注:由于這樣做後s2基本會被限制在正态分布下,使得網絡的表達能力下降。為解決該問題,引入兩個新的參數:γ,β. γ和β是在訓練時網絡自己學習得到的。)将s2乘以γ調整數值大小,再加上β增加偏移後得到s3,s3
-
經過激活函數後得到h1
需要注意的是,上述的計算方法用于在訓練過程中。在測試時,所使用的μ和σ2是整個訓練集的均值μp和方差σ2p. 整個訓練集的均值μp和方差σ2p的值通常是在訓練的同時用移動平均法來計算的.
在看具體代碼之前,先來看兩個求平均值函數的用法:
mean, variance = tf.nn.moments(x, axes, name=None, keep_dims=False)
這個函數的輸入參數x表示樣本,形如[batchsize, height, width, kernels]
axes表示在哪個次元上求解,是個list
函數輸出均值和方差
輸出結果:''' batch = np.array(np.random.randint(1, 100, [10, 5]))開始這裡沒有定義資料類型,batch的dtype=int64,導緻後面sess.run([mm,vv])時老報InvalidArgumentError錯誤,原因是tf.nn.moments中的計算要求參數是float的 ''' batch = np.array(np.random.randint(1, 100, [10, 5]),dtype=np.float64) mm, vv=tf.nn.moments(batch,axes=[0])#按次元0求均值和方差 #mm, vv=tf.nn.moments(batch,axes=[0,1])求所有資料的平均值和方差 sess = tf.Session() print batch print sess.run([mm, vv])#一定要注意參數類型 sess.close()
[[ 53. 9. 67. 30. 69.] [ 79. 25. 7. 80. 16.] [ 77. 67. 60. 30. 85.] [ 45. 14. 92. 12. 67.] [ 32. 98. 70. 98. 48.] [ 45. 89. 73. 73. 80.] [ 35. 67. 21. 77. 63.] [ 24. 33. 56. 85. 17.] [ 88. 43. 58. 82. 59.] [ 53. 23. 34. 4. 33.]] [array([ 53.1, 46.8, 53.8, 57.1, 53.7]), array([ 421.09, 896.96, 598.36, 1056.69, 542.61])]
ema = tf.train.ExponentialMovingAverage(decay) 求滑動平均值需要提供一個衰減率。該衰減率用于控制模型更新的速度,ExponentialMovingAverage 對每一個(待更新訓練學習的)變量(variable)都會維護一個影子變量(shadow variable)。影子變量的初始值就是這個變量的初始值,
shadow_variable=decay×shadow_variable+(1−decay)×variable
由上述公式可知, decay 控制着模型更新的速度,越大越趨于穩定。實際運用中,decay 一般會設定為十分接近 1 的常數(0.99或0.999)。為了使得模型在訓練的初始階段更新得更快,ExponentialMovingAverage 還提供了 num_updates 參數來動态設定 decay 的大小:
decay=min{decay,1+num_updates10+num_updates}
對于滑動平均值我是這樣了解的(也不知道對不對,如果有覺得錯了的地方希望能幫忙指正)
假設有一串時間序列 {a1,a2,a3,⋯,at,at+1,⋯,}
t時刻的平均值為mvt=a1+a2+⋯+att
t+1時刻的平均值為mvt+1=a1+a2+⋯+at+at+1t+1=tmvt+at+1t+1=tt+1mvt+1t+1at+1
令decay=tt+1, 則mvt+1=decay∗mvt+(1−decay)∗at+1
輸出:import tensorflow as tf graph=tf.Graph() with graph.as_default(): w = tf.Variable(dtype=tf.float32,initial_value=1.0) ema = tf.train.ExponentialMovingAverage(0.9) update = tf.assign_add(w, 1.0) with tf.control_dependencies([update]): ema_op = ema.apply([w])#傳回一個op,這個op用來更新moving_average #這句和下面那句不能調換順序 ema_val = ema.average(w)#此op用來傳回目前的moving_average,這個參數不能是list with tf.Session(graph=graph) as sess: sess.run(tf.initialize_all_variables()) for i in range(3): print i print 'w_old=',sess.run(w) print sess.run(ema_op) print 'w_new=', sess.run(w) print sess.run(ema_val) print '**************'
關于加入了batch Normal的對mnist手寫數字分類的nn網絡完整代碼:0 w_old= 1.0 None w_new= 2.0#在執行ema_op時先執行了對w的更新 1.1 #0.9*1.0+0.1*2.0=1.1 ************** 1 w_old= 2.0 None w_new= 3.0 1.29 #0.9*1.1+0.1*3.0=1.29 ************** 2 w_old= 3.0 None w_new= 4.0 1.561 #0.9*1.29+0.1*4.0=1.561
import tensorflow as tf #import input_data from tqdm import tqdm import numpy as np import math from six.moves import cPickle as pickle #資料預處理 pickle_file = '/home/sxl/tensor學習/My Udacity/notM/notMNISTs.pickle' #為了加速計算,這個是經過處理的小樣本mnist手寫數字,這個資料可在[這裡](http://download.csdn.net/detail/whitesilence/9908115)下載下傳 with open(pickle_file, 'rb') as f: save = pickle.load(f) train_dataset = save['train_dataset'] train_labels = save['train_labels'] valid_dataset = save['valid_dataset'] valid_labels = save['valid_labels'] test_dataset = save['test_dataset'] test_labels = save['test_labels'] del save # hint to help gc free up memory print('Training set', train_dataset.shape, train_labels.shape) print('Validation set', valid_dataset.shape, valid_labels.shape) print('Test set', test_dataset.shape, test_labels.shape) image_size = 28 num_labels = 10 def reformat(dataset, labels): dataset = dataset.reshape((-1, image_size * image_size)).astype(np.float32) # Map 0 to [1.0, 0.0, 0.0 ...], 1 to [0.0, 1.0, 0.0 ...] labels = (np.arange(num_labels) == labels[:, None]).astype(np.float32) return dataset, labels train_dataset, train_labels = reformat(train_dataset, train_labels) valid_dataset, valid_labels = reformat(valid_dataset, valid_labels) test_dataset, test_labels = reformat(test_dataset, test_labels) print('Training set', train_dataset.shape, train_labels.shape) print('Validation set', valid_dataset.shape, valid_labels.shape) print('Test set', test_dataset.shape, test_labels.shape) #建立一個7層網絡 layer_sizes = [784, 1000, 500, 250, 250,250,10] L = len(layer_sizes) - 1 # number of layers num_examples = train_dataset.shape[0] num_epochs = 100 starter_learning_rate = 0.02 decay_after = 15 # epoch after which to begin learning rate decay batch_size = 120 num_iter = (num_examples/batch_size) * num_epochs # number of loop iterations x = tf.placeholder(tf.float32, shape=(None, layer_sizes[0])) outputs = tf.placeholder(tf.float32) testing=tf.placeholder(tf.bool) learning_rate = tf.Variable(starter_learning_rate, trainable=False) def bi(inits, size, name): return tf.Variable(inits * tf.ones([size]), name=name) def wi(shape, name): return tf.Variable(tf.random_normal(shape, name=name)) / math.sqrt(shape[0]) shapes = zip(layer_sizes[:-1], layer_sizes[1:]) # shapes of linear layers weights = {'W': [wi(s, "W") for s in shapes], # feedforward weights # batch normalization parameter to shift the normalized value 'beta': [bi(0.0, layer_sizes[l+1], "beta") for l in range(L)], # batch normalization parameter to scale the normalized value 'gamma': [bi(1.0, layer_sizes[l+1], "beta") for l in range(L)]} ewma = tf.train.ExponentialMovingAverage(decay=0.99) # to calculate the moving averages of mean and variance bn_assigns = [] # this list stores the updates to be made to average mean and variance def batch_normalization(batch, mean=None, var=None): if mean is None or var is None: mean, var = tf.nn.moments(batch, axes=[0]) return (batch - mean) / tf.sqrt(var + tf.constant(1e-10)) # average mean and variance of all layers running_mean = [tf.Variable(tf.constant(0.0, shape=[l]), trainable=False) for l in layer_sizes[1:]] running_var = [tf.Variable(tf.constant(1.0, shape=[l]), trainable=False) for l in layer_sizes[1:]] def update_batch_normalization(batch, l): "batch normalize + update average mean and variance of layer l" mean, var = tf.nn.moments(batch, axes=[0]) assign_mean = running_mean[l-1].assign(mean) assign_var = running_var[l-1].assign(var) bn_assigns.append(ewma.apply([running_mean[l-1], running_var[l-1]])) with tf.control_dependencies([assign_mean, assign_var]): return (batch - mean) / tf.sqrt(var + 1e-10) def eval_batch_norm(batch,l): mean = ewma.average(running_mean[l - 1]) var = ewma.average(running_var[l - 1]) s = batch_normalization(batch, mean, var) return s def net(x,weights,testing=False): d={'m': {}, 'v': {}, 'h': {}} h=x for l in range(1, L+1): print "Layer ", l, ": ", layer_sizes[l-1], " -> ", layer_sizes[l] d['h'][l-1]=h s= tf.matmul(d['h'][l-1], weights['W'][l-1]) m, v = tf.nn.moments(s, axes=[0]) if testing: s=eval_batch_norm(s,l) else: s=update_batch_normalization(s, l) s=weights['gamma'][l-1] * s + weights["beta"][l-1] if l == L: # use softmax activation in output layer h = tf.nn.softmax(s) else: h= tf.nn.relu(s) d['m'][l]=m d['v'][l]=v d['h'][l]=h return h,d y,_=net(x,weights) cost = -tf.reduce_mean(tf.reduce_sum(outputs*tf.log(y), 1)) correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(outputs, 1)) # no of correct predictions accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float")) * tf.constant(100.0) train_step = tf.train.AdamOptimizer(learning_rate).minimize(cost) # add the updates of batch normalization statistics to train_step bn_updates = tf.group(*bn_assigns) with tf.control_dependencies([train_step]): train_step = tf.group(bn_updates) print "=== Starting Session ===" sess = tf.Session() init = tf.initialize_all_variables() sess.run(init) i_iter = 0 print "=== Training ===" #print "Initial Accuracy: ", sess.run(accuracy, feed_dict={x: test_dataset, outputs: test_labels, testing: True}), "%" for i in tqdm(range(i_iter, num_iter)): #images, labels = mnist.train.next_batch(batch_size) start = (i * batch_size) % num_examples images=train_dataset[start:start+batch_size,:] labels=train_labels[start:start+batch_size,:] sess.run(train_step, feed_dict={x: images, outputs: labels}) if (i > 1) and ((i+1) % (num_iter/num_epochs) == 0):#i>1且完成了一個epochs,即所有資料訓練完一遍 epoch_n = i/(num_examples/batch_size)#第幾個epochs perm = np.arange(num_examples) np.random.shuffle(perm) train_dataset = train_dataset[perm]#所有訓練資料疊代完一次後,對訓練資料進行重排,避免下一次疊代時取的是同樣的資料 train_labels = train_labels[perm] if (epoch_n+1) >= decay_after: # decay learning rate # learning_rate = starter_learning_rate * ((num_epochs - epoch_n) / (num_epochs - decay_after)) ratio = 1.0 * (num_epochs - (epoch_n+1)) # epoch_n + 1 because learning rate is set for next epoch ratio = max(0, ratio / (num_epochs - decay_after)) sess.run(learning_rate.assign(starter_learning_rate * ratio)) print "Train Accuracy: ",sess.run(accuracy,feed_dict={x: images, outputs: labels}) print "Final Accuracy: ", sess.run(accuracy, feed_dict={x: test_dataset, outputs: test_labels, testing: True}), "%" sess.close()
關于batch normal 的另一參考資料http://blog.csdn.net/intelligence1994/article/details/53888270
tensorflow常用函數介紹http://blog.csdn.net/wuqingshan2010/article/details/71056292
- 歡迎關注朋友的CSDN:
- https://me.csdn.net/qq_41007606
- https://me.csdn.net/qq_40962368