李理：自動梯度求解——使用自動求導實作多層神經網絡

本系列文章面向深度學習研發者，希望通過Image Caption Generation，一個有意思的具體任務，深入淺出地介紹深度學習的知識。本系列文章涉及到很多深度學習流行的模型，如CNN，RNN/LSTM，Attention等。本文為第6篇。

作者：李理

目前就職于環信，即時通訊雲平台和全媒體智能客服平台，在環信從事智能客服和智能機器人相關工作，緻力于用深度學習來提高智能機器人的性能。

相關文章：

李理：從Image Caption Generation了解深度學習（part I）

李理：從Image Caption Generation了解深度學習（part II）

李理：從Image Caption Generation了解深度學習（part III）

李理：自動梯度求解反向傳播算法的另外一種視角

李理：自動梯度求解——cs231n的notes

常見深度學習架構/工具使用方法

前面我們介紹了4種梯度的計算方法：

手工計算
數值計算
符号求導
自動求導

作為一個架構或者工具，顯然不能使用手工計算的方式，另外數值計算效率太低，一般隻能用來做gradient check。剩下的兩種就是符号求導和自動求導了，目前的架構都是用的自動求導。

【注：theano說自己是Symbolic Differentiation ，但含義并不是數學上的Symbolic Diff，感興趣的讀者可以參考這裡。】

再細分一下，深度學習架構可以分成兩類：

1. 使用者可以使用使用者基本函數（也有叫操作op的）來定義計算圖的；

2. 使用者隻能用更上層的函數。

但是這兩者的界限其實很模糊。哪些函數算基本的，哪些算上層的？到底要提供多少函數才能表示所有的神經網絡？

這些其實是很難界定的，但是大部分架構都提供了擴充能力，比如tensorflow可以自定義op，如果一個函數沒有，你可以實作，同樣的theano也可以自定義。

而另外一些架構或者工具可能就沒有那麼靈活。但他們本質都是類似的——我們通過某種方式（代碼或者配置檔案）定義一個計算圖，并且定義哪些是變量【可訓練的】，哪些是【常量】（或者批量給定的值如tensorflow裡的placeholder），以及損失函數，它就能自動地幫我們計算損失函數對每個可訓練參數的導數，而且大部分架構把梯度下降的常用方法都封裝好了，我們隻有指定一些參數，比如batch大小，learning rate等等。

當然有一些架構如theano并不做這些，它隻幫助我們求梯度，這樣的工具更“底層”一些，當然對AI使用者要求更高一些，也會更靈活一些，适合對細節感興趣的使用者和那些需要自己“創造”神經網絡結構的使用者——很多學術界的人很喜歡theano，而像caffe，torch，keras等就是更”上層“的工具，使用它時，我隻需要定義一個一個CNN或者DNN的層就行，這個層有多少hidden unit，激活函數用什麼，是否dropout，用什麼loss function，然後其餘的事情就不用管了。

使用自動求導來實作多層神經網絡

其實就是完成CS231n的Assignment2的部分内容。

環境

請仔細閱讀安裝需要的軟體。我這裡根據我的環境(Ubuntu 14.04 LTS)列舉一些安裝的指令。

一、下載下傳和解壓

這裡是下載下傳路徑。

二、安裝virtualenv和依賴

cd assignment2
sudo pip install virtualenv      # This may already be installed
virtualenv .env                  # Create a virtual environment
source .env/bin/activate         # Activate the virtual environment
pip install -r requirements.txt  # Install dependencies
# Work on the assignment for a while ...
deactivate                       # Exit the virtual environment

virtualenv可以了解為一個虛拟的python環境，和系統的環境可以隔離開，而且安裝程式也不需要root權限。用的時候記得source .env/bin/active！

三、下載下傳資料

cd cs231n/datasets
./get_datasets.sh

四、編譯cython擴充

五、啟動ipython notebook

(.env) lili@lili-desktop:~/cs231n/assignment2$ ipython notebook

應該會彈出浏覽器打開http://localhost:8888/tree，打FullyConnectedNets.ipynb。

如果沒有用過 ipython notebook，請先閱讀此參考資料。確定了解基本的操作，知道怎麼執行cell等基本概念後再往下閱讀。

作業

cs231n中的作業是要實作一個全連接配接的神經網絡，網絡的層數是可以自己定義的。

我們把神經網絡分解成一些基本的Layer【注意：這裡的Layer不是我們之前說的一層，之前說的一個Layer是全連接配接的網絡，而這裡的Layer可以認為是一個Gate，或者一個函數一個Op】，每一個Layer我們都能進行feedforward和backward計算，然後我們通過這些基本的Layer組成一個複雜的神經網絡，進行這個網絡的整體feedforward和backprop計算，然後訓練參數和進行預測。

是以我們實作的很多Layer的結構如下：

def layer_forward(x, w):
  """ Receive inputs x and weights w """
  # Do some computations ...
  z = # ... some intermediate value
  # Do some more computations ...
  out = # the output

  cache = (x, w, z, out) # Values we need to compute gradients

  return out, cache

我們會進行forward的計算，然後把輸入，輸出還有一些中間結果都儲存下來，放到cache裡【backward時要用到的】，然後傳回輸出和cache。

def layer_backward(dout, cache):
  """
  Receive derivative of loss with respect to outputs and cache,
  and compute derivative with respect to inputs.
  """
  # Unpack cache values
  x, w, z, out = cache

  # Use values in cache to compute derivatives
  dx = # Derivative of loss with respect to x
  dw = # Derivative of loss with respect to w

  return dx, dw

而backward的計算，我們能拿到cache和dout【從後面的layer傳過來的gradient】

一上來我們會從cache裡讀取出輸入，輸出和中間值。然後就計算對每個變量的local gradient，然後乘以後層傳過來的dout，得到最終的dLoss/dw。然後傳回。

cell-1

滑鼠點選這個cell，然後選擇cell菜單，選擇運作，如果沒有任何輸出，恭喜你，環境沒有問題，如果發現有import之類的錯誤，那就是之前的環境和依賴沒有安裝好，請根據錯誤資訊google解決。

這個cell是導入一些依賴，然後定義了一個rel_error函數：

def rel_error(x, y):
  """ returns relative error """
  return np.max(np.abs(x - y) / (np.maximum(, np.abs(x) + np.abs(y))))

這個函數計算兩個ndarray【兩個數或者兩個向量或者兩個矩陣】的相對誤差，主要會用來做gradient check，也就是用numerical gradient和我們計算的gradient比較，如果相對誤差比較小，那麼就說明我們的gradient可能是正确的。【如果誤差較大肯定不對，但是誤差小不見得一定對，就像我們的單元測試，通過了單元測試不見得就沒bug，但是沒通過肯定有bug。】

計算方法也很簡單，計算x-y的絕對值，然後除以它們絕對值的和。當然計算機精度的問題，分母可能為0，是以用一個max函數，如果小于10的-8次方就取10的-8次方，否則就是x的絕對值加y的絕對值。【數值計算的時候一定要考慮溢出，包括下溢為0和上溢為無窮大】

cell-2

這個cell加載cifar-10的資料，如果想了解這個資料的格式，請參考作業1的CNN.pynb的前幾個cell。

說明：作業1的位址在這裡。

安裝方法和作業2是一樣的，不過在pip install -r requirements.txt時可能會提升pillow-3.0已經存在了，打開requirements.txt，裡面有兩個pillow的版本，删除一個就行了。

下面是我運作作業1的前幾個cell的結果，你有可以自己也試一試，有空最好把作業1自己做一做。

李理：自動梯度求解——使用自動求導實作多層神經網絡

下面是我運作這個cell的結果：

李理：自動梯度求解——使用自動求導實作多層神經網絡

cifar10的資料10類圖檔，’plane’, ‘car’, ‘bird’, ‘cat’, ‘deer’, ‘dog’, ‘frog’, ‘horse’, ‘ship’, ‘truck。總共有6萬張标注的資料，其中50000張訓練資料，10000張測試資料。而在這裡，我們把50000張中的49000用來真正訓練，1000張用來validate。

cell-3

接下來我們打開cs231n/layers.py這個檔案，實作其中的affine_forward函數。

首先看一下沒有寫任何代碼時課程已經提供的一些代碼：

def affine_forward(x, w, b):
  """
  Computes the forward pass for an affine (fully-connected) layer.

  The input x has shape (N, d_1, ..., d_k) and contains a minibatch of N
  examples, where each example x[i] has shape (d_1, ..., d_k). We will
  reshape each input into a vector of dimension D = d_1 * ... * d_k, and
  then transform it to an output vector of dimension M.

  Inputs:
  - x: A numpy array containing input data, of shape (N, d_1, ..., d_k)
  - w: A numpy array of weights, of shape (D, M)
  - b: A numpy array of biases, of shape (M,)

  Returns a tuple of:
  - out: output, of shape (N, M)
  - cache: (x, w, b)
  """
  out = None
  #############################################################################
  # TODO: Implement the affine forward pass. Store the result in out. You     #
  # will need to reshape the input into rows.                                 #
  #############################################################################
  pass
  #############################################################################
  #                             END OF YOUR CODE                              #
  #############################################################################
  cache = (x, w, b)
  return out, cache

函數下面的注釋需要仔細閱讀一遍。

這個函數計算affine (全連接配接)層的forward pass 仿射變換 (Affine transformation)看起來很數學，其實我們隻有知道它是一個簡單的線性變化就行了。如果想了解細節，可以參考wiki。

如果變成特殊的一維的情況，y=Ax+b就是仿射變換，多元的情況就是把A變成矩陣，b變成向量就行了。

【注：這裡的全連接配接，指的是仿射變換，不保護非線性激活的情況，有的文獻可能把包含了激活的一個層叫作全連接配接層】

輸入參數x

x是numpy的ndarray，次元是:

N是batch大小，為了提高計算效率，我們一般同時計算一個batch的forward和backward pass。剩下的次元為什麼是變長的呢？其實是為了友善。因為CNN的filter的大小是不固定的。不過我們可以簡單的把這個多元的tensor展開成一個一維的向量，因為全連接配接的層是不會考慮不同輸入的空間位置的【而CNN是考慮空間關系的，是以在圖像進行中效果更好，後面我們在介紹CNN時會詳細介紹】。我們把展開後的向量的次元記為D:

D=d1∗...∗dk

輸入參數w

w是numpy的ndarray，次元是(D, M)，這個很容易了解，對于全連接配接的層，輸入神經元是D，輸出神經元是M，參數w就是(D,M)的矩陣。【說明，如果讀者還記得前面的代碼，我們之前是反過來的，w是M*D的矩陣。其實不論怎麼記都不影響，隻是一種習慣而已，不過計算的時候有的要轉置，有的不需要。我們隻要記住一點，滿足矩陣的乘法條件就行了！】

輸入參數b

b是numpy的ndarray，次元是(M,)，它是一個M維的向量，是bias。

輸出 out

輸出out是ndarray，次元是 (N, M)。

輸出cache

cache儲存這一層的輸入和中間變量，這裡cache = (x, w, b)，cache是一個tuple，儲存了x,w,b，在backward的階段會用到。

實作函數 affine_forward

介紹完了函數的輸入和輸出，我們就需要實作這個函數，課程的代碼告訴我們在指定的地方（pass那個地方）實作out的計算就行了。

前面我們也說過了，affine函數就是out=Wx+b，需要主要的是矩陣乘法的次元。首先我們需要把x從一個高維的tensor變成2維的matrix。

N = x.shape[]
x_temp = x.reshape(N,-)

這裡要用到ndarray的reshape函數，具體文檔可以上網查，也可以直接在python裡看。在python裡看比較友善，可以先啟動ipython，然後導入numpy，然後用?檢視：

$ ipython
import numpy as np
np.reshape?

我們可以自己寫代碼計算

D=d1∗...∗dk

不過numpy的reshape有一個簡便的方法就是設定某一個次元是-1，讓numpy來推測。因為我們知道第一維是N，剩下的次元展開成一個一維的向量，是以我們設定為-1。

當然我們也可以完全自己來計算D，請同學們修改代碼自己計算D【可能很多同學第一個想到的方法是for循環，但是在numpy或者類似的工具如matlab裡盡量避免用for循環，因為使用一些函數，numpy會優化代碼】

接下來就是out=Wx+b了。

輸出out是N M，W是D M，x_temp是N D，那麼唯一合法的乘法就是x_temp W了。是以out的計算如下：

out = x_temp.dot(w) + b

稍等一下！b是M維的向量，x_temp.dot(W)是N*M，這兩個ndarray怎麼相加呢？這裡用到的技巧就是numpy的broadcast。如果還不了解，請閱讀。

如果不計算一個batch隻計算一個，那麼N就是1，那麼就可以相加，現在我們一次計算了N個訓練資料的W x，那麼b卻是一樣的（N次計算W和b是不變的），如果我們不用broadcast的技巧，那麼需要複制b成為N M的矩陣，這會浪費空間。

這樣我們就完成了一個函數，是否很簡單呢？

寫完這個函數後怎麼知道我們寫的沒問題呢？CS231N的課程非常好的一點就是每一個步驟都會有檢驗的代碼。我們寫完這個函數之後就可以運作這個cell測試一下：

李理：自動梯度求解——使用自動求導實作多層神經網絡

cell的注釋裡寫了，如果相對錯誤了小于10的-9次方，那麼說明代碼是沒有問題的【至少是通過單元測試了】。恭喜！你正确的完成了第一個函數！

cell-4

第二個要實作的是affine_backward函數，也就是反向計算梯度。

輸入dout

從上層（後面）傳過來的dLoss/dout，次元是和out一樣的，(N, M)。

輸入cache

我們儲存的cache，它是個tuple，具體為：

x: 輸入次元是 (N,d1,…dk)

w: 權重矩陣，次元是 (D, M)

b: bias，次元是(M,)

輸出

輸出傳回一個tuple:

dx: dLoss/dx, 次元是 (N,d1,…,dk)

dw: dLoss/dw, (D, M)

db: dLoss/db (M,)

x, w, b = cache
  dx, dw, db = None, None, None
  #############################################################################
  # TODO: Implement the affine backward pass.                                 #
  #############################################################################
  db = np.sum(dout, axis = )
  x_temp = x.reshape(x.shape[],-)
  dw = x_temp.T.dot(dout)
  dx = dout.dot(w.T).reshape(x.shape)
  #############################################################################
  #                             END OF YOUR CODE                              #
  #############################################################################
return dx, dw, db

代碼我已經放上去了，下面來分析為什麼。

首先我們計算dw和dx。

根據鍊式法則：

是(N,M)，x_temp是(N,D)，而dw是(D,M)是以唯一合法的乘法就是：

是以代碼為：

dw = x_temp.T.dot(dout)

同理可以求dx，稍微不同的就是計算出來的是展開的dx，需要再reshape成和x一樣的次元的tensor：

dx = dout.dot(w.T).reshape(x.shape)

最後是db，如果batch等于1，那麼很簡單db=dout，但現在dout是N個訓練樣本的梯度，是以需要加起來。具體用到的是np.sum函數【當然也可以寫個for循環，但是這會比較低效而且代碼看起來很羅嗦】， db = np.sum(dout, axis = 0)

In []: dout=np.array([[1,2,3],[4,5,6]])

In []: dout
Out[]: 
array([[1, 2, 3],
       [4, 5, 6]])

In []: dout.sum(axis=)
Out[]: array([, , ])

上面是sum函數的一個例子，請大家了解了db的求法。

實作了之後我們再來測試一下cell-4，課程代碼已經幫我們寫好單元測試了，我們隻需要允許cell-4就行了。

另外值得注意的是這個cell裡用到了eval_numerical_gradient_array函數，在cs231n/gradient_check.py下，另外這個檔案下還有個eval_numerical_gradient，都是用來計算數值梯度和我們求出的梯度的誤差的，有興趣的讀者可以仔細閱讀這個代碼。

李理：自動梯度求解——使用自動求導實作多層神經網絡

cell-5

這個cell實作ReLU的forward pass。

代碼隻有一行：

out = np.maximum(, x)

注意numpy的maximum函數和max函數，前者有兩個參數，求其中較大的那個，也就是數學上的max(x,y)函數，而numpy的max函數用于在一個ndarray中求較大的數【當然也可能求某個次元較大的值】

cell-6

這個cell實作ReLU的backward pass。

也隻有一行代碼：

怎麼來的呢？還記得前面max(x,y)的偏導數嗎？

把y設定成0，則

x>=0傳回什麼呢？我們測試一下：

In []: x=np.array([,-])

In []: x>=
Out[]: array([ True, False], dtype=bool)

傳回的是一個bool數組，那bool數組乘以一個double數組呢？

In []: x=np.array([,-])

In []: x>=
Out[]: array([ True, False], dtype=bool)

In []: y=np.array([,])

In []: (x>=)*y
Out[]: array([ .,  .])

可以看到true會類型轉換成1,false轉換成0。

是以numpy的 x>=0 其實就是數學上的indicator函數:

cell-7

這個cell要實作的是affine_relu_forward和affine_relu_backward【其實已經實作了，我們看一下代碼就行了】，因為神經網絡的一次同時需要affine_layer和relu_layer，把它們”拼“在一起用起來更友善。

具體代碼在 cs231n/layer_utils.py

def affine_relu_forward(x, w, b):
  """
  Convenience layer that perorms an affine transform followed by a ReLU

  Inputs:
  - x: Input to the affine layer
  - w, b: Weights for the affine layer

  Returns a tuple of:
  - out: Output from the ReLU
  - cache: Object to give to the backward pass
  """
  a, fc_cache = affine_forward(x, w, b)
  out, relu_cache = relu_forward(a)
  cache = (fc_cache, relu_cache)
  return out, cache


def affine_relu_backward(dout, cache):
  """
  Backward pass for the affine-relu convenience layer
  """
  fc_cache, relu_cache = cache
  da = relu_backward(dout, relu_cache)
  dx, dw, db = affine_backward(da, fc_cache)
  return dx, dw, db

cell-8

svm-loss和softmax-loss。

課程代碼已經給出了，因為這個函數本來應該是在第一個作業來完成的。因為我們跳過了作業1，是以還是需要了解其中的代碼，svm loss我們就不仔細介紹了，感興趣的同學參考這裡。我們來簡單的講一下softmax loss，因為這個loss在神經網絡中非常常見。詳細的介紹請閱讀此參考資料。

首先需要澄清一個概念，并沒有一個loss function叫softmax loss。它指的是在輸出層加一個softmax函數，然後用cross entropy的損失函數。

softmax函數

簡單的說，softmax函數把一個向量變成另外一個向量，這個新的向量每一個元素都大于0【根據後面的條件小于1】，并且加起來等于1。還有一個條件就是”單調“的映射，也就是兩個數的順序在映射之後還能夠保持。

如果隻看上面的描述，你會怎麼實作softmax函數呢？首先要把它們變成大于0的數，這當然有很多方法，指數函數是最容易想到的。首先是:

另外就是如果x1>x2，那麼:

那怎麼讓它們加起來等于1呢？也很簡單，除以它們的和就行了。是以我們就得到了softmax函數：

因為這個向量的每個元素都大于0小于1而且加起來等于1，如果我們把這個輸出當成一個K類分類器的輸出的話，我們可以把它當成分類器的”機率”。

cross entropy 損失函數

而實際的分類結果應該是1,2,…,K中的一個，我們可以用one-hot的方式來表示，比如分類的結果是2，我們可以表示成[0, 1, 0, …, 0]的形式。

那麼我們可以用cross-entroy來計算真實的機率p=[0,1,0…0]和模型輸出的機率q的”距離“，具體細節參考這裡。

距離越小說明損失。

因為p隻有一個是1，其餘的是0，是以隻要下标為1的-logq就行了。

舉個例子：假設K=5，假設真實的分類是2，分類器的輸出是[0.1, 0.7, 0.1, 0.1, 0]，那麼損失應該就是 -log0.7。如果分類器的輸出是[0.3, 0.7, 0, 0, 0]，那麼損失還是-log0.7，可以看出，它之關注真實分類的值，這是很合理的一個loss。如果分類器在第二個元素越大，那麼分類器分成第二類的機率就越大，是以log值也越大【最大是log1=0，沒有損失】，-log就越小，損失也越小！

softmax loss的代碼

def softmax_loss(x, y):
  """
  Computes the loss and gradient for softmax classification.

  Inputs:
  - x: Input data, of shape (N, C) where x[i, j] is the score for the jth class
    for the ith input.
  - y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and
    0 <= y[i] < C

  Returns a tuple of:
  - loss: Scalar giving the loss
  - dx: Gradient of the loss with respect to x
  """
  probs = np.exp(x - np.max(x, axis=, keepdims=True))
  probs /= np.sum(probs, axis=, keepdims=True)
  N = x.shape[]
  loss = -np.sum(np.log(probs[np.arange(N), y])) / N
  dx = probs.copy()
  dx[np.arange(N), y] -= 
  dx /= N
  return loss, dx

首先是函數的輸入參數：

x是輸入資料(N,C)，N是batch大小，C是分類的類别數。

y是标簽，次元是(N,)，y[i]的取值範圍是[0,C)，表示正确的分類。

輸出：

loss，标量的loss

dx，dLoss/dx

這個函數看起來隻有很簡單幾行代碼，但是其實的内容非常豐富。讓我們來逐行講解代碼：

第1行

probs = np.exp(x - np.max(x, axis=, keepdims=True))

如果直接計算，就是上面的公式，但是在實際的數值運算時可能會溢出。

比如輸入x=[1000000, 1000000]，則exp(1000000)會溢出，那怎麼辦呢？

我們看一個例子：x=[1,2,3]，x=[101,102,103]，我們分别計算：

我們把下面這個式子的分子分母同時除以exp(100)，那麼結果應該是不變的，我們會發現這兩個值是一樣的！

是以softmax函數的一個特點是它之取決于輸入向量的”相對“大小。為了防止溢出，我們可以減去最大的那個數，然後在算exp。前面講過np.max和np.maximum的差別，np.max是在x中找最大的數，但是因為我們一次處理一個batch(N)，是以我們需要從axis=1這個次元找最大值。下面是max的例子，請仔細了解：

In []: x=np.array([[1,3,5],[2,1,1]])

In []: np.max(x)
Out[]: 

In []: np.max(x, axis=)
Out[]: array([, ])

In []: np.max(x, axis=, keepdims=True)
Out[]: 
array([[5],
       [2]])

In []: np.max(x, axis=).shape
Out[]: (,)

In []: np.max(x, axis=, keepdims=True).shape
Out[]: (, )

注意keepdims的作用，因為對一個向量求最大，就會變成一個标量，次元減少了1,同樣對一個矩陣的某一維求max，也會變成一個向量。keepdims的意思是保留次元。x是(N,C)維的，而np.max(x,axis=1,keepdims=True)得到(N,1)的，是以x-np.max(x,axis=1,keepdims=True)可以相減【根據broadcast規則】，如果keepdims=False，則(N,C)是不能減(N,)維的向量的。

最後再用np.exp求這個(N,C)矩陣的沒一個元素的exp值（universal function）。

第2行

進行歸一化，使用了np.sum函數，和max類似，也有keepdims的問題。

第3，4行

N = x.shape[]
loss = -np.sum(np.log(probs[np.arange(N), y])) / N

首先需要了解probs[np.arange(N), y]

In []: x=np.array([[,,],[,,]])

In []: probs = np.exp(x - np.max(x, axis=, keepdims=True))

In []: probs /= np.sum(probs, axis=, keepdims=True)

In []: N = x.shape[]

In []: y=np.array([,])

In []: probs[np.arange(N), y]
Out[]: array([ ,  ])

前面我們說過了，我們需要計算”真實“分類對應的下标的log值。”真實“的分類下标就是y，比如上面的例子中x是兩個訓練資料，y是對應的正确分類下标值1和0。那麼我們需要求第0行的第1列和第1行的第0列，求它們的-log，然後加起來。我們可以寫for循環來做。但是在numpy裡，ndarray提供了友善的方法來slice數組的一部分，np.arange和python标準的range類似，不過得到的是ndarray，得到(0，1,…,N-1)這個N個數，然後probs[np.arange(N), y]分别用這兩個一維數組來slice得到一個一維的數組，相當于[probs[0,1], probs[1,0]]

接下來就是用log函數對這個數組的每一個求log，然後除以N就得到平均的loss。

請仔細了解這行代碼。如果對ndarray的slice不熟，請參考這裡。

接下來是計算dLoss/dx，這個公式有些複雜，下面我先來詳細推導一下，讀者如果有時間的話請自己一步一步的推導。為了公式簡單，我們用變量p替代了代碼裡的probs：

首先我們來求

我們分為兩種情況，第一種情況是 i=j，首先回憶一下：

第二種情況是 i≠j

接下來我們來求

對于求和下标k分為兩種情況： k=i和k≠i ，分别代入上面的公式得到：

最後一步用到的是 ∑kyk=1。

推導有些複雜，記憶起來其實不複雜，softmax+cross entropy的梯度就是模型預測的結果p減去lable y。

下面我們來看代碼怎麼實作！

dx = probs.copy()
  dx[np.arange(N), y] -= 
  dx /= N

我們需要實作probs - y，不過公式裡的y是one-hot表示的向量，而我們這裡的y是下标【如果不考慮batch N】。是以這裡先從probs裡複制一份給新的變量dx【我個人覺得直接修改probs也沒有問題】，因為y隻有在對應的label的下标才是1，是以 dx[np.arange(N), y] -= 1，然後除以N得到平均的dx。

cell-9

接下來就是把這些Layers拼裝成一個完整的多層神經網絡，請打開cs231n/classifiers/fc_net.py，我們現在要完成TwoLayerNet這個類。

我們直接把代碼放到下面，然後用注釋的方式解釋代碼。補充的解釋在後面。

class TwoLayerNet(object):
  """
  這個類實作兩層全連接配接的神經網絡，使用ReLU激活函數，softmax loss。我們假設輸入的次元是D，hidden unit是H，輸出是C維。

  網絡的結構是 affine - relu - affine - softmax。

  注意：這個類并不會實作梯度下降算法；相反，它會使用一個單獨的 Solver 對象來實作參數的優化。

  模型可以學習(訓練)的參數應該放到self.params這個dict裡，key是參數名，value是對應的numpy ndarray。

  """

  def __init__(self, input_dim=**, hidden_dim=, num_classes=,
               weight_scale=, reg=):
    """
    初始化一個新的神經網絡。
    輸入:
    - input_dim: 一個整數，代表輸入的向量的大小，預設3*32*32[cifar-10的資料]。
    - hidden_dim: 一個整數，代表hidden unit的個數，預設100。
    - num_classes: 一個整數，代表輸出分類的個數，預設10[cifar-10的分類數]。
    - dropout: 标量，範圍是0-1，代表dropout的機率
    - weight_scale: 一個标量，代表用來随機初始化wegiht的标準差，預設值1e-3。
    - reg: 一個标量，L2 正則化參數
    """
    self.params = {}
    self.reg = reg

    ############################################################################
    # TODO: 初始化兩層神經網絡的weights和biases。Weights    #
    # 用高斯分别來初始化，均值是0，标準差是weight_scale，biases初始化為0#
    # 所有的 weights 和 biases 應該儲存在 self.params, 第一層的  #
    # weights 和 biases 使用 key 'W1' 和 'b1'， 第二層的用 'W2'和 'b2' #
      ############################################################################
    self.params['W1'] = np.random.normal(, weight_scale, (input_dim, hidden_dim)) #使用np.random.normal函數來生成指定大小的矩陣，标準差是weight_scale
    self.params['b1'] = np.zeros(hidden_dim)
    self.params['W2'] = np.random.normal(, weight_scale, (hidden_dim, num_classes))
    self.params['b2'] = np.zeros(num_classes)
    ############################################################################
    #                             END OF YOUR CODE                             #
    ############################################################################


  def loss(self, X, y=None):
    """
    計算一個batch的資料的loss和gradient。

    輸入:
    - X: ndarray，shape是 (N, d_1, ..., d_k)
    - y: label數組， shape (N,). y[i] 是 X[i]的label，取值範圍是{0,1,...,C-1}

    傳回:
    如果y是None，則運作測試時的forward pass【說明：測試時不需要計算最後一個softmax，因為我們最後隻是為了選擇一個分類，而softmax是單調的函數 argmax softmax[x1,x2] = argmax [x1,x2]。】 并且傳回:
    - scores: 數組 shape是 (N, C) 代表分類的得分，scores[i, c]是 X[i] 分成類 c的得分。

    如果y不是None， 則運作一次訓練時的forward和backward，傳回一個tuple:
    - loss: 一個标量，代表loss
    - grads: 一個dict，key和self.params的key一樣，值則是對應的梯度。
    """  
    scores = None
    ############################################################################
    # TODO: 計算兩層神經網絡的forward pass#
    # 計算scores              #
    ############################################################################
    affine_relu_out, affine_relu_cache = affine_relu_forward(X, self.params['W1'], self.params['b1'])
    affine2_out, affine2_cache = affine_forward(affine_relu_out, self.params['W2'], self.params['b2'])

    scores = affine2_out
    ############################################################################
    #                             END OF YOUR CODE                             #
    ############################################################################

    # 如果y是None 那麼我們是在test mode，隻需要傳回scores
    if y is None:
      return scores

    loss, grads = , {}
    ############################################################################
    # TODO: Implement the backward pass for the two-layer net. Store the loss  #
    # in the loss variable and gradients in the grads dictionary. Compute data #
    # loss using softmax, and make sure that grads[k] holds the gradients for  #
    # self.params[k]. Don't forget to add L2 regularization!                   #
    #                                                                          #
    # NOTE: To ensure that your implementation matches ours and you pass the   #
    # automated tests, make sure that your L2 regularization includes a factor #
    # of 0.5 to simplify the expression for the gradient.                      #
    ############################################################################
    loss, dscores = softmax_loss(scores, y)
    loss +=  * self.reg * (
    np.sum(self.params['W1'] * self.params['W1']) + np.sum(self.params['W2'] * self.params['W2']))

    affine2_dx, affine2_dw, affine2_db = affine_backward(dscores, affine2_cache)
    grads['W2'] = affine2_dw + self.reg * self.params['W2']
    grads['b2'] = affine2_db

    affine1_dx, affine1_dw, affine1_db = affine_relu_backward(affine2_dx, affine_relu_cache)
    grads['W1'] = affine1_dw + self.reg * self.params['W1']

    grads['b1'] = affine1_db
    ############################################################################
    #                             END OF YOUR CODE                             #
    ############################################################################

    return loss, grads

請仔細閱讀代碼，其實代碼相當簡單，不過之前沒有介紹L2 regulariation(正則化)，這裡簡單介紹一下，詳細的内容參考這裡。

目的是為了防止overfitting（過拟合），是以在Loss function裡增加：

對應到代碼 λ就是self.reg參數，是以有這樣一行代碼：

loss +=  * self.reg * (np.sum(self.params['W1'] * self.params['W1']) + np.sum(self.params['W2'] * self.params['W2']))

同樣的，在計算每個weights的時候梯度的時候也要加上 λw

grads['W2'] = affine2_dw + self.reg * self.params['W2']

注意，這裡沒有把biases加到正則化參數裡去。

接下來我們運作這個cell，檢查相對error是否足夠小。

cell-10

實作Solver，forward和backward的代碼都好了，接下來就是要實作(batch)梯度下降的邏輯了。

請打開cs231n/solver.py，課程已經幫我們實作了，我們需要了解其代碼然後使用它。Solver的代碼較長，如果讀者不想閱讀全部代碼，至少要閱讀最前面的注釋，了解它應該怎麼用。

"""
  A Solver encapsulates all the logic necessary for training classification
  models. The Solver performs stochastic gradient descent using different
  update rules defined in optim.py.
  Solver封裝了用于訓練分類器模型的所有邏輯。Solver用定義于optim.py的更新規則來進行随機梯度下降。

  The solver accepts both training and validataion data and labels so it can
  periodically check classification accuracy on both training and validation
  data to watch out for overfitting.
  solver同時接受用于訓練和驗證的資料與标簽，是以它能周期的檢查訓練和驗證資料上的準确率進而避免過拟合。

  To train a model, you will first construct a Solver instance, passing the
  model, dataset, and various optoins (learning rate, batch size, etc) to the
  constructor. You will then call the train() method to run the optimization
  procedure and train the model.
  如果想訓練一個模型，你首先需要構造一個Solver對象，傳給它model，dataset和一些選項(learning rate, batch siez等等）給它的構造函數。然後你調用它的train()方法來進行參數優化和訓練模型。

  After the train() method returns, model.params will contain the parameters
  that performed best on the validation set over the course of training.
  In addition, the instance variable solver.loss_history will contain a list
  of all losses encountered during training and the instance variables
  solver.train_acc_history and solver.val_acc_history will be lists containing
  the accuracies of the model on the training and validation set at each epoch.
  train()方法傳回之後，model.params儲存的是在驗證集結果最好的參數。此外，solver.loss_history裡儲存了訓練過程中的所有loss。solver.train_acc_history和solver.val_acc_history儲存了每個epoch結束後在訓練資料和驗證資料上的準确率。

  Example usage might look something like this:
  用法可能如下：

  data = {
    'X_train': # training data
    'y_train': # training labels
    'X_val': # validation data
    'X_train': # validation labels
  }
  model = MyAwesomeModel(hidden_size=100, reg=10)
  solver = Solver(model, data,
                  update_rule='sgd',
                  optim_config={
                    'learning_rate': 1e-3,
                  },
                  lr_decay=0.95,
                  num_epochs=10, batch_size=100,
                  print_every=100)
  solver.train()


  A Solver works on a model object that must conform to the following API:
  傳給Solver的model對象必須遵循如下API：

  - model.params must be a dictionary mapping string parameter names to numpy
    arrays containing parameter values.
    model.params必須是一個dict，key是參數名，value是對應的值的ndarray

  - model.loss(X, y) must be a function that computes training-time loss and
    gradients, and test-time classification scores, with the following inputs
    and outputs:
    model.loss(X,y)必須是一個函數，它計算訓練時的loss和梯度【y is not None】，測試時的分類得分【y is None]。它的輸入和輸出如下：

    Inputs:
    輸入
    - X: Array giving a minibatch of input data of shape (N, d_1, ..., d_k)
    X：minibatch的輸入資料，次元是(N, d_1, ..., d_k)
    - y: Array of labels, of shape (N,) giving labels for X where y[i] is the
      label for X[i]. 
      y: minibatch個标簽，shape是(N,)，y[i]是X[i]的label
    Returns:
    傳回：
    If y is None, run a test-time forward pass and return:
    - scores: Array of shape (N, C) giving classification scores for X where
      scores[i, c] gives the score of class c for X[i].
      如果y是None，傳回測試時的分類得分scores，shape是(N,C)，其中scores[i,c]是X[i]分類為c的得分。

    If y is not None, run a training time forward and backward pass and return
    a tuple of:
    - loss: Scalar giving the loss
    - grads: Dictionary with the same keys as self.params mapping parameter
      names to gradients of the loss with respect to those parameters.
      如果y不是None，進行一次訓練時的前向和後向計算，并且傳回：
      loss：一個标量代表loss
      grads：一個dict，key和self.params一樣，value是對應的梯度。
  """

  def __init__(self, model, data, **kwargs):
    """
    Construct a new Solver instance.
    構造一個新的Solver對象

    Required arguments:
    需要的參數：
    - model: A model object conforming to the API described above
    - model: 一個model對象需要滿足上面描述的API。
    - data: A dictionary of training and validation data with the following:
    - data: 一個dict包含如下資料：
      'X_train': Array of shape (N_train, d_1, ..., d_k) giving training images
      'X_val': Array of shape (N_val, d_1, ..., d_k) giving validation images
      'y_train': Array of shape (N_train,) giving labels for training images
      'y_val': Array of shape (N_val,) giving labels for validation images
      'X_train': 訓練圖像的ndarray，shape是(N_train, d_1, .., d_k)
      'X_val': 驗證集的圖像的ndarray，shape是(N_val, d_1, .., d_k)
      'y_train': 訓練圖像的lable，shape是(N_train,)
      'y_val': 驗證圖像的lable，shape是(N_val,)

    Optional arguments:
    可選參數：
    - update_rule: A string giving the name of an update rule in optim.py.
      Default is 'sgd'.
      - update_rule: optim.py裡的update rule的名字，預設'sgd'
    - optim_config: A dictionary containing hyperparameters that will be
      passed to the chosen update rule. Each update rule requires different
      hyperparameters (see optim.py) but all update rules require a
      'learning_rate' parameter so that should always be present.
      - optim_config: 一個dict包含傳給update rule的超參數。不同的update rule有不同的超參數(請參考optim.py)。但是所有的update rules必須有'learning_rate'這個參數。
    - lr_decay: A scalar for learning rate decay; after each epoch the learning
      rate is multiplied by this value.
    - lr_decay: 每輪epoch後learning_rate都會乘以這個參數，讓learning_rate越來越小。
    - batch_size: Size of minibatches used to compute loss and gradient during
      training.
    - batch_size: batch大小
    - num_epochs: The number of epochs to run for during training.
    - num_epochs: 訓練的epochs
    - print_every: Integer; training losses will be printed every print_every
      iterations.
    - print_every: 每隔多久打一下訓練loss
    - verbose: Boolean; if set to false then no output will be printed during
      training.

    """
    self.model = model
    self.X_train = data['X_train']
    self.y_train = data['y_train']
    self.X_val = data['X_val']
    self.y_val = data['y_val']

    # Unpack keyword arguments
    self.update_rule = kwargs.pop('update_rule', 'sgd')
    self.optim_config = kwargs.pop('optim_config', {})
    self.lr_decay = kwargs.pop('lr_decay', )
    self.batch_size = kwargs.pop('batch_size', )
    self.num_epochs = kwargs.pop('num_epochs', )

    self.print_every = kwargs.pop('print_every', )
    self.verbose = kwargs.pop('verbose', True)

    # Throw an error if there are extra keyword arguments
    if len(kwargs) > :
      extra = ', '.join('"%s"' % k for k in kwargs.keys())
      raise ValueError('Unrecognized arguments %s' % extra)

    # Make sure the update rule exists, then replace the string
    # name with the actual function
    if not hasattr(optim, self.update_rule):
      raise ValueError('Invalid update_rule "%s"' % self.update_rule)
    self.update_rule = getattr(optim, self.update_rule)

    self._reset()


  def _reset(self):
    """
    Set up some book-keeping variables for optimization. Don't call this
    manually.
    """
    # Set up some variables for book-keeping
    self.epoch = 
    self.best_val_acc = 
    self.best_params = {}
    self.loss_history = []
    self.train_acc_history = []
    self.val_acc_history = []

    # Make a deep copy of the optim_config for each parameter
    self.optim_configs = {}
    for p in self.model.params:
      d = {k: v for k, v in self.optim_config.iteritems()}
      self.optim_configs[p] = d


  def _step(self):
    """
    Make a single gradient update. This is called by train() and should not
    be called manually.
    """
    # Make a minibatch of training data
    num_train = self.X_train.shape[]
    batch_mask = np.random.choice(num_train, self.batch_size)
    X_batch = self.X_train[batch_mask]
    y_batch = self.y_train[batch_mask]

    # Compute loss and gradient
    loss, grads = self.model.loss(X_batch, y_batch)
    self.loss_history.append(loss)

    # Perform a parameter update
    for p, w in self.model.params.iteritems():
      dw = grads[p]
      config = self.optim_configs[p]
      next_w, next_config = self.update_rule(w, dw, config)
      self.model.params[p] = next_w
      self.optim_configs[p] = next_config


  def check_accuracy(self, X, y, num_samples=None, batch_size=):
    """
    Check accuracy of the model on the provided data.

    Inputs:
    - X: Array of data, of shape (N, d_1, ..., d_k)
    - y: Array of labels, of shape (N,)
    - num_samples: If not None, subsample the data and only test the model
      on num_samples datapoints.
    - batch_size: Split X and y into batches of this size to avoid using too
      much memory.

    Returns:
    - acc: Scalar giving the fraction of instances that were correctly
      classified by the model.
    """

    # Maybe subsample the data
    N = X.shape[]
    if num_samples is not None and N > num_samples:
      mask = np.random.choice(N, num_samples)
      N = num_samples
      X = X[mask]
      y = y[mask]

    # Compute predictions in batches
    num_batches = N / batch_size
    if N % batch_size != :
      num_batches += 
    y_pred = []
    for i in xrange(num_batches):
      start = i * batch_size
      end = (i + ) * batch_size
      scores = self.model.loss(X[start:end])
      y_pred.append(np.argmax(scores, axis=))
    y_pred = np.hstack(y_pred)
    acc = np.mean(y_pred == y)

    return acc


  def train(self):
    """
    Run optimization to train the model.
    """
    num_train = self.X_train.shape[]
    iterations_per_epoch = max(num_train / self.batch_size, )
    num_iterations = self.num_epochs * iterations_per_epoch

    for t in xrange(num_iterations):
      self._step()

      # Maybe print training loss
      if self.verbose and t % self.print_every == :
        print '(Iteration %d / %d) loss: %f' % (
               t + , num_iterations, self.loss_history[-])

      # At the end of every epoch, increment the epoch counter and decay the
      # learning rate.
      epoch_end = (t + ) % iterations_per_epoch == 
      if epoch_end:
        self.epoch += 
        for k in self.optim_configs:
          self.optim_configs[k]['learning_rate'] *= self.lr_decay

      # Check train and val accuracy on the first iteration, the last
      # iteration, and at the end of each epoch.
      first_it = (t == )
      last_it = (t == num_iterations + )
      if first_it or last_it or epoch_end:
        train_acc = self.check_accuracy(self.X_train, self.y_train,
                                        num_samples=)
        val_acc = self.check_accuracy(self.X_val, self.y_val)
        self.train_acc_history.append(train_acc)
        self.val_acc_history.append(val_acc)

        if self.verbose:
          print '(Epoch %d / %d) train acc: %f; val_acc: %f' % (
                 self.epoch, self.num_epochs, train_acc, val_acc)

        # Keep track of the best model
        if val_acc > self.best_val_acc:
          self.best_val_acc = val_acc
          self.best_params = {}
          for k, v in self.model.params.iteritems():
            self.best_params[k] = v.copy()

    # At the end of training swap the best params into the model
    self.model.params = self.best_params

說明：這裡實作的sgd和前面稍微不同。前面假設有1000個訓練資料，minibatch是100，那麼一個epoch會有10次疊代，每次疊代100個訓練資料。之前的代碼能保證10次疊代會周遊1000個訓練資料，每個資料用一次。而這個代碼則是10次疊代每次随機采樣100個，是以并不能保證1000個資料每個用一次，可能有些樣本一次也沒有用，而另外一些用多次。

另外，參數的更新封裝在optim.py裡，Solver和optim.py的協定為：next_w, next_config = self.update_rule(w, dw, config)。這兩點說明都在_step函數裡能看到用法：

def _step(self):
    """
    Make a single gradient update. This is called by train() and should not
    be called manually.
    """
    # Make a minibatch of training data
    num_train = self.X_train.shape[]
    batch_mask = np.random.choice(num_train, self.batch_size)
    X_batch = self.X_train[batch_mask]
    y_batch = self.y_train[batch_mask]

    # Compute loss and gradient
    loss, grads = self.model.loss(X_batch, y_batch)
    self.loss_history.append(loss)

    # Perform a parameter update
    for p, w in self.model.params.iteritems():
      dw = grads[p]
      config = self.optim_configs[p]
      next_w, next_config = self.update_rule(w, dw, config)
      self.model.params[p] = next_w
      self.optim_configs[p] = next_config

而更詳細的optim的update協定在optim.py檔案裡，這裡不羅嗦了，請讀者閱讀。

我們再來看看預設的sgd的實作：

def sgd(w, dw, config=None):
  """
  Performs vanilla stochastic gradient descent.

  config format:
  - learning_rate: Scalar learning rate.
  """
  if config is None: config = {}
  config.setdefault('learning_rate', )

  w -= config['learning_rate'] * dw
  return w, config

核心的代碼就一行 w -= config[‘learning_rate’] * dw

閱讀完Solver.py和optim.py，我們就用它和之前的TwoLayerNet來訓練一個兩層的神經網絡，要求在validation的準确率超過50%。

下面是代碼：

model = TwoLayerNet()
solver = None

##############################################################################
# TODO: Use a Solver instance to train a TwoLayerNet that achieves at least  #
# 50% accuracy on the validation set.                                        #
##############################################################################
for k, v in data.iteritems():
  print '%s: ' % k, v.shape

model = TwoLayerNet(hidden_dim=, reg= )
solver = Solver(model, data,
                update_rule='sgd',
                optim_config={
                  'learning_rate': ,
                },
                lr_decay=,
                num_epochs=, batch_size=,
                print_every=)
solver.train()
##############################################################################
#                             END OF YOUR CODE                               #
##############################################################################

運作的結果如下【因為随機的因素，你的結果和我可能不同】：

李理：自動梯度求解——使用自動求導實作多層神經網絡

讀者可能會問，為什麼要用learning_rate=1e-3，reg=1e-2？換别的參數好像到不了50%的validation 準确率。這個就是訓練神經網絡的一些tricks了。誰也不能提前知道，隻能通過不斷的嘗試才能找到比較好的超參數。感興趣的同學請參考這裡和這裡，網絡上也有很多訓練神經網絡的tricks，讀者也可以自行搜尋學習。另外後面的作業裡也有除了sgd之外收斂速度更快的優化算法，比如RMSProp 和 Adam，感興趣的同學可以參考這裡。我後面隻會把代碼實作，細節部分請讀者自己琢磨。

cell-11

這個部分不需要自己實作，直接運作就可以了。不過對python不熟的讀者可以閱讀一下代碼。學習怎麼繪圖。

李理：自動梯度求解——使用自動求導實作多層神經網絡

cell-12

實作FullyConnectedNet，從兩層推廣的任意層的全連接配接網絡。其實和兩層差不多，細節我就不羅嗦了，請讀者自行閱讀。唯一注意的是第一層和最後一層是需要特殊處理的。

class FullyConnectedNet(object):
  """
  A fully-connected neural network with an arbitrary number of hidden layers,
  ReLU nonlinearities, and a softmax loss function. This will also implement
  dropout and batch normalization as options. For a network with L layers,
  the architecture will be

  {affine - [batch norm] - relu - [dropout]} x (L - 1) - affine - softmax

  where batch normalization and dropout are optional, and the {...} block is
  repeated L - 1 times.

  Similar to the TwoLayerNet above, learnable parameters are stored in the
  self.params dictionary and will be learned using the Solver class.
  """

  def __init__(self, hidden_dims, input_dim=**, num_classes=,
               dropout=, use_batchnorm=False, reg=,
               weight_scale=, dtype=np.float32, seed=None):
    """
    Initialize a new FullyConnectedNet.

    Inputs:
    - hidden_dims: A list of integers giving the size of each hidden layer.
    - input_dim: An integer giving the size of the input.
    - num_classes: An integer giving the number of classes to classify.
    - dropout: Scalar between 0 and 1 giving dropout strength. If dropout=0 then
      the network should not use dropout at all.
    - use_batchnorm: Whether or not the network should use batch normalization.
    - reg: Scalar giving L2 regularization strength.
    - weight_scale: Scalar giving the standard deviation for random
      initialization of the weights.
    - dtype: A numpy datatype object; all computations will be performed using
      this datatype. float32 is faster but less accurate, so you should use
      float64 for numeric gradient checking.
    - seed: If not None, then pass this random seed to the dropout layers. This
      will make the dropout layers deteriminstic so we can gradient check the
      model.
    """
    self.use_batchnorm = use_batchnorm
    self.use_dropout = dropout > 
    self.reg = reg
    self.num_layers =  + len(hidden_dims)
    self.dtype = dtype
    self.params = {}

    ############################################################################
    # TODO: Initialize the parameters of the network, storing all values in    #
    # the self.params dictionary. Store weights and biases for the first layer #
    # in W1 and b1; for the second layer use W2 and b2, etc. Weights should be #
    # initialized from a normal distribution with standard deviation equal to  #
    # weight_scale and biases should be initialized to zero.                   #
    #                                                                          #
    # When using batch normalization, store scale and shift parameters for the #
    # first layer in gamma1 and beta1; for the second layer use gamma2 and     #
    # beta2, etc. Scale parameters should be initialized to one and shift      #
    # parameters should be initialized to zero.                                #
    ############################################################################
    pass
    ############################################################################
    #                             END OF YOUR CODE                             #
    ############################################################################

    # When using dropout we need to pass a dropout_param dictionary to each
    # dropout layer so that the layer knows the dropout probability and the mode
    # (train / test). You can pass the same dropout_param to each dropout layer.
    self.dropout_param = {}
    if self.use_dropout:
      self.dropout_param = {'mode': 'train', 'p': dropout}
      if seed is not None:
        self.dropout_param['seed'] = seed

    # With batch normalization we need to keep track of running means and
    # variances, so we need to pass a special bn_param object to each batch
    # normalization layer. You should pass self.bn_params[0] to the forward pass
    # of the first batch normalization layer, self.bn_params[1] to the forward
    # pass of the second batch normalization layer, etc.
    self.bn_params = []
    if self.use_batchnorm:
      self.bn_params = [{'mode': 'train'} for i in xrange(self.num_layers - )]

    # Cast all parameters to the correct datatype
    for k, v in self.params.iteritems():
      self.params[k] = v.astype(dtype)


  def loss(self, X, y=None):
    """
    Compute loss and gradient for the fully-connected net.

    Input / output: Same as TwoLayerNet above.
    """
    X = X.astype(self.dtype)
    mode = 'test' if y is None else 'train'

    # Set train/test mode for batchnorm params and dropout param since they
    # behave differently during training and testing.
    if self.dropout_param is not None:
      self.dropout_param['mode'] = mode   
    if self.use_batchnorm:
      for bn_param in self.bn_params:
        bn_param[mode] = mode

    scores = None
    ############################################################################
    # TODO: Implement the forward pass for the fully-connected net, computing  #
    # the class scores for X and storing them in the scores variable.          #
    #                                                                          #
    # When using dropout, you'll need to pass self.dropout_param to each       #
    # dropout forward pass.                                                    #
    #                                                                          #
    # When using batch normalization, you'll need to pass self.bn_params[0] to #
    # the forward pass for the first batch normalization layer, pass           #
    # self.bn_params[1] to the forward pass for the second batch normalization #
    # layer, etc.                                                              #
    ############################################################################
    pass
    ############################################################################
    #                             END OF YOUR CODE                             #
    ############################################################################

    # If test mode return early
    if mode == 'test':
      return scores

    loss, grads = , {}
    ############################################################################
    # TODO: Implement the backward pass for the fully-connected net. Store the #
    # loss in the loss variable and gradients in the grads dictionary. Compute #
    # data loss using softmax, and make sure that grads[k] holds the gradients #
    # for self.params[k]. Don't forget to add L2 regularization!               #
    #                                                                          #
    # When using batch normalization, you don't need to regularize the scale   #
    # and shift parameters.                                                    #
    #                                                                          #
    # NOTE: To ensure that your implementation matches ours and you pass the   #
    # automated tests, make sure that your L2 regularization includes a factor #
    # of 0.5 to simplify the expression for the gradient.                      #
    ############################################################################
    pass
    ############################################################################
    #                             END OF YOUR CODE                             #
    ############################################################################

    return loss, grads

cell-13

用兩層神經網絡過拟合50個訓練資料

需要調參數。方法就是多試試，最好寫一個腳本。

我使用的參數是

weight_scale = 5e-2

learning_rate = 1e-3

cell-14

用5層神經網絡過拟合50個訓練資料，我們會發現要找一個合适的參數比兩層網絡更困難。

我使用的參數：

weight_scale = 5e-2

learning_rate = 5e-3

cell-15

sgd_momentum

在TODO部分複制如下代碼：

next_w = w
  v = config['momentum'] * v - config['learning_rate'] * dw

  next_w += v

下面是sgd和sgd_momentum的收斂速度比較：

李理：自動梯度求解——使用自動求導實作多層神經網絡

cell-16

rmsprop和adam

#rmsprop
  next_x = x
  config['cache'] = config['decay_rate'] * config['cache'] + ( - config['decay_rate']) * (dx * dx)

  x += -config['learning_rate'] * dx / (np.sqrt(config['cache']) + config['epsilon'])

#adam
  config['t'] += 
  config['m'] = config['beta1'] * config['m'] + ( - config['beta1']) * dx
  config['v'] = config['beta2'] * config['v'] + ( - config['beta2']) * (dx ** )
  mb = config['m'] / ( - config['beta1'] ** config['t'])
  vb = config['v'] / ( - config['beta2'] ** config['t'])

  next_x = x - config['learning_rate'] * mb / (np.sqrt(vb) + config['epsilon'])

收斂速度比較：

李理：自動梯度求解——使用自動求導實作多層神經網絡

cell-17

用5層的全連接配接神經網絡訓練cifar-10，要求得到一個在驗證集上得到50%以上準确率的模型。你會發現要調好各種參數确實挺tricky的。不過也不要花太多時間在這裡，之後我們的重點是CNN，我們多花些時間調CNN吧。不過如果對神經網絡不太熟悉，也可以多調調參數找找感覺，請寫個python腳本來搜尋最優的參數。同時如果計算資源足夠，也可以同時跑多個腳本并行搜尋。另外一個技巧就是如果發現loss或者val_acc不怎麼變化，就可以提前停止了。

我這裡就不列舉最優參數了，請大家自己試試能不能找到val_acc大于0.5的超參數。

X_val= data['X_val']
y_val= data['y_val']
X_test= data['X_test']
y_test= data['y_test']

lr = #需要調的參數
ws =  #需要調的參數

model = FullyConnectedNet([, , , ],
      weight_scale=ws, dtype=np.float64,use_batchnorm=False, reg= )
solver = Solver(model, data,
        print_every=, num_epochs=, batch_size=,
        update_rule='adam',
        optim_config={
          'learning_rate': lr,
        },
        lr_decay = , #需要調的參數
        verbose = True
        )   

solver.train()

plt.subplot(, , )
plt.plot(solver.loss_history)
plt.title('Loss history')
plt.xlabel('Iteration')
plt.ylabel('Loss')

plt.subplot(, , )
plt.plot(solver.train_acc_history, label='train')
plt.plot(solver.val_acc_history, label='val')
plt.title('Classification accuracy history')
plt.xlabel('Epoch')
plt.ylabel('Clasification accuracy')
plt.show() 

best_model = model

本系列文章也将在CSDN人工智能公衆号AI_Thinker中進行連載，掃描下方二維碼即可關注。

李理：自動梯度求解——使用自動求導實作多層神經網絡

李理：自動梯度求解——使用自動求導實作多層神經網絡

常見深度學習架構/工具使用方法

使用自動求導來實作多層神經網絡

環境

作業

cell-1

cell-2

cell-3

cell-4

cell-5

cell-6

cell-7

cell-8

softmax函數

cross entropy 損失函數

softmax loss的代碼

第1行

第2行

第3，4行

cell-9

cell-10

cell-11

cell-12

cell-13

cell-14

cell-15

cell-16

cell-17

繼續閱讀