前向和反向傳播公式推導

1 定義網絡結構

假設某二分類問題的網絡結構由如圖1.1組成（暫僅以2層網絡舉例，更高層數可依此類比），其輸入的特征向量維數為n，隐藏層神經元個數為

前向和反向傳播公式推導

，輸出層神經元個數為

前向和反向傳播公式推導

（由于是二分類問題，故僅含一個）。

前向和反向傳播公式推導

圖1.1 神經網絡結構

其訓練過程為：首先擷取訓練資料 X ，初始化每層的訓練參數 w、b ，并通過前向傳播算法（包含線性前向傳播和激活函數前向傳播）得到

前向和反向傳播公式推導

并計算

前向和反向傳播公式推導

，然後通過反向傳播算法（包含線性反向傳播和激活函數反向傳播）計算每層的 dw 和 db，最終通過梯度下降算法更新參數 w、b，并多次疊代該過程，得到訓練參數。

前向和反向傳播公式推導

圖1.2 定義神經網絡算法結構

其中參數為：

前向和反向傳播公式推導

類比可知

前向和反向傳播公式推導

維數為

前向和反向傳播公式推導

，

前向和反向傳播公式推導

維數為

前向和反向傳播公式推導

。

X 包含 m 個樣本的

前向和反向傳播公式推導

個特征值，

前向和反向傳播公式推導

包含網絡第1層的所有權重參數，

前向和反向傳播公式推導

包含網絡第1層的所有偏置參數。

2 前向傳播算法

輸入資料 X ，然後通過每層的權重 w 和偏置 b 計算得到輸出預測值

前向和反向傳播公式推導

，并最終通過代價函數衡量算法的運作情況。

前向和反向傳播公式推導

注：上式中

前向和反向傳播公式推導

需要先廣播（python中的一種運算機制）至

前向和反向傳播公式推導

再參與運算。

前向和反向傳播公式推導

，此處激活函數

前向和反向傳播公式推導

選擇 Relu 函數。

前向和反向傳播公式推導

，此處激活函數

前向和反向傳播公式推導

選擇 sigmoid 函數。

前向和反向傳播公式推導

，即為輸出的預測值。

損失函數：

前向和反向傳播公式推導

代價函數（暫不考慮正則化）：

前向和反向傳播公式推導

注意：此處不同于線性回歸的平方和代價函數，因為sigmoid函數在平方和函數（

前向和反向傳播公式推導

）中會成為“非凸函數”，産生很多局部最小值，這對後續的梯度下降法尋找最優值是極為不利的，故重此處新定義代價函數。

前向和反向傳播公式推導

圖2.1 非凸函數

3 反向傳播算法

可以說，一個神經網絡的計算，都是按照前向或反向傳播過程組織的。首先我們計算出一個新的網絡的輸出（前向過程），緊接着進行一個反向傳輸操作。後者即用來計算出每層神經元對應的梯度或導數。

3.1計算圖

如圖3.1，先舉一個計算圖的例子，該例也可從本質上反應反向傳播算法的計算過程。

前向和反向傳播公式推導

圖3.1 計算圖

此處先對

前向和反向傳播公式推導

求導，

前向和反向傳播公式推導

接下來向前傳遞一次，

前向和反向傳播公式推導

即可得到 a, b, c 的所有導數：

前向和反向傳播公式推導

3.2 反向傳播

該例為一簡易的邏輯回歸，其計算圖如圖3.2所示（代碼中我們使用dx表示變量x的導數）：

前向和反向傳播公式推導

圖3.2 邏輯回歸計算圖

(1) 因為

前向和反向傳播公式推導

且

前向和反向傳播公式推導

，輸出值

前向和反向傳播公式推導

，故

前向和反向傳播公式推導

又因為，在邏輯回歸中，損失函數

前向和反向傳播公式推導

故

前向和反向傳播公式推導

(2) 根據計算圖向前計算一步，

前向和反向傳播公式推導

因為

前向和反向傳播公式推導

，故其導函數

前向和反向傳播公式推導

因為

前向和反向傳播公式推導

，故

前向和反向傳播公式推導

則

前向和反向傳播公式推導

(3) 根據計算圖，再次推進一步，

前向和反向傳播公式推導

因為

前向和反向傳播公式推導

，故

前向和反向傳播公式推導

(4)

前向和反向傳播公式推導

因為

前向和反向傳播公式推導

，故

前向和反向傳播公式推導

(5)

前向和反向傳播公式推導

因為

前向和反向傳播公式推導

，故

前向和反向傳播公式推導

(6)

前向和反向傳播公式推導

因為

前向和反向傳播公式推導

，故

前向和反向傳播公式推導

(7)

前向和反向傳播公式推導

因為

前向和反向傳播公式推導

，故

前向和反向傳播公式推導

(8)

前向和反向傳播公式推導

因為

前向和反向傳播公式推導

，故

前向和反向傳播公式推導

注意：由于X是固定的，且所有所需的參數dw、db也已全部求導完成，也不需要對X進行優化，故不需要對

前向和反向傳播公式推導

進行求導。

至此，反向傳播算法的所有計算步驟已全部推導完成，接下來隻需将計算式向量化為代碼所需的形式即可。

3.3 向量化

對應上述所有計算式，向量化為：

(1)

前向和反向傳播公式推導

(2)

前向和反向傳播公式推導

（注意：其中加入

前向和反向傳播公式推導

求平均是因為防止樣本過大而導緻數值過大的情況）

(3)

前向和反向傳播公式推導

(4)

前向和反向傳播公式推導

(5)

前向和反向傳播公式推導

(6)

前向和反向傳播公式推導

4 梯度下降算法

梯度下降是一個用來求函數最小值的算法，此處将使用梯度下降算法來求出代價函數

前向和反向傳播公式推導

的最小值。梯度下降背後的思想是：開始時随機初始化參數w、b，并通過前向傳播計算代價函數，然後尋找下一個能讓代價函數值下降最多的參數組合。多次疊代該步驟直到找到一個局部最小值。

前向和反向傳播公式推導

圖4.1 梯度下降算法

梯度下降算法的種類較多，此處僅以批量梯度下降算法為例，其公式為：

repeat until convergence{

前向和反向傳播公式推導

}

其中 a 是學習率（learning rate），它決定了我們沿着能讓代價函數下降程度最大的方向向下邁出的步子有多大。

5 代碼實作

import neural_networks as nn

n0 = train_x.shape[0]             # 輸入的特征向量維數：12288
Layer = (n0, 20, 7, 5, 1)         # 定義4層網絡，3個隐藏層，1個輸出層

parameters = nn.L_layer_model(train_x, train_y, Layer, num_iterations = 2500, print_cost = True)    # 執行該函數後即可得到優化後的參數w、b

import numpy as np

def initialize_parameters_deep(layer_dims):
    np.random.seed(1)
    parameters = {}
    L = len(layer_dims)            # number of layers in the network

    for l in range(1, L):
        parameters['W' + str(l)] = np.random.randn(layer_dims[l], layer_dims[l-1]) * np.sqrt(1 / layer_dims[l-1]) # * 0.01
        parameters['b' + str(l)] = np.zeros((layer_dims[l], 1))
        
        assert(parameters['W' + str(l)].shape == (layer_dims[l], layer_dims[l-1]))
        assert(parameters['b' + str(l)].shape == (layer_dims[l], 1))

    return parameters

"""
Implements the sigmoid activation in numpy
"""
def sigmoid(Z):

    A = 1/(1+np.exp(-Z))
    cache = Z
    
    return A, cache

"""
Implement the RELU function.
"""
def relu(Z):

    A = np.maximum(0,Z)
    
    assert(A.shape == Z.shape)
    
    cache = Z 
    return A, cache


"""
Implement the backward propagation for a single RELU unit.
"""
def relu_backward(dA, cache):

    Z = cache
    dZ = np.array(dA, copy=True) # just converting dz to a correct object.
    
    # When z <= 0, you should set dz to 0 as well. 
    dZ[Z <= 0] = 0
    
    assert (dZ.shape == Z.shape)
    
    return dZ


"""
Implement the backward propagation for a single SIGMOID unit.
"""
def sigmoid_backward(dA, cache):

    Z = cache
    
    s = 1/(1+np.exp(-Z))
    dZ = dA * s * (1-s)
    
    assert (dZ.shape == Z.shape)
    
    return dZ

"""
Implement the linear part of a layer's forward propagation.
"""
def linear_forward(A, W, b):

    Z = np.dot(W, A) + b
    
    assert(Z.shape == (W.shape[0], A.shape[1]))
    cache = (A, W, b)
    
    return Z, cache


"""
Implement the forward propagation for the LINEAR->ACTIVATION layer
"""
def linear_activation_forward(A_prev, W, b, activation):
    if activation == "sigmoid":
        # Inputs: "A_prev, W, b". Outputs: "A, activation_cache".
        Z, linear_cache = linear_forward(A_prev, W, b)
        A, activation_cache = sigmoid(Z)
    elif activation == "relu":
        # Inputs: "A_prev, W, b". Outputs: "A, activation_cache".
        Z, linear_cache = linear_forward(A_prev, W, b)
        A, activation_cache = relu(Z)
    
    assert (A.shape == (W.shape[0], A_prev.shape[1]))
    cache = (linear_cache, activation_cache)

    return A, cache

"""
Implement forward propagation for the [LINEAR->RELU]*(L-1)->LINEAR->SIGMOID computation
"""
def L_model_forward(X, parameters):
    caches = []
    A = X
    L = len(parameters) // 2                  # number of layers in the neural network
    
    # Implement [LINEAR -> RELU]*(L-1). Add "cache" to the "caches" list.
    for l in range(1, L):
        A_prev = A 
        A, cache = linear_activation_forward(A_prev, parameters['W' + str(l)], \
            parameters['b' + str(l)], activation = "relu")
        caches.append(cache)
    
    # Implement LINEAR -> SIGMOID. Add "cache" to the "caches" list.
    AL, cache = linear_activation_forward(A, parameters['W' + str(L)], \
            parameters['b' + str(L)], activation = "sigmoid")
    caches.append(cache)
    
    assert(AL.shape == (1,X.shape[1]))
            
    return AL, caches

"""
Implement the cost function defined by equation (7).
"""
def compute_cost(AL, Y):
    m = Y.shape[1]

    # Compute loss from aL and y.
    cost = (1./m) * (-np.dot(Y,np.log(AL).T) - np.dot(1-Y, np.log(1-AL).T))
    cost = np.squeeze(cost)      # To make sure your cost's shape is what we expect (e.g. this turns [[17]] into 17).
    assert(cost.shape == ())
    
    return cost


"""
Implement the linear portion of backward propagation for a single layer (layer l)
"""
def linear_backward(dZ, cache):
    
    A_prev, W, b = cache  # b沒用上
    m = A_prev.shape[1]
    
    dW = (1 / m) * np.dot(dZ, A_prev.T)
    db = (1 / m) * np.sum(dZ, axis=1, keepdims=True)
    dA_prev = np.dot(W.T, dZ)
    
    assert (dA_prev.shape == A_prev.shape)
    assert (dW.shape == W.shape)
    assert (db.shape == b.shape)
    
    return dA_prev, dW, db


"""
Implement the backward propagation for the LINEAR->ACTIVATION layer.
"""
def linear_activation_backward(dA, cache, activation):
    
    linear_cache, activation_cache = cache
    
    if activation == "relu":
        dZ = relu_backward(dA, activation_cache)
        dA_prev, dW, db = linear_backward(dZ, linear_cache)
        
    elif activation == "sigmoid":
        dZ = sigmoid_backward(dA, activation_cache)
        dA_prev, dW, db = linear_backward(dZ, linear_cache)
    
    return dA_prev, dW, db


"""
Implement the backward propagation for the [LINEAR->RELU] * (L-1) -> LINEAR -> SIGMOID group
"""
def L_model_backward(AL, Y, caches):
    
    grads = {}
    L = len(caches) # the number of layers
#    m = AL.shape[1]
    Y = Y.reshape(AL.shape) # after this line, Y is the same shape as AL

    # Initializing the backpropagation
    dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL)) # derivative of cost with respect to AL
    
    # Lth layer (SIGMOID -> LINEAR) gradients. Inputs: "AL, Y, caches". Outputs: "grads["dAL"], grads["dWL"], grads["dbL"]
    current_cache = caches[L-1] #((A W b),  (Z))
    grads["dA" + str(L)], grads["dW" + str(L)], grads["db" + str(L)] = \
    linear_activation_backward(dAL, current_cache, activation = "sigmoid")
    
    for l in reversed(range(L - 1)):
        # lth layer: (RELU -> LINEAR) gradients.
        # Inputs: "grads["dA" + str(l + 2)], caches". Outputs: "grads["dA" + str(l + 1)] , grads["dW" + str(l + 1)] , grads["db" + str(l + 1)] 

        current_cache = caches[l]
        dA_prev_temp, dW_temp, db_temp = \
        linear_activation_backward(grads["dA" + str(l+2)], current_cache, activation = "relu")
        grads["dA" + str(l + 1)] = dA_prev_temp
        grads["dW" + str(l + 1)] = dW_temp
        grads["db" + str(l + 1)] = db_temp

    return grads

"""
Update parameters using gradient descent
"""
def update_parameters(parameters, grads, learning_rate):

    L = len(parameters) // 2 # number of layers in the neural network

    # Update rule for each parameter. Use a for loop.
    for l in range(L):
        parameters["W" + str(l+1)] =  parameters["W" + str(l+1)] - learning_rate * grads["dW" + str(l + 1)]
        parameters["b" + str(l+1)] = parameters["b" + str(l+1)] - learning_rate * grads["db" + str(l + 1)]
        
    return parameters


"""
Implements a L-layer neural network: [LINEAR->RELU]*(L-1)->LINEAR->SIGMOID.
"""
def L_layer_model(X, Y, layers_dims, learning_rate = 0.0075, num_iterations = 3000, print_cost=False):#lr was 0.009

    costs = []                         # keep track of cost
    
    # Parameters initialization.
    parameters = initialize_parameters_deep(layers_dims)
    
    # Loop (gradient descent)
    for i in range(0, num_iterations):

        # Forward propagation: [LINEAR -> RELU]*(L-1) -> LINEAR -> SIGMOID.
        AL, caches = L_model_forward(X, parameters)
        
        # Compute cost.
        cost = compute_cost(AL, Y)
    
        # Backward propagation.
        grads = L_model_backward(AL, Y, caches)
 
        # Update parameters.
        parameters = update_parameters(parameters, grads, learning_rate)
                
        # Print the cost every 100 training example
        if print_cost and i % 100 == 0:
            print ("Cost after iteration %i: %f" %(i, cost))
        if print_cost and i % 100 == 0:
            costs.append(cost)
    
    return parameters

在此也對吳恩達老師的課程以及黃海廣博士整理的資料表示感謝 (≧∇≦)ﾉ，贈人玫瑰，手有餘香。

前向和反向傳播公式推導

1 定義網絡結構

2 前向傳播算法

3 反向傳播算法

3.1計算圖

3.2 反向傳播

3.3 向量化

4 梯度下降算法

5 代碼實作

繼續閱讀

考證大全 | 證券從業資格考試

敲黑闆！2021年證券從業考試考點預測

2021年銀行從業考試考情介紹,果斷收藏!

證券從業合格證書什麼時候列印？有哪些注意事項？

【幹貨滿滿】初級銀行從業考試《個人理财》重點梳理

2020年經濟師考試，難嗎？

初級銀行從業資格證有什麼用？

MBA提前面試純幹貨分享

MBA值得學麼

吳恩達logistic回歸實作

【人工智能行業大師訪談1】吳恩達采訪 Geoffery Hinton

深度學習模型分析人類複雜疾病的準确性

人工智能如何有效地運用于自然語言處理

【趨高機器視覺】機器視覺技術原了解析及解決方案

解碼器用于語義分割：資料依賴的解碼可以實作靈活的特征聚合

cs231n斯坦福基于卷積神經網絡的CV學習筆記（一）KNN和線性分類器/分類器損失/反向傳播一，KNN圖像分類算法二，線性分類器三，線性分類器損失四，反向傳播五，神經網絡