多變量線性回歸中的特征縮放

為什麼要進行特征縮放？

"Most of the times, your dataset will contain features highly varying in magnitudes, units and range. But since, most of the machine learning algorithms use Eucledian distance between two data points in their computations, this is a problem.

If left alone, these algorithms only take in the magnitude of features neglecting the units. The results would vary greatly between different units, 5kg and 5000gms. The features with high magnitudes will weigh in a lot more in the distance calculations than features with low magnitudes.

To supress this effect, we need to bring all features to the same level of magnitudes. This can be acheived by scaling."

以上引自https://medium.com/greyatom/why-how-and-when-to-scale-your-features-4b30ab09db5e

本文以線性回歸為例，簡單介紹特征縮放在利用梯度下降法求解多變量線性回歸方程中參數時的應用：

說明：文中示例所使用的資料來自于Andrew Ng的機器學習公開課。

上代碼：

%matplotlib notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import mpl_toolkits.mplot3d

path =  'ex1data2.txt'
data = pd.read_csv(path, header=None, names=['Size', 'Bedrooms', 'Price'])

data中部分資料如下圖所示：

多變量線性回歸中的特征縮放

#對data中的特征進行歸一化處理：

max_min_scaler = lambda x : (x-np.average(x))/(np.max(x)-np.min(x))
data[['Size']] = data[['Size']].apply(max_min_scaler)
data[['Bedrooms']] = data[['Bedrooms']].apply(max_min_scaler)

進行特征縮放後，data中的部分資料如下圖所示：

多變量線性回歸中的特征縮放

data的各描述統計量資訊如下：

多變量線性回歸中的特征縮放

利用梯度下降求解多變量回歸方程中的未知參數：

# 向data中添加一列便于矩陣計算的輔助列：
data.insert(0, 'Ones', 1)


# 定義代價函數：
def computeCost(X, y, theta):
    inner = np.power(((X * theta.T) - y), 2)
    return np.sum(inner) / (2 * len(X))


# 定義梯度下降函數：
def gradientDescent(X, y, theta, alpha, iters):
    temp = np.matrix(np.zeros(theta.shape))
    parameters = int(theta.ravel().shape[1])
    cost = np.zeros(iters)
    
    for i in range(iters):
        error = (X * theta.T) - y
        
        for j in range(parameters):
            term = np.multiply(error, X[:,j])
            temp[0,j] = theta[0,j] - ((alpha / len(X)) * np.sum(term))
            
        theta = temp
        cost[i] = computeCost(X, y, theta)
        
    return theta, cost


# 擷取訓練資料集的特征和标簽：
cols = data.shape[1]
X = data.iloc[:,0:cols-1]
y = data.iloc[:,cols-1:cols]

# 将資料集的特征和标簽轉換成矩陣形式：
X = np.matrix(X.values)
y = np.matrix(y.values)

# 初始化相關參數
alpha = 0.01
iters = 10000
theta = np.matrix(np.array([0,0,0]))

# 調用梯度下降函數求解多變量線性回歸方程中的未知參數：
g, cost = gradientDescent(X, y, theta, alpha, iters)

# g的值為matrix([[340412.65957447, 468817.94834267,  10324.476191  ]])

繪制代價函數值與疊代次數的關系圖像：

fig, ax = plt.subplots()
ax.plot(np.arange(iters), cost, 'r')
ax.set_xlabel('Iterations')
ax.set_ylabel('Cost')
ax.set_title('Error vs. Training Epoch')
fig.savefig('p1.png')

結果：

多變量線性回歸中的特征縮放

繪制進行特征縮放後資料的線性拟合圖：

# 先繪制原始資料的三維散點圖
fig = plt.figure()
axes=plt.subplot(111,projection='3d')
axes.scatter(X[:,1],X[:,2],y)

# 在繪制原始資料的線性拟合圖
h = X*g.T
axes.scatter(X[:,1],X[:,2],h,c='r')
axes.set(xlabel='Size',ylabel='Bedrooms',zlabel='Price')
fig.savefig('3dscatter.png')

結果：

多變量線性回歸中的特征縮放

注：本文使用的特征縮放方式是标準化縮放，除此之外，常用的特征縮放方法還有均值歸一化、Min-Max縮放和機關向量等。詳細内容可以參考https://medium.com/greyatom/why-how-and-when-to-scale-your-features-4b30ab09db5e。

其他參考：https://blog.csdn.net/tiancai13579/article/details/72781111

PS：本文為部落客原創文章，轉載請注明出處。

多變量線性回歸中的特征縮放

繼續閱讀

簡單文檔分類——樸素貝葉斯算法樸素貝葉斯算法簡單文檔分類執行個體步驟總結樸素貝葉斯分類調用(sklearn)

【分類算法】什麼是分類算法定義分類與聚類分類過程方法

分類算法的評價名額

K-近鄰算法以及圖像分類應用

weka之NB算法

使用weka的select attribute

weka中分類器算法

在weka中內建自己的算法

【多變量線性回歸】學習記錄序思路實作終

申請評分模型拒絕推斷（RI）方法申請評分模型拒絕推斷（RI）方法

【人工智能行業大師訪談1】吳恩達采訪 Geoffery Hinton

【趨高機器視覺】機器視覺技術原了解析及解決方案

吳恩達 coursera ML 第七課總結+作業答案前言目錄正文模型表示作業答案

XGBoost Plotting API以及GBDT組合特征實踐 XGBoost Plotting API以及GBDT組合特征實踐

解碼器用于語義分割：資料依賴的解碼可以實作靈活的特征聚合

2021-2025年中國運動療法（KT）帶行業市場供需與戰略研究報告