天天看點

多變量線性回歸中的特征縮放

為什麼要進行特征縮放?

"Most of the times, your dataset will contain features highly varying in magnitudes, units and range. But since, most of the machine learning algorithms use Eucledian distance between two data points in their computations, this is a problem.

If left alone, these algorithms only take in the magnitude of features neglecting the units. The results would vary greatly between different units, 5kg and 5000gms. The features with high magnitudes will weigh in a lot more in the distance calculations than features with low magnitudes.

To supress this effect, we need to bring all features to the same level of magnitudes. This can be acheived by scaling."

以上引自https://medium.com/greyatom/why-how-and-when-to-scale-your-features-4b30ab09db5e

本文以線性回歸為例,簡單介紹特征縮放在利用梯度下降法求解多變量線性回歸方程中參數時的應用:

說明:文中示例所使用的資料來自于Andrew Ng的機器學習公開課。

上代碼:

%matplotlib notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import mpl_toolkits.mplot3d

path =  'ex1data2.txt'
data = pd.read_csv(path, header=None, names=['Size', 'Bedrooms', 'Price'])
           

data中部分資料如下圖所示:

多變量線性回歸中的特征縮放
#對data中的特征進行歸一化處理:

max_min_scaler = lambda x : (x-np.average(x))/(np.max(x)-np.min(x))
data[['Size']] = data[['Size']].apply(max_min_scaler)
data[['Bedrooms']] = data[['Bedrooms']].apply(max_min_scaler)
           

進行特征縮放後,data中的部分資料如下圖所示:

多變量線性回歸中的特征縮放

data的各描述統計量資訊如下:

多變量線性回歸中的特征縮放

利用梯度下降求解多變量回歸方程中的未知參數:

# 向data中添加一列便于矩陣計算的輔助列:
data.insert(0, 'Ones', 1)


# 定義代價函數:
def computeCost(X, y, theta):
    inner = np.power(((X * theta.T) - y), 2)
    return np.sum(inner) / (2 * len(X))


# 定義梯度下降函數:
def gradientDescent(X, y, theta, alpha, iters):
    temp = np.matrix(np.zeros(theta.shape))
    parameters = int(theta.ravel().shape[1])
    cost = np.zeros(iters)
    
    for i in range(iters):
        error = (X * theta.T) - y
        
        for j in range(parameters):
            term = np.multiply(error, X[:,j])
            temp[0,j] = theta[0,j] - ((alpha / len(X)) * np.sum(term))
            
        theta = temp
        cost[i] = computeCost(X, y, theta)
        
    return theta, cost


# 擷取訓練資料集的特征和标簽:
cols = data.shape[1]
X = data.iloc[:,0:cols-1]
y = data.iloc[:,cols-1:cols]

# 将資料集的特征和标簽轉換成矩陣形式:
X = np.matrix(X.values)
y = np.matrix(y.values)

# 初始化相關參數
alpha = 0.01
iters = 10000
theta = np.matrix(np.array([0,0,0]))

# 調用梯度下降函數求解多變量線性回歸方程中的未知參數:
g, cost = gradientDescent(X, y, theta, alpha, iters)

# g的值為matrix([[340412.65957447, 468817.94834267,  10324.476191  ]])
           

繪制代價函數值與疊代次數的關系圖像:

fig, ax = plt.subplots()
ax.plot(np.arange(iters), cost, 'r')
ax.set_xlabel('Iterations')
ax.set_ylabel('Cost')
ax.set_title('Error vs. Training Epoch')
fig.savefig('p1.png')
           

    結果: 

多變量線性回歸中的特征縮放

繪制進行特征縮放後資料的線性拟合圖:

# 先繪制原始資料的三維散點圖
fig = plt.figure()
axes=plt.subplot(111,projection='3d')
axes.scatter(X[:,1],X[:,2],y)

# 在繪制原始資料的線性拟合圖
h = X*g.T
axes.scatter(X[:,1],X[:,2],h,c='r')
axes.set(xlabel='Size',ylabel='Bedrooms',zlabel='Price')
fig.savefig('3dscatter.png')
           

    結果: 

多變量線性回歸中的特征縮放

注:本文使用的特征縮放方式是标準化縮放,除此之外,常用的特征縮放方法還有均值歸一化、Min-Max縮放和機關向量等。詳細内容可以參考https://medium.com/greyatom/why-how-and-when-to-scale-your-features-4b30ab09db5e。

其他參考:https://blog.csdn.net/tiancai13579/article/details/72781111

PS:本文為部落客原創文章,轉載請注明出處。

繼續閱讀