天天看點

機器學習學習筆記一:線性回歸(一)

本文主要練習LinearRegression, ElasticNetCV, LassoCV, RidgeCV四種模型進行模組化,掌握四種模型模組化基本操作。同時運用多項式處理比較1階到5階的拟合程度及R方變化,運用Pipeline管道(流水線)縮短代碼,完成本文可以基本掌握四種模型sklearn用法,另外學會運用Pipeline。

首先導入需要使用的子產品,這個也可以邊寫邊倒入。

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, ElasticNetCV, LassoCV, RidgeCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
           

然後可以設定畫圖時對中文的支援,這裡windows比較簡單,mac我一般用如下辦法,網上還有其他辦法,操作較多較複雜

#列印中文,mac版,之後會用到這函數
def getChineseFont():  
    return FontProperties(fname='/System/Library/Fonts/STHeiti Medium.ttc') 

#列印中文windows版
mpl.rcParams['font.sans-serif']=[u'simHei']
mpl.rcParams['axes.unicode_minus']=False
           

之後導入資料,觀察資料特點

path = r'./datas/household_power_consumption_1000.txt'
datas = pd.read_csv(path, sep=';')
datas.head(10)
           
Date Time Global_active_power Global_reactive_power Voltage Global_intensity Sub_metering_1 Sub_metering_2 Sub_metering_3
16/12/2006 17:24:00 4.216 0.418 234.84 18.4 0.0 1.0
1 16/12/2006 17:25:00 5.360 0.436 233.63 23.0 0.0 1.0
2 16/12/2006 17:26:00 5.374 0.498 233.29 23.0 0.0 2.0
3 16/12/2006 17:27:00 5.388 0.502 233.74 23.0 0.0 1.0
4 16/12/2006 17:28:00 3.666 0.528 235.68 15.8 0.0 1.0
5 16/12/2006 17:29:00 3.520 0.522 235.02 15.0 0.0 2.0
6 16/12/2006 17:30:00 3.702 0.520 235.09 15.8 0.0 1.0
7 16/12/2006 17:31:00 3.700 0.520 235.22 15.8 0.0 1.0
8 16/12/2006 17:32:00 3.668 0.510 233.99 15.8 0.0 1.0
9 16/12/2006 17:33:00 3.662 0.510 233.86 15.8 0.0 2.0
Global_active_power Global_reactive_power Voltage Global_intensity Sub_metering_1 Sub_metering_2 Sub_metering_3
count 1000.000000 1000.000000 1000.00000 1000.000000 1000.0 1000.000000
mean 2.418772 0.089232 240.03579 10.351000 0.0 2.749000
std 1.239979 0.088088 4.08442 5.122214 0.0 8.104053
min 0.206000 0.000000 230.98000 0.800000 0.0 0.000000
25% 1.806000 0.000000 236.94000 8.400000 0.0 0.000000
50% 2.414000 0.072000 240.65000 10.000000 0.0 0.000000
75% 3.308000 0.126000 243.29500 14.000000 0.0 1.000000
max 7.706000 0.528000 249.37000 33.200000 0.0 38.000000

指定X與Y,指定訓練集與測試集

names = ['Date', 'Time', 'Global_active_power', 'Global_reactive_power', 'Voltage', 'Global_intensity', 'Sub_metering_1', 'Sub_metering_2', 'Sub_metering_3']
X = datas[names[4:6]]
Y = datas[names[2]]
#劃分訓練集與測試集,測試集占20%
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state=3)
           
ss = StandardScaler()
x_train = ss.fit_transform(x_train)
x_test = ss.transform(x_test)
lr = LinearRegression()
lr.fit(x_train, y_train)
print('訓練集R^2: ', lr.score(x_train, y_train))
print('截距: ', lr.intercept_)
print('參數: ', lr.coef_)
print('測試集R^2', lr.score(x_test, y_test))

from matplotlib.font_manager import FontManager, FontProperties 

y_predict = lr.predict(x_test)
#畫圖
plt.figure(figsize=(12, 6), facecolor='w')
plt.plot(range(len(x_test)), y_test, 'r-', lw=1, label='test', zorder=10)
plt.plot(range(len(x_test)), y_predict, 'b-', lw=1, label='predict', zorder=10)
plt.title(u'功率與電流、電壓的關系', fontproperties=getChineseFont())
plt.legend(loc='upper left')
plt.show()
           

訓練集R^2: 0.9914990458818783

截距: 2.4425775

參數: [0.02165243 1.2555645 ]

測試集R^2: 0.9901973293430661

機器學習學習筆記一:線性回歸(一)

到這裡用了linear模型小試牛刀,下面比較四個模型,運用了多項式處理和pipeline技術。

#利用pipeline管道(網上也叫流水線,更容易了解),減少重複代碼,之後如果需要使用直接利用pipeline就好
models = [
    Pipeline([
        ('poly', PolynomialFeatures()),
        ('lr', LinearRegression())
    ]),
    Pipeline([
        ('poly', PolynomialFeatures()),
        ('lr', LassoCV())
    ]),
    Pipeline([
        ('poly', PolynomialFeatures()),
        ('lr', RidgeCV())
    ]),
    Pipeline([
        ('poly', PolynomialFeatures()),
        ('lr', ElasticNetCV())
    ])
]
#劃分訓練街和測試集,測試集占30%
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0)
model_name = ['lr', 'lasso', 'ridge', 'ela']
colors = ['r', 'r', 'y', 'y', 'b', 'b']
#标準化
ss = StandardScaler()
X_train_ = ss.fit_transform(X_train)  #對訓練集用fit和transform
X_test_ = ss.transform(X_test)  #因為上面一行已經fit過了,訓練集和測試集用一套fit标準就好,是以這裡不fit

#模組化加畫圖
plt.figure(facecolor='y')
plt.subplots(figsize=(21, 24))
for i in range(len(models)):
    plt.subplot(4, 1, i+1)
    model = models[i]  #每次導入一個模型
    for j in range(1, 6, 2):
        #設定階數,poly為管道裡PolynomialFeatures别名,degree為設定參數,中間用兩個下劃線_     
        model.set_params(poly__degree=j)  
        model.fit(X_train, Y_train)  #訓練模型
        poly = model.get_params('poly')['poly']  #擷取多項式擴充對象,兩個參數都為要擷取對象的别名
        feature = poly.get_feature_names()  #擷取多項式對象的變量屬性,這樣可以把每個參數對應到變量,寫出模型表達式
        lin = model.get_params('lr')['lr']  #擷取線性回歸對象
        output = '%d階,%s模型,分數為:%.3f,參數為:' % (j, model_name[i], model.score(X_test, Y_test))
        print(output, lin.coef_)
        print('feature:', feature)
        y_predict = model.predict(X_test)
        label = '%d階,score:%.3f'%(j, model.score(X_test, Y_test))
        plt.plot(range(len(X_test)), y_predict, color=colors[j-1], lw=1, label=label)
    plt.plot(range(len(X_test)), Y_test, 'g-', lw=1)
    plt.legend(loc='upper left')  #設定label位置
    plt.title(model_name[i], fontsize=16)

plt.show()  
           

結果如下

1階,lr模型,分數為:0.992,參數為: [0. 0.00611712 0.24437297]

feature: [‘1’, ‘x0’, ‘x1’]

3階,lr模型,分數為:0.993,參數為: [ 0.00000000e+00 -3.70951434e+01 -6.40167154e+00 1.53651601e-01

5.22340225e-02 1.79832277e-02 -2.12195366e-04 -1.02794156e-04

-6.16294133e-05 -8.12414760e-05]

feature: [‘1’, ‘x0’, ‘x1’, ‘x0^2’, ‘x0 x1’, ‘x1^2’, ‘x0^3’, ‘x0^2 x1’, ‘x0 x1^2’, ‘x1^3’]

5階,lr模型,分數為:0.995,參數為: [ 0.00000000e+00 -6.69622090e-01 1.45306572e+00 -1.13216456e+01

-3.90041968e+01 2.69703377e+02 8.95263141e-02 4.74095642e-01

-3.14655737e+00 -1.19288298e+00 -2.65128414e-04 -1.92118023e-03

1.22165319e-02 9.45563983e-03 1.96964251e-03 2.78816740e-07

2.59505716e-06 -1.57775191e-05 -1.88099529e-05 -7.25206107e-06

-2.76570191e-06]

feature: [‘1’, ‘x0’, ‘x1’, ‘x0^2’, ‘x0 x1’, ‘x1^2’, ‘x0^3’, ‘x0^2 x1’, ‘x0 x1^2’, ‘x1^3’, ‘x0^4’, ‘x0^3 x1’, ‘x0^2 x1^2’, ‘x0 x1^3’, ‘x1^4’, ‘x0^5’, ‘x0^4 x1’, ‘x0^3 x1^2’, ‘x0^2 x1^3’, ‘x0 x1^4’, ‘x1^5’]

1階,lasso模型,分數為:0.992,參數為: [0. 0.00488852 0.24340701]

feature: [‘1’, ‘x0’, ‘x1’]

3階,lasso模型,分數為:0.992,參數為: [ 0.00000000e+00 -0.00000000e+00 0.00000000e+00 -0.00000000e+00

0.00000000e+00 0.00000000e+00 -9.96040425e-08 4.24340539e-06

0.00000000e+00 0.00000000e+00]

feature: [‘1’, ‘x0’, ‘x1’, ‘x0^2’, ‘x0 x1’, ‘x1^2’, ‘x0^3’, ‘x0^2 x1’, ‘x0 x1^2’, ‘x1^3’]

5階,lasso模型,分數為:0.990,參數為: [ 0.00000000e+00 -0.00000000e+00 0.00000000e+00 -0.00000000e+00

0.00000000e+00 0.00000000e+00 -0.00000000e+00 0.00000000e+00

0.00000000e+00 0.00000000e+00 -0.00000000e+00 0.00000000e+00

0.00000000e+00 0.00000000e+00 0.00000000e+00 -2.46402283e-12

7.32871262e-11 0.00000000e+00 0.00000000e+00 0.00000000e+00

0.00000000e+00]

feature: [‘1’, ‘x0’, ‘x1’, ‘x0^2’, ‘x0 x1’, ‘x1^2’, ‘x0^3’, ‘x0^2 x1’, ‘x0 x1^2’, ‘x1^3’, ‘x0^4’, ‘x0^3 x1’, ‘x0^2 x1^2’, ‘x0 x1^3’, ‘x1^4’, ‘x0^5’, ‘x0^4 x1’, ‘x0^3 x1^2’, ‘x0^2 x1^3’, ‘x0 x1^4’, ‘x1^5’]

剩下兩個模型結果略

<matplotlib.figure.Figure at 0x1185afcf8>

機器學習學習筆記一:線性回歸(一)
機器學習學習筆記一:線性回歸(一)
機器學習學習筆記一:線性回歸(一)
機器學習學習筆記一:線性回歸(一)

結論:可以看到Ridge在5階時出現了過拟合現象,其他模型在5階以内拟合效果較好,說明L2正則項在階數較高時容易出現過拟合現象,這在使用時需要注意。

總結:本文重點是掌握四種線性回歸的基本用法,學會運用多項式擴充和pipeline技術來優化模型及代碼

繼續閱讀