選擇和訓練模型

線性回歸模型

from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)

現在就建構好了一個可以工作的線性回歸模型。

可以使用sklearn的mean_squared_error函數來測量整個訓練集上回歸模型的RMSE：

from sklearn.metrics import mean_squared_error
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)      # 68628.413493824875

決策樹模型

它能夠從資料中找到複雜的非線性關系。

from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_prepared, housing_labels)

housing_predicions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels, housing_predicions)
tree_rmse = np.sqrt(tree_mse)      # 0.0

我們可以看到結果是完全沒有錯誤，說明這個模型對資料嚴重過拟合。

現在來改進一下評估的政策

使用交叉驗證來更好的進行評估

使用sklearn的交叉驗證功能。以下是執行k-折交叉驗證的代碼：它将訓練集随機分割成10個不同的子集，每個子集稱為一個折疊，然後對決策樹模型進行10此訓練和評估——每次挑選一個折疊進行評估，使用另外9個折疊進行訓練。産出的結果是一個包含10次評估分數的數組：

from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
                         scoring="neg_mean_squared_error", cv=10)
rmse_scores = np.sqrt(-scores)

sklearn的交叉驗證功能更傾向于使用效用函數（越大越好）而不是成本函數（越小越好），是以計算分數的函數實際上是負的MSE

def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

随機森林

from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_reg.fit(housing_prepared, housing_labels)
forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,
                             scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)

随機森林：

Scores: [49565.79529588 47337.80993895 50185.64303548 52405.39139117

49600.49582891 53273.54270025 48704.41836964 47764.91984528

52851.82081761 50215.74587276]

Mean: 50190.55830959307

Standard deviation: 1961.179867922108

決策樹：

Scores: [69626.46134399 67991.90860685 71566.04190367 70032.02237379

70596.49995302 74664.05771371 70091.6453497 71805.24386367

78157.17712767 69618.17027461]

Mean: 71414.92285106823

Standard deviation: 2804.7690022906345

線性回歸：

Scores: [66877.52325028 66608.120256 70575.91118868 74179.94799352

67683.32205678 71103.16843468 64782.65896552 67711.29940352

71080.40484136 67687.6384546 ]

Mean: 68828.99948449328

Standard deviation: 2662.7615706103393

使用python 的pickel子產品或是joblib庫可以輕松儲存sklearn模型，這樣可以更有效。

import joblib
joblib.dump(my_model, "my_model.pkl")

my_model_loaded = joblib.load("my_model.pkl")

微調模型

網格搜尋

可以使用sklearn的GridSearchCV來替你進行搜尋。

from sklearn.model_selection import GridSearchCV

param_grid = [
    {'n_estimators' : [3, 10, 30], 'max_features' : [2, 4, 6, 8]},
    {'bootstrap' : [False], 'n_estimators' : [3, 10], 'max_features' : [2, 3, 4]},
]

forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(housing_prepared, housing_labels)

最佳的參數組合：

grid_search.best_params_

最好的估算器：

grid_search.best_estimator_

評估分數

cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params

随機搜尋

當搜尋的組合數量較少——網格搜尋是不錯的；但是當搜尋的範圍較大時，通常會優先選擇使用RandomizedSearchCV，它不會嘗試所有可能的組合，而是在每次疊代中為每個超參數選擇一個随機值，然後對一定數量的随機組合進行評估。

feature_importances = grid_search.best_estimator_.feature_importances_

利用sklearn來選擇和訓練模型，微調模型選擇和訓練模型微調模型

選擇和訓練模型

線性回歸模型

決策樹模型

使用交叉驗證來更好的進行評估

随機森林

微調模型

網格搜尋

随機搜尋

繼續閱讀

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

2021-2025年中國運動療法（KT）帶行業市場供需與戰略研究報告

Small tricks

libsvm for python 安裝

2021年危險化學品經營機關安全管理人員考試題庫及危險化學品經營機關安全管理人員考試技巧

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

無人機--飛控科普

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入