選擇和訓練模型
線性回歸模型
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)
現在就建構好了一個可以工作的線性回歸模型。
可以使用sklearn的mean_squared_error函數來測量整個訓練集上回歸模型的RMSE:
from sklearn.metrics import mean_squared_error
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse) # 68628.413493824875
決策樹模型
它能夠從資料中找到複雜的非線性關系。
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_prepared, housing_labels)
housing_predicions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels, housing_predicions)
tree_rmse = np.sqrt(tree_mse) # 0.0
我們可以看到結果是完全沒有錯誤,說明這個模型對資料嚴重過拟合。
現在來改進一下評估的政策
使用交叉驗證來更好的進行評估
使用sklearn的交叉驗證功能。以下是執行k-折交叉驗證的代碼:它将訓練集随機分割成10個不同的子集,每個子集稱為一個折疊,然後對決策樹模型進行10此訓練和評估——每次挑選一個折疊進行評估, 使用另外9個折疊進行訓練。産出的結果是一個包含10次評估分數的數組:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
scoring="neg_mean_squared_error", cv=10)
rmse_scores = np.sqrt(-scores)
sklearn的交叉驗證功能更傾向于使用效用函數(越大越好)而不是成本函數(越小越好),是以計算分數的函數實際上是負的MSE
def display_scores(scores):
print("Scores:", scores)
print("Mean:", scores.mean())
print("Standard deviation:", scores.std())
随機森林
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_reg.fit(housing_prepared, housing_labels)
forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,
scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
随機森林:
Scores: [49565.79529588 47337.80993895 50185.64303548 52405.39139117
49600.49582891 53273.54270025 48704.41836964 47764.91984528
52851.82081761 50215.74587276]
Mean: 50190.55830959307
Standard deviation: 1961.179867922108
決策樹:
Scores: [69626.46134399 67991.90860685 71566.04190367 70032.02237379
70596.49995302 74664.05771371 70091.6453497 71805.24386367
78157.17712767 69618.17027461]
Mean: 71414.92285106823
Standard deviation: 2804.7690022906345
線性回歸:
Scores: [66877.52325028 66608.120256 70575.91118868 74179.94799352
67683.32205678 71103.16843468 64782.65896552 67711.29940352
71080.40484136 67687.6384546 ]
Mean: 68828.99948449328
Standard deviation: 2662.7615706103393
使用python 的pickel子產品或是joblib庫可以輕松儲存sklearn模型,這樣可以更有效。
import joblib
joblib.dump(my_model, "my_model.pkl")
my_model_loaded = joblib.load("my_model.pkl")
微調模型
網格搜尋
可以使用sklearn的GridSearchCV來替你進行搜尋。
from sklearn.model_selection import GridSearchCV
param_grid = [
{'n_estimators' : [3, 10, 30], 'max_features' : [2, 4, 6, 8]},
{'bootstrap' : [False], 'n_estimators' : [3, 10], 'max_features' : [2, 3, 4]},
]
forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(housing_prepared, housing_labels)
最佳的參數組合:
grid_search.best_params_
最好的估算器:
grid_search.best_estimator_
評估分數
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(np.sqrt(-mean_score), params
随機搜尋
當搜尋的組合數量較少——網格搜尋是不錯的;但是當搜尋的範圍較大時,通常會優先選擇使用RandomizedSearchCV,它不會嘗試所有可能的組合,而是在每次疊代中為每個超參數選擇一個随機值,然後對一定數量的随機組合進行評估。
feature_importances = grid_search.best_estimator_.feature_importances_