
說到機器學習,我先講述下內建學習的三種方法,bagging,boosting和stacking。
1、bagging是一種有放回的随機取樣,通過vote得出結果。
2、boosting呢,是基于權值的弱分類器內建,通過采樣權重和計算分類器權重不斷疊代更新,使結果接近最優分類。boosting使原來分錯的樣本在下一個分類器中有較大幾率出現,提升分類後分對的機率。
3、stacking與 bagging 和 boosting 主要存在兩方面的差異。首先,Stacking 通常考慮的是異質弱學習器(不同的學習算法被組合在一起),而bagging 和 boosting 主要考慮的是同質弱學習器。其次,stacking 學習用元模型組合基礎模型,而bagging 和 boosting 則根據确定性算法組合弱學習器。
好,那我們大緻了解了這三種內建學習,bagging就是跟随機森林差不多,但有差别,主要展現在兩方面,第一随機森林取全部樣本數進行vote,但bagging計算的數額小于樣本數;第二bagging使用全部特征得到分類器,随機森林使用部分特征。boosting呢主要是引入兩個權重,這個很關鍵也很重要。stacking是可以選擇 KNN 分類器、logistic 回歸和SVM 多個分類器作為弱學習器,并決定學習神經網絡作為元模型。然後,神經網絡将會把三個弱學習器的輸出作為輸入,并傳回基于該輸入的最終預測。
我們使用的XGboost是一個boosting方法,它是"極端梯度提升"(eXtreme Gradient Boosting)的簡稱。XGBoost 源于梯度提升架構,但是更加高效,秘訣就在于算法能并行計算、近似建樹、對稀疏資料的有效處理以及記憶體使用優化,這使得 XGBoost 至少比現有梯度提升實作有至少 10 倍的速度提升。
代碼如下:
# XGboost法簡單運用
# 收入階層預測
# 導入分析庫檔案
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline
# 導入資料檔案
df = pd.read_csv('adult.data')
df
資料源大緻是個這樣的一個結構
39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
0 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
1 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
2 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
3 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K
4 37 Private 284582 Masters 14 Married-civ-spouse Exec-managerial Wife White Female 0 0 40 United-States <=50K
5 49 Private 160187 9th 5 Married-spouse-absent Other-service Not-in-family Black Female 0 0 16 Jamaica <=50K
6 52 Self-emp-not-inc 209642 HS-grad 9 Married-civ-spouse Exec-managerial Husband White Male 0 0 45 United-States >50K
7 31 Private 45781 Masters 14 Never-married Prof-specialty Not-in-family White Female 14084 0 50 United-States >50K
8 42 Private 159449 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 5178 0 40 United-States >50K
9 37 Private 280464 Some-college 10 Married-civ-spouse Exec-managerial Husband Black Male 0 0 80 United-States >50K
10 30 State-gov 141297 Bachelors 13 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 India >50K
11 23 Private 122272 Bachelors 13 Never-married Adm-clerical Own-child White Female 0 0 30 United-States <=50K
12 32 Private 205019 Assoc-acdm 12 Never-married Sales Not-in-family Black Male 0 0 50 United-States <=50K
13 40 Private 121772 Assoc-voc 11 Married-civ-spouse Craft-repair Husband Asian-Pac-Islander Male 0 0 40 ? >50K
14 34 Private 245487 7th-8th 4 Married-civ-spouse Transport-moving Husband Amer-Indian-Eskimo Male 0 0 45 Mexico <=50K
15 25 Self-emp-not-inc 176756 HS-grad 9 Never-married Farming-fishing Own-child White Male 0 0 35 United-States <=50K
16 32 Private 186824 HS-grad 9 Never-married Machine-op-inspct Unmarried White Male 0 0 40 United-States <=50K
17 38 Private 28887 11th 7 Married-civ-spouse Sales Husband White Male 0 0 50 United-States <=50K
18 43 Self-emp-not-inc 292175 Masters 14 Divorced Exec-managerial Unmarried White Female 0 0 45 United-States >50K
19 40 Private 193524 Doctorate 16 Married-civ-spouse Prof-specialty Husband White Male 0 0 60 United-States >50K
20 54 Private 302146 HS-grad 9 Separated Other-service Unmarried Black Female 0 0 20 United-States <=50K
21 35 Federal-gov 76845 9th 5 Married-civ-spouse Farming-fishing Husband Black Male 0 0 40 United-States <=50K
22 43 Private 117037 11th 7 Married-civ-spouse Transport-moving Husband White Male 0 2042 40 United-States <=50K
23 59 Private 109015 HS-grad 9 Divorced Tech-support Unmarried White Female 0 0 40 United-States <=50K
24 56 Local-gov 216851 Bachelors 13 Married-civ-spouse Tech-support Husband White Male 0 0 40 United-States >50K
25 19 Private 168294 HS-grad 9 Never-married Craft-repair Own-child White Male 0 0 40 United-States <=50K
26 54 ? 180211 Some-college 10 Married-civ-spouse ? Husband Asian-Pac-Islander Male 0 0 60 South >50K
27 39 Private 367260 HS-grad 9 Divorced Exec-managerial Not-in-family White Male 0 0 80 United-States <=50K
28 49 Private 193366 HS-grad 9 Married-civ-spouse Craft-repair Husband White Male 0 0 40 United-States <=50K
29 23 Local-gov 190709 Assoc-acdm 12 Never-married Protective-serv Not-in-family White Male 0 0 52 United-States <=50K
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
32530 30 ? 33811 Bachelors 13 Never-married ? Not-in-family Asian-Pac-Islander Female 0 0 99 United-States <=50K
32531 34 Private 204461 Doctorate 16 Married-civ-spouse Prof-specialty Husband White Male 0 0 60 United-States >50K
32532 54 Private 337992 Bachelors 13 Married-civ-spouse Exec-managerial Husband Asian-Pac-Islander Male 0 0 50 Japan >50K
32533 37 Private 179137 Some-college 10 Divorced Adm-clerical Unmarried White Female 0 0 39 United-States <=50K
32534 22 Private 325033 12th 8 Never-married Protective-serv Own-child Black Male 0 0 35 United-States <=50K
32535 34 Private 160216 Bachelors 13 Never-married Exec-managerial Not-in-family White Female 0 0 55 United-States >50K
32536 30 Private 345898 HS-grad 9 Never-married Craft-repair Not-in-family Black Male 0 0 46 United-States <=50K
32537 38 Private 139180 Bachelors 13 Divorced Prof-specialty Unmarried Black Female 15020 0 45 United-States >50K
32538 71 ? 287372 Doctorate 16 Married-civ-spouse ? Husband White Male 0 0 10 United-States >50K
32539 45 State-gov 252208 HS-grad 9 Separated Adm-clerical Own-child White Female 0 0 40 United-States <=50K
32540 41 ? 202822 HS-grad 9 Separated ? Not-in-family Black Female 0 0 32 United-States <=50K
32541 72 ? 129912 HS-grad 9 Married-civ-spouse ? Husband White Male 0 0 25 United-States <=50K
32542 45 Local-gov 119199 Assoc-acdm 12 Divorced Prof-specialty Unmarried White Female 0 0 48 United-States <=50K
32543 31 Private 199655 Masters 14 Divorced Other-service Not-in-family Other Female 0 0 30 United-States <=50K
32544 39 Local-gov 111499 Assoc-acdm 12 Married-civ-spouse Adm-clerical Wife White Female 0 0 20 United-States >50K
32545 37 Private 198216 Assoc-acdm 12 Divorced Tech-support Not-in-family White Female 0 0 40 United-States <=50K
32546 43 Private 260761 HS-grad 9 Married-civ-spouse Machine-op-inspct Husband White Male 0 0 40 Mexico <=50K
32547 65 Self-emp-not-inc 99359 Prof-school 15 Never-married Prof-specialty Not-in-family White Male 1086 0 60 United-States <=50K
32548 43 State-gov 255835 Some-college 10 Divorced Adm-clerical Other-relative White Female 0 0 40 United-States <=50K
32549 43 Self-emp-not-inc 27242 Some-college 10 Married-civ-spouse Craft-repair Husband White Male 0 0 50 United-States <=50K
32550 32 Private 34066 10th 6 Married-civ-spouse Handlers-cleaners Husband Amer-Indian-Eskimo Male 0 0 40 United-States <=50K
32551 43 Private 84661 Assoc-voc 11 Married-civ-spouse Sales Husband White Male 0 0 45 United-States <=50K
32552 32 Private 116138 Masters 14 Never-married Tech-support Not-in-family Asian-Pac-Islander Male 0 0 11 Taiwan <=50K
32553 53 Private 321865 Masters 14 Married-civ-spouse Exec-managerial Husband White Male 0 0 40 United-States >50K
32554 22 Private 310152 Some-college 10 Never-married Protective-serv Not-in-family White Male 0 0 40 United-States <=50K
32555 27 Private 257302 Assoc-acdm 12 Married-civ-spouse Tech-support Wife White Female 0 0 38 United-States <=50K
32556 40 Private 154374 HS-grad 9 Married-civ-spouse Machine-op-inspct Husband White Male 0 0 40 United-States >50K
32557 58 Private 151910 HS-grad 9 Widowed Adm-clerical Unmarried White Female 0 0 40 United-States <=50K
32558 22 Private 201490 HS-grad 9 Never-married Adm-clerical Own-child White Male 0 0 20 United-States <=50K
32559 52 Self-emp-inc 287927 HS-grad 9 Married-civ-spouse Exec-managerial Wife White Female 15024 0 40 United-States >50K
32560 rows × 15 columns
進行資料清洗
# 資料清洗
# 更改列名稱 年齡 工薪階層 id 學曆 學習時長 婚姻狀況 職業 關系 種族 性别 資本收益 資本損失 每周連續工作小時數 原籍國 收入
df.columns = ['age', 'workplace', 'id', 'education', 'education_num', 'marital_status',
'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss',
'hours_per_week', 'native_country', 'income']
df1 = df.copy()
# 檢視資料集資訊
df1.info()
# 檢視資料空值情況
df1.isna().sum()
df1.isnull().sum()
df2 = df1.replace(' ?', np.nan).dropna()# 将空值=‘?‘去掉。
清洗完成後,開始特征選取
# 特征工程 去除無用的特征
df3 = df2.drop('id',axis=1)
# 特征值編碼
df_get_dum = pd.get_dummies(df3.iloc[:,:-1])
#顯示所有列
pd.set_option('display.max_columns', None)
#顯示所有行
pd.set_option('display.max_rows', None)
df_get_dum.head()
df_get_dum['income'] = df3['income']
data = df_get_dum.copy()
# 對标簽值進行編碼
from sklearn.preprocessing import LabelEncoder
lab = LabelEncoder()
data['income'] = lab.fit_transform(data['income'])
data['income'].value_counts()
db = data.copy()# 得到最終可供機器學習的資料集 預處理完畢
下面我們建立模型
# 模組化分析預測 sklearn.metrics評價名額
# XGboost
import time
import numpy as np
import xgboost as xgb
from xgboost import plot_importance,plot_tree
from sklearn.metrics import accuracy_score
import os
%matplotlib inline
# 資料集亂序
#顯示所有列
pd.set_option('display.max_columns', None)
#顯示所有行
pd.set_option('display.max_rows', None)
from sklearn.utils import shuffle
db1 = shuffle(db)
db1.head(20)
注釋:sklean 中有資料集打亂方法shuffle(),為了增加機器學習的魯棒性
# 訓練集、測試集拆分
from sklearn.model_selection import train_test_split
target_name = 'income'
X = db1.drop('income', axis=1)
y = db1[target_name]
# 訓練資料 訓練目标
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=123, stratify=y)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
這樣,我們也劃分了訓練集和測試集,下面該設定模型參數了,XGboost模型主要有三種參數
1、通用參數,也稱宏觀參數
a、booster:選擇每次疊代的模型, 有兩種模型可以選擇gbtree和gblinear。gbtree使用基于樹的模型進行提升計算,gblinear使用線性模型進行提升計算。預設值為gbtree
2、學習目标參數
a、eta:學習率,用來控制樹的權重
b、max_depth:樹深度,用來避免過拟合,預設值為6
3、任務參數:
a、objective:[ default=reg:linear ]
# 定義學習任務及相應的學習目标,可選的目标函數如下:
# “reg:linear” –線性回歸。
# “reg:logistic” –邏輯回歸。
# “binary:logistic” –二分類的邏輯回歸問題,輸出為機率。
# “multi:softmax” –讓XGBoost采用softmax目标函數處理多分類問題,同時需要設定參數num_class(類别個數)
# “multi:softprob” –和softmax一樣,但是輸出的是ndata * nclass的向量,可以将該向量reshape成ndata行nclass列的矩陣。每行資料表示樣本所屬于每個類别的機率。
在這裡我們使用softmax函數
# 參數設定
# 訓練算法參數設定
params = {
# 通用參數
'booster': 'gbtree',
'nthread': 4,
'silent':0,
'num_feature':103,
'seed': 1000,
# 任務參數
'objective': 'multi:softmax',
'num_class': 2,
# 提升參數
'gamma': 0.1, # 葉子節點進行劃分時需要損失函數減少的最小值
'max_depth': 6, # 樹的最大深度,預設值為6,可設定其他值
'lambda': 2, # 正則化權重
'subsample': 0.7, # 訓練模型的樣本占總樣本的比例,用于防止過拟合
'colsample_bytree': 0.7, # 建立樹時對特征進行采樣的比例
'min_child_weight': 3, # 葉子節點繼續劃分的最小的樣本權重和
'eta': 0.1, # 加法模型中使用的收縮步長
# 讓XGBoost采用softmax目标函數處理多分類問題,同時需要設定參數num_class(類别個數)
# silent [default=0]
# 取0時表示列印出運作時資訊,取1時表示以緘默方式運作,不列印運作時的資訊。預設值為0
# 指樹的最大深度
# 樹的深度越大,則對資料的拟合程度越高(過拟合程度也越高)。即該參數也是控制過拟合
# gamma值使得算法更conservation保護性,且其值依賴于loss function ,在模型中應該進行調參。
}
plst = list(params.items())
接下來,我們就設定疊代次數,生成模型啦,用測試集預測準确度
# 資料格式轉換
# 資料集格式轉換
dtrain = xgb.DMatrix(X_train, y_train)
dtest = xgb.DMatrix(X_test)
# 設定疊代次數 ,訓練模型
# 疊代次數,對于分類問題,每個類别的疊代次數,是以總的基學習器的個數 = 疊代次數*類别個數
num_rounds = 100
model = xgb.train(plst, dtrain, num_rounds) # xgboost模型訓練
# 對測試集進行預測
y_pred = model.predict(dtest)
# 計算準确率
accuracy = accuracy_score(y_test,y_pred)
print("accuarcy: %.2f%%" % (accuracy*100.0))
plot_importance(model)
plt.show()
得出結論:1、XGboosting算法的預測準确率明顯高于其他算法。
2、對收入預測最優影響的特征為age,可說明年齡是影響收入的最重要因素。
總結:本項目意在了解XGboost算法的代碼實作,用一個收入回歸測試驗證其有效性,準确率可觀,熟悉算法過程,便于後續實際應用。