天天看點

xgboost分類_收入分類預測--XGboost法

xgboost分類_收入分類預測--XGboost法

說到機器學習,我先講述下內建學習的三種方法,bagging,boosting和stacking。

1、bagging是一種有放回的随機取樣,通過vote得出結果。

2、boosting呢,是基于權值的弱分類器內建,通過采樣權重和計算分類器權重不斷疊代更新,使結果接近最優分類。boosting使原來分錯的樣本在下一個分類器中有較大幾率出現,提升分類後分對的機率。

3、stacking與 bagging 和 boosting 主要存在兩方面的差異。首先,Stacking 通常考慮的是異質弱學習器(不同的學習算法被組合在一起),而bagging 和 boosting 主要考慮的是同質弱學習器。其次,stacking 學習用元模型組合基礎模型,而bagging 和 boosting 則根據确定性算法組合弱學習器。

好,那我們大緻了解了這三種內建學習,bagging就是跟随機森林差不多,但有差别,主要展現在兩方面,第一随機森林取全部樣本數進行vote,但bagging計算的數額小于樣本數;第二bagging使用全部特征得到分類器,随機森林使用部分特征。boosting呢主要是引入兩個權重,這個很關鍵也很重要。stacking是可以選擇 KNN 分類器、logistic 回歸和SVM 多個分類器作為弱學習器,并決定學習神經網絡作為元模型。然後,神經網絡将會把三個弱學習器的輸出作為輸入,并傳回基于該輸入的最終預測。

我們使用的XGboost是一個boosting方法,它是"極端梯度提升"(eXtreme Gradient Boosting)的簡稱。XGBoost 源于梯度提升架構,但是更加高效,秘訣就在于算法能并行計算、近似建樹、對稀疏資料的有效處理以及記憶體使用優化,這使得 XGBoost 至少比現有梯度提升實作有至少 10 倍的速度提升。

代碼如下:

# XGboost法簡單運用
# 收入階層預測
# 導入分析庫檔案
import pandas as pd 
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np 
%matplotlib inline
# 導入資料檔案
df = pd.read_csv('adult.data')
df
           

資料源大緻是個這樣的一個結構

39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	0	40	United-States	<=50K
0	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	0	13	United-States	<=50K
1	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	0	40	United-States	<=50K
2	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	0	40	United-States	<=50K
3	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	0	40	Cuba	<=50K
4	37	Private	284582	Masters	14	Married-civ-spouse	Exec-managerial	Wife	White	Female	0	0	40	United-States	<=50K
5	49	Private	160187	9th	5	Married-spouse-absent	Other-service	Not-in-family	Black	Female	0	0	16	Jamaica	<=50K
6	52	Self-emp-not-inc	209642	HS-grad	9	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	0	45	United-States	>50K
7	31	Private	45781	Masters	14	Never-married	Prof-specialty	Not-in-family	White	Female	14084	0	50	United-States	>50K
8	42	Private	159449	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	5178	0	40	United-States	>50K
9	37	Private	280464	Some-college	10	Married-civ-spouse	Exec-managerial	Husband	Black	Male	0	0	80	United-States	>50K
10	30	State-gov	141297	Bachelors	13	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	Male	0	0	40	India	>50K
11	23	Private	122272	Bachelors	13	Never-married	Adm-clerical	Own-child	White	Female	0	0	30	United-States	<=50K
12	32	Private	205019	Assoc-acdm	12	Never-married	Sales	Not-in-family	Black	Male	0	0	50	United-States	<=50K
13	40	Private	121772	Assoc-voc	11	Married-civ-spouse	Craft-repair	Husband	Asian-Pac-Islander	Male	0	0	40	?	>50K
14	34	Private	245487	7th-8th	4	Married-civ-spouse	Transport-moving	Husband	Amer-Indian-Eskimo	Male	0	0	45	Mexico	<=50K
15	25	Self-emp-not-inc	176756	HS-grad	9	Never-married	Farming-fishing	Own-child	White	Male	0	0	35	United-States	<=50K
16	32	Private	186824	HS-grad	9	Never-married	Machine-op-inspct	Unmarried	White	Male	0	0	40	United-States	<=50K
17	38	Private	28887	11th	7	Married-civ-spouse	Sales	Husband	White	Male	0	0	50	United-States	<=50K
18	43	Self-emp-not-inc	292175	Masters	14	Divorced	Exec-managerial	Unmarried	White	Female	0	0	45	United-States	>50K
19	40	Private	193524	Doctorate	16	Married-civ-spouse	Prof-specialty	Husband	White	Male	0	0	60	United-States	>50K
20	54	Private	302146	HS-grad	9	Separated	Other-service	Unmarried	Black	Female	0	0	20	United-States	<=50K
21	35	Federal-gov	76845	9th	5	Married-civ-spouse	Farming-fishing	Husband	Black	Male	0	0	40	United-States	<=50K
22	43	Private	117037	11th	7	Married-civ-spouse	Transport-moving	Husband	White	Male	0	2042	40	United-States	<=50K
23	59	Private	109015	HS-grad	9	Divorced	Tech-support	Unmarried	White	Female	0	0	40	United-States	<=50K
24	56	Local-gov	216851	Bachelors	13	Married-civ-spouse	Tech-support	Husband	White	Male	0	0	40	United-States	>50K
25	19	Private	168294	HS-grad	9	Never-married	Craft-repair	Own-child	White	Male	0	0	40	United-States	<=50K
26	54	?	180211	Some-college	10	Married-civ-spouse	?	Husband	Asian-Pac-Islander	Male	0	0	60	South	>50K
27	39	Private	367260	HS-grad	9	Divorced	Exec-managerial	Not-in-family	White	Male	0	0	80	United-States	<=50K
28	49	Private	193366	HS-grad	9	Married-civ-spouse	Craft-repair	Husband	White	Male	0	0	40	United-States	<=50K
29	23	Local-gov	190709	Assoc-acdm	12	Never-married	Protective-serv	Not-in-family	White	Male	0	0	52	United-States	<=50K
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
32530	30	?	33811	Bachelors	13	Never-married	?	Not-in-family	Asian-Pac-Islander	Female	0	0	99	United-States	<=50K
32531	34	Private	204461	Doctorate	16	Married-civ-spouse	Prof-specialty	Husband	White	Male	0	0	60	United-States	>50K
32532	54	Private	337992	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	Asian-Pac-Islander	Male	0	0	50	Japan	>50K
32533	37	Private	179137	Some-college	10	Divorced	Adm-clerical	Unmarried	White	Female	0	0	39	United-States	<=50K
32534	22	Private	325033	12th	8	Never-married	Protective-serv	Own-child	Black	Male	0	0	35	United-States	<=50K
32535	34	Private	160216	Bachelors	13	Never-married	Exec-managerial	Not-in-family	White	Female	0	0	55	United-States	>50K
32536	30	Private	345898	HS-grad	9	Never-married	Craft-repair	Not-in-family	Black	Male	0	0	46	United-States	<=50K
32537	38	Private	139180	Bachelors	13	Divorced	Prof-specialty	Unmarried	Black	Female	15020	0	45	United-States	>50K
32538	71	?	287372	Doctorate	16	Married-civ-spouse	?	Husband	White	Male	0	0	10	United-States	>50K
32539	45	State-gov	252208	HS-grad	9	Separated	Adm-clerical	Own-child	White	Female	0	0	40	United-States	<=50K
32540	41	?	202822	HS-grad	9	Separated	?	Not-in-family	Black	Female	0	0	32	United-States	<=50K
32541	72	?	129912	HS-grad	9	Married-civ-spouse	?	Husband	White	Male	0	0	25	United-States	<=50K
32542	45	Local-gov	119199	Assoc-acdm	12	Divorced	Prof-specialty	Unmarried	White	Female	0	0	48	United-States	<=50K
32543	31	Private	199655	Masters	14	Divorced	Other-service	Not-in-family	Other	Female	0	0	30	United-States	<=50K
32544	39	Local-gov	111499	Assoc-acdm	12	Married-civ-spouse	Adm-clerical	Wife	White	Female	0	0	20	United-States	>50K
32545	37	Private	198216	Assoc-acdm	12	Divorced	Tech-support	Not-in-family	White	Female	0	0	40	United-States	<=50K
32546	43	Private	260761	HS-grad	9	Married-civ-spouse	Machine-op-inspct	Husband	White	Male	0	0	40	Mexico	<=50K
32547	65	Self-emp-not-inc	99359	Prof-school	15	Never-married	Prof-specialty	Not-in-family	White	Male	1086	0	60	United-States	<=50K
32548	43	State-gov	255835	Some-college	10	Divorced	Adm-clerical	Other-relative	White	Female	0	0	40	United-States	<=50K
32549	43	Self-emp-not-inc	27242	Some-college	10	Married-civ-spouse	Craft-repair	Husband	White	Male	0	0	50	United-States	<=50K
32550	32	Private	34066	10th	6	Married-civ-spouse	Handlers-cleaners	Husband	Amer-Indian-Eskimo	Male	0	0	40	United-States	<=50K
32551	43	Private	84661	Assoc-voc	11	Married-civ-spouse	Sales	Husband	White	Male	0	0	45	United-States	<=50K
32552	32	Private	116138	Masters	14	Never-married	Tech-support	Not-in-family	Asian-Pac-Islander	Male	0	0	11	Taiwan	<=50K
32553	53	Private	321865	Masters	14	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	0	40	United-States	>50K
32554	22	Private	310152	Some-college	10	Never-married	Protective-serv	Not-in-family	White	Male	0	0	40	United-States	<=50K
32555	27	Private	257302	Assoc-acdm	12	Married-civ-spouse	Tech-support	Wife	White	Female	0	0	38	United-States	<=50K
32556	40	Private	154374	HS-grad	9	Married-civ-spouse	Machine-op-inspct	Husband	White	Male	0	0	40	United-States	>50K
32557	58	Private	151910	HS-grad	9	Widowed	Adm-clerical	Unmarried	White	Female	0	0	40	United-States	<=50K
32558	22	Private	201490	HS-grad	9	Never-married	Adm-clerical	Own-child	White	Male	0	0	20	United-States	<=50K
32559	52	Self-emp-inc	287927	HS-grad	9	Married-civ-spouse	Exec-managerial	Wife	White	Female	15024	0	40	United-States	>50K
32560 rows × 15 columns
           

進行資料清洗

# 資料清洗
# 更改列名稱 年齡 工薪階層 id 學曆 學習時長 婚姻狀況 職業 關系 種族 性别 資本收益 資本損失 每周連續工作小時數 原籍國 收入
df.columns = ['age', 'workplace', 'id', 'education', 'education_num', 'marital_status',
       'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss',
       'hours_per_week', 'native_country', 'income']
df1 = df.copy()
# 檢視資料集資訊
df1.info()
# 檢視資料空值情況
df1.isna().sum()
df1.isnull().sum()
df2 = df1.replace(' ?', np.nan).dropna()# 将空值=‘?‘去掉。
           

清洗完成後,開始特征選取

# 特征工程 去除無用的特征
df3 = df2.drop('id',axis=1)
# 特征值編碼
df_get_dum = pd.get_dummies(df3.iloc[:,:-1])
#顯示所有列
pd.set_option('display.max_columns', None)
#顯示所有行
pd.set_option('display.max_rows', None)
df_get_dum.head()
           
xgboost分類_收入分類預測--XGboost法
df_get_dum['income'] = df3['income']
data = df_get_dum.copy()
# 對标簽值進行編碼
from sklearn.preprocessing import LabelEncoder
lab = LabelEncoder()
data['income'] = lab.fit_transform(data['income'])
data['income'].value_counts()
           
xgboost分類_收入分類預測--XGboost法
db = data.copy()# 得到最終可供機器學習的資料集 預處理完畢
           

下面我們建立模型

# 模組化分析預測 sklearn.metrics評價名額
# XGboost
import time
import numpy as np
import xgboost as xgb
from xgboost import plot_importance,plot_tree
from sklearn.metrics import accuracy_score
import os
%matplotlib inline
# 資料集亂序
#顯示所有列
pd.set_option('display.max_columns', None)
#顯示所有行
pd.set_option('display.max_rows', None)
from sklearn.utils import shuffle 
db1 = shuffle(db)
db1.head(20)
           

注釋:sklean 中有資料集打亂方法shuffle(),為了增加機器學習的魯棒性

# 訓練集、測試集拆分
from sklearn.model_selection import train_test_split
target_name = 'income'
X = db1.drop('income', axis=1)
y = db1[target_name]
# 訓練資料 訓練目标 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=123, stratify=y)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
           

這樣,我們也劃分了訓練集和測試集,下面該設定模型參數了,XGboost模型主要有三種參數

1、通用參數,也稱宏觀參數

a、booster:選擇每次疊代的模型, 有兩種模型可以選擇gbtree和gblinear。gbtree使用基于樹的模型進行提升計算,gblinear使用線性模型進行提升計算。預設值為gbtree

2、學習目标參數

a、eta:學習率,用來控制樹的權重

b、max_depth:樹深度,用來避免過拟合,預設值為6

3、任務參數:

a、objective:[ default=reg:linear ]

# 定義學習任務及相應的學習目标,可選的目标函數如下:

# “reg:linear” –線性回歸。

# “reg:logistic” –邏輯回歸。

# “binary:logistic” –二分類的邏輯回歸問題,輸出為機率。

# “multi:softmax” –讓XGBoost采用softmax目标函數處理多分類問題,同時需要設定參數num_class(類别個數)

# “multi:softprob” –和softmax一樣,但是輸出的是ndata * nclass的向量,可以将該向量reshape成ndata行nclass列的矩陣。每行資料表示樣本所屬于每個類别的機率。

在這裡我們使用softmax函數

# 參數設定
# 訓練算法參數設定
params = {
    # 通用參數
    'booster': 'gbtree', 
    'nthread': 4,
    'silent':0, 
    'num_feature':103,
    'seed': 1000, 
    # 任務參數
    'objective': 'multi:softmax', 
    'num_class': 2, 
    # 提升參數
    'gamma': 0.1, # 葉子節點進行劃分時需要損失函數減少的最小值
    'max_depth': 6, # 樹的最大深度,預設值為6,可設定其他值
    'lambda': 2, # 正則化權重
    'subsample': 0.7, # 訓練模型的樣本占總樣本的比例,用于防止過拟合
    'colsample_bytree': 0.7, # 建立樹時對特征進行采樣的比例
    'min_child_weight': 3, # 葉子節點繼續劃分的最小的樣本權重和
    'eta': 0.1, # 加法模型中使用的收縮步長   
# 讓XGBoost采用softmax目标函數處理多分類問題,同時需要設定參數num_class(類别個數)    
  # silent [default=0]
# 取0時表示列印出運作時資訊,取1時表示以緘默方式運作,不列印運作時的資訊。預設值為0  
    # 指樹的最大深度
# 樹的深度越大,則對資料的拟合程度越高(過拟合程度也越高)。即該參數也是控制過拟合
    # gamma值使得算法更conservation保護性,且其值依賴于loss function ,在模型中應該進行調參。
}
plst = list(params.items())
           

接下來,我們就設定疊代次數,生成模型啦,用測試集預測準确度

# 資料格式轉換
# 資料集格式轉換
dtrain = xgb.DMatrix(X_train, y_train)
dtest = xgb.DMatrix(X_test)
# 設定疊代次數 ,訓練模型
# 疊代次數,對于分類問題,每個類别的疊代次數,是以總的基學習器的個數 = 疊代次數*類别個數
num_rounds = 100
model = xgb.train(plst, dtrain, num_rounds) # xgboost模型訓練
           
xgboost分類_收入分類預測--XGboost法
# 對測試集進行預測
y_pred = model.predict(dtest)
# 計算準确率
accuracy = accuracy_score(y_test,y_pred)
print("accuarcy: %.2f%%" % (accuracy*100.0))
plot_importance(model)
plt.show()
           
xgboost分類_收入分類預測--XGboost法
xgboost分類_收入分類預測--XGboost法

得出結論:1、XGboosting算法的預測準确率明顯高于其他算法。

2、對收入預測最優影響的特征為age,可說明年齡是影響收入的最重要因素。

總結:本項目意在了解XGboost算法的代碼實作,用一個收入回歸測試驗證其有效性,準确率可觀,熟悉算法過程,便于後續實際應用。