
说到机器学习,我先讲述下集成学习的三种方法,bagging,boosting和stacking。
1、bagging是一种有放回的随机取样,通过vote得出结果。
2、boosting呢,是基于权值的弱分类器集成,通过采样权重和计算分类器权重不断迭代更新,使结果接近最优分类。boosting使原来分错的样本在下一个分类器中有较大几率出现,提升分类后分对的概率。
3、stacking与 bagging 和 boosting 主要存在两方面的差异。首先,Stacking 通常考虑的是异质弱学习器(不同的学习算法被组合在一起),而bagging 和 boosting 主要考虑的是同质弱学习器。其次,stacking 学习用元模型组合基础模型,而bagging 和 boosting 则根据确定性算法组合弱学习器。
好,那我们大致了解了这三种集成学习,bagging就是跟随机森林差不多,但有差别,主要体现在两方面,第一随机森林取全部样本数进行vote,但bagging计算的数额小于样本数;第二bagging使用全部特征得到分类器,随机森林使用部分特征。boosting呢主要是引入两个权重,这个很关键也很重要。stacking是可以选择 KNN 分类器、logistic 回归和SVM 多个分类器作为弱学习器,并决定学习神经网络作为元模型。然后,神经网络将会把三个弱学习器的输出作为输入,并返回基于该输入的最终预测。
我们使用的XGboost是一个boosting方法,它是"极端梯度提升"(eXtreme Gradient Boosting)的简称。XGBoost 源于梯度提升框架,但是更加高效,秘诀就在于算法能并行计算、近似建树、对稀疏数据的有效处理以及内存使用优化,这使得 XGBoost 至少比现有梯度提升实现有至少 10 倍的速度提升。
代码如下:
# XGboost法简单运用
# 收入阶层预测
# 导入分析库文件
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline
# 导入数据文件
df = pd.read_csv('adult.data')
df
数据源大致是个这样的一个结构
39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
0 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
1 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
2 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
3 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K
4 37 Private 284582 Masters 14 Married-civ-spouse Exec-managerial Wife White Female 0 0 40 United-States <=50K
5 49 Private 160187 9th 5 Married-spouse-absent Other-service Not-in-family Black Female 0 0 16 Jamaica <=50K
6 52 Self-emp-not-inc 209642 HS-grad 9 Married-civ-spouse Exec-managerial Husband White Male 0 0 45 United-States >50K
7 31 Private 45781 Masters 14 Never-married Prof-specialty Not-in-family White Female 14084 0 50 United-States >50K
8 42 Private 159449 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 5178 0 40 United-States >50K
9 37 Private 280464 Some-college 10 Married-civ-spouse Exec-managerial Husband Black Male 0 0 80 United-States >50K
10 30 State-gov 141297 Bachelors 13 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 India >50K
11 23 Private 122272 Bachelors 13 Never-married Adm-clerical Own-child White Female 0 0 30 United-States <=50K
12 32 Private 205019 Assoc-acdm 12 Never-married Sales Not-in-family Black Male 0 0 50 United-States <=50K
13 40 Private 121772 Assoc-voc 11 Married-civ-spouse Craft-repair Husband Asian-Pac-Islander Male 0 0 40 ? >50K
14 34 Private 245487 7th-8th 4 Married-civ-spouse Transport-moving Husband Amer-Indian-Eskimo Male 0 0 45 Mexico <=50K
15 25 Self-emp-not-inc 176756 HS-grad 9 Never-married Farming-fishing Own-child White Male 0 0 35 United-States <=50K
16 32 Private 186824 HS-grad 9 Never-married Machine-op-inspct Unmarried White Male 0 0 40 United-States <=50K
17 38 Private 28887 11th 7 Married-civ-spouse Sales Husband White Male 0 0 50 United-States <=50K
18 43 Self-emp-not-inc 292175 Masters 14 Divorced Exec-managerial Unmarried White Female 0 0 45 United-States >50K
19 40 Private 193524 Doctorate 16 Married-civ-spouse Prof-specialty Husband White Male 0 0 60 United-States >50K
20 54 Private 302146 HS-grad 9 Separated Other-service Unmarried Black Female 0 0 20 United-States <=50K
21 35 Federal-gov 76845 9th 5 Married-civ-spouse Farming-fishing Husband Black Male 0 0 40 United-States <=50K
22 43 Private 117037 11th 7 Married-civ-spouse Transport-moving Husband White Male 0 2042 40 United-States <=50K
23 59 Private 109015 HS-grad 9 Divorced Tech-support Unmarried White Female 0 0 40 United-States <=50K
24 56 Local-gov 216851 Bachelors 13 Married-civ-spouse Tech-support Husband White Male 0 0 40 United-States >50K
25 19 Private 168294 HS-grad 9 Never-married Craft-repair Own-child White Male 0 0 40 United-States <=50K
26 54 ? 180211 Some-college 10 Married-civ-spouse ? Husband Asian-Pac-Islander Male 0 0 60 South >50K
27 39 Private 367260 HS-grad 9 Divorced Exec-managerial Not-in-family White Male 0 0 80 United-States <=50K
28 49 Private 193366 HS-grad 9 Married-civ-spouse Craft-repair Husband White Male 0 0 40 United-States <=50K
29 23 Local-gov 190709 Assoc-acdm 12 Never-married Protective-serv Not-in-family White Male 0 0 52 United-States <=50K
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
32530 30 ? 33811 Bachelors 13 Never-married ? Not-in-family Asian-Pac-Islander Female 0 0 99 United-States <=50K
32531 34 Private 204461 Doctorate 16 Married-civ-spouse Prof-specialty Husband White Male 0 0 60 United-States >50K
32532 54 Private 337992 Bachelors 13 Married-civ-spouse Exec-managerial Husband Asian-Pac-Islander Male 0 0 50 Japan >50K
32533 37 Private 179137 Some-college 10 Divorced Adm-clerical Unmarried White Female 0 0 39 United-States <=50K
32534 22 Private 325033 12th 8 Never-married Protective-serv Own-child Black Male 0 0 35 United-States <=50K
32535 34 Private 160216 Bachelors 13 Never-married Exec-managerial Not-in-family White Female 0 0 55 United-States >50K
32536 30 Private 345898 HS-grad 9 Never-married Craft-repair Not-in-family Black Male 0 0 46 United-States <=50K
32537 38 Private 139180 Bachelors 13 Divorced Prof-specialty Unmarried Black Female 15020 0 45 United-States >50K
32538 71 ? 287372 Doctorate 16 Married-civ-spouse ? Husband White Male 0 0 10 United-States >50K
32539 45 State-gov 252208 HS-grad 9 Separated Adm-clerical Own-child White Female 0 0 40 United-States <=50K
32540 41 ? 202822 HS-grad 9 Separated ? Not-in-family Black Female 0 0 32 United-States <=50K
32541 72 ? 129912 HS-grad 9 Married-civ-spouse ? Husband White Male 0 0 25 United-States <=50K
32542 45 Local-gov 119199 Assoc-acdm 12 Divorced Prof-specialty Unmarried White Female 0 0 48 United-States <=50K
32543 31 Private 199655 Masters 14 Divorced Other-service Not-in-family Other Female 0 0 30 United-States <=50K
32544 39 Local-gov 111499 Assoc-acdm 12 Married-civ-spouse Adm-clerical Wife White Female 0 0 20 United-States >50K
32545 37 Private 198216 Assoc-acdm 12 Divorced Tech-support Not-in-family White Female 0 0 40 United-States <=50K
32546 43 Private 260761 HS-grad 9 Married-civ-spouse Machine-op-inspct Husband White Male 0 0 40 Mexico <=50K
32547 65 Self-emp-not-inc 99359 Prof-school 15 Never-married Prof-specialty Not-in-family White Male 1086 0 60 United-States <=50K
32548 43 State-gov 255835 Some-college 10 Divorced Adm-clerical Other-relative White Female 0 0 40 United-States <=50K
32549 43 Self-emp-not-inc 27242 Some-college 10 Married-civ-spouse Craft-repair Husband White Male 0 0 50 United-States <=50K
32550 32 Private 34066 10th 6 Married-civ-spouse Handlers-cleaners Husband Amer-Indian-Eskimo Male 0 0 40 United-States <=50K
32551 43 Private 84661 Assoc-voc 11 Married-civ-spouse Sales Husband White Male 0 0 45 United-States <=50K
32552 32 Private 116138 Masters 14 Never-married Tech-support Not-in-family Asian-Pac-Islander Male 0 0 11 Taiwan <=50K
32553 53 Private 321865 Masters 14 Married-civ-spouse Exec-managerial Husband White Male 0 0 40 United-States >50K
32554 22 Private 310152 Some-college 10 Never-married Protective-serv Not-in-family White Male 0 0 40 United-States <=50K
32555 27 Private 257302 Assoc-acdm 12 Married-civ-spouse Tech-support Wife White Female 0 0 38 United-States <=50K
32556 40 Private 154374 HS-grad 9 Married-civ-spouse Machine-op-inspct Husband White Male 0 0 40 United-States >50K
32557 58 Private 151910 HS-grad 9 Widowed Adm-clerical Unmarried White Female 0 0 40 United-States <=50K
32558 22 Private 201490 HS-grad 9 Never-married Adm-clerical Own-child White Male 0 0 20 United-States <=50K
32559 52 Self-emp-inc 287927 HS-grad 9 Married-civ-spouse Exec-managerial Wife White Female 15024 0 40 United-States >50K
32560 rows × 15 columns
进行数据清洗
# 数据清洗
# 更改列名称 年龄 工薪阶层 id 学历 学习时长 婚姻状况 职业 关系 种族 性别 资本收益 资本损失 每周连续工作小时数 原籍国 收入
df.columns = ['age', 'workplace', 'id', 'education', 'education_num', 'marital_status',
'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss',
'hours_per_week', 'native_country', 'income']
df1 = df.copy()
# 查看数据集信息
df1.info()
# 查看数据空值情况
df1.isna().sum()
df1.isnull().sum()
df2 = df1.replace(' ?', np.nan).dropna()# 将空值=‘?‘去掉。
清洗完成后,开始特征选取
# 特征工程 去除无用的特征
df3 = df2.drop('id',axis=1)
# 特征值编码
df_get_dum = pd.get_dummies(df3.iloc[:,:-1])
#显示所有列
pd.set_option('display.max_columns', None)
#显示所有行
pd.set_option('display.max_rows', None)
df_get_dum.head()
df_get_dum['income'] = df3['income']
data = df_get_dum.copy()
# 对标签值进行编码
from sklearn.preprocessing import LabelEncoder
lab = LabelEncoder()
data['income'] = lab.fit_transform(data['income'])
data['income'].value_counts()
db = data.copy()# 得到最终可供机器学习的数据集 预处理完毕
下面我们建立模型
# 建模分析预测 sklearn.metrics评价指标
# XGboost
import time
import numpy as np
import xgboost as xgb
from xgboost import plot_importance,plot_tree
from sklearn.metrics import accuracy_score
import os
%matplotlib inline
# 数据集乱序
#显示所有列
pd.set_option('display.max_columns', None)
#显示所有行
pd.set_option('display.max_rows', None)
from sklearn.utils import shuffle
db1 = shuffle(db)
db1.head(20)
注释:sklean 中有数据集打乱方法shuffle(),为了增加机器学习的鲁棒性
# 训练集、测试集拆分
from sklearn.model_selection import train_test_split
target_name = 'income'
X = db1.drop('income', axis=1)
y = db1[target_name]
# 训练数据 训练目标
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=123, stratify=y)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
这样,我们也划分了训练集和测试集,下面该设定模型参数了,XGboost模型主要有三种参数
1、通用参数,也称宏观参数
a、booster:选择每次迭代的模型, 有两种模型可以选择gbtree和gblinear。gbtree使用基于树的模型进行提升计算,gblinear使用线性模型进行提升计算。缺省值为gbtree
2、学习目标参数
a、eta:学习率,用来控制树的权重
b、max_depth:树深度,用来避免过拟合,缺省值为6
3、任务参数:
a、objective:[ default=reg:linear ]
# 定义学习任务及相应的学习目标,可选的目标函数如下:
# “reg:linear” –线性回归。
# “reg:logistic” –逻辑回归。
# “binary:logistic” –二分类的逻辑回归问题,输出为概率。
# “multi:softmax” –让XGBoost采用softmax目标函数处理多分类问题,同时需要设置参数num_class(类别个数)
# “multi:softprob” –和softmax一样,但是输出的是ndata * nclass的向量,可以将该向量reshape成ndata行nclass列的矩阵。每行数据表示样本所属于每个类别的概率。
在这里我们使用softmax函数
# 参数设置
# 训练算法参数设置
params = {
# 通用参数
'booster': 'gbtree',
'nthread': 4,
'silent':0,
'num_feature':103,
'seed': 1000,
# 任务参数
'objective': 'multi:softmax',
'num_class': 2,
# 提升参数
'gamma': 0.1, # 叶子节点进行划分时需要损失函数减少的最小值
'max_depth': 6, # 树的最大深度,缺省值为6,可设置其他值
'lambda': 2, # 正则化权重
'subsample': 0.7, # 训练模型的样本占总样本的比例,用于防止过拟合
'colsample_bytree': 0.7, # 建立树时对特征进行采样的比例
'min_child_weight': 3, # 叶子节点继续划分的最小的样本权重和
'eta': 0.1, # 加法模型中使用的收缩步长
# 让XGBoost采用softmax目标函数处理多分类问题,同时需要设置参数num_class(类别个数)
# silent [default=0]
# 取0时表示打印出运行时信息,取1时表示以缄默方式运行,不打印运行时的信息。缺省值为0
# 指树的最大深度
# 树的深度越大,则对数据的拟合程度越高(过拟合程度也越高)。即该参数也是控制过拟合
# gamma值使得算法更conservation保护性,且其值依赖于loss function ,在模型中应该进行调参。
}
plst = list(params.items())
接下来,我们就设置迭代次数,生成模型啦,用测试集预测准确度
# 数据格式转换
# 数据集格式转换
dtrain = xgb.DMatrix(X_train, y_train)
dtest = xgb.DMatrix(X_test)
# 设置迭代次数 ,训练模型
# 迭代次数,对于分类问题,每个类别的迭代次数,所以总的基学习器的个数 = 迭代次数*类别个数
num_rounds = 100
model = xgb.train(plst, dtrain, num_rounds) # xgboost模型训练
# 对测试集进行预测
y_pred = model.predict(dtest)
# 计算准确率
accuracy = accuracy_score(y_test,y_pred)
print("accuarcy: %.2f%%" % (accuracy*100.0))
plot_importance(model)
plt.show()
得出结论:1、XGboosting算法的预测准确率明显高于其他算法。
2、对收入预测最优影响的特征为age,可说明年龄是影响收入的最重要因素。
总结:本项目意在理解XGboost算法的代码实现,用一个收入回归测试验证其有效性,准确率可观,熟悉算法过程,便于后续实际应用。