xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

前言

本文是对 <<XGBoost：A Scalable Tree Boosting System>> 的算法解读。本着最近养成的写此类文章的习惯，我会延续先介绍 Paper 中涉及到的一些扫盲概念，然后逐步切入 XGBoost 主要设计思想，以及对公式的详细解读，使读者能对 Paper 中的抽象部分有更深度的理解。

此 Paper 是华人学术明星来自华盛顿大学的 XGBoost 作者本人：陈天奇，在SIGKDD 2016 大会上发表过的论文。在深度学习大火之前的年代 XGBoost 几乎横扫了 Kaggle 竞赛里面的大部分的奖项，XGBoost 出于它天生的树结构设计优势，也在工业界的分布式计算中得到广泛应用。在我个人经历的项目中，树类模型出于出色的可解释性，经常被用于 Baseline 模型（比如 feature importance 的存在）。

由于原始论文中涉及到了大量的决策树（Decision Tree），提升树（Boosting Tree）等算法的相关的知识，本文第一章会先从一些容易混淆的概念，数学公式和抽象的算法思路入手，使读者对该论文涉及到的“树”知识体系有一个大体的认识。然后在第二章、第三章以与论文的相同顺序详细解释作者提到的三个 Tree Boosting 算法和分裂算法（Split Finding Algorithms）。由于本文除了数学算法上的优化外还涉及到计算机系统方向的优化，这部分内容会在后面做简单的总结。由于论文后几章的内容不涉及到抽象的数学和计算机系统优化，考虑到篇幅的关系，不在此赘述，有兴趣的朋友可以直接阅读原文。

1. 基础概念 1.1 决策树 & CART

决策树(Decision Tree）是在已知各种情况发生概率的基础上，通过构成决策树来求取净现值的期望值大于等于零的概率，评价项目风险，判断其可行性的决策分析方法，是直观运用概率分析的一种图解法。由于这种决策分支画成图形很像一棵树的枝干，故称决策树[1]。

决策树是一种十分常用的分类方法，需要监督学习（Supervised Learning），就是给出一堆样本，每个样本都有一组属性和一个分类结果，通过学习，优化这些样本得到决策树，能够对新的数据给出正确的分类。

在树分裂（某节点分出新节点）时，使用信息论中的熵来衡量随机事件的“混乱程度”。随机变量的信息熵越大，则它的值（内容）能给你补充的信息量越大。数学上被记为：

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

。其中 p(x) 为 X 的概率函数。决策树本着熵越大分类结果越好的原则，使用数据的维度生成树。

常见的熵计算有[2]：

ID3 算法中根据特征选择和信息增益评估，每次选择 信息增益最大 的特征作为分支标准。
C4.5 使用“ 增益率 ”（gain ratio）来选择最优的分支标准
CART 的分支标准建立在 GINI 指数 上，GINI 值越大表明样本集合的类别越杂乱

决策树可宏观上分为：

分类树

和

回归树

两类。基本流程都是根据样本的特征找出阈值，但是这两类树的阈值衡量 / 计算标准有所不同，所以相对应的损失函数也不同。分类树选择阈值的标准是实现每个分枝都尽可能只有一个分类，即男性分枝上尽可能多的是男性，女性尽可能少；女性分枝上尽可能多的是女性，男性尽可能少；回归树选择阈值的标准则是：实现预测值和真实值的最小化均方差。回归和分类从实现上，分别用不同的方法实现了"动态确定样本权值"这一目标。

总结：回归树是用拟合残差，分类树是用错误率来调整样本权值。

1.2 提升树 1.2.1 基本思想

梯度提升（Gradient Boost）其实只是一个框架或者设计理念，里面可以套用不同的算法。其核心思想是，用多个弱分类器（比如生成的子树）来构建一个强分类器（最终集成出来的树）。大体流程是首先建立多棵决策树做集成，再将所有树的输出结果（学出的最佳参数）累加起来。或者可以这样说，每一棵树（以回归树为例）学习的是之前所有树结论和的

残差，

这个残差就是一个加预测值后能得真实值的累加量。

其数学表达式可写为：

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

其中

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

为第 k 个棵树（弱分类器），

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

为第 k 个决策树的参数， K 为决策树的数量。

算法具体步骤如下：

初始化提升树

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》
= 0
其中残差

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》
为当前模型拟合数据的残差
第 k 步的模型为：

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》
, 其中

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

为当前的待求的由残差训练得出第

k 棵树
通过经验风险极小化确定第 k 个决策树的参数

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》
：

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》
其中：对于回归类的损失函数为

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》
；对分类类的损失函数为

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》
;

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》
为真实值，

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》
为预测值

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》
的另一种写法。利用这两种损失函数的性质在求导过程中会给我们带来很多方便。
最后整理后得出分类树的损失函数（此处为推导过程，不包括损失函数）：

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

将损失函数带入

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

变成求目标函数最小值问题。迭代方法见图1。

见到网上很多文章里把 Gradient Boost 理解成 Gradient Boosting Decision Tree（GBDT），这其实是错误的认识，后者严格上来讲是指迭代梯度决策树，是 Gradient Boosting Tree 中的一个分类算法。另外与 Boosting 同类的集成模型框架中还有一个叫做 Bagging ，在机器学习训练过程中和 Boosting 有一定的区别。具体见Bagging vs Boosting。

本文所涉及到的 XGBoost 看名字就知道，是讲的 Boosting 的算法。下面小节中我会介绍两种常见的 Boosting 算法，其中 Paper 中第二章讨论的基础就是 Gradient Boosting Regression Tree（GBRT），且使用了 CART 作为树分裂算法，所以后面所涉及到的优化算法例子也是基于它的。

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

图1：Boosting Tree 计算过程

1.2.1 提升树的优化（Gradient Boosting Tree）

上文中的例子，损失函数可以是

平方损失函数

或

指数损失函数

，在此基础上

Freidman

提出了

梯度提升树算法(GBT)

。梯度提升树(GBT)的一个核心思想是

利用损失函数的负梯度在当前模型的值作为残差的近似值

，本质上是对损失函数进行一阶泰勒展开，从而拟合一个回归树。

回忆

泰勒展开公式（稍后 XGBoost 会用）：

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

在

GBT

中损失函数使用

一阶泰勒展开公式（

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

相当于这里的

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

）

：

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

则有：

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

意思为：要使得损失函数降低，则根据梯度下降的思想让损失函数对

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

进行求导，按照负梯度更新改损失函数值，所以才有：

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

其中对于平方或指数损失函数，就是通常意义上的残差。对于其他普通函数，残差是导数的近似值。用于回归模型时，是

梯度提升回归树GBRT；

梯度提升树用于分类模型时，是

梯度提升决策树 GBDT

；二者的区别主要是损失函数不同。此处借鉴了GBT、GBDT、GBRT与Xgboost 文章。

1.2.1 迭代梯度回归树 (GBRT)

迭代决策树算法，由多棵决策树组成，并将所有树的输出结果累加起来。且每一棵树是从之前所有树的

残差

中来学习的。GBRT 几乎可用于所有的回归问题（线性/非线性），相对logistic regression仅能用于线性回归，GBRT的适用面非常广。亦可用于二分类问题（设定阈值，大于阈值为正例，反之为负例）。损失函数数学表达式见 1.2 。

1.2.2 迭代梯度决策树 (GBDT)

GBDT 分类算法在思想上和回归算法没有区别，但是由于样本输出不是连续的值，而是离散的类别，导致我们无法直接从输出类别去拟合类别输出的误差。为解决此问题，我们尝试用类似于逻辑回归的对数似然损失函数的方法,也就是说我们用的是类别的预测概率值和真实概率值来拟合损失函数。对于对数似然损失函数，我们有二元分类和多元分类的区别。[3]

2. 从 GDBT 升级到 XGBoost

在第一章节，我用了大量的篇幅介绍了 GBT 的数学推导和工作原理，正是因为在这些理论上才有了 XGBoost ，在原始论文中作者本着读者已经了解 GBT 和 Freidman 提出的优化观点直接切入 XGBoost 话题，导致了很多初学者一时间无法领悟这么多容易混淆的概念。

在有了良好的基础后，我们下面回到论文的主题，对 XGBoost 进行更深度的剖析。原文作者在现有的 GDBT 思路上做了些小改动。注：下文中关于 XGBoost 的讲解基于 GBRT 和 CART 算法。

2.1 目标函数添加正则化

由此章节前一部分和我们讲得基本吻合，所以此处只会对提到的数学符号做一下解释。

给定 XGBoost 的目标函数：

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

；其中

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

公式中

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》
是 1.2.1 章节中提到的损失函数，限制为可微、且为凸函数
xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》
表示对模型复杂度的惩罚项（正则项）
在惩罚项中

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》
、

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》
表示惩罚系数
xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》
表示给定一颗树的叶节点数
决策树定义为

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》
，

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》
为某一样本，这里的

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》
表示该样本所在的叶子结点，

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》
为叶子结点权重

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》
，所以得出

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》
为每个样本的取值

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》
（即预测值）。所以

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》
表示每颗树叶节点上的输出分数的平方(相当于L2正则)

从目标函数的定义可以看出 XGBoost 的惩罚项对模型复杂度考虑了每颗树的叶节点个数，以及每颗树叶节点输出得分值得平方和。作者强调与 Regularized greedy forest[4] 相比，这个惩罚项更便于做并行化计算的优化。

2.2 梯度提升树

在通常的模型中针对这类目标函数可以使用梯度下降的方式进行优化，但注意到

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

表示的是一颗树，而非一个数值型的向量，所以不能使用梯度下降的方式去优化该目标函数。在此作者提出了使用

前向分步算法

（additive manner）。

目标函数为：

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

其中第 i 个样本在第 t 颗树的预测值 (

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

)等于样本 i 在前 t-1 棵树的预测值(

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

) 加当前第 t 颗树预测值(

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

) 公式表达为：

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

对于该函数的优化，在 XGBoost 中使用了我们前面提到的泰勒展开式，与 GDBT 不同的是 XGBoost 使用了泰勒二次展开式。

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

使

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

则有：

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

这里是针对

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

的求导，所以可以将

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

部分看成常数去掉。

得出简化后的损失函数：

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

定义集合

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

为树的第 j 个叶节点上的所有样本点的集合，即给定一颗树，所有按照决策规则被划分到第 j 个叶节点的样本集合。根据模型复杂度惩罚项的定义有：

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

第二步推导第三步过程请回顾 2.1 章节公式，

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

代表每个样本的损失函数，但这些值最终都会被归为某个叶子节点，且有

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

；

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

表示该样本所在的叶子结点；

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

表示第

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

个叶子节点取。所以也可以把叶子节点上所有损失累加起来，效果和 2.1 的损失函数效果是一样的。这里有可能会思考起来有点抽象，建议把整个 GBDT 计算过程走一遍会更容易理解，推荐阅读 GDBT 实例理解。

如果确定了树的结构，为了使目标函数最小，可以令其导数为0，解得每个叶节点的最优预测分数为：

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

将

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

代入简化后的目标函数得到最小损失为：

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

替换：

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

则有：

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

用做为得分值评价一颗树的好坏。这和决策树中的剪枝是一样原理，给定一个损失函数，判断剪枝后,根据损失函数是否减小来决定是否执行剪枝如图2。

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

图2 Structure score calcuation

这里作者也提出了，通常是不会把所有可能出现的排列组合树形都枚举出来。他解决此问题用了

贪心算法

。假设在某节点分裂了左

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

，右节点

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

且满足如果确定了树的结构（即q(x)确定），为了使目标函数最小，可以令其导数为0，解得每个叶节点的最有预测分数为：

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

那么分裂后的 L 可以表示为：

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

该值越大，说明分裂后能使目标函数减少越多，就越好。此方法被用于 XGBoost 判断最优特征以及最优切分点（此处借鉴了xgboost的理解），换句话说这个目标函数的值诠释了

树结构的好坏

。

2.3 Shrinkage & 列降采样

作为 Gradient Boosting 算法，作者提出可以使用

Shrinkage 策略（权重衰减，类似 LSTM）。比如把整个训练过程看作一个时间序列，离当前树时间点越靠前的权重对当前权重的累加影响越小，这个衰减就是论文里的参数

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》
控制的
特征降采样的方法来避免过拟合，只是这里的降采样使用的是 列降采样 （与随机森林做法一样），它的另外一个好处是可以方便加速并行化

3. 分裂查找算法

在此作者点出了树类算法与其他算法（比如求距离，概率）比较头疼的地方在于：树的生成/分裂过程。因为每个树生成的问题都要考虑到两个问题是：1）按照哪个维度 / 顺序组合切 2）如何判断

最佳切分点

。

围绕着这两个问题引出了如下两个解决方案。

3.1 精确贪心算法

从树的根节点开始，对每个叶节点枚举所有的可用特征。此处文中指出：对该节下的数据需令其按特征值进行排序，这样做可以使计算分裂收益值更加高效，

分裂收益值

计算如下：

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

然后选择该轮最大收益值作为

最佳分裂点

，使在该节点行分裂出左右两个新的叶节点，并重新分配新节点上的样本。至此一个循环结束，继续按同样流程递归迭代直至满足条件停止。相应实现步骤在论文中给出了伪代码，见 3.1 章节 Algorithm 1：Exact Greedy Algorithm for Split Finding。

由于该方法需要穷举所有分裂可能，可以在特征量大的情况下内存是肯定不够的。对此作者给出了下一个近似值计算方法。

3.2 求近似值算法

为了减少计算作者使用了将连续特征变量按照特征值分布进行分桶操作的方法，这样只需要计算每个桶中的统计信息就可以求出最佳分裂点的最佳分裂收益值。

在确定分裂点时作者提出了两种方法：

Global、Local

。

Global

表示在生成树

之前

进行候选切分点（candidate splits）的计算，且在整个计算过程中只做

一次

操作。在以后的节点划分时都使用已经提前计算好的候选切分点；

Local

则是在每次节点划分时才进行候选切分点的计算。经作者实战经验，如果想使两种方法达到逼近

精确贪婪算法

的准确性，需要对 Global 取更多的候选切分点来提高精度。由于 Local 在每次节点划分时都要计算，所以他们在有些情况下计算量很接近。

这里作者给出的总结是：Global 适合在取大量切分点下使用； Local 更适用于深度较大的树结构。具体需要在实际中根据训练数据自由调整。

3.3 加权分位数略图（Weighted Quantile Sketch）

为了避免简单统计指标对定义候选切分点不合理的问题，作者引出了加权分位数略图的概念。

设数据集

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

表示每个样本的第 k 个特征值(

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

)和二阶导数(

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

)的集合。

定义排名函数

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

；表示数据集中第 k 个特征值小于 z 的样本所在比例。

然后根据

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

进行侯选点的选取。

这里有一个也许让正常人都觉得很抽象的概念，

为什么会用二阶导数？它是怎么影响样本权重划分的？

由于作者只用一个公式带过，让这件事看起来更难以理解。我本人也是借鉴了其他人的推导过程[5]。

具体如下：

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

反正我自己是想不出来，但看过程无非就是凑个平方差公式。最后在标签为

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

的情况下，权重为

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

。

最后给我这类数学分析基础较差的爱好者们总结下：综上所述证明了在求解对样本点合理切割的问题上，可以用样本的特征值和其二阶导数的关系更好地划分数据分布范围！

3.4 稀疏值处理（Sparsity-aware Split Finding）

对于缺失数据作者让模型自动学习默认的划分方向。采用的是在每次的切分中，让缺失值分别被切分到左节点以及右节点，通过计算得分值比较两种切分方法哪一个更优，则会对每个特征的缺失值都会学习到一个最优的默认切分方向。

4. 总结

以上只是总结了该论文在数学算法上的分析，另外作者也从计算机系统方向做了大量优化，使模型在训练中的速度大大提升，同时也可以利用硬盘 IO 处理超大规模数据，甚至优化了缓存。由于这部分和计算机系统更加相关，在此就不做赘述。

对比传统的 GBDT，XGBoost 在目标函数上加入了惩罚项，使模型的泛化能力大大增强，且对行列支持降采样，优化了计算速度。

本人不太确定的是：对于特征工程处理缺失值使用让模型自动学习的方法，也许会造成模型在生产环境中的稳定性造成一定影响，文中作者的只提到了它在计算速度上的效率，但并未给出模型处理这类稀疏问题的准确性。我从个人之前数据挖掘经验来看，这种插值方法在真正的生产环境中对模型的稳定性也许会存在一定隐患（比如模型准确率不到一个月就会大打折扣）。不知具体实战效果如何？不知可否使用传统的中位数代替？

iris 数据集测试代码

：

#importing library and segregation of data as train and test using DMatrix 
# Data structure
from sklearn import datasets
from sklearn import metrics
from sklearn.model_selection import train_test_split
import xgboost as xgb

iris = datasets.load_iris() #dataset loading
X = iris.data               #Features stored in X 
y = iris.target             #Class variable

#Splitting dataset into Training (80%) and testing data (20%) using train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 这里提前定义了特征名称，以便之后 feature_importance 能直接显示特征名，而不是 f{n}
feature_names = ['sepal_len', 'sepal_width', 'petal_len', 'petal_width']

dtrain = xgb.DMatrix(data = X_train, 
                     label = y_train, 
                     feature_names = feature_names)

dtest = xgb.DMatrix(data = X_test, 
                    label = y_test)

param = {
    'max_depth': 3,  # the maximum depth of each tree
    'eta': 0.3,  # the training step for each iteration
    'silent': 1,  # logging mode - quiet
    'objective': 'multi:softprob',  # error evaluation for multiclass training
    'num_class': 3}  # the number of classes that exist in this datset

num_round = 5  # the number of training iterations

bst = xgb.train(param, dtrain, num_round)
y_predict = bst.predict(dtest)
print(metrics.accuracy_score(y_test, y_pred))

bst.dump_model('dump.raw.txt')

这个文件会生成 xgboost 库在创建树结构时的具体信息，包括：

ID: 每个节点的ID
Feature: 使用树分裂的特征，分裂停止于叶子节点
Split: 分裂时计算出的特征值
Yes: 分支中下一个节点满足分裂条件的特征ID
No: 分支中下一个节点不满足分裂条件的特征ID
Missing: 分支中下一个节点用于分裂条件的特征ID，且该特征有缺失值

booster[0]:
0:[petal_len<2.45000005] yes=1,no=2,missing=1
	1:leaf=0.426035523
	2:leaf=-0.218845025
booster[1]:
0:[petal_len<2.45000005] yes=1,no=2,missing=1
	1:leaf=-0.213017777
	2:[petal_width<1.75] yes=3,no=4,missing=3
		3:leaf=0.357142866
		4:leaf=-0.193288609
booster[2]:
0:[petal_len<4.75] yes=1,no=2,missing=1
	1:[petal_width<1.45000005] yes=3,no=4,missing=3
		3:leaf=-0.217894763
		4:leaf=-0.109756112
	2:[petal_width<1.75] yes=5,no=6,missing=5
		5:leaf=0.0878048688
		6:leaf=0.404698014
booster[3]:
0:[petal_len<2.45000005] yes=1,no=2,missing=1
	1:leaf=0.293278694
	2:leaf=-0.195857152
booster[4]:
0:[petal_len<2.45000005] yes=1,no=2,missing=1
	1:leaf=-0.189503655
	2:[petal_len<4.94999981] yes=3,no=4,missing=3
		3:leaf=0.262102395
		4:leaf=-0.163355067
booster[5]:
0:[petal_len<4.75] yes=1,no=2,missing=1
	1:[petal_width<1.45000005] yes=3,no=4,missing=3
		3:leaf=-0.19525367
		4:leaf=-0.0907141864
	2:[petal_len<4.94999981] yes=5,no=6,missing=5
		5:leaf=0.0648387149
		6:leaf=0.284316152
booster[6]:
0:[petal_len<2.45000005] yes=1,no=2,missing=1
	1:leaf=0.234882191
	2:leaf=-0.180683166
booster[7]:
0:[petal_len<2.45000005] yes=1,no=2,missing=1
	1:leaf=-0.173286781
	2:[petal_width<1.6500001] yes=3,no=4,missing=3
		3:leaf=0.210330531
		4:leaf=-0.145057067
booster[8]:
0:[petal_len<4.75] yes=1,no=2,missing=1
	1:[petal_width<1.45000005] yes=3,no=4,missing=3
		3:leaf=-0.179621071
		4:leaf=-0.0728611574
	2:[petal_width<1.75] yes=5,no=6,missing=5
		5:leaf=0.046688173
		6:leaf=0.23004286
booster[9]:
0:[petal_len<2.45000005] yes=1,no=2,missing=1
	1:leaf=0.202377468
	2:leaf=-0.169775575
booster[10]:
0:[petal_len<2.45000005] yes=1,no=2,missing=1
	1:leaf=-0.161111236
	2:[petal_len<4.94999981] yes=3,no=4,missing=3
		3:leaf=0.176073477
		4:leaf=-0.133052737
booster[11]:
0:[petal_len<4.75] yes=1,no=2,missing=1
	1:[petal_width<1.45000005] yes=3,no=4,missing=3
		3:leaf=-0.168202192
		4:leaf=-0.059584748
	2:[petal_len<5.05000019] yes=5,no=6,missing=5
		5:leaf=0.0603878833
		6:leaf=0.202011034
booster[12]:
0:[petal_len<2.45000005] yes=1,no=2,missing=1
	1:leaf=0.18176958
	2:leaf=-0.161522761
booster[13]:
0:[petal_len<2.45000005] yes=1,no=2,missing=1
	1:leaf=-0.151254103
	2:[petal_width<1.75] yes=3,no=4,missing=3
		3:leaf=0.149364129
		4:leaf=-0.133772731
booster[14]:
0:[petal_len<4.75] yes=1,no=2,missing=1
	1:[sepal_width<2.54999995] yes=3,no=4,missing=3
		3:leaf=-0.0491230562
		4:leaf=-0.160760283
	2:[petal_width<1.75] yes=5,no=6,missing=5
		5:leaf=0.0225836635
		6:leaf=0.175175846

树结构可视化：

# visualization
import matplotlib.pyplot as plt

xgb.plot_tree(bst, num_trees=1)
fig = plt.gcf()
fig.set_size_inches(150, 100)
fig.savefig('treeIris.png')

fig.show()

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

特征重要性展示：

from xgboost import plot_importance
from matplotlib import pyplot
plot_importance(bst)
pyplot.show()

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

如图所示，在样本量大的情况下我们可以用分布式的方式训练，在得到结果后通过特征贡献程度来辅助选择特征（类似 PCA 的降维操作，不过 XGBoost 是可监督的模型，准确度会更高）。由此可见用此类模型做 Baseline 模型还是有诸多益处的。

如果一定要抱怨，那恐怕只有超参数过多吧。代码例子中只展示了部分超参数，具体见 API。

备注：

这篇文章写得不是太满意，因为树本身不像其他算法那样只涉及到算距离，概率的问题，而是用数学表达了一个抽象的树结构，以及叶子节点信息等概念。。。刚刚看会比较难理解其真正的数学意义。以后会慢慢改进对这部分的描述。

推荐阅读：

阿泽：【机器学习】决策树（下）——XGBoost、LightGBM（非常详细）zhuanlan.zhihu.com

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

金贵涛：对xgboost的理解zhuanlan.zhihu.com

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

XGboost: A Scalable Tree Boosting System论文及源码导读mlnote.com

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

yyHaker：GBT、GBDT、GBRT与Xgboostzhuanlan.zhihu.com

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

参考

^决策树 https://baike.baidu.com/item/%E5%86%B3%E7%AD%96%E6%A0%91
^决策树熵计算： https://blog.csdn.net/u010089444/article/details/53241218
^GDBT 详解 https://zhuanlan.zhihu.com/p/36339161
^RGF https://arxiv.org/pdf/1109.0887.pdf
^加权分位数略图 https://zhuanlan.zhihu.com/p/75217528

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》

参考

继续阅读

xgboost分类_XGBoost (eXtreme Gradient Boosting)可扩展的提升树系统

xgboost分类_算法笔记：XGBoost: A Scalable Tree Boosting Sysem

xgboost分类_XGBoost多分类预测

xgboost分类_收入分类预测--XGboost法

xgboost分类_解读论文 《XGBoost: A Scalable Tree Boosting System》

参考

继续阅读

xgboost分类_解读论文《XGBoost: A Scalable Tree Boosting System》