天天看点

回归树与基于规则的模型(part2)--简单回归树

学习笔记,仅供参考,有错必纠

回归树与基于规则的模型

简单回归树

简单回归树将数据划分为若干组,其中组内的样本点在结果变量取值上具有一定的同质性。为了实现这种同质性划分,回归树需要决定:

  • 用于切分的预测变量及对应的切分点
  • 树的深度或复杂度
  • 最终节点上的预测方程

在这里,我们首先关注最终节点为常数的模型。

构建回归树有许多不同的方法,其中,最常用的就是the classification and regression tree (CART) methodology of Breiman.

对于回归问题,模型首先从完整的数据集S开始,搜索每一个预测变量的每一个不同的取值,以此将数据划分为两组(),其中和的选取需要整体误差平方和达到最小:

式中和是和组内训练集结果变量的的平均值

接下来分别在和中,模型继续搜索预测变量的切分点,以使得达到最大的缩减,由于回归树的这一过程,本质上是递归的切分,因此这种方法也通常称为递归划分。

When the predictor is continuous, the process for finding the optimal split point is straightforward(直接) since the data can be ordered in a natural way(数据可以进行自然的排序).

Binary predictors(0-1取值的预测变量) are also easy to split, because there is only one possible split point(因为它只有一种可能的切分点).

进行完整生长后的树可能会变得很大,因此会倾向于过度拟合训练集。因此,树经常会进行剪枝,以回到一个较小的深度。采用的剪枝过程称为cost–complexity tuning(代价-复杂度调优).

这一过程的目的是找到一个"合适大小的树",以使得误差达到最小。To do this, we penalize the error rate using the size of the tree(为了达到这个目的,我们利用树的大小对误差进行惩罚):

其中被称为复杂度参数,For a specific value of the complexity parameter(对于一个给定的复杂度参数取值), we find the smallest pruned tree(剪枝后的数) that has the lowest penalized error rate(使惩罚后的误差达到最小).

As with other regularization methods(正则化方法) previously discussed, smaller penalties tend to produce more complex models, which result in larger trees.

Larger values of the complexity parameter may result in a tree with one split (a stump) or, perhaps, even a tree with no splits.

为了找到最优的剪枝树,需要在一系列的取值上对数据进行计算,这一过程会对每一个值计算一个SSE。但我们知道的是,当选择了一个不同的样本时,SSE的数值也会有所变化,为了体现每一个取值下SSE的变异,Breiman等人建议使用类似于交叉验证的方法。他们还提出了一倍标准差准则作为优化准则,来给出最简单的树:在一倍的标准差之内,找到最简单的使得绝对误差最小的树。

Alternatively, the model can be tuned by choosing the value of the complexity parameter(复杂度参数) associated with the smallest possible RMSE value.

This particular tree methodology can also handle missing data(处理缺失值). When building the tree, missing data are ignored. For each split, a variety of alternatives are evaluated.(对于每个拆分,模型都会计算一系列的备选方案).

这一系列备选方案被称为代理切分(surrogate splits).

A surrogate split is one whose results are similar to the original split actually used in the tree(代理切分是指与树中实际切分结果相类似的备选切分方案).If a surrogate split approximates the original split well, it can be used when the predictor data associated with the original split are not available.(如果代理切分对原始切分的近似效果良好,当原始切分的预测变量有缺失值时,代理切分可以发挥作用)

Once the tree has been finalized, we begin to assess the relative importance of the predictors(预测变量的相对重要性) to the outcome. One way to compute an aggregate measure of importance is to keep track of the overall reduction in the optimization criteria for each predictor(记录每一个预测变量对优化目标的减少量) . If SSE is the optimization criteria(优化准则), then the reduction in the SSE for the training set is aggregated for each predictor(那么每一个预测变量都可以计算训练集上整体的SSE减少量). Intuitively, predictors that appear higher in the tree (earlier splits) or those that appear multiple times in the tree will be more important than predictors that occur lower in the tree or not at all.

An advantage(优势) of tree-based models is that, when the tree is not large, the model is simple and interpretable(简单而又有解释性). Also, this type of tree can be computed quickly (despite using multiple exhaustive searches[尽管使用了若干次穷举搜索]). Tree models intrinsically conduct feature selection(特征选择); if a predictor is never used in a split, the prediction equation is independent of these data. This advantage is weakened when there are highly correlated predictors(高度相关的预测变量). If two predictors are extremely correlated, the choice of which to use in a split is somewhat random(选择一个作为切分点就几乎是随机的).

While trees are highly interpretable and easy to compute, they do have some noteworthy disadvantages(显著的缺点). First, single regression trees are more likely to have sub-optimal predictive performance(次优预测能力) compared to other modeling approaches. This is partly due to the simplicity of the model(这在一定程度上是由模型的简洁性决定的). By construction, tree models partition(划分) the data into rectangular regions of the predictor space. If the relationship between predictors and the outcome is not adequately described by these rectangles, then the predictive performance of a tree will not be optimal. Also, the number of possible predicted outcomes from a tree is finite and is determined by the number of terminal nodes(树模型给出的结果变量预测值只有有限种可能,它由最终节点的数目决定).

An additional disadvantage is that an individual tree tends to be unstable.If the data are slightly altered, a completely different set of splits might be found.

Finally, these trees suffer from selection bias(选择偏差): predictors with a higher number of distinct values are favored over more granular predictors(具有很多不同取值的预测变量通常比取值较离散的预测变量更容易出现在模型中).

The danger occurs when a data set consists of a mix of informative and noise variables, and the noise variables have many more splits than the informative variables. Then there is a high probability that the noise variables will be chosen to split the top nodes of the tree. Pruning will produce either a tree with misleading structure or no tree at all.

Also, as the number of missing values increases, the selection of predictors becomes more biased(有偏)

文献中确实存在若干无偏的回归树方法。Loh提出了广义的、无偏的、检测交互项和进行估计的GUIDE算法。

GUIDE solves the problem by decoupling the process(分离流程) of selecting the split variable(选择切分变量) and the split value(选择切分点). This algorithm ranks the predictors using statistical hypothesis testing(假设检验) and then finds the appropriate split value associated with the most important factor(然后对于最重要的变量寻找合适的切分点).

Hothorn提出了条件推断树(conditional inference trees)In this model, statistical hypothesis tests are used to do an exhaustive search across the predictors and their possible split points(在这个模型中,首先会利用假没检验对预测变量和可能的切分点进行穷举搜索). For a candidate split(对于一个备选切分点), a statistical test is used to evaluate the difference between the means of the two groups created by the split and a p-value can be computed for the test(假设检验可以用来评估由切分点形成的两组之间均值的差异,然后计算检验的p值).

继续阅读