天天看點

回歸樹與基于規則的模型(part2)--簡單回歸樹

學習筆記,僅供參考,有錯必糾

回歸樹與基于規則的模型

簡單回歸樹

簡單回歸樹将資料劃分為若幹組,其中組内的樣本點在結果變量取值上具有一定的同質性。為了實作這種同質性劃分,回歸樹需要決定:

  • 用于切分的預測變量及對應的切分點
  • 樹的深度或複雜度
  • 最終節點上的預測方程

在這裡,我們首先關注最終節點為常數的模型。

建構回歸樹有許多不同的方法,其中,最常用的就是the classification and regression tree (CART) methodology of Breiman.

對于回歸問題,模型首先從完整的資料集S開始,搜尋每一個預測變量的每一個不同的取值,以此将資料劃分為兩組(),其中和的選取需要整體誤差平方和達到最小:

式中和是群組内訓練集結果變量的的平均值

接下來分别在和中,模型繼續搜尋預測變量的切分點,以使得達到最大的縮減,由于回歸樹的這一過程,本質上是遞歸的切分,是以這種方法也通常稱為遞歸劃分。

When the predictor is continuous, the process for finding the optimal split point is straightforward(直接) since the data can be ordered in a natural way(資料可以進行自然的排序).

Binary predictors(0-1取值的預測變量) are also easy to split, because there is only one possible split point(因為它隻有一種可能的切分點).

進行完整生長後的樹可能會變得很大,是以會傾向于過度拟合訓練集。是以,樹經常會進行剪枝,以回到一個較小的深度。采用的剪枝過程稱為cost–complexity tuning(代價-複雜度調優).

這一過程的目的是找到一個"合适大小的樹",以使得誤差達到最小。To do this, we penalize the error rate using the size of the tree(為了達到這個目的,我們利用樹的大小對誤差進行懲罰):

其中被稱為複雜度參數,For a specific value of the complexity parameter(對于一個給定的複雜度參數取值), we find the smallest pruned tree(剪枝後的數) that has the lowest penalized error rate(使懲罰後的誤差達到最小).

As with other regularization methods(正則化方法) previously discussed, smaller penalties tend to produce more complex models, which result in larger trees.

Larger values of the complexity parameter may result in a tree with one split (a stump) or, perhaps, even a tree with no splits.

為了找到最優的剪枝樹,需要在一系列的取值上對資料進行計算,這一過程會對每一個值計算一個SSE。但我們知道的是,當選擇了一個不同的樣本時,SSE的數值也會有所變化,為了展現每一個取值下SSE的變異,Breiman等人建議使用類似于交叉驗證的方法。他們還提出了一倍标準差準則作為優化準則,來給出最簡單的樹:在一倍的标準差之内,找到最簡單的使得絕對誤差最小的樹。

Alternatively, the model can be tuned by choosing the value of the complexity parameter(複雜度參數) associated with the smallest possible RMSE value.

This particular tree methodology can also handle missing data(處理缺失值). When building the tree, missing data are ignored. For each split, a variety of alternatives are evaluated.(對于每個拆分,模型都會計算一系列的備選方案).

這一系列備選方案被稱為代理切分(surrogate splits).

A surrogate split is one whose results are similar to the original split actually used in the tree(代理切分是指與樹中實際切分結果相類似的備選切分方案).If a surrogate split approximates the original split well, it can be used when the predictor data associated with the original split are not available.(如果代理切分對原始切分的近似效果良好,當原始切分的預測變量有缺失值時,代理切分可以發揮作用)

Once the tree has been finalized, we begin to assess the relative importance of the predictors(預測變量的相對重要性) to the outcome. One way to compute an aggregate measure of importance is to keep track of the overall reduction in the optimization criteria for each predictor(記錄每一個預測變量對優化目标的減少量) . If SSE is the optimization criteria(優化準則), then the reduction in the SSE for the training set is aggregated for each predictor(那麼每一個預測變量都可以計算訓練集上整體的SSE減少量). Intuitively, predictors that appear higher in the tree (earlier splits) or those that appear multiple times in the tree will be more important than predictors that occur lower in the tree or not at all.

An advantage(優勢) of tree-based models is that, when the tree is not large, the model is simple and interpretable(簡單而又有解釋性). Also, this type of tree can be computed quickly (despite using multiple exhaustive searches[盡管使用了若幹次窮舉搜尋]). Tree models intrinsically conduct feature selection(特征選擇); if a predictor is never used in a split, the prediction equation is independent of these data. This advantage is weakened when there are highly correlated predictors(高度相關的預測變量). If two predictors are extremely correlated, the choice of which to use in a split is somewhat random(選擇一個作為切分點就幾乎是随機的).

While trees are highly interpretable and easy to compute, they do have some noteworthy disadvantages(顯著的缺點). First, single regression trees are more likely to have sub-optimal predictive performance(次優預測能力) compared to other modeling approaches. This is partly due to the simplicity of the model(這在一定程度上是由模型的簡潔性決定的). By construction, tree models partition(劃分) the data into rectangular regions of the predictor space. If the relationship between predictors and the outcome is not adequately described by these rectangles, then the predictive performance of a tree will not be optimal. Also, the number of possible predicted outcomes from a tree is finite and is determined by the number of terminal nodes(樹模型給出的結果變量預測值隻有有限種可能,它由最終節點的數目決定).

An additional disadvantage is that an individual tree tends to be unstable.If the data are slightly altered, a completely different set of splits might be found.

Finally, these trees suffer from selection bias(選擇偏差): predictors with a higher number of distinct values are favored over more granular predictors(具有很多不同取值的預測變量通常比取值較離散的預測變量更容易出現在模型中).

The danger occurs when a data set consists of a mix of informative and noise variables, and the noise variables have many more splits than the informative variables. Then there is a high probability that the noise variables will be chosen to split the top nodes of the tree. Pruning will produce either a tree with misleading structure or no tree at all.

Also, as the number of missing values increases, the selection of predictors becomes more biased(有偏)

文獻中确實存在若幹無偏的回歸樹方法。Loh提出了廣義的、無偏的、檢測互動項和進行估計的GUIDE算法。

GUIDE solves the problem by decoupling the process(分離流程) of selecting the split variable(選擇切分變量) and the split value(選擇切分點). This algorithm ranks the predictors using statistical hypothesis testing(假設檢驗) and then finds the appropriate split value associated with the most important factor(然後對于最重要的變量尋找合适的切分點).

Hothorn提出了條件推斷樹(conditional inference trees)In this model, statistical hypothesis tests are used to do an exhaustive search across the predictors and their possible split points(在這個模型中,首先會利用假沒檢驗對預測變量和可能的切分點進行窮舉搜尋). For a candidate split(對于一個備選切分點), a statistical test is used to evaluate the difference between the means of the two groups created by the split and a p-value can be computed for the test(假設檢驗可以用來評估由切分點形成的兩組之間均值的差異,然後計算檢驗的p值).

繼續閱讀