天天看點

回歸樹與基于規則的模型(part3)--回歸模型樹

學習筆記,僅供參考,有錯必糾

回歸樹與基于規則的模型

回歸模型樹

One limitation of simple regression trees is that each terminal node(最終節點) uses the average of the training set outcomes(訓練結果變量的平均值) in that node for prediction. As a consequence, these models may not do a good job predicting samples whose true outcomes are extremely high or low.

One approach to dealing with this issue is to use a different estimator(其他隊的估計量) in the terminal nodes.

Here we focus on the model tree approach described in Quinlan (1992) called M5, which is similar to regression trees except:

  • 切分的準則不同;
  • 最終節點利用線性模型來對結果變量進行預測(而不是使用簡單的平均);
  • 新樣本的預測值通常是樹中同一條路徑下,若幹不同模型預測值的組合。

Like simple regression trees, the initial split(初次切分) is found using an exhaustive search(窮舉搜尋) over the predictors and training set samples, but, unlike those models, the expected reduction in the node’s error rate is used(此處的優化準則是節點上的期望誤差率減少量). Let denote the entire set of data and let represent the P subsets of the data after splitting. The split criterion(分割準則) would be:

where SD is the standard deviation(标準差) and is the number of samples in partition (第個子集).

This metric determines if the total variation in the splits, weighted by sample size, is lower than in the presplit data.(這個名額衡量了進行切分後按樣本量進行權重的總變異是否比沒有切分時的總變異更小)

The split that is associated with the largest reduction in error is chosen (能使誤差達到最小的切分方案将被選中) and a linear model is created within the partitions using the split variable in the model(同時,在每一份子集中将利用切分變量拟合一個線性模型).

For subsequent splitting iterations, this process is repeated:

an initial split is determined and a linear model is created for the partition using the current split variable and all others that preceded it. (在子集中利用該切分變量和之前所有的切分變量拟合線性模型)

The error associated with each linear model is used in place of in Eq.1 to determine the expected reduction in the error rate for the next split.(要計算下一個切分方案的期望誤差減小值,要在(1)式中用每一個線性模型的誤差取代)

The tree growing process continues along the branches of the tree until there are no further improvements in the error rate(誤差率不再有進一步的提升) or there are not enough samples to continue the process(沒有足夠的樣本去執行這個過程). Once the tree is fully grown, there is a linear model for every node in the tree(當樹完全生長後,樹的每個節點都具備一個線性模型).

Once the complete set of linear models have been created, each undergoes a simplification procedure to potentially drop some of the terms(其中的每個模型都将經曆一個簡化的過程,即從模型中移除部分項). For a given model, an adjusted error rate is computed(可以計算其調整後的誤差率). First, the absolute differences between the observed and predicted data are calculated then multiplied by a term that penalizes models with large numbers of parameters(對變量數較多的模型進行懲罰):

where is the number of training set data points that were used to build the model and is the number of parameters.

Each model term is dropped and the adjusted error rate(調整後的誤差率) is computed. 如果誤差率沒有比移除部分項時調整後誤差率要小,那麼該項将從模型中被移除. In some cases, the linear model may be simplified to having only an intercept. This procedure is independently applied to each linear model(不同線性模型的化簡過程是互相獨立的).

Model trees also incorporate a type of smoothing(平滑) to decrease the potential for over-fitting(減少潛在的過拟合).The technique is based on the recursive shrinking(遞歸收縮) methodology of Hastie and Pregibon (1990).

When predicting, the new sample goes downthe appropriate path of the tree(新樣本自上而下的落入到合适的路徑中) , and moving from the bottom up(自下而上), the linear models along that path are combined(路徑中的線性模型被組合在一起).

子節點和父節點的預測值将通過如下方式組合在一起,并獲得組合後的父節點預測值:

其中是子節點的預測值,是子節點訓練集的樣本量,是父節點的預測值,是一個常數,預設值為15.

繼續閱讀