回歸樹與基于規則的模型(part3)--回歸模型樹

學習筆記，僅供參考，有錯必糾

回歸樹與基于規則的模型

回歸模型樹

One limitation of simple regression trees is that each terminal node(最終節點) uses the average of the training set outcomes(訓練結果變量的平均值) in that node for prediction. As a consequence, these models may not do a good job predicting samples whose true outcomes are extremely high or low.

One approach to dealing with this issue is to use a diﬀerent estimator(其他隊的估計量) in the terminal nodes.

Here we focus on the model tree approach described in Quinlan (1992) called M5, which is similar to regression trees except:

切分的準則不同；
最終節點利用線性模型來對結果變量進行預測(而不是使用簡單的平均)；
新樣本的預測值通常是樹中同一條路徑下，若幹不同模型預測值的組合。

Like simple regression trees, the initial split(初次切分) is found using an exhaustive search(窮舉搜尋) over the predictors and training set samples, but, unlike those models, the expected reduction in the node’s error rate is used(此處的優化準則是節點上的期望誤差率減少量). Let denote the entire set of data and let represent the P subsets of the data after splitting. The split criterion(分割準則) would be：

where SD is the standard deviation(标準差) and is the number of samples in partition (第個子集).

This metric determines if the total variation in the splits, weighted by sample size, is lower than in the presplit data.(這個名額衡量了進行切分後按樣本量進行權重的總變異是否比沒有切分時的總變異更小)

The split that is associated with the largest reduction in error is chosen (能使誤差達到最小的切分方案将被選中) and a linear model is created within the partitions using the split variable in the model(同時，在每一份子集中将利用切分變量拟合一個線性模型).

For subsequent splitting iterations, this process is repeated:

an initial split is determined and a linear model is created for the partition using the current split variable and all others that preceded it. (在子集中利用該切分變量和之前所有的切分變量拟合線性模型)

The error associated with each linear model is used in place of in Eq.1 to determine the expected reduction in the error rate for the next split.(要計算下一個切分方案的期望誤差減小值，要在(1)式中用每一個線性模型的誤差取代)

The tree growing process continues along the branches of the tree until there are no further improvements in the error rate(誤差率不再有進一步的提升) or there are not enough samples to continue the process(沒有足夠的樣本去執行這個過程). Once the tree is fully grown, there is a linear model for every node in the tree(當樹完全生長後，樹的每個節點都具備一個線性模型).

Once the complete set of linear models have been created, each undergoes a simpliﬁcation procedure to potentially drop some of the terms(其中的每個模型都将經曆一個簡化的過程，即從模型中移除部分項). For a given model, an adjusted error rate is computed(可以計算其調整後的誤差率). First, the absolute diﬀerences between the observed and predicted data are calculated then multiplied by a term that penalizes models with large numbers of parameters(對變量數較多的模型進行懲罰):

where is the number of training set data points that were used to build the model and is the number of parameters.

Each model term is dropped and the adjusted error rate(調整後的誤差率) is computed. 如果誤差率沒有比移除部分項時調整後誤差率要小，那麼該項将從模型中被移除. In some cases, the linear model may be simpliﬁed to having only an intercept. This procedure is independently applied to each linear model(不同線性模型的化簡過程是互相獨立的).

Model trees also incorporate a type of smoothing(平滑) to decrease the potential for over-ﬁtting(減少潛在的過拟合).The technique is based on the recursive shrinking(遞歸收縮) methodology of Hastie and Pregibon (1990).

When predicting, the new sample goes downthe appropriate path of the tree(新樣本自上而下的落入到合适的路徑中) , and moving from the bottom up(自下而上), the linear models along that path are combined(路徑中的線性模型被組合在一起).

子節點和父節點的預測值将通過如下方式組合在一起，并獲得組合後的父節點預測值：

其中是子節點的預測值，是子節點訓練集的樣本量，是父節點的預測值，是一個常數，預設值為15.

回歸樹與基于規則的模型(part3)--回歸模型樹

回歸樹與基于規則的模型

回歸模型樹

繼續閱讀

簡單文檔分類——樸素貝葉斯算法樸素貝葉斯算法簡單文檔分類執行個體步驟總結樸素貝葉斯分類調用(sklearn)

【分類算法】什麼是分類算法定義分類與聚類分類過程方法

分類算法的評價名額

K-近鄰算法以及圖像分類應用

weka之NB算法

使用weka的select attribute

weka中分類器算法

在weka中內建自己的算法

【多變量線性回歸】學習記錄序思路實作終

申請評分模型拒絕推斷（RI）方法申請評分模型拒絕推斷（RI）方法

【人工智能行業大師訪談1】吳恩達采訪 Geoffery Hinton

【趨高機器視覺】機器視覺技術原了解析及解決方案

吳恩達 coursera ML 第七課總結+作業答案前言目錄正文模型表示作業答案

XGBoost Plotting API以及GBDT組合特征實踐 XGBoost Plotting API以及GBDT組合特征實踐

解碼器用于語義分割：資料依賴的解碼可以實作靈活的特征聚合

2021-2025年中國運動療法（KT）帶行業市場供需與戰略研究報告