由前向分步算法可以推導AdaBoost,用定理叙述這一關系:
定理:
AdaBoost算法是前向分步加法算法的特例。這時,模型是由基本分類器組成的加法模型,損失函數是指數函數。
證明:
前向分步算法學習的是加法模型,當基函數為基本分類器時,該加法模型等價于AdaBoost的最終分類器:
f ( x ) = ∑ m = 1 M α m G m ( x ) f(x) = \sum_{m=1}^{M} \alpha_{m}G_{m}(x) f(x)=m=1∑MαmGm(x)
由基本分類器 G m ( x ) G_{m}(x) Gm(x)及其系數 α m \alpha_{m} αm組成, m = 1 , 2 , … , M m=1, 2, …, M m=1,2,…,M。前向分步算法逐一學習基函數,這一過程與AdaBoost 算法逐一學習基本分類器的過程一緻。下面證明前向分步算法的損失函數是指數損失函數(exponential loss function):
L ( y , f ( x ) ) = e x p [ − y f ( x ) ] L(y, f(x))=exp[-yf(x)] L(y,f(x))=exp[−yf(x)]時,其學習的具體操作等價于AdaBoost算法學習的具體操作。
假設經過 m − 1 m-1 m−1輪疊代前向分步算法已經得到 f m − 1 ( x ) f_{m-1}(x) fm−1(x):
f m − 1 ( x ) = f m − 2 ( x ) + α m − 1 G m − 1 ( x ) = α 1 G 1 ( x ) + . . . + α m − 1 G m − 1 ( x ) \begin{aligned} f_{m-1}(x) &= f_{m-2}(x) + \alpha_{m-1}G_{m-1}(x) \\ &= \alpha_{1}G_{1}(x) + ... + \alpha_{m-1}G_{m-1}(x) \end{aligned} fm−1(x)=fm−2(x)+αm−1Gm−1(x)=α1G1(x)+...+αm−1Gm−1(x)
在第m輪疊代得到 α m \alpha_{m} αm, G m ( x ) G_{m}(x) Gm(x)和 f m ( x ) f_{m}(x) fm(x):
f m ( x ) = f m − 1 ( x ) + α m G m ( x ) f_{m}(x) = f_{m-1}(x) + \alpha_{m}G_{m}(x) fm(x)=fm−1(x)+αmGm(x)
目标是使前向分步算法得到的 α m \alpha_{m} αm和 G m ( x ) G_{m}(x) Gm(x)使 f m ( x ) f_{m}(x) fm(x)在訓練集T上的指數損失最小,即:
(1) ( α m , G m ( x ) ) = arg min α , G ∑ i = 1 N L ( y i , f m ( x ) ) = arg min α , G ∑ i = 1 N e x p [ − y i ( f m − 1 ( x i ) + α G ( x i ) ) ] \begin{aligned} (\alpha_m,G_{m}(x)) & = \arg\min_{\alpha,G} \sum_{i=1}^N L(y_i, f_{m}(x)) \\ & = \arg\min_{\alpha,G} \sum_{i=1}^N exp[-y_i(f_{m-1}(x_i)+\alpha G(x_i))] \tag1 \end{aligned} (αm,Gm(x))=argα,Gmini=1∑NL(yi,fm(x))=argα,Gmini=1∑Nexp[−yi(fm−1(xi)+αG(xi))](1)
注:
對于回歸問題,前向分步算法的損失函數可以選平方損失,即
L ( y i , f ( x ) ) = ( y i − f ( x ) ) 2 L(y_i,f(x)) = (y_i - f(x))^2 L(yi,f(x))=(yi−f(x))2
是以有:
L ( y i , f m − 1 ( x i ) + α G ( x i ) ) = ( y i − f m − 1 ( x i ) − α G ( x i ) ) 2 = ( r m i − α G ( x i ) ) 2 \begin{aligned} L(y_i,f_{m-1}(x_i)+\alpha G(x_i)) & = (y_i - f_{m-1}(x_i) - \alpha G(x_i))^2 \\ & = (r_{mi} - \alpha G(x_i))^2 \end{aligned} L(yi,fm−1(xi)+αG(xi))=(yi−fm−1(xi)−αG(xi))2=(rmi−αG(xi))2
其中 r m i = ( y i − f m − 1 ( x i ) ) r_{mi}= (y_i - f_{m-1}(x_i)) rmi=(yi−fm−1(xi)),這就是目前模型的殘差,為了擷取 α G ( x i ) \alpha G(x_i) αG(xi),也就是令其去拟合目前模型的殘差。
   \;
AdaBoost是個分類器,對于分類問題,平方損失就不太适合了。是以引入指數損失。
将(1)式寫為:
(2) ( α m , G m ) = arg min α , G ∑ i = 1 N w ‾ m i e x p ( − α y i G ( x i ) ) \begin{aligned} (\alpha_m,G_m) = \arg\min_{\alpha,G} \sum_{i=1}^N \overline{w}_{mi} exp(-\alpha y_i G(x_i)) \end{aligned} \tag2 (αm,Gm)=argα,Gmini=1∑Nwmiexp(−αyiG(xi))(2)
其中, w ‾ m i = e x p ( − y i f m − 1 ( x i ) ) \overline{w}_{mi} = exp(-y_i f_{m-1}(x_i)) wmi=exp(−yifm−1(xi))。因為 w ‾ m i \overline{w}_{mi} wmi既不依賴 α \alpha α也不依賴于 G G G,是以與最小化無關。但 w ‾ m i \overline{w}_{mi} wmi依賴于 f m − 1 ( x ) f_{m-1}(x) fm−1(x),随着每一輪疊代而發生改變。
現證使式(2)達到最小的 α m ∗ \alpha_{m}^{*} αm∗和 G m ∗ ( x ) G^{*}_{m}(x) Gm∗(x)就是AdaBoost算法所得到的 α m \alpha_{m} αm和 G m ( x ) G_{m}(x) Gm(x)。求解(2)式可分兩步:
第一步:求 G m ∗ ( x ) G^{*}_{m}(x) Gm∗(x):
因為 y i ∈ { − 1 , 1 } y_i∈\{−1,1\} yi∈{−1,1},且 y i y_i yi要麼等于 G ( x i ) G(x_i) G(xi),要麼不等于,是以将上述公式拆成兩部分。PS:暫時省略 a r g m i n argmin argmin。
∑ i = 1 N w ‾ m i e x p ( − α y i G ( x i ) ) = e − α ∑ y i = G m ( x i ) w ‾ m i + e α ∑ y i = ̸ G m ( x i ) w ‾ m i = e − α ∑ y i = G m ( x i ) w ‾ m i + e α ∑ y i = ̸ G m ( x i ) w ‾ m i + e − α ∑ y i = ̸ G m ( x i ) w ‾ m i − e − α ∑ y i = ̸ G m ( x i ) w ‾ m i \begin{aligned} \sum_{i=1}^N & \overline{w}_{mi} exp(-\alpha y_i G(x_i)) \\ & = e^{-\alpha} \sum_{y_i=G_m(x_i)} \overline w_{mi} + e^{\alpha} \sum_{y_{i} = \not G_{m}(x_i)} \overline w_{mi} \\ & = e^{-\alpha} \sum_{y_i=G_m(x_i)} \overline w_{mi} + e^{\alpha} \sum_{y_i = \not G_m(x_i)} \overline w_{mi} + e^{-\alpha} \sum_{y_{i} = \not G_m(x_i)} \overline w_{mi} - e^{-\alpha} \sum_{y_i = \not G_m(x_i)} \overline w_{mi} \\ \end{aligned} i=1∑Nwmiexp(−αyiG(xi))=e−αyi=Gm(xi)∑wmi+eαyi≠Gm(xi)∑wmi=e−αyi=Gm(xi)∑wmi+eαyi≠Gm(xi)∑wmi+e−αyi≠Gm(xi)∑wmi−e−αyi≠Gm(xi)∑wmi
上式合并,得到:
(3) ( e α − e − α ) ∑ i = 1 N w ‾ m i I ( y i = ̸ G m ( x i ) ) + e − α ∑ i = 1 N w ‾ m i \begin{aligned} (e^{\alpha} - e^{-\alpha}) \sum_{i=1}^N \overline w_{mi} I(y_i = \not G_m(x_i)) + e^{-\alpha} \sum_{i=1}^N \overline w_{mi} \tag 3 \end{aligned} (eα−e−α)i=1∑NwmiI(yi≠Gm(xi))+e−αi=1∑Nwmi(3)
對于疊代的第 m m m步,假設 α \alpha α為常數,那麼上式第二項以及 ( e α − e − α ) (e^{\alpha} - e^{-\alpha}) (eα−e−α)都可以看成常數,要讓損失函數取得最小值,隻需使 ∑ i = 1 N w ‾ m i I ( y i = ̸ G m ( x i ) ) \sum_{i=1}^N \overline w_{mi} I(y_i = \not G_m(x_i)) ∑i=1NwmiI(yi≠Gm(xi))取最小值。是以有:
G m ∗ ( x ) = arg min G ∑ i = 1 N w ‾ m i I ( y i = ̸ G ( x i ) ) G^{*}_{m}(x) = \arg\min_G \sum_{i=1}^N \overline w_{mi} I(y_i = \not G(x_i)) Gm∗(x)=argGmini=1∑NwmiI(yi≠G(xi))此分類器 G m ∗ ( x ) G^{*}_{m}(x) Gm∗(x)即為AdaBoost算法的基本分類器 G m ( x ) G_{m}(x) Gm(x),是以它是第m輪權重訓練資料分類誤差率最小的基本分類器。
第二步:求 α m ∗ \alpha^{*}_{m} αm∗:
現假設 G m G_m Gm已知的情況下,回到公式(3)。此時的變量為 α \alpha α,要讓損失函數取得最小值,先對 α \alpha α求偏導,得到:
∂ L ∂ α = e α ∑ i = 1 N w ‾ m i I ( y i = ̸ G ( x i ) ) + e − α ∑ i = 1 N w ‾ m i I ( y i = ̸ G ( x i ) ) − e − α ∑ i = 1 N w ‾ m i \frac {\partial_L} {\partial_{\alpha}} = e^{\alpha} \sum_{i=1}^N \overline w_{mi} I(y_i = \not G(x_i)) + e^{-\alpha} \sum_{i=1}^N \overline w_{mi} I(y_i = \not G(x_i)) - e^{-\alpha} \sum_{i=1}^N \overline w_{mi} ∂α∂L=eαi=1∑NwmiI(yi≠G(xi))+e−αi=1∑NwmiI(yi≠G(xi))−e−αi=1∑Nwmi
再令 ∂ L ∂ α = 0 \frac {\partial_L} {\partial_{\alpha}} = 0 ∂α∂L=0,得:
e α ∑ i = 1 N w ‾ m i I ( y i = ̸ G ( x i ) ) = [ ∑ i = 1 N w ‾ m i − ∑ i = 1 N w ‾ m i I ( y i = ̸ G ( x i ) ) ] e − α e^{\alpha} \sum_{i=1}^N \overline w_{mi} I(y_i = \not G(x_i)) = [\sum_{i=1}^N \overline w_{mi} - \sum_{i=1}^N \overline w_{mi} I(y_i = \not G(x_i))] e^{-\alpha} eαi=1∑NwmiI(yi≠G(xi))=[i=1∑Nwmi−i=1∑NwmiI(yi≠G(xi))]e−α
對兩邊同求log,得到:
l o g ∑ i = 1 N w ‾ m i I ( y i = ̸ G ( x i ) ) + l o g e α = l o g [ ∑ i = 1 N w ‾ m i − ∑ i = 1 N w ‾ m i I ( y i = ̸ G ( x i ) ) ] + l o g e − α log \sum_{i=1}^N \overline w_{mi} I(y_i = \not G(x_i)) + log e^{\alpha} = log [\sum_{i=1}^N \overline w_{mi} - \sum_{i=1}^N \overline w_{mi} I(y_i = \not G(x_i))] + log e^{-\alpha} logi=1∑NwmiI(yi≠G(xi))+logeα=log[i=1∑Nwmi−i=1∑NwmiI(yi≠G(xi))]+loge−α
又因為 l o g e − α = − l o g e α log e^{-\alpha} = -log e^{\alpha} loge−α=−logeα,是以有:
l o g e α = 1 2 l o g ∑ i = 1 N w ‾ m i − ∑ i = 1 N w ‾ m i I ( y i = ̸ G ( x i ) ) ∑ i = 1 N w ‾ m i I ( y i = ̸ G ( x i ) ) log e^{\alpha} = \frac {1} {2} log \frac {\sum_{i=1}^N \overline w_{mi} - \sum_{i=1}^N \overline w_{mi} I(y_i = \not G(x_i))} {\sum_{i=1}^N \overline w_{mi} I(y_i = \not G(x_i))} logeα=21log∑i=1NwmiI(yi≠G(xi))∑i=1Nwmi−∑i=1NwmiI(yi≠G(xi))
解得:
α m = 1 2 l o g ∑ i = 1 N w ‾ m i − ∑ i = 1 N w ‾ m i I ( y i = ̸ G ( x i ) ) ∑ i = 1 N w ‾ i I ( y i = ̸ G ( x i ) ) \alpha_m = \frac {1} {2} log \frac {\sum_{i=1}^N \overline w_{mi} - \sum_{i=1}^N \overline w_{mi} I(y_i = \not G(x_i))} {\sum_{i=1}^N \overline w_i I(y_i = \not G(x_i))} αm=21log∑i=1NwiI(yi≠G(xi))∑i=1Nwmi−∑i=1NwmiI(yi≠G(xi))
又因為權重誤差率:
e m = ∑ i = 1 N w ‾ m i I ( y i = ̸ G ( x i ) ) ∑ i = 1 N w ‾ m i e_m = \frac {\sum_{i=1}^N \overline w_{mi} I(y_i = \not G(x_i))} {\sum_{i=1}^N \overline w_{mi}} em=∑i=1Nwmi∑i=1NwmiI(yi≠G(xi))
是以 α m \alpha_m αm可以寫成
α m = 1 2 l o g 1 − e m e m \alpha_m = \frac {1} {2} log \frac {1 - e_m} {e_m} αm=21logem1−em
求出了 G m ( x ) G_m(x) Gm(x)與 α m \alpha_m αm,就可以寫出 f ( x ) f(x) f(x)的更新公式:
f m ( x ) = f m − 1 ( x ) + α m G m ( x ) f_m(x) = f_{m-1}(x) + \alpha_m G_m(x) fm(x)=fm−1(x)+αmGm(x)
根據 w ‾ m i = e x p ( − y i f m − 1 ( x i ) ) \overline w_{mi} = exp(-y_i f_{m-1}(x_i)) wmi=exp(−yifm−1(xi)),可得 w w w的更新公式:
w ‾ m + 1 , i = e x p ( − y i f m ( x i ) ) = e x p ( − y i ( f m − 1 ( x i ) + α m G m ( x i ) ) ) = w ‾ m , i e x p ( − α m y i G m ( x i ) ) \begin{aligned} \overline w_{m+1,i} & = exp(-y_i f_m (x_i)) \\ & = exp(-y_i (f_{m-1}(x_i)+\alpha_m G_m(x_i))) \\ & = \overline w_{m,i} exp(- \alpha_m y_i G_m(x_i)) \end{aligned} wm+1,i=exp(−yifm(xi))=exp(−yi(fm−1(xi)+αmGm(xi)))=wm,iexp(−αmyiGm(xi))
這與AdaBoost算法的樣本權值更新隻差規範化因子,因而等價。
這也就推導出:目前向分步算法的損失函數為指數損失時,前向分步算法就是AdaBoost。
參考:
《統計學習方法》 李航
從前向分步算法推導出AdaBoost