天天看點

統計學習方法(7)前向分步算法推導AdaBoost的詳細過程

由前向分步算法可以推導AdaBoost,用定理叙述這一關系:

定理:

AdaBoost算法是前向分步加法算法的特例。這時,模型是由基本分類器組成的加法模型,損失函數是指數函數。

證明:

前向分步算法學習的是加法模型,當基函數為基本分類器時,該加法模型等價于AdaBoost的最終分類器:

f ( x ) = ∑ m = 1 M α m G m ( x ) f(x) = \sum_{m=1}^{M} \alpha_{m}G_{m}(x) f(x)=m=1∑M​αm​Gm​(x)

由基本分類器 G m ( x ) G_{m}(x) Gm​(x)及其系數 α m \alpha_{m} αm​組成, m = 1 , 2 , … , M m=1, 2, …, M m=1,2,…,M。前向分步算法逐一學習基函數,這一過程與AdaBoost 算法逐一學習基本分類器的過程一緻。下面證明前向分步算法的損失函數是指數損失函數(exponential loss function):

L ( y , f ( x ) ) = e x p [ − y f ( x ) ] L(y, f(x))=exp[-yf(x)] L(y,f(x))=exp[−yf(x)]時,其學習的具體操作等價于AdaBoost算法學習的具體操作。

假設經過 m − 1 m-1 m−1輪疊代前向分步算法已經得到 f m − 1 ( x ) f_{m-1}(x) fm−1​(x):

f m − 1 ( x ) = f m − 2 ( x ) + α m − 1 G m − 1 ( x ) = α 1 G 1 ( x ) + . . . + α m − 1 G m − 1 ( x ) \begin{aligned} f_{m-1}(x) &= f_{m-2}(x) + \alpha_{m-1}G_{m-1}(x) \\ &= \alpha_{1}G_{1}(x) + ... + \alpha_{m-1}G_{m-1}(x) \end{aligned} fm−1​(x)​=fm−2​(x)+αm−1​Gm−1​(x)=α1​G1​(x)+...+αm−1​Gm−1​(x)​

在第m輪疊代得到 α m \alpha_{m} αm​, G m ( x ) G_{m}(x) Gm​(x)和 f m ( x ) f_{m}(x) fm​(x):

f m ( x ) = f m − 1 ( x ) + α m G m ( x ) f_{m}(x) = f_{m-1}(x) + \alpha_{m}G_{m}(x) fm​(x)=fm−1​(x)+αm​Gm​(x)

目标是使前向分步算法得到的 α m \alpha_{m} αm​和 G m ( x ) G_{m}(x) Gm​(x)使 f m ( x ) f_{m}(x) fm​(x)在訓練集T上的指數損失最小,即:

(1) ( α m , G m ( x ) ) = arg ⁡ min ⁡ α , G ∑ i = 1 N L ( y i , f m ( x ) ) = arg ⁡ min ⁡ α , G ∑ i = 1 N e x p [ − y i ( f m − 1 ( x i ) + α G ( x i ) ) ] \begin{aligned} (\alpha_m,G_{m}(x)) & = \arg\min_{\alpha,G} \sum_{i=1}^N L(y_i, f_{m}(x)) \\ & = \arg\min_{\alpha,G} \sum_{i=1}^N exp[-y_i(f_{m-1}(x_i)+\alpha G(x_i))] \tag1 \end{aligned} (αm​,Gm​(x))​=argα,Gmin​i=1∑N​L(yi​,fm​(x))=argα,Gmin​i=1∑N​exp[−yi​(fm−1​(xi​)+αG(xi​))]​(1)

注:

對于回歸問題,前向分步算法的損失函數可以選平方損失,即

L ( y i , f ( x ) ) = ( y i − f ( x ) ) 2 L(y_i,f(x)) = (y_i - f(x))^2 L(yi​,f(x))=(yi​−f(x))2

是以有:

L ( y i , f m − 1 ( x i ) + α G ( x i ) ) = ( y i − f m − 1 ( x i ) − α G ( x i ) ) 2 = ( r m i − α G ( x i ) ) 2 \begin{aligned} L(y_i,f_{m-1}(x_i)+\alpha G(x_i)) & = (y_i - f_{m-1}(x_i) - \alpha G(x_i))^2 \\ & = (r_{mi} - \alpha G(x_i))^2 \end{aligned} L(yi​,fm−1​(xi​)+αG(xi​))​=(yi​−fm−1​(xi​)−αG(xi​))2=(rmi​−αG(xi​))2​

其中 r m i = ( y i − f m − 1 ( x i ) ) r_{mi}= (y_i - f_{m-1}(x_i)) rmi​=(yi​−fm−1​(xi​)),這就是目前模型的殘差,為了擷取 α G ( x i ) \alpha G(x_i) αG(xi​),也就是令其去拟合目前模型的殘差。

   \;

AdaBoost是個分類器,對于分類問題,平方損失就不太适合了。是以引入指數損失。

将(1)式寫為:

(2) ( α m , G m ) = arg ⁡ min ⁡ α , G ∑ i = 1 N w ‾ m i e x p ( − α y i G ( x i ) ) \begin{aligned} (\alpha_m,G_m) = \arg\min_{\alpha,G} \sum_{i=1}^N \overline{w}_{mi} exp(-\alpha y_i G(x_i)) \end{aligned} \tag2 (αm​,Gm​)=argα,Gmin​i=1∑N​wmi​exp(−αyi​G(xi​))​(2)

其中, w ‾ m i = e x p ( − y i f m − 1 ( x i ) ) \overline{w}_{mi} = exp(-y_i f_{m-1}(x_i)) wmi​=exp(−yi​fm−1​(xi​))。因為 w ‾ m i \overline{w}_{mi} wmi​既不依賴 α \alpha α也不依賴于 G G G,是以與最小化無關。但 w ‾ m i \overline{w}_{mi} wmi​依賴于 f m − 1 ( x ) f_{m-1}(x) fm−1​(x),随着每一輪疊代而發生改變。

現證使式(2)達到最小的 α m ∗ \alpha_{m}^{*} αm∗​和 G m ∗ ( x ) G^{*}_{m}(x) Gm∗​(x)就是AdaBoost算法所得到的 α m \alpha_{m} αm​和 G m ( x ) G_{m}(x) Gm​(x)。求解(2)式可分兩步:

第一步:求 G m ∗ ( x ) G^{*}_{m}(x) Gm∗​(x):

因為 y i ∈ { − 1 , 1 } y_i∈\{−1,1\} yi​∈{−1,1},且 y i y_i yi​要麼等于 G ( x i ) G(x_i) G(xi​),要麼不等于,是以将上述公式拆成兩部分。PS:暫時省略 a r g m i n argmin argmin。

∑ i = 1 N w ‾ m i e x p ( − α y i G ( x i ) ) = e − α ∑ y i = G m ( x i ) w ‾ m i + e α ∑ y i = ̸ G m ( x i ) w ‾ m i = e − α ∑ y i = G m ( x i ) w ‾ m i + e α ∑ y i = ̸ G m ( x i ) w ‾ m i + e − α ∑ y i = ̸ G m ( x i ) w ‾ m i − e − α ∑ y i = ̸ G m ( x i ) w ‾ m i \begin{aligned} \sum_{i=1}^N & \overline{w}_{mi} exp(-\alpha y_i G(x_i)) \\ & = e^{-\alpha} \sum_{y_i=G_m(x_i)} \overline w_{mi} + e^{\alpha} \sum_{y_{i} = \not G_{m}(x_i)} \overline w_{mi} \\ & = e^{-\alpha} \sum_{y_i=G_m(x_i)} \overline w_{mi} + e^{\alpha} \sum_{y_i = \not G_m(x_i)} \overline w_{mi} + e^{-\alpha} \sum_{y_{i} = \not G_m(x_i)} \overline w_{mi} - e^{-\alpha} \sum_{y_i = \not G_m(x_i)} \overline w_{mi} \\ \end{aligned} i=1∑N​​wmi​exp(−αyi​G(xi​))=e−αyi​=Gm​(xi​)∑​wmi​+eαyi​≠​Gm​(xi​)∑​wmi​=e−αyi​=Gm​(xi​)∑​wmi​+eαyi​≠​Gm​(xi​)∑​wmi​+e−αyi​≠​Gm​(xi​)∑​wmi​−e−αyi​≠​Gm​(xi​)∑​wmi​​

上式合并,得到:

(3) ( e α − e − α ) ∑ i = 1 N w ‾ m i I ( y i = ̸ G m ( x i ) ) + e − α ∑ i = 1 N w ‾ m i \begin{aligned} (e^{\alpha} - e^{-\alpha}) \sum_{i=1}^N \overline w_{mi} I(y_i = \not G_m(x_i)) + e^{-\alpha} \sum_{i=1}^N \overline w_{mi} \tag 3 \end{aligned} (eα−e−α)i=1∑N​wmi​I(yi​≠​Gm​(xi​))+e−αi=1∑N​wmi​​(3)

對于疊代的第 m m m步,假設 α \alpha α為常數,那麼上式第二項以及 ( e α − e − α ) (e^{\alpha} - e^{-\alpha}) (eα−e−α)都可以看成常數,要讓損失函數取得最小值,隻需使 ∑ i = 1 N w ‾ m i I ( y i = ̸ G m ( x i ) ) \sum_{i=1}^N \overline w_{mi} I(y_i = \not G_m(x_i)) ∑i=1N​wmi​I(yi​≠​Gm​(xi​))取最小值。是以有:

G m ∗ ( x ) = arg ⁡ min ⁡ G ∑ i = 1 N w ‾ m i I ( y i = ̸ G ( x i ) ) G^{*}_{m}(x) = \arg\min_G \sum_{i=1}^N \overline w_{mi} I(y_i = \not G(x_i)) Gm∗​(x)=argGmin​i=1∑N​wmi​I(yi​≠​G(xi​))此分類器 G m ∗ ( x ) G^{*}_{m}(x) Gm∗​(x)即為AdaBoost算法的基本分類器 G m ( x ) G_{m}(x) Gm​(x),是以它是第m輪權重訓練資料分類誤差率最小的基本分類器。

第二步:求 α m ∗ \alpha^{*}_{m} αm∗​:

現假設 G m G_m Gm​已知的情況下,回到公式(3)。此時的變量為 α \alpha α,要讓損失函數取得最小值,先對 α \alpha α求偏導,得到:

∂ L ∂ α = e α ∑ i = 1 N w ‾ m i I ( y i = ̸ G ( x i ) ) + e − α ∑ i = 1 N w ‾ m i I ( y i = ̸ G ( x i ) ) − e − α ∑ i = 1 N w ‾ m i \frac {\partial_L} {\partial_{\alpha}} = e^{\alpha} \sum_{i=1}^N \overline w_{mi} I(y_i = \not G(x_i)) + e^{-\alpha} \sum_{i=1}^N \overline w_{mi} I(y_i = \not G(x_i)) - e^{-\alpha} \sum_{i=1}^N \overline w_{mi} ∂α​∂L​​=eαi=1∑N​wmi​I(yi​≠​G(xi​))+e−αi=1∑N​wmi​I(yi​≠​G(xi​))−e−αi=1∑N​wmi​

再令 ∂ L ∂ α = 0 \frac {\partial_L} {\partial_{\alpha}} = 0 ∂α​∂L​​=0,得:

e α ∑ i = 1 N w ‾ m i I ( y i = ̸ G ( x i ) ) = [ ∑ i = 1 N w ‾ m i − ∑ i = 1 N w ‾ m i I ( y i = ̸ G ( x i ) ) ] e − α e^{\alpha} \sum_{i=1}^N \overline w_{mi} I(y_i = \not G(x_i)) = [\sum_{i=1}^N \overline w_{mi} - \sum_{i=1}^N \overline w_{mi} I(y_i = \not G(x_i))] e^{-\alpha} eαi=1∑N​wmi​I(yi​≠​G(xi​))=[i=1∑N​wmi​−i=1∑N​wmi​I(yi​≠​G(xi​))]e−α

對兩邊同求log,得到:

l o g ∑ i = 1 N w ‾ m i I ( y i = ̸ G ( x i ) ) + l o g e α = l o g [ ∑ i = 1 N w ‾ m i − ∑ i = 1 N w ‾ m i I ( y i = ̸ G ( x i ) ) ] + l o g e − α log \sum_{i=1}^N \overline w_{mi} I(y_i = \not G(x_i)) + log e^{\alpha} = log [\sum_{i=1}^N \overline w_{mi} - \sum_{i=1}^N \overline w_{mi} I(y_i = \not G(x_i))] + log e^{-\alpha} logi=1∑N​wmi​I(yi​≠​G(xi​))+logeα=log[i=1∑N​wmi​−i=1∑N​wmi​I(yi​≠​G(xi​))]+loge−α

又因為 l o g e − α = − l o g e α log e^{-\alpha} = -log e^{\alpha} loge−α=−logeα,是以有:

l o g e α = 1 2 l o g ∑ i = 1 N w ‾ m i − ∑ i = 1 N w ‾ m i I ( y i = ̸ G ( x i ) ) ∑ i = 1 N w ‾ m i I ( y i = ̸ G ( x i ) ) log e^{\alpha} = \frac {1} {2} log \frac {\sum_{i=1}^N \overline w_{mi} - \sum_{i=1}^N \overline w_{mi} I(y_i = \not G(x_i))} {\sum_{i=1}^N \overline w_{mi} I(y_i = \not G(x_i))} logeα=21​log∑i=1N​wmi​I(yi​≠​G(xi​))∑i=1N​wmi​−∑i=1N​wmi​I(yi​≠​G(xi​))​

解得:

α m = 1 2 l o g ∑ i = 1 N w ‾ m i − ∑ i = 1 N w ‾ m i I ( y i = ̸ G ( x i ) ) ∑ i = 1 N w ‾ i I ( y i = ̸ G ( x i ) ) \alpha_m = \frac {1} {2} log \frac {\sum_{i=1}^N \overline w_{mi} - \sum_{i=1}^N \overline w_{mi} I(y_i = \not G(x_i))} {\sum_{i=1}^N \overline w_i I(y_i = \not G(x_i))} αm​=21​log∑i=1N​wi​I(yi​≠​G(xi​))∑i=1N​wmi​−∑i=1N​wmi​I(yi​≠​G(xi​))​

又因為權重誤差率:

e m = ∑ i = 1 N w ‾ m i I ( y i = ̸ G ( x i ) ) ∑ i = 1 N w ‾ m i e_m = \frac {\sum_{i=1}^N \overline w_{mi} I(y_i = \not G(x_i))} {\sum_{i=1}^N \overline w_{mi}} em​=∑i=1N​wmi​∑i=1N​wmi​I(yi​≠​G(xi​))​

是以 α m \alpha_m αm​可以寫成

α m = 1 2 l o g 1 − e m e m \alpha_m = \frac {1} {2} log \frac {1 - e_m} {e_m} αm​=21​logem​1−em​​

求出了 G m ( x ) G_m(x) Gm​(x)與 α m \alpha_m αm​,就可以寫出 f ( x ) f(x) f(x)的更新公式:

f m ( x ) = f m − 1 ( x ) + α m G m ( x ) f_m(x) = f_{m-1}(x) + \alpha_m G_m(x) fm​(x)=fm−1​(x)+αm​Gm​(x)

根據 w ‾ m i = e x p ( − y i f m − 1 ( x i ) ) \overline w_{mi} = exp(-y_i f_{m-1}(x_i)) wmi​=exp(−yi​fm−1​(xi​)),可得 w w w的更新公式:

w ‾ m + 1 , i = e x p ( − y i f m ( x i ) ) = e x p ( − y i ( f m − 1 ( x i ) + α m G m ( x i ) ) ) = w ‾ m , i e x p ( − α m y i G m ( x i ) ) \begin{aligned} \overline w_{m+1,i} & = exp(-y_i f_m (x_i)) \\ & = exp(-y_i (f_{m-1}(x_i)+\alpha_m G_m(x_i))) \\ & = \overline w_{m,i} exp(- \alpha_m y_i G_m(x_i)) \end{aligned} wm+1,i​​=exp(−yi​fm​(xi​))=exp(−yi​(fm−1​(xi​)+αm​Gm​(xi​)))=wm,i​exp(−αm​yi​Gm​(xi​))​

這與AdaBoost算法的樣本權值更新隻差規範化因子,因而等價。

這也就推導出:目前向分步算法的損失函數為指數損失時,前向分步算法就是AdaBoost。

參考:

《統計學習方法》 李航

從前向分步算法推導出AdaBoost

繼續閱讀