天天看點

《統計學習方法》筆記——EM公式推導EM算法GMM模型

文章目錄

  • EM算法
    • 問題描述
    • 算法導出
  • GMM模型
    • GMM模型描述
    • 使用EM算法估計GMM

EM算法

問題描述

EM算法是一種利用資料估計生成模型的算法。即求解目标為資料滿足的機率模型。

假設觀測變量為 Y = { y 1 , y 2 , … , y N } Y=\{y_1,y_2,\dots,y_N\} Y={y1​,y2​,…,yN​},隐藏變量為 Z Z Z,估計參數為 θ \theta θ。

根據最大似然估計的思想,很自然的想法就是尋找合适的參數 θ \theta θ,使得取得觀測值的機率最大。

arg ⁡ max ⁡ θ P ( Y ∣ θ ) (1) \arg \max_\theta P(Y|\theta)\tag{1} argθmax​P(Y∣θ)(1)

一般我們認為樣本之間獨立同分布,是以有

P ( Y ∣ θ ) = ∏ j = 1 N P ( y j ∣ θ ) (2) P(Y|\theta)=\prod_{j=1}^NP(y_j|\theta)\tag{2} P(Y∣θ)=j=1∏N​P(yj​∣θ)(2)

為了友善運算,一般将 ( 1 ) (1) (1)式改寫為如下形式

arg ⁡ max ⁡ θ log ⁡ P ( Y ∣ θ ) (3) \arg \max_\theta \log P(Y|\theta)\tag{3} argθmax​logP(Y∣θ)(3)

由于隐藏變量的存在,這個最值問題無法直接求解,是以使用EM算法來疊代

算法導出

首先定義 L ( θ ) L(\theta) L(θ)

L ( θ ) ≜ log ⁡ P ( Y ∣ θ ) (4) L(\theta) \triangleq \log P(Y|\theta) \tag{4} L(θ)≜logP(Y∣θ)(4)

疊代過程中我們希望求得的 θ ( 1 ) , θ ( 2 ) , ⋯   , θ ( i ) , ⋯ \theta^{(1)},\theta^{(2)},\cdots,\theta^{(i)},\cdots θ(1),θ(2),⋯,θ(i),⋯保證 L ( θ ) L(\theta) L(θ)能夠單調遞增。為此,考慮 L ( θ ) − L ( θ ( i ) ) L(\theta)-L(\theta^{(i)}) L(θ)−L(θ(i))

L ( θ ) − L ( θ ( i ) ) = log ⁡ ( P ( Y ∣ θ ) ) − log ⁡ ( P ( Y ∣ θ ( i ) ) ) = log ⁡ ( P ( Y ∣ Z , θ ) P ( Z ∣ θ ) ) − log ⁡ ( P ( Y ∣ θ ( i ) ) ) = log ⁡ ( P ( Z ∣ Y , θ ( i ) ) P ( Y ∣ Z , θ ) P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) ) − log ⁡ ( P ( Y ∣ θ ( i ) ) ) = log ⁡ ( E P ( Z ∣ Y , θ ( i ) ) [ P ( Y ∣ Z , θ ) P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) ] ) − log ⁡ ( P ( Y ∣ θ ( i ) ) ) ⩾ E P ( Z ∣ Y , θ ( i ) ) [ log ⁡ ( P ( Y ∣ Z , θ ) P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) ) ] − E P ( Z ∣ Y , θ ( i ) ) [ log ⁡ ( P ( Y ∣ θ ( i ) ) ) ] = E P ( Z ∣ Y , θ ( i ) ) [ log ⁡ ( P ( Y ∣ Z , θ ) P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) P ( Y ∣ θ ( i ) ) ) ] \begin{aligned} L(\theta)-L(\theta^{(i)}) & = \log(P(Y|\theta))-\log(P(Y|\theta^{(i)}))\\ & =\log(P(Y|Z,\theta)P(Z|\theta))-\log(P(Y|\theta^{(i)}))\\ &=\log(P(Z|Y,\theta^{(i)})\frac{P(Y|Z,\theta)P(Z|\theta)}{P(Z|Y,\theta^{(i)})})-\log(P(Y|\theta^{(i)}))\\ &=\log(E_{P(Z|Y,\theta^{(i)})}[ \frac{P(Y|Z,\theta)P(Z|\theta)}{P(Z|Y,\theta^{(i)})}])-\log(P(Y|\theta^{(i)}))\\ &\geqslant E_{P(Z|Y,\theta^{(i)})}[\log(\frac{P(Y|Z,\theta)P(Z|\theta)}{P(Z|Y,\theta^{(i)})})]-E_{P(Z|Y,\theta^{(i)})}[\log(P(Y|\theta^{(i)}))]\\ &=E_{P(Z|Y,\theta^{(i)})}[\log(\frac{P(Y|Z,\theta)P(Z|\theta)}{P(Z|Y,\theta^{(i)})P(Y|\theta^{(i)})})] \end{aligned} L(θ)−L(θ(i))​=log(P(Y∣θ))−log(P(Y∣θ(i)))=log(P(Y∣Z,θ)P(Z∣θ))−log(P(Y∣θ(i)))=log(P(Z∣Y,θ(i))P(Z∣Y,θ(i))P(Y∣Z,θ)P(Z∣θ)​)−log(P(Y∣θ(i)))=log(EP(Z∣Y,θ(i))​[P(Z∣Y,θ(i))P(Y∣Z,θ)P(Z∣θ)​])−log(P(Y∣θ(i)))⩾EP(Z∣Y,θ(i))​[log(P(Z∣Y,θ(i))P(Y∣Z,θ)P(Z∣θ)​)]−EP(Z∣Y,θ(i))​[log(P(Y∣θ(i)))]=EP(Z∣Y,θ(i))​[log(P(Z∣Y,θ(i))P(Y∣θ(i))P(Y∣Z,θ)P(Z∣θ)​)]​

B ( θ , θ ( i ) ) ≜ E P ( Z ∣ Y , θ ( i ) ) [ log ⁡ ( P ( Y ∣ Z , θ ) P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) P ( Y ∣ θ ( i ) ) ) ] + L ( θ ( i ) ) (5) B(\theta,\theta^{(i)}) \triangleq E_{P(Z|Y,\theta^{(i)})}[\log(\frac{P(Y|Z,\theta)P(Z|\theta)}{P(Z|Y,\theta^{(i)})P(Y|\theta^{(i)})})]+L(\theta^{(i)})\tag{5} B(θ,θ(i))≜EP(Z∣Y,θ(i))​[log(P(Z∣Y,θ(i))P(Y∣θ(i))P(Y∣Z,θ)P(Z∣θ)​)]+L(θ(i))(5)

則有

L ( θ ) ⩾ B ( θ , θ ( i ) ) L(\theta)\geqslant B(\theta,\theta^{(i)}) L(θ)⩾B(θ,θ(i))

上面繁瑣的計算就是為了說明 B ( θ , θ ( i ) ) B(\theta,\theta^{(i)}) B(θ,θ(i))為 L ( θ ) L(\theta) L(θ)的下界,當且僅當 θ = θ ( i ) \theta = \theta^{(i)} θ=θ(i)時取等号。

不難了解,當存在 θ ′ \theta' θ′使得 B ( θ , θ ( i ) ) > B ( θ ( i ) , θ ( i ) ) B(\theta,\theta^{(i)})>B(\theta^{(i)},\theta^{(i)}) B(θ,θ(i))>B(θ(i),θ(i))時,有 L ( θ ′ ) > B ( θ , θ ( i ) ) > B ( θ ( i ) , θ ( i ) ) = L ( θ ( i ) ) L(\theta')>B(\theta,\theta^{(i)})>B(\theta^{(i)},\theta^{(i)})=L(\theta^{(i)}) L(θ′)>B(θ,θ(i))>B(θ(i),θ(i))=L(θ(i))

換言之,隻要能找到使得 B ( θ , θ ( i ) ) B(\theta,\theta^{(i)}) B(θ,θ(i))增大的 θ \theta θ值,該 θ \theta θ值一定可以使得 L ( θ ) > L ( θ ( i ) ) L(\theta)>L(\theta^{(i)}) L(θ)>L(θ(i)),盡管這個 θ \theta θ并不能保證使 L ( θ ) L(\theta) L(θ)取得最大值。

以上就是EM算法疊代的基本思路,具體計算如下,在參數取值 θ ( i ) \theta^{(i)} θ(i)的基礎上計算 θ ( i + 1 ) \theta^{(i+1)} θ(i+1)

θ ( i + 1 ) = arg ⁡ max ⁡ θ B ( θ , θ ( i ) ) \theta^{(i+1)} = \arg \max_\theta B(\theta,\theta^{(i)}) θ(i+1)=argθmax​B(θ,θ(i))

省去與極大化 θ \theta θ無關的常數項

θ ( i + 1 ) = arg ⁡ max ⁡ θ E P ( Z ∣ Y , θ ( i ) ) [ log ⁡ ( P ( Y ∣ Z , θ ) P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) P ( Y ∣ θ ( i ) ) ) ] = arg ⁡ max ⁡ θ E P ( Z ∣ Y , θ ( i ) ) [ log ⁡ ( P ( Y ∣ Z , θ ) P ( Z ∣ θ ) ] = arg ⁡ max ⁡ θ E P ( Z ∣ Y , θ ( i ) ) [ log ⁡ ( P ( Y , Z ∣ θ ) ] \begin{aligned} \theta^{(i+1)} &= \arg \max_\theta E_{P(Z|Y,\theta^{(i)})}[\log(\frac{P(Y|Z,\theta)P(Z|\theta)}{P(Z|Y,\theta^{(i)})P(Y|\theta^{(i)})})]\\ &=\arg \max_\theta E_{P(Z|Y,\theta^{(i)})}[\log({P(Y|Z,\theta)P(Z|\theta)}]\\ &=\arg \max_\theta E_{P(Z|Y,\theta^{(i)})}[\log({P(Y,Z|\theta)}] \end{aligned} θ(i+1)​=argθmax​EP(Z∣Y,θ(i))​[log(P(Z∣Y,θ(i))P(Y∣θ(i))P(Y∣Z,θ)P(Z∣θ)​)]=argθmax​EP(Z∣Y,θ(i))​[log(P(Y∣Z,θ)P(Z∣θ)]=argθmax​EP(Z∣Y,θ(i))​[log(P(Y,Z∣θ)]​

E P ( Z ∣ Y , θ ( i ) ) [ log ⁡ ( P ( Y , Z ∣ θ ) ] E_{P(Z|Y,\theta^{(i)})}[\log({P(Y,Z|\theta)}] EP(Z∣Y,θ(i))​[log(P(Y,Z∣θ)]就是Q函數

Q ( θ , θ ( i ) ) ≜ E P ( Z ∣ Y , θ ( i ) ) [ log ⁡ ( P ( Y , Z ∣ θ ) ] Q(\theta,\theta^{(i)}) \triangleq E_{P(Z|Y,\theta^{(i)})}[\log({P(Y,Z|\theta)}] Q(θ,θ(i))≜EP(Z∣Y,θ(i))​[log(P(Y,Z∣θ)]

至此,就可以了解EM算法的名字了,E就是求得Q函數,M就是對Q函數求極大值。

以上推導過程也就可以這樣了解了,首先找到似然函數 L ( θ ) L(\theta) L(θ)的下界 B ( θ , θ ( i ) ) B(\theta,\theta^{(i)}) B(θ,θ(i)),通過 θ ( i + 1 ) = arg ⁡ max ⁡ θ B ( θ , θ ( i ) ) \theta^{(i+1)} = \arg \max_\theta B(\theta,\theta^{(i)}) θ(i+1)=argmaxθ​B(θ,θ(i))保證 θ ( i + 1 ) \theta^{(i+1)} θ(i+1)使得 L ( θ ( i + 1 ) ) > L ( θ ( i ) ) L(\theta^{(i+1)})>L(\theta^{(i)}) L(θ(i+1))>L(θ(i))。重複以上過程就可以不斷逼近最大似然。

由于M過程并不需要直接對B函數進行操作,是以實際算法中是對Q函數不斷疊代求極值的過程。

GMM模型

GMM模型描述

GMM(Gaussian mixture model)是一種混合的高斯模型,具有如下機率模型

P ( y ∣ θ ) = ∑ k = 1 K α k ϕ ( y ∣ θ k ) P(y|\theta)= \sum_{k=1}^K\alpha_k\phi(y|\theta_k) P(y∣θ)=k=1∑K​αk​ϕ(y∣θk​)

其中 α k \alpha_k αk​是系數, α k ⩾ 0 , ∑ k = 1 K α k = 1 \alpha_k \geqslant0,\sum_{k=1}^K\alpha_k=1 αk​⩾0,∑k=1K​αk​=1, p h i ( y ∣ θ k ) phi(y|\theta_k) phi(y∣θk​)是高斯分布的機率密度函數, θ k = ( μ k , σ k 2 ) \theta_k=(\mu_k,\sigma_k^2) θk​=(μk​,σk2​)。

可以把GMM了解為首先通過機率 α k \alpha_k αk​選擇第 k k k個高斯模型 Y ∼ N ( μ k , σ k 2 ) Y\sim N(\mu_k,\sigma_k^2) Y∼N(μk​,σk2​),再根據第 k k k個高斯模型的機率密度随機輸出一個觀測值。

使用EM算法估計GMM

為了友善描述,定義隐藏變量 γ j k \gamma^{jk} γjk

γ j k = { 1 第j個觀測來自第k個分模型 0 o t h e r w i s e \gamma^{jk} =\begin{cases}1 &\text{第j個觀測來自第k個分模型} \cr 0 &otherwise\end{cases} γjk={10​第j個觀測來自第k個分模型otherwise​

在有了 γ j k \gamma^{jk} γjk的幫助下,就可以寫出觀測序列的似然函數了

P ( Y , γ ∣ θ ) = ∏ j = 1 N P ( y j , γ j 1 , γ j 2 , ⋯   , γ j K ∣ θ ) = ∏ j = 1 N ∏ k = 1 K [ α k ϕ ( y j ∣ θ ) ] γ j k = ∏ k = 1 K α k n k ∏ j = 1 N [ ϕ ( y j ∣ θ ) ] γ j k = ∏ k = 1 K α k n k ∏ j = 1 N [ 1 2 π σ k exp ⁡ ( − ( y j − μ k ) 2 2 σ k 2 ) ] γ j k \begin{aligned} P(Y,\gamma|\theta) &=\prod_{j=1}^NP(y_j,\gamma_{j1},\gamma_{j2},\cdots,\gamma_{jK}|\theta)\\ &=\prod_{j=1}^N\prod_{k=1}^K[\alpha_k\phi(y_j|\theta)]^{\gamma_{jk}}\\ &=\prod_{k=1}^K\alpha_k^{n_k}\prod_{j=1}^N[\phi(y_j|\theta)]^{\gamma_{jk}}\\ &=\prod_{k=1}^K\alpha_k^{n_k}\prod_{j=1}^N[\frac{1}{\sqrt{2\pi}\sigma_k}\exp(-\frac{(y_j-\mu_k)^2}{2\sigma_k^2})]^{\gamma_{jk}} \end{aligned} P(Y,γ∣θ)​=j=1∏N​P(yj​,γj1​,γj2​,⋯,γjK​∣θ)=j=1∏N​k=1∏K​[αk​ϕ(yj​∣θ)]γjk​=k=1∏K​αknk​​j=1∏N​[ϕ(yj​∣θ)]γjk​=k=1∏K​αknk​​j=1∏N​[2π

​σk​1​exp(−2σk2​(yj​−μk​)2​)]γjk​​

式中, n k = ∑ j = 1 N γ j k , ∑ k = 1 K n k = N n_k=\sum_{j=1}^N{\gamma_{jk}},\sum_{k=1}^Kn_k=N nk​=∑j=1N​γjk​,∑k=1K​nk​=N,第一個等号成立的條件是每一個觀測資料 y j , j = 1 , 2 , ⋯   , N y_j,j=1,2,\cdots,N yj​,j=1,2,⋯,N獨立同分布。

注: P ( y j , γ j 1 , γ j 2 , ⋯   , γ j K ∣ θ ) = ∏ k = 1 K [ α k ϕ ( y j ∣ θ ) ] γ j k P(y_j,\gamma_{j1},\gamma_{j2},\cdots,\gamma_{jK}|\theta)=\prod_{k=1}^K[\alpha_k\phi(y_j|\theta)]^{\gamma_{jk}} P(yj​,γj1​,γj2​,⋯,γjK​∣θ)=∏k=1K​[αk​ϕ(yj​∣θ)]γjk​,右側顯然除了 γ j k = 1 \gamma_{jk}=1 γjk​=1的項,其餘均取值為1

接下來就是使用EM算法來對這個似然問題求解。回憶上文中的Q函數定義式,這裡如下求取Q函數

Q ( θ , θ ( i ) ) = E P ( γ ∣ Y , θ ( i ) ) [ log ⁡ ( P ( Y , γ ∣ θ ) ] = E P ( γ ∣ Y , θ ( i ) ) log ⁡ { ∏ k = 1 K α k n k ∏ j = 1 N [ 1 2 π σ k exp ⁡ ( − ( y j − μ k ) 2 2 σ k 2 ) ] γ j k } = E P ( γ ∣ Y , θ ( i ) ) { ∑ k = 1 K n k log ⁡ α k + ∑ j = 1 N γ j k [ log ⁡ ( 1 2 π ) − log ⁡ σ k − ( y j − μ k ) 2 2 σ k 2 ] } = ∑ k = 1 K E P ( γ ∣ Y , θ ( i ) ) [ ∑ j = 1 N γ j K ] log ⁡ α k + ∑ j = 1 N E P ( γ ∣ Y , θ ( i ) ) [ γ j k ] [ log ⁡ ( 1 2 π ) − log ⁡ σ k − ( y j − μ k ) 2 2 σ k 2 ] \begin{aligned} Q(\theta,\theta^{(i)}) &= E_{P(\gamma|Y,\theta^{(i)})}[\log({P(Y,\gamma|\theta)}]\\ &=E_{P(\gamma|Y,\theta^{(i)})}\log\{\prod_{k=1}^K\alpha_k^{n_k}\prod_{j=1}^N[\frac{1}{\sqrt{2\pi}\sigma_k}\exp(-\frac{(y_j-\mu_k)^2}{2\sigma_k^2})]^{\gamma_{jk}}\}\\ &=E_{P(\gamma|Y,\theta^{(i)})}\{\sum_{k=1}^K n_k\log \alpha_k+\sum_{j=1}^N\gamma_{jk}[\log(\frac{1}{\sqrt{2\pi}})-\log \sigma_k-\frac{(y_j-\mu_k)^2}{2\sigma_k^2}]\}\\ &=\sum_{k=1}^K E_{P(\gamma|Y,\theta^{(i)})}[\sum_{j=1}^N{\gamma_{jK}}]\log \alpha_k+\sum_{j=1}^NE_{P(\gamma|Y,\theta^{(i)})}[\gamma_{jk}][\log(\frac{1}{\sqrt{2\pi}})-\log \sigma_k-\frac{(y_j-\mu_k)^2}{2\sigma_k^2}] \end{aligned} Q(θ,θ(i))​=EP(γ∣Y,θ(i))​[log(P(Y,γ∣θ)]=EP(γ∣Y,θ(i))​log{k=1∏K​αknk​​j=1∏N​[2π

​σk​1​exp(−2σk2​(yj​−μk​)2​)]γjk​}=EP(γ∣Y,θ(i))​{k=1∑K​nk​logαk​+j=1∑N​γjk​[log(2π

​1​)−logσk​−2σk2​(yj​−μk​)2​]}=k=1∑K​EP(γ∣Y,θ(i))​[j=1∑N​γjK​]logαk​+j=1∑N​EP(γ∣Y,θ(i))​[γjk​][log(2π

​1​)−logσk​−2σk2​(yj​−μk​)2​]​

Q函數的求解就是一個求期望的過程,期望是關于随機變量 γ \gamma γ的,而式中隻有兩項與 γ \gamma γ有關,且這兩項都是用簡單的運算符連接配接,可以直接将期望運算符内置。

記 E P ( γ ∣ Y , θ ( i ) ) [ γ j k ] E_{P(\gamma|Y,\theta^{(i)})}[\gamma_{jk}] EP(γ∣Y,θ(i))​[γjk​]為 γ ^ j k \hat{\gamma}_{jk} γ^​jk​,由于變量 γ \gamma γ是一個0-1分布的變量,是以 γ ^ j k \hat{\gamma}_{jk} γ^​jk​可以看做 y j y_j yj​來自于模型 k k k的機率。

γ ^ j k = E P ( γ ∣ y j , θ ( i ) ) [ γ j k ] = P ( γ j k = 1 ∣ y j , θ ) = P ( γ j k = 1 , y j ∣ θ ) P ( y j ∣ θ ) = P ( γ j k = 1 , y j ∣ θ ) ∑ k = 1 K P ( y j ∣ γ j k = 1 , θ ) P ( γ j k = 1 ∣ θ ) = α k ϕ ( y j ∣ θ k ) ∑ k = 1 K α k ϕ ( y j ∣ θ k ) \begin{aligned} \hat{\gamma}_{jk} &=E_{P(\gamma|y_j,\theta^{(i)})}[\gamma_{jk}]=P(\gamma_{jk}=1|y_j,\theta)\\ &=\frac{P(\gamma_{jk}=1,y_j|\theta)}{P(y_j|\theta)}\\ &=\frac{P(\gamma_{jk}=1,y_j|\theta)}{\sum_{k=1}^KP(y_j|\gamma_{jk}=1,\theta)P(\gamma_{jk}=1|\theta)}\\ &=\frac{\alpha_k\phi(y_j|\theta_k)}{\sum_{k=1}^K\alpha_k\phi(y_j|\theta_k)} \end{aligned} γ^​jk​​=EP(γ∣yj​,θ(i))​[γjk​]=P(γjk​=1∣yj​,θ)=P(yj​∣θ)P(γjk​=1,yj​∣θ)​=∑k=1K​P(yj​∣γjk​=1,θ)P(γjk​=1∣θ)P(γjk​=1,yj​∣θ)​=∑k=1K​αk​ϕ(yj​∣θk​)αk​ϕ(yj​∣θk​)​​

重新觀察Q函數,我們不難發現與 γ ^ j k \hat{\gamma}_{jk} γ^​jk​與參數 θ = ( α , μ , σ ) \theta=(\alpha,\mu,\sigma) θ=(α,μ,σ)無關,故可以直接對三個參數求偏導并令其為0,就能完成 θ ( i + 1 ) = arg ⁡ max ⁡ θ Q ( θ , θ ( i ) ) \theta^{(i+1)} = \arg \max_\theta Q(\theta,\theta^{(i)}) θ(i+1)=argθmax​Q(θ,θ(i))

繼續閱讀