文章目錄
- EM算法
-
- 問題描述
- 算法導出
- GMM模型
-
- GMM模型描述
- 使用EM算法估計GMM
EM算法
問題描述
EM算法是一種利用資料估計生成模型的算法。即求解目标為資料滿足的機率模型。
假設觀測變量為 Y = { y 1 , y 2 , … , y N } Y=\{y_1,y_2,\dots,y_N\} Y={y1,y2,…,yN},隐藏變量為 Z Z Z,估計參數為 θ \theta θ。
根據最大似然估計的思想,很自然的想法就是尋找合适的參數 θ \theta θ,使得取得觀測值的機率最大。
arg max θ P ( Y ∣ θ ) (1) \arg \max_\theta P(Y|\theta)\tag{1} argθmaxP(Y∣θ)(1)
一般我們認為樣本之間獨立同分布,是以有
P ( Y ∣ θ ) = ∏ j = 1 N P ( y j ∣ θ ) (2) P(Y|\theta)=\prod_{j=1}^NP(y_j|\theta)\tag{2} P(Y∣θ)=j=1∏NP(yj∣θ)(2)
為了友善運算,一般将 ( 1 ) (1) (1)式改寫為如下形式
arg max θ log P ( Y ∣ θ ) (3) \arg \max_\theta \log P(Y|\theta)\tag{3} argθmaxlogP(Y∣θ)(3)
由于隐藏變量的存在,這個最值問題無法直接求解,是以使用EM算法來疊代
算法導出
首先定義 L ( θ ) L(\theta) L(θ)
L ( θ ) ≜ log P ( Y ∣ θ ) (4) L(\theta) \triangleq \log P(Y|\theta) \tag{4} L(θ)≜logP(Y∣θ)(4)
疊代過程中我們希望求得的 θ ( 1 ) , θ ( 2 ) , ⋯ , θ ( i ) , ⋯ \theta^{(1)},\theta^{(2)},\cdots,\theta^{(i)},\cdots θ(1),θ(2),⋯,θ(i),⋯保證 L ( θ ) L(\theta) L(θ)能夠單調遞增。為此,考慮 L ( θ ) − L ( θ ( i ) ) L(\theta)-L(\theta^{(i)}) L(θ)−L(θ(i))
L ( θ ) − L ( θ ( i ) ) = log ( P ( Y ∣ θ ) ) − log ( P ( Y ∣ θ ( i ) ) ) = log ( P ( Y ∣ Z , θ ) P ( Z ∣ θ ) ) − log ( P ( Y ∣ θ ( i ) ) ) = log ( P ( Z ∣ Y , θ ( i ) ) P ( Y ∣ Z , θ ) P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) ) − log ( P ( Y ∣ θ ( i ) ) ) = log ( E P ( Z ∣ Y , θ ( i ) ) [ P ( Y ∣ Z , θ ) P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) ] ) − log ( P ( Y ∣ θ ( i ) ) ) ⩾ E P ( Z ∣ Y , θ ( i ) ) [ log ( P ( Y ∣ Z , θ ) P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) ) ] − E P ( Z ∣ Y , θ ( i ) ) [ log ( P ( Y ∣ θ ( i ) ) ) ] = E P ( Z ∣ Y , θ ( i ) ) [ log ( P ( Y ∣ Z , θ ) P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) P ( Y ∣ θ ( i ) ) ) ] \begin{aligned} L(\theta)-L(\theta^{(i)}) & = \log(P(Y|\theta))-\log(P(Y|\theta^{(i)}))\\ & =\log(P(Y|Z,\theta)P(Z|\theta))-\log(P(Y|\theta^{(i)}))\\ &=\log(P(Z|Y,\theta^{(i)})\frac{P(Y|Z,\theta)P(Z|\theta)}{P(Z|Y,\theta^{(i)})})-\log(P(Y|\theta^{(i)}))\\ &=\log(E_{P(Z|Y,\theta^{(i)})}[ \frac{P(Y|Z,\theta)P(Z|\theta)}{P(Z|Y,\theta^{(i)})}])-\log(P(Y|\theta^{(i)}))\\ &\geqslant E_{P(Z|Y,\theta^{(i)})}[\log(\frac{P(Y|Z,\theta)P(Z|\theta)}{P(Z|Y,\theta^{(i)})})]-E_{P(Z|Y,\theta^{(i)})}[\log(P(Y|\theta^{(i)}))]\\ &=E_{P(Z|Y,\theta^{(i)})}[\log(\frac{P(Y|Z,\theta)P(Z|\theta)}{P(Z|Y,\theta^{(i)})P(Y|\theta^{(i)})})] \end{aligned} L(θ)−L(θ(i))=log(P(Y∣θ))−log(P(Y∣θ(i)))=log(P(Y∣Z,θ)P(Z∣θ))−log(P(Y∣θ(i)))=log(P(Z∣Y,θ(i))P(Z∣Y,θ(i))P(Y∣Z,θ)P(Z∣θ))−log(P(Y∣θ(i)))=log(EP(Z∣Y,θ(i))[P(Z∣Y,θ(i))P(Y∣Z,θ)P(Z∣θ)])−log(P(Y∣θ(i)))⩾EP(Z∣Y,θ(i))[log(P(Z∣Y,θ(i))P(Y∣Z,θ)P(Z∣θ))]−EP(Z∣Y,θ(i))[log(P(Y∣θ(i)))]=EP(Z∣Y,θ(i))[log(P(Z∣Y,θ(i))P(Y∣θ(i))P(Y∣Z,θ)P(Z∣θ))]
令
B ( θ , θ ( i ) ) ≜ E P ( Z ∣ Y , θ ( i ) ) [ log ( P ( Y ∣ Z , θ ) P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) P ( Y ∣ θ ( i ) ) ) ] + L ( θ ( i ) ) (5) B(\theta,\theta^{(i)}) \triangleq E_{P(Z|Y,\theta^{(i)})}[\log(\frac{P(Y|Z,\theta)P(Z|\theta)}{P(Z|Y,\theta^{(i)})P(Y|\theta^{(i)})})]+L(\theta^{(i)})\tag{5} B(θ,θ(i))≜EP(Z∣Y,θ(i))[log(P(Z∣Y,θ(i))P(Y∣θ(i))P(Y∣Z,θ)P(Z∣θ))]+L(θ(i))(5)
則有
L ( θ ) ⩾ B ( θ , θ ( i ) ) L(\theta)\geqslant B(\theta,\theta^{(i)}) L(θ)⩾B(θ,θ(i))
上面繁瑣的計算就是為了說明 B ( θ , θ ( i ) ) B(\theta,\theta^{(i)}) B(θ,θ(i))為 L ( θ ) L(\theta) L(θ)的下界,當且僅當 θ = θ ( i ) \theta = \theta^{(i)} θ=θ(i)時取等号。
不難了解,當存在 θ ′ \theta' θ′使得 B ( θ , θ ( i ) ) > B ( θ ( i ) , θ ( i ) ) B(\theta,\theta^{(i)})>B(\theta^{(i)},\theta^{(i)}) B(θ,θ(i))>B(θ(i),θ(i))時,有 L ( θ ′ ) > B ( θ , θ ( i ) ) > B ( θ ( i ) , θ ( i ) ) = L ( θ ( i ) ) L(\theta')>B(\theta,\theta^{(i)})>B(\theta^{(i)},\theta^{(i)})=L(\theta^{(i)}) L(θ′)>B(θ,θ(i))>B(θ(i),θ(i))=L(θ(i))
換言之,隻要能找到使得 B ( θ , θ ( i ) ) B(\theta,\theta^{(i)}) B(θ,θ(i))增大的 θ \theta θ值,該 θ \theta θ值一定可以使得 L ( θ ) > L ( θ ( i ) ) L(\theta)>L(\theta^{(i)}) L(θ)>L(θ(i)),盡管這個 θ \theta θ并不能保證使 L ( θ ) L(\theta) L(θ)取得最大值。
以上就是EM算法疊代的基本思路,具體計算如下,在參數取值 θ ( i ) \theta^{(i)} θ(i)的基礎上計算 θ ( i + 1 ) \theta^{(i+1)} θ(i+1)
θ ( i + 1 ) = arg max θ B ( θ , θ ( i ) ) \theta^{(i+1)} = \arg \max_\theta B(\theta,\theta^{(i)}) θ(i+1)=argθmaxB(θ,θ(i))
省去與極大化 θ \theta θ無關的常數項
θ ( i + 1 ) = arg max θ E P ( Z ∣ Y , θ ( i ) ) [ log ( P ( Y ∣ Z , θ ) P ( Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) P ( Y ∣ θ ( i ) ) ) ] = arg max θ E P ( Z ∣ Y , θ ( i ) ) [ log ( P ( Y ∣ Z , θ ) P ( Z ∣ θ ) ] = arg max θ E P ( Z ∣ Y , θ ( i ) ) [ log ( P ( Y , Z ∣ θ ) ] \begin{aligned} \theta^{(i+1)} &= \arg \max_\theta E_{P(Z|Y,\theta^{(i)})}[\log(\frac{P(Y|Z,\theta)P(Z|\theta)}{P(Z|Y,\theta^{(i)})P(Y|\theta^{(i)})})]\\ &=\arg \max_\theta E_{P(Z|Y,\theta^{(i)})}[\log({P(Y|Z,\theta)P(Z|\theta)}]\\ &=\arg \max_\theta E_{P(Z|Y,\theta^{(i)})}[\log({P(Y,Z|\theta)}] \end{aligned} θ(i+1)=argθmaxEP(Z∣Y,θ(i))[log(P(Z∣Y,θ(i))P(Y∣θ(i))P(Y∣Z,θ)P(Z∣θ))]=argθmaxEP(Z∣Y,θ(i))[log(P(Y∣Z,θ)P(Z∣θ)]=argθmaxEP(Z∣Y,θ(i))[log(P(Y,Z∣θ)]
E P ( Z ∣ Y , θ ( i ) ) [ log ( P ( Y , Z ∣ θ ) ] E_{P(Z|Y,\theta^{(i)})}[\log({P(Y,Z|\theta)}] EP(Z∣Y,θ(i))[log(P(Y,Z∣θ)]就是Q函數
Q ( θ , θ ( i ) ) ≜ E P ( Z ∣ Y , θ ( i ) ) [ log ( P ( Y , Z ∣ θ ) ] Q(\theta,\theta^{(i)}) \triangleq E_{P(Z|Y,\theta^{(i)})}[\log({P(Y,Z|\theta)}] Q(θ,θ(i))≜EP(Z∣Y,θ(i))[log(P(Y,Z∣θ)]
至此,就可以了解EM算法的名字了,E就是求得Q函數,M就是對Q函數求極大值。
以上推導過程也就可以這樣了解了,首先找到似然函數 L ( θ ) L(\theta) L(θ)的下界 B ( θ , θ ( i ) ) B(\theta,\theta^{(i)}) B(θ,θ(i)),通過 θ ( i + 1 ) = arg max θ B ( θ , θ ( i ) ) \theta^{(i+1)} = \arg \max_\theta B(\theta,\theta^{(i)}) θ(i+1)=argmaxθB(θ,θ(i))保證 θ ( i + 1 ) \theta^{(i+1)} θ(i+1)使得 L ( θ ( i + 1 ) ) > L ( θ ( i ) ) L(\theta^{(i+1)})>L(\theta^{(i)}) L(θ(i+1))>L(θ(i))。重複以上過程就可以不斷逼近最大似然。
由于M過程并不需要直接對B函數進行操作,是以實際算法中是對Q函數不斷疊代求極值的過程。
GMM模型
GMM模型描述
GMM(Gaussian mixture model)是一種混合的高斯模型,具有如下機率模型
P ( y ∣ θ ) = ∑ k = 1 K α k ϕ ( y ∣ θ k ) P(y|\theta)= \sum_{k=1}^K\alpha_k\phi(y|\theta_k) P(y∣θ)=k=1∑Kαkϕ(y∣θk)
其中 α k \alpha_k αk是系數, α k ⩾ 0 , ∑ k = 1 K α k = 1 \alpha_k \geqslant0,\sum_{k=1}^K\alpha_k=1 αk⩾0,∑k=1Kαk=1, p h i ( y ∣ θ k ) phi(y|\theta_k) phi(y∣θk)是高斯分布的機率密度函數, θ k = ( μ k , σ k 2 ) \theta_k=(\mu_k,\sigma_k^2) θk=(μk,σk2)。
可以把GMM了解為首先通過機率 α k \alpha_k αk選擇第 k k k個高斯模型 Y ∼ N ( μ k , σ k 2 ) Y\sim N(\mu_k,\sigma_k^2) Y∼N(μk,σk2),再根據第 k k k個高斯模型的機率密度随機輸出一個觀測值。
使用EM算法估計GMM
為了友善描述,定義隐藏變量 γ j k \gamma^{jk} γjk
γ j k = { 1 第j個觀測來自第k個分模型 0 o t h e r w i s e \gamma^{jk} =\begin{cases}1 &\text{第j個觀測來自第k個分模型} \cr 0 &otherwise\end{cases} γjk={10第j個觀測來自第k個分模型otherwise
在有了 γ j k \gamma^{jk} γjk的幫助下,就可以寫出觀測序列的似然函數了
P ( Y , γ ∣ θ ) = ∏ j = 1 N P ( y j , γ j 1 , γ j 2 , ⋯ , γ j K ∣ θ ) = ∏ j = 1 N ∏ k = 1 K [ α k ϕ ( y j ∣ θ ) ] γ j k = ∏ k = 1 K α k n k ∏ j = 1 N [ ϕ ( y j ∣ θ ) ] γ j k = ∏ k = 1 K α k n k ∏ j = 1 N [ 1 2 π σ k exp ( − ( y j − μ k ) 2 2 σ k 2 ) ] γ j k \begin{aligned} P(Y,\gamma|\theta) &=\prod_{j=1}^NP(y_j,\gamma_{j1},\gamma_{j2},\cdots,\gamma_{jK}|\theta)\\ &=\prod_{j=1}^N\prod_{k=1}^K[\alpha_k\phi(y_j|\theta)]^{\gamma_{jk}}\\ &=\prod_{k=1}^K\alpha_k^{n_k}\prod_{j=1}^N[\phi(y_j|\theta)]^{\gamma_{jk}}\\ &=\prod_{k=1}^K\alpha_k^{n_k}\prod_{j=1}^N[\frac{1}{\sqrt{2\pi}\sigma_k}\exp(-\frac{(y_j-\mu_k)^2}{2\sigma_k^2})]^{\gamma_{jk}} \end{aligned} P(Y,γ∣θ)=j=1∏NP(yj,γj1,γj2,⋯,γjK∣θ)=j=1∏Nk=1∏K[αkϕ(yj∣θ)]γjk=k=1∏Kαknkj=1∏N[ϕ(yj∣θ)]γjk=k=1∏Kαknkj=1∏N[2π
σk1exp(−2σk2(yj−μk)2)]γjk
式中, n k = ∑ j = 1 N γ j k , ∑ k = 1 K n k = N n_k=\sum_{j=1}^N{\gamma_{jk}},\sum_{k=1}^Kn_k=N nk=∑j=1Nγjk,∑k=1Knk=N,第一個等号成立的條件是每一個觀測資料 y j , j = 1 , 2 , ⋯ , N y_j,j=1,2,\cdots,N yj,j=1,2,⋯,N獨立同分布。
注: P ( y j , γ j 1 , γ j 2 , ⋯ , γ j K ∣ θ ) = ∏ k = 1 K [ α k ϕ ( y j ∣ θ ) ] γ j k P(y_j,\gamma_{j1},\gamma_{j2},\cdots,\gamma_{jK}|\theta)=\prod_{k=1}^K[\alpha_k\phi(y_j|\theta)]^{\gamma_{jk}} P(yj,γj1,γj2,⋯,γjK∣θ)=∏k=1K[αkϕ(yj∣θ)]γjk,右側顯然除了 γ j k = 1 \gamma_{jk}=1 γjk=1的項,其餘均取值為1
接下來就是使用EM算法來對這個似然問題求解。回憶上文中的Q函數定義式,這裡如下求取Q函數
Q ( θ , θ ( i ) ) = E P ( γ ∣ Y , θ ( i ) ) [ log ( P ( Y , γ ∣ θ ) ] = E P ( γ ∣ Y , θ ( i ) ) log { ∏ k = 1 K α k n k ∏ j = 1 N [ 1 2 π σ k exp ( − ( y j − μ k ) 2 2 σ k 2 ) ] γ j k } = E P ( γ ∣ Y , θ ( i ) ) { ∑ k = 1 K n k log α k + ∑ j = 1 N γ j k [ log ( 1 2 π ) − log σ k − ( y j − μ k ) 2 2 σ k 2 ] } = ∑ k = 1 K E P ( γ ∣ Y , θ ( i ) ) [ ∑ j = 1 N γ j K ] log α k + ∑ j = 1 N E P ( γ ∣ Y , θ ( i ) ) [ γ j k ] [ log ( 1 2 π ) − log σ k − ( y j − μ k ) 2 2 σ k 2 ] \begin{aligned} Q(\theta,\theta^{(i)}) &= E_{P(\gamma|Y,\theta^{(i)})}[\log({P(Y,\gamma|\theta)}]\\ &=E_{P(\gamma|Y,\theta^{(i)})}\log\{\prod_{k=1}^K\alpha_k^{n_k}\prod_{j=1}^N[\frac{1}{\sqrt{2\pi}\sigma_k}\exp(-\frac{(y_j-\mu_k)^2}{2\sigma_k^2})]^{\gamma_{jk}}\}\\ &=E_{P(\gamma|Y,\theta^{(i)})}\{\sum_{k=1}^K n_k\log \alpha_k+\sum_{j=1}^N\gamma_{jk}[\log(\frac{1}{\sqrt{2\pi}})-\log \sigma_k-\frac{(y_j-\mu_k)^2}{2\sigma_k^2}]\}\\ &=\sum_{k=1}^K E_{P(\gamma|Y,\theta^{(i)})}[\sum_{j=1}^N{\gamma_{jK}}]\log \alpha_k+\sum_{j=1}^NE_{P(\gamma|Y,\theta^{(i)})}[\gamma_{jk}][\log(\frac{1}{\sqrt{2\pi}})-\log \sigma_k-\frac{(y_j-\mu_k)^2}{2\sigma_k^2}] \end{aligned} Q(θ,θ(i))=EP(γ∣Y,θ(i))[log(P(Y,γ∣θ)]=EP(γ∣Y,θ(i))log{k=1∏Kαknkj=1∏N[2π
σk1exp(−2σk2(yj−μk)2)]γjk}=EP(γ∣Y,θ(i)){k=1∑Knklogαk+j=1∑Nγjk[log(2π
1)−logσk−2σk2(yj−μk)2]}=k=1∑KEP(γ∣Y,θ(i))[j=1∑NγjK]logαk+j=1∑NEP(γ∣Y,θ(i))[γjk][log(2π
1)−logσk−2σk2(yj−μk)2]
Q函數的求解就是一個求期望的過程,期望是關于随機變量 γ \gamma γ的,而式中隻有兩項與 γ \gamma γ有關,且這兩項都是用簡單的運算符連接配接,可以直接将期望運算符内置。
記 E P ( γ ∣ Y , θ ( i ) ) [ γ j k ] E_{P(\gamma|Y,\theta^{(i)})}[\gamma_{jk}] EP(γ∣Y,θ(i))[γjk]為 γ ^ j k \hat{\gamma}_{jk} γ^jk,由于變量 γ \gamma γ是一個0-1分布的變量,是以 γ ^ j k \hat{\gamma}_{jk} γ^jk可以看做 y j y_j yj來自于模型 k k k的機率。
γ ^ j k = E P ( γ ∣ y j , θ ( i ) ) [ γ j k ] = P ( γ j k = 1 ∣ y j , θ ) = P ( γ j k = 1 , y j ∣ θ ) P ( y j ∣ θ ) = P ( γ j k = 1 , y j ∣ θ ) ∑ k = 1 K P ( y j ∣ γ j k = 1 , θ ) P ( γ j k = 1 ∣ θ ) = α k ϕ ( y j ∣ θ k ) ∑ k = 1 K α k ϕ ( y j ∣ θ k ) \begin{aligned} \hat{\gamma}_{jk} &=E_{P(\gamma|y_j,\theta^{(i)})}[\gamma_{jk}]=P(\gamma_{jk}=1|y_j,\theta)\\ &=\frac{P(\gamma_{jk}=1,y_j|\theta)}{P(y_j|\theta)}\\ &=\frac{P(\gamma_{jk}=1,y_j|\theta)}{\sum_{k=1}^KP(y_j|\gamma_{jk}=1,\theta)P(\gamma_{jk}=1|\theta)}\\ &=\frac{\alpha_k\phi(y_j|\theta_k)}{\sum_{k=1}^K\alpha_k\phi(y_j|\theta_k)} \end{aligned} γ^jk=EP(γ∣yj,θ(i))[γjk]=P(γjk=1∣yj,θ)=P(yj∣θ)P(γjk=1,yj∣θ)=∑k=1KP(yj∣γjk=1,θ)P(γjk=1∣θ)P(γjk=1,yj∣θ)=∑k=1Kαkϕ(yj∣θk)αkϕ(yj∣θk)
重新觀察Q函數,我們不難發現與 γ ^ j k \hat{\gamma}_{jk} γ^jk與參數 θ = ( α , μ , σ ) \theta=(\alpha,\mu,\sigma) θ=(α,μ,σ)無關,故可以直接對三個參數求偏導并令其為0,就能完成 θ ( i + 1 ) = arg max θ Q ( θ , θ ( i ) ) \theta^{(i+1)} = \arg \max_\theta Q(\theta,\theta^{(i)}) θ(i+1)=argθmaxQ(θ,θ(i))