定义符号:
X i = ∑ j = 1 N X i , j P i , k = X i , k X i r a t i o i , j , k = P i , k P j , k X_i = \sum_{j=1}^N{X_{i,j}}\\ P_{i,k} = \frac{X_{i,k}}{X_i}\\ ratio_{i,j,k} = \frac{P_{i,k}}{P_{j,k}} Xi=j=1∑NXi,jPi,k=XiXi,kratioi,j,k=Pj,kPi,k
ratioi,j,k的值 | 单词j,k相关 | 单词j,k不相关 |
---|---|---|
单词i,k相关 | 趋近1 | 很大 |
单词i,k不相关 | 很小 | 趋近1 |
推导:
假设已经得到词向量,则词向量和共现矩阵应该具有很好的一致性。假设词向量
$v_i ,v_j, v_k$
计算 r a t i o i , j , k ratio_{i,j,k} ratioi,j,k的函数为 g ( w i , w j , w k ) g(w_i ,w_j ,w_k) g(wi,wj,wk),则:
P i , k P j , k = r a t i o i , j , k = g ( w i , w j , w k ) \frac{P_{i,k}}{P_{j,k}} = ratio_{i,j,k} = g(w_{i},w_{j},w_{k}) Pj,kPi,k=ratioi,j,k=g(wi,wj,wk)
需要等式左右尽可能接近,所以代价函数:
J = ∑ i , j , k N ( P i , k P j , k − g ( w i , w j , w k ) ) 2 J = \sum_{i,j,k}^N(\frac{P_{i,k}}{P_{j,k}}-g(w_{i},w_{j},w_{k}))^2 J=i,j,k∑N(Pj,kPi,k−g(wi,wj,wk))2
但是模型包括三个单词,复杂度 N ∗ N ∗ N N*N*N N∗N∗N。
如何简化:
- 要考虑单词i和j之间的关系,则g大概会有 w i − w j w_i - w_j wi−wj;
- r a t i o i , j , k ratio_{i,j,k} ratioi,j,k是标量,g也应该是标量,所以g应该包含 ( w i − w j ) T w k (w_i-w_j)^Tw_k (wi−wj)Twk;
- 再套上指数运算 e x p ( ) exp() exp(),最终 g ( w i , w j , w k ) = e x p ( ( w i − w j ) T w k ) g(w_i,w_j,w_k) = exp((w_i-w_j)^Tw_k) g(wi,wj,wk)=exp((wi−wj)Twk)
P i , k P j , k = g ( w i , w j , w k ) P i , k P j , k = e x p ( ( w i − w j ) T w k ) P i , k P j , k = e x p ( w i T w k − w j T w k ) P i , k P j , k = e x p ( w i T w k ) e x p ( w j T w k ) \frac{P_{i,k}}{P_{j,k}} = g(w_i,w_j,w_k)\\ \frac{P_{i,k}}{P_{j,k}} = exp((w_i-w_j)^Tw_k)\\ \frac{P_{i,k}}{P_{j,k}} = exp(w_i^Tw_k-w_j^Tw_k)\\ \frac{P_{i,k}}{P_{j,k}} = \frac{exp(w_i^Tw_k)}{exp(w_j^Tw_k)} Pj,kPi,k=g(wi,wj,wk)Pj,kPi,k=exp((wi−wj)Twk)Pj,kPi,k=exp(wiTwk−wjTwk)Pj,kPi,k=exp(wjTwk)exp(wiTwk)
可以看出:
P i , j = e x p ( w i T w j ) P_{i,j} = exp(w_i^Tw_j) Pi,j=exp(wiTwj) l o g ( X i , j ) − l o g ( X i ) = w i T w j log(X_{i,j}) - log(X_i) = w_i^Tw_j log(Xi,j)−log(Xi)=wiTwj l o g ( X i , j ) = w i T w j + b i + b j log(X_{i,j}) = w_i^Tw_j+b_i+b_j log(Xi,j)=wiTwj+bi+bj
损失函数变为:
J = ∑ i , j N ( w i T w j + b i + b j − l o g ( X i , j ) ) 2 J = \sum_{i,j}^N(w_i^Tw_j+b_i+b_j-log(X_{i,j}))^2 J=i,j∑N(wiTwj+bi+bj−log(Xi,j))2
矩阵分解方法,有个缺点,就是各个词的权重是一样的
基于出现频率越高的词对权重应该越大的原则,损失函数添加权重项:
J = ∑ i , j N f ( X i , j ) ( v i T v j + b i + b j − l o g ( X i , j ) ) 2 J = \sum_{i,j}^Nf(X_{i,j})(v_i^Tv_j+b_i+b_j-log(X_{i,j}))^2 J=i,j∑Nf(Xi,j)(viTvj+bi+bj−log(Xi,j))2 f ( x ) = { ( x / x m a x ) 0.75 , if x < x m a x 1 , if x > = x m a x f(x) = \begin{cases} (x/x_{max})^{0.75}, &\text{if } x < x_{max} \\ 1, &\text{if } x>=x_{max} \end{cases} f(x)={(x/xmax)0.75,1,if x<xmaxif x>=xmax