Attention總結

對論文NML的總結，論文标題：

NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE

算是attention的開山之作了。

這篇論文引出的Attention model（在論文中被稱為alignment model），是依附于RNN Encoder-Decoder架構的。是以作者先對最基本的RNN Encoder-Decoder架構做了一個簡單的介紹。

RNN Encoder-Decoder

an encoder reads the input sentence, a sequence of vectors x = ( x 1 , . . . , x T x ) \bold{x} = (x_1,...,x_{T_x}) x=(x1,...,xTx)，into a vector c c c。其中我們記 x \bold{x} x為source sentence， x i x_i xi is 1-of-K coded word vector， T x T_x Tx表示source sentence的長度。

對于RNN，

h t = f ( x t , h t − 1 ) c = q ( { h 1 , . . . , h T x } ) h_t = f(x_t, h_{t-1})\\c = q(\{h_1,...,h_{T_x}\}) ht=f(xt,ht−1)c=q({h1,...,hTx})

其中， h t h_t ht是時刻 t t t的hidden state， f f f and g g g are some nonlinear functions。

the decoder is often trained to predict the next word y t ′ y_{t'} yt′ given the context vector c c c and all the

previously predicted words y 1 , . . . , y t ′ − 1 {y_1,...,y_{t'-1}} y1,...,yt′−1。In other words, the decoder defines a probability over

the translation y by decomposing the joint probability into the ordered conditionals:

p ( y ) = ∏ t = 1 T y p ( y t ∣ { y 1 , . . . , y t − 1 } , c ) y = ( y 1 , . . . , y T y ) p(\bold{y}) = \prod_{t=1}^{T_y}p(y_t|\{y_1,...,y_{t-1}\},c)\\\bold{y} = (y_1, ...,y_{T_y}) p(y)=t=1∏Typ(yt∣{y1,...,yt−1},c)y=(y1,...,yTy)

With an RNN, each conditional probability is modeled as

p ( y t ∣ { y 1 , . . . , y t − 1 } , c ) = g ( y t − 1 , s t , c ) p(y_t|\{y_1,...,y_{t-1}\},c) = g(y_{t-1}, s_t,c) p(yt∣{y1,...,yt−1},c)=g(yt−1,st,c)

其中， y t − 1 y_{t-1} yt−1是上一時刻的輸出， s t s_t st是時刻t的hidden state， g g g是nonlinear function，可以是RNN或者LSTM單元。

當然，這裡的RNN可以換成LSTM，并且效果會更好。

可以看出，無論是在聯合機率表達式還是在單個的條件機率表達式中，context vector c c c都是相同的，即所謂的“分心模型”，進而引出後文的alignment model。

alignment model

在介紹這個model時，作者是以BiRNN為例。

在引入alignment model後，上一節定義的each conditional probability變成了：

p ( y i ∣ y 1 , . . . , y i − 1 , x ) = g ( y i − 1 , s i , c i ) p(y_i|y_1,...,y_{i-1},\bold{x}) = g(y_{i-1}, s_i, c_i) p(yi∣y1,...,yi−1,x)=g(yi−1,si,ci)

s i s_i si的更新表達式（ i i i時刻的hidden state）：

s i = f ( s i − 1 , y i − 1 , c i ) s_i = f(s_{i-1},y_{i-1},c_i) si=f(si−1,yi−1,ci)

可以看出， s i s_i si的更新表達式和正常RNN和LSTM形式差不多，隻不過多了一個輸入 c i c_i ci。

這兩個表達式的關鍵在于decoder在不同的時刻 c i c_i ci也是不同的，即search through a source sentence x \bold{x} x during decoding a translation to form c i c_i ci，而不是像上一節的表達式中不同時刻的 c c c都是相同的。

下圖是注意力配置設定的可視化計算過程：

Attention總結Attention總結

接下來看看如何計算context vector c i c_i ci：

c i = ∑ j = 1 T x α i j h j α i j = e x p ( e i j ) ∑ k = 1 T x e x p ( e i k ) e i j = a ( s i − 1 , h j ) c_i = \sum_{j=1}^{T_x}\alpha_{ij}h_j \\\alpha_{ij} = \frac{exp(e_{ij})}{\sum_{k=1}^{T_x}exp(e_{ik})}\\e_{ij} = a(s_{i-1}, h_j) ci=j=1∑Txαijhjαij=∑k=1Txexp(eik)exp(eij)eij=a(si−1,hj)

以上三個表達式就是所謂的alignment model（即我們現在所熟悉的attention機制）。為什麼原文叫做alignment呢？scores how well the inputs around position j j j and the output at position i i i match。

可視化：

Attention總結Attention總結

這裡的AM其實是soft AM，意思是在求注意力配置設定機率分布的時候，對于source sentence x \bold{x} x中任意一個單詞都給出一個對齊機率（即目标單詞有多大可能是由目前這個單詞decode得到，這就是對齊的意思），是一個機率分布。既然有soft AM，相應的也有hard AM，這裡按下不表。

論文中AM is a feedforward neural network which is jointly trained with all the other components of the proposed system。具體形式為：

a ( s i − 1 , h j ) = v a T t a n h ( W a s i − 1 + U a h j ) a(s_{i-1}, h_j) = \bold{v}_a^Ttanh(W_as_{i-1}+U_ah_j) a(si−1,hj)=vaTtanh(Wasi−1+Uahj)

實際應用中注意力函數有很多變體。主流的注意力函數有：加性注意力（additive attention）、乘法（點積）注意力（multiplicative attention）、自注意力（self-attention）、鍵-值注意力（key-value attention）

additive attention:

來自于論文：Attention-Based Models for Speech Recognition。

f a t t ( h i , s j − 1 ) = v a T t a n h ( W a [ h i ; s j − 1 ] ) , i . e . f a t t ( h i , s j − 1 ) = v a T t a n h ( W 1 h i + W 2 s j − 1 ) f_{att}(h_i, s_{j-1})=\bold{v}_a^Ttanh(W_a[h_i;s_{j-1}]),i.e.\\f_{att}(h_i,s_{j-1})=\bold{v}_a^Ttanh(W_1h_i+W_2s_{j-1}) fatt(hi,sj−1)=vaTtanh(Wa[hi;sj−1]),i.e.fatt(hi,sj−1)=vaTtanh(W1hi+W2sj−1)

本質是利用前饋網絡來計算注意力配置設定。

multiplicative attention:

來自于論文：Effective Approaches to Attention-based Neural Machine Translation

f a t t ( h i , s j − 1 ) = h i T W a s j − 1 f_{att}(h_i,s_{j-1})=h_i^TW_as_{j-1} fatt(hi,sj−1)=hiTWasj−1

加性注意力和乘法注意力在複雜度上是差不多的，但是乘法注意力在實踐中更快、存儲更高效，因為可以使用矩陣操作。

self-attention:

來自于論文：Attention is All you Need

A = s o f t m a x ( V a t a n h ( W a H T ) ) C = A H A = softmax(V_atanh(W_aH^T))\\C=AH A=softmax(Vatanh(WaHT))C=AH

自注意力和一般的注意力差別還是挺大的，是以這裡的表達式沒有涉及到 s i s_i si。Transformer是自注意力的典型應用。

key-value attention:

來自于論文：Frustratingly Short Attention Spans in Neural Language Modeling

這種注意力的計算方式的關鍵在于将 h i h_i hi分離成一個鍵值 k i k_i ki向量和一個值向量 v i v_i vi，即 [ k i ; v i ] = h i [k_i;v_i]=h_i [ki;vi]=hi：

a i = s o f t m a x ( V a T t a n h ( W 1 [ k i − L ; , , , ; k i − 1 ] + ( W 2 k i ) 1 T ) ) c i = [ v i − L ; , , , ; v i − 1 ] a T c = [ c ; v i ] a_i=softmax(\bold{V}_a^Ttanh(W_1[\bold{k}_{i-L};,,,;\bold{k}_{i-1}]+(W_2\bold{k}_i)1^T))\\c_i = [\bold{v}_{i-L};,,,;\bold{v}_{i-1}]\bold{a}^T\\\bold{c} = [\bold{c};v_i] ai=softmax(VaTtanh(W1[ki−L;,,,;ki−1]+(W2ki)1T))ci=[vi−L;,,,;vi−1]aTc=[c;vi]

L L L為注意力視窗的長度.

作者是以BiRNN為例引出AM，BiRNN主要是 h j h_j hj的計算：

h j = [ h j → T ; h j ← T ] T h_j=[\overrightarrow{h_j}^T;\overleftarrow{h_j}^T]^T hj=[hj

T;hj

T]T

把前向和後向得到的 h j h_j hjconcatenate在一起。

到這差不多把這篇文章的主要内容給了解了。

上述的AM是依附于encoder-decoder進行了解的，但是AM可以不用依附于任何架構，我們需要了解AM的本質思想，具體可以參考這篇博文連結。

這篇文章中有兩個點目前還不了解：

文中提到的maxout hidden layer，參考論文：Maxout networks；
使用gated hidden unit作為激活函數 f f f，參考論文：Learning phrase representations using RNN encoder-decoder for statistical machine translation。

後續有時間整理下self-attention和transformer。

preference

1：https://blog.csdn.net/mpk_no1/article/details/72862348

2:https://blog.csdn.net/TG229dvt5I93mxaQ5A6U/article/details/78422216

Attention總結Attention總結

Attention總結

RNN Encoder-Decoder

alignment model

preference

繼續閱讀

考證大全 | 證券從業資格考試

敲黑闆！2021年證券從業考試考點預測

2021年銀行從業考試考情介紹,果斷收藏!

證券從業合格證書什麼時候列印？有哪些注意事項？

【幹貨滿滿】初級銀行從業考試《個人理财》重點梳理

2020年經濟師考試，難嗎？

初級銀行從業資格證有什麼用？

MBA提前面試純幹貨分享

MBA值得學麼

吳恩達logistic回歸實作

【人工智能行業大師訪談1】吳恩達采訪 Geoffery Hinton

深度學習模型分析人類複雜疾病的準确性

人工智能如何有效地運用于自然語言處理

【趨高機器視覺】機器視覺技術原了解析及解決方案

解碼器用于語義分割：資料依賴的解碼可以實作靈活的特征聚合

cs231n斯坦福基于卷積神經網絡的CV學習筆記（一）KNN和線性分類器/分類器損失/反向傳播一，KNN圖像分類算法二，線性分類器三，線性分類器損失四，反向傳播五，神經網絡