Attention is all your need

Abstract

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism.

主要的序列轉導模型是基于複雜的遞歸或卷積神經網絡，包括一個編碼器和一個解碼器。性能最好的模型還通過注意機制連接配接編碼器和解碼器。
We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.

我們提出了一種新的簡單的網絡結構，即Transformer，它完全基于注意力機制，完全不需要重複和卷積。
Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU.

在兩個機器翻譯任務上的實驗表明，這些模型在品質上是優越的，同時具有更高的并行性，所需的訓練時間大大減少。我們的模型在WMT 2014英語到德語翻譯任務中達到了28.4 BLEU，比現有的最佳結果（包括ensembles）提高了超過2個BLEU。
On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature.

在WMT 2014英語到法語翻譯任務中，我們的模型在8個GPU上訓練了3.5天後，建立了一個新的單一模型最先進的BLEU分數41.8，這隻是文獻中最好模型的教育訓練成本的一小部分。
We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

通過對大量和有限的訓練資料的成功應用，證明了該transfomer能很好地推廣到其他任務。

一、Introduction

自從有了RNN的變體LSTM和GRU，Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures.

此後，人們一直在努力擴大循環語言模型和編碼器-解碼器體系結構的界限。
Recurrent models typically factor computation along the symbol positions of the input and outputsequences.

遞歸模型通常沿輸入和輸出序列的符号位置考慮計算。
Aligning the positions to steps in computation time, they generate a sequence of hidden states ht, as a function of the previous hidden state ht−1 and the input for position t. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples.

将位置與計算時間中的步驟對齊，它們生成一系列隐藏狀态ht，作為先前隐藏狀态ht−1的函數和位置t的輸入。這種固有的順序性質排除了訓練示例中的并行化，這在較長的序列長度下變得至關重要，因為記憶體限制限制了跨示例的批處理。
Recent work has achieved significant improvements in computational efficiency through factorization tricks and conditional computation, while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains.

最近的工作已經通過因子分解技巧和條件計算在計算效率上取得了顯著的改進，同時也提高了後者的模型性能。然而，順序計算的基本限制仍然存在。
Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences. In all but a few cases, however, such attention mechanisms are used in conjunction with a recurrent network.注意機制已經成為各種任務中令人信服的序列模組化和轉導模型的一個組成部分，允許對依賴關系進行模組化，而不考慮它們在輸入或輸出序列中的距離。然而，除少數情況外，這些注意機制都是與一個循環網絡一起使用的。
In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output.The Transformer allows for significantly more parallelization … .在這項工作中，我們提出了一種Transformer，一種避免重複的模型體系結構，完全依賴于注意力機制來繪制輸入和輸出之間的全局依存關系.Transformer可以實作更多的并行化，…。

二、Background

Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations.

自我注意，有時稱為内注意，是一種将單個序列的不同位置聯系起來以計算序列的表示的一種注意機制。自我注意已成功地應用于閱讀了解、抽象總結、文本蘊涵和學習任務無關的句子表征。
End-to-end memory networks are based on a recurrent attention mechanism instead of sequence-aligned recurrence and have been shown to perform well on simple-language question answering and language modeling tasks.

端到端記憶網絡是基于循環注意機制而不是序列對齊的循環機制，在簡單的語言問答和語言模組化任務中表現良好.
the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution.

Transformer是第一個完全依靠self-attention來計算其輸入和輸出表示的轉導(Transduction)模型，而無需使用序列對齊的RNN或CNN。

三、Model Architecture

-Most competitive neural sequence transduction models have an encoder-decoder structure.

大多數競争性神經序列轉導模型都具有編碼器-解碼器結構。

-Here, the encoder maps an input sequence of symbol representations(x1,…,xn)to a sequence of continuous representations z= (z1,…,zn). Given z, the decoder then generates an output sequence(y1,…,ym)of symbols one element at a time. At each step the model is auto-regressive[10], consuming the previously generated symbols as additional input when generating the next.

這裡，編碼器将符号表示的輸入序列（x1，…，xn）映射到連續表示序列z=（z1，…，zn）。給定z，解碼器然後一次生成一個符号的輸出序列（y1，…，ym）。在每一步中，模型都是自回歸的[10]，在生成下一步時，使用先前生成的符号作為附加輸入。

-The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.

Transformer遵循這個整體架構，對編碼器和解碼器使用堆疊的自我關注和點式的完全連接配接層，分别顯示在圖1的左半部和右半部。

Attention is all your need(20.11.21)Attention is all your need

3.1 Encoder and Decoder Stacks

1）Encoder:

i.The encoder is composed of a stack of N= 6identical layers.Each layer has two sub-layers.

編碼器由一個N=6個相同層組成的堆棧。每層有兩個子層。

ii.The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network. We employ a residual connection [11] around each of the two sub-layers, followed by layer normalization [1].

第一種是多頭自我注意機制，第二種是一種簡單的、位置上完全連接配接的前饋網絡。我們在兩個子層的每一個子層周圍使用一個剩餘連接配接[11]，然後進行層标準化[1]。

iii.That is, the output of each sub-layer is LayerNorm(x+ Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension d_model= 512.

也就是說，每個子層的輸出是LayerNorm（x+Sublayer（x）），其中Sublayer（x）是子層本身實作的功能。為了友善這些剩餘連接配接，模型中的所有子層以及嵌入層都會生成維數為dmodel=512的輸出。

2）Decoder：

i.he decoder is also composed of a stack of N= 6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization.

解碼器還由N＝6個相同層的堆棧組成。除了每個編碼器層中的兩個子層外，解碼器插入第三個子層，該子層對編碼器堆棧的輸出執行多個頭部關注。與編碼器類似，我們在每個子層周圍使用剩餘連接配接，然後進行層标準化。

ii.We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.

我們還修改了解碼器堆棧中的自我注意子層，以防止位置出現在後續位置。這種掩蓋，加上輸出嵌入被一個位置偏移的事實，確定了對位置i的預測隻能依賴于小于i位置的已知輸出。
3.2 Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

注意函數可以描述為将一個查詢和一組鍵值對映射到一個輸出，其中查詢、鍵、值和輸出都是向量。輸出被計算為值的權重和，其中配置設定給每個值的權重由查詢與相應鍵的相容性函數計算。

Attention is all your need(20.11.21)Attention is all your need

3.2.1 Scaled Dot-Product Attention
We call our particular attention “Scaled Dot-Product Attention” (Figure 2). The input consists of queries and keys of dimension d_k, and values of dimension d_v. We compute the dot products of the query with all keys, divide each by√dk, and apply a softmax function to obtain the weights on the values.

我們稱我們的特别關注為“标度點産品關注”（圖2）。輸入包含次元為d_k的查詢和鍵以及次元為d_v的值。我們用所有鍵計算查詢的點積，将每個鍵除以√dk，然後應用softmax函數獲得值的權重。
In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrixQ. The keys and values are also packed together into matrices K and V. We compute the matrix of outputs as:

在實踐中，我們同時對一組查詢計算注意力函數，并将它們打包成一個matrixQ。鍵和值也打包到矩陣K和V中。我們将輸出矩陣計算為：

Attention is all your need(20.11.21)Attention is all your need
The two most commonly used attention functions are additive attention [2], and dot-product (multi-plicative) attention. Dot-product attention is identical to our algorithm, except for the scaling factor of 1√dk. Additive attention computes the compatibility function using a feed-forward network with a single hidden layer.While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.

兩個最常用的注意力函數是附加注意力[2]和點積注意力。點乘注意力與我們的算法相同，隻是比例因子為1√ dk。添加注意力使用一個單一隐藏層的前饋網絡計算相容性函數。盡管兩者在理論上的複雜度相似，但是在實踐中點積的關注要快得多，并且空間效率更高，因為可以使用高度優化的矩陣乘法代碼來實作。
While for small values of d_k the two mechanisms perform similarly, additive attention outperforms dot product attention without scaling for larger values of d_k[3]. We suspect that for large values of d_k, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients4. To counteract this effect, we scale the dot products by 1√dk.

對于較小的d_k值，這兩種機制的執行效果相似，而對于較大的d_k [3]而言，加法注意要優于點積注意。我們懷疑，對于較大的d_k值，點積的大小會增大，進而将softmax函數推入梯度極小的區域4。為了抵消這種影響，我們将點積縮放1√dk。

Attention is all your need(20.11.21)Attention is all your need
3.2.2 Multi-Head Attention
Instead of performing a single attention function with d_model-dimensional keys, values and queries, we found it beneficial to linearly project the queries, keys and values h times with different, learned linear projections to d_k, d_k and d_v dimensions, respectively. On each of these projected versions of queries, keys and values we then perform the attention function in parallel, yielding d_v-dimensional output values. These are concatenated and once again projected, resulting in the final values, as depicted in Figure 2.

與用d_model維的鍵，值和查詢執行單個注意功能相比，我們發現将查詢，鍵和值分别以不同的學習線性投影h次線性投影到d_k，d_k和d_v維是有益的。然後，在查詢，鍵和值的這些預計的每個版本上，我們并行執行關注功能，進而産生d_v維輸出值。将它們連接配接起來并再次投影，得到最終值，如圖2所示。
Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.

多頭注意允許模型在不同的位置聯合處理來自不同表示子空間的資訊。對于一個注意力集中的頭部，平均會抑制這種情況。

Attention is all your need(20.11.21)Attention is all your need
In this work we employ h= 8parallel attention layers, or heads. For each of these we use d_k=d_v=d_model/h= 64. Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.

在這項工作中，我們采用h = 8個平行注意層或頭部。對于每一個，我們使用d_k = d_v = d_model / h =64。由于每個頭部的維數降低，總的計算量與全維的單頭注意的計算量相似。

Attention is all your need(20.11.21)Attention is all your need
3.2.3 Applications of Attention in our Model

The Transformer uses multi-head attention in three different ways:

transformer通過三種不同方式使用多頭注意力：

1）In “encoder-decoder attention” layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence. This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models such as[38,2,9].

在“編解碼器注意”層中，查詢來自前一個解碼器層，存儲鍵和值來自編碼器的輸出。這使得譯碼器中的每個位置都能處理輸入序列中的所有位置。這在序列到序列模型中模拟了典型的編解碼器注意機制，例如[38,2,9]。

2）The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder.

編碼器包含自我注意層。在自我關注層中，所有鍵，值和查詢都來自同一位置，在這種情況下，是編碼器中前一層的輸出。編碼器中的每個位置都可以覆寫編碼器上一層中的所有位置。

3）Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We implement this inside of scaled dot-product attention by masking out (setting to−∞) all values in the input of the softmax which correspond to illegal connections. See Figure 2.

類似地，解碼器中的自我注意層允許解碼器中的每個位置關注直到包括該位置的解碼器中的所有位置。我們需要阻止解碼器中的向左資訊流，以保留自回歸屬性。我們通過屏蔽（設定為-∞）softmax輸入中與非法連接配接相對應的所有值，來在按比例縮放的點積産品注意範圍内實作此功能。見圖2。
3.3 Position-wise Feed-Forward Networks

In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.

除了關注子層之外，我們的編碼器和解碼器中的每個層還包含一個完全連接配接的前饋網絡，該網絡分别獨立且相同地應用于每個位置。這由兩個線性變換組成，兩個線性變換之間具有ReLU激活。

Attention is all your need(20.11.21)Attention is all your need

While the linear transformations are the same across different positions, they use different parameters from layer to layer. Another way of describing this is as two convolutions with kernel size 1.The dimensionality of input and output is d_model= 512, and the inner-layer has dimensionality d_ff= 2048.

雖然線性變換在不同位置上相同，但是它們使用不同的參數。描述它的另一種方式是使用兩個核心大小為1的卷積。輸入和輸出的維數為d_model = 512，内層的維數為d_ff = 2048。
3.4 Embeddings and Softmax

Similarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension d_model. We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities. In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, similar to [30]. In the embedding layers, we multiply those weights by√d_model.

與其他序列轉導模型類似，我們使用學習嵌入将輸入标記和輸出标記轉換為維數模型的向量。我們還使用通常學習的線性變換和softmax函數将解碼器輸出轉換為預測的下一個标記機率。在我們的模型中，我們在兩個嵌入層之間共享相同的權重矩陣和pre-softmaxlinear變換，類似于[30]。在嵌入層中，我們用√dmodel乘以這些權重。
3.5 Positional Encoding

Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add “positional encodings” to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension d_model as the embeddings, so that the two can be summed. There are many choices of positional encodings, learned and fixed[9].

由于我們的模型不包含遞歸和卷積，為了使模型能夠利用序列的順序，我們必須注入一些關于标記在序列中的相對或絕對位置的資訊。為此，我們在編碼器和解碼器堆棧底部的輸入嵌入中添加“位置編碼”。位置編碼具有與嵌入相同的維d_模型，是以可以将二者相加。有許多位置編碼的選擇，可以學習的和固定。

In this work, we use sine and cosine functions of different frequencies:在這項工作中，我們使用不同頻率的正弦和餘弦函數：

Attention is all your need(20.11.21)Attention is all your need

其中pos是位置，i是尺寸。也就是說，位置編碼的每個次元對應于一個正弦曲線。波長形成從2π到10000·2π的幾何級數。我們選擇此函數是因為我們假設它會允許模型輕松學習相對位置的參加，因為對于任何固定的偏移量k，PE_pos + k都可以表示為PE_pos的線性函數。

四、Why Self-Attention

五、 Training

5.1 Training Data and Batching
5.2 Hardware and Schedule

8 NVIDIA P100 GPUs.
5.3 Optimizer
5.4 Regularization

六、 Results

6.1 Machine Translation

1）On the WMT 2014 English-to-German translation task, the big transformer model (Transformer (big)in Table 2) outperforms the best previously reported models (including ensembles) by more than2.0BLEU, establishing a new state-of-the-art BLEU score of28.4. The configuration of this model is listed in the bottom line of Table 3. Training took3.5days on8P100 GPUs. Even our base model surpasses all previously published models and ensembles, at a fraction of the training cost of any of the competitive models.

在WMT 2014英語到德語翻譯任務中，大變壓器模型（表2中的變壓器（大））比先前報告的最好的模型（包括組合體）高出2.0BLEU以上，進而建立了最先進的BLEU分數28.4。該型号的配置列于表3的最後一行。在8p100 GPU上訓練3.5天。即使是我們的基礎模型也超過了之前釋出的所有模型群組合，隻是任何競争模型的教育訓練成本的一小部分。

2）On the WMT 2014 English-to-French translation task, our big model achieves a BLEU score of 41.0, outperforming all of the previously published single models, at less than1/4the training cost of the previous state-of-the-art model. The Transformer (big) model trained for English-to-French used dropout rate P_drop= 0.1, instead of 0.3.

在WMT 2014英語到法語的翻譯任務中，我們的大型模型的BLEU得分達到41.0，優于以前釋出的所有單個模型，而教育訓練成本卻不到以前的最新模型的1/4。為英語到法語訓練的Transformer（大型）模型使用的drop率P_drop = 0.1，而不是0.3。

3）For the base models, we used a single model obtained by averaging the last 5 checkpoints, which were written at 10-minute intervals. For the big models, we averaged the last 20 checkpoints. We used beam search with a beam size of4and length penaltyα= 0.6[38]. These hyper parameters were chosen after experimentation on the development set. We set the maximum output length during inference to input length +50, but terminate early when possible [38].

對于基本模型，我們使用一個單獨的模型，通過平均最後5個檢查點獲得，這些檢查點每隔10分鐘編寫一次。對于大模型，我們平均了最後20個檢查點。我們使用波束搜尋，波束大小為4，長度懲罰α=0.6[38]。這些超參數是在開發集上進行實驗後選擇的。我們将推理過程中的最大輸出長度設定為輸入長度+50，但盡可能提前終止[38]。

4）Table 2 summarizes our results and compares our translation quality and training costs to other model architectures from the literature. We estimate the number of floating point operations used to train a model by multiplying the training time, the number of GPUs used, and an estimate of the sustained single-precision floating-point capacity of each GPU.

表2總結了我們的結果，并将我們的翻譯品質和教育訓練成本與文獻中的其他模型架構進行了比較。我們通過乘以訓練時間，使用的GPU數量以及每個GPU持續的單精度浮點容量的估計值，來估計用于訓練模型的浮點運算的數量。

Attention is all your need(20.11.21)Attention is all your need
6.2 Model Variations

To evaluate the importance of different components of the Transformer, we varied our base model in different ways, measuring the change in performance on English-to-German translation on the development set, news test 2013. We used beam search as described in the previous section, but no checkpoint averaging. We present these results in Table 3.

為了評估Transformer的不同元件的重要性，我們以不同的方式改變了基本模型，在開發集，新聞測試2013中測量了英語到德語翻譯的性能變化。我們使用了前文所述的光束搜尋部分，但沒有檢查點平均。我們将這些結果顯示在表3中。

In Table 3 rows (A), we vary the number of attention heads and the attention key and value dimensions, keeping the amount of computation constant, as described in Section 3.2.2. While single-head attention is 0.9 BLEU worse than the best setting, quality also drops off with too many heads.

在表3行（A）中，我們改變了注意頭的數量、注意鍵和值次元，保持計算量不變，如第3.2.2節所述。雖然單頭注意力比最佳設定差0.9，但頭部過多也會導緻品質下降。

Attention is all your need(20.11.21)Attention is all your need

In Table 3 rows (B), we observe that reducing the attention key size d_k hurts model quality. This suggests that determining compatibility is not easy and that a more sophisticated compatibility function than dot product may be beneficial. We further observe in rows © and (D) that, as expected, bigger models are better, and dropout is very helpful in avoiding over-fitting. In row (E) we replace our sinusoidal positional encoding with learned positional embeddings [9], and observe nearly identical results to the base model.

在表3（B）行中，我們觀察到減小注意鍵大小d_k會損害模型品質。這表明确定相容性并不容易，并且比點乘積更複雜的相容性功能可能會有所幫助。我們進一步在（C）和（D）行中觀察到，正如預期的那樣，較大的模型更好，并且遺漏對于避免過度拟合非常有幫助。在（E）行中，我們将正弦位置編碼替換為學習的位置嵌入[9]，并觀察到與基本模型幾乎相同的結果。

6.3 English Constituency Parsing
To evaluate if the Transformer can generalize to other tasks we performed experiments on English constituency parsing. This task presents specific challenges: the output is subject to strong structural constraints and is significantly longer than the input. Furthermore, RNN sequence-to-sequence models have not been able to attain state-of-the-art results in small-data regimes[37].為了評估Transformer是否可以推廣到其他任務，我們對英語選區解析進行了實驗。這項任務提出了具體的挑戰：産出受到強大的結構限制，并且比投入的時間長得多。此外，RNN序列到序列模型還無法在小資料體制中獲得最新的結果[37]。
We trained a 4-layer transformer with d_model= 1024on the Wall Street Journal (WSJ) portion of thePenn Treebank [25], about 40K training sentences. We also trained it in a semi-supervised setting, using the larger high-confidence and BerkleyParser corpora from with approximately 17M sentences[37]. We used a vocabulary of 16K tokens for the WSJ only setting and a vocabulary of 32K tokens for the semi-supervised setting.

我們在Penn樹庫的《華爾街日報》（WSJ）部分[25]上訓練了一個d_model = 1024的4層transformer，訓練句子約為40K。我們還使用來自約1700萬個句子的較大的高可信度和BerkleyParser語料庫，在半監督的情況下對其進行了訓練[37]。對于僅WSJ設定，我們使用了16K token的詞彙表；對于半監督設定，我們使用了32K token的詞彙表。
We performed only a small number of experiments to select the dropout, both attention and residual(section 5.4), learning rates and beam size on the Section 22 development set, all other parameters remained unchanged from the English-to-German base translation model. During inference, we increased the maximum output length to input length +300. We used a beam size of 21and α= 0.3for both WSJ only and the semi-supervised setting.

我們僅進行了少量實驗來選擇第22節開發集上的dropout，注意力和殘差（第5.4節），學習率和波束大小，所有其他參數在英語到德語基礎翻譯模型中均保持不變。在推論期間，我們将最大輸出長度增加到輸入長度+300。對于WSJ和半監督設定，我們使用的光束大小均為21，α= 0.3。
Our results in Table 4 show that despite the lack of task-specific tuning our model performs surprisingly well, yielding better results than all previously reported models with the exception of theRecurrent Neural Network Grammar [8].In contrast to RNN sequence-to-sequence models [37], the Transformer outperforms the Berkeley-Parser [29] even when training only on the WSJ training set of 40K sentences.

我們在表4中的結果表明，盡管缺少針對特定任務的調優，我們的模型仍然表現出令人驚訝的出色表現，除遞歸神經網絡文法外，其結果比以前報告的所有模型都更好。與RNN序列到序列模型[37]相比，即使僅在40K句子的WSJ訓練集上進行訓練，該Transformer的性能也優于Berkeley-Parser [29]。

七、Conclusion

In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention.

在這項工作中，我們介紹了Transformer，這是完全基于注意力的第一個序列轉導模型，用多頭自注意力代替了編解碼器體系結構中最常用的循環層。
For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, we achieve a new state of the art. In the former task our best model outperforms even all previously reported ensembles.

對于翻譯任務，與基于循環層或卷積層的體系結構相比，可以大大加快Transformer的教育訓練速度。在WMT 2014英語到德語和WMT 2014英語到法語的翻譯任務中，我們都達到了最新水準。在前一項任務中，我們最好的模型甚至勝過所有先前報告的合奏。
We are excited about the future of attention-based models and plan to apply them to other tasks. We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs such as images, audio and video. Making generation less sequential is another research goals of ours.

我們對基于注意力的模型的未來感到興奮，并計劃将其應用于其他任務。我們計劃将“變形金剛”擴充到涉及文本以外的涉及輸入和輸出形式的問題，并研究局部受限的注意機制，以有效處理大型輸入和輸出，例如圖像，音頻和視訊。使世代相繼減少是我們的另一個研究目标。

Attention is all your need(20.11.21)Attention is all your need

Attention is all your need

Abstract

一、Introduction

二、Background

三、Model Architecture

四、Why Self-Attention

五、 Training

六、 Results

七、Conclusion

繼續閱讀

文本分類之 residual-connection+selfAttention的詞向量平均模型

新聞文本分類-06 基于Bert的文本分類

seq2seq模型 + Attention機制

elasticlunr.js 最新版本v0.6.7釋出啦應用示例為什麼你需要elasticlunr.js?

RNN/LSTM學習資料總結

使用中文維基百科進行GloVe實驗

從詞向量衡量标準到全局向量的詞嵌入模型GloVe再到一詞多義的解決方式衡量标準Evaluation引子全局向量的詞嵌入應用對一詞多義的思考Reference

NLP︱進階詞向量表達（一）——GloVe（理論、相關測評結果、R&python實作、相關應用）一、理論簡述二、測評三、Glove實作&R&python四、相關應用

GloVe與word2vec的差別，及GloVe的缺陷

更别緻的詞向量模型(一)：simpler glove

glove_python安裝（避免編譯錯誤）

python 分析qq聊天記錄

[一起學BERT]（一）：BERT模型的原理基礎Self-Attention機制理論Multi-head Self-Attention注意力機制位置編碼Transformer理論BERT理論

ELMO BERT GPT

BERT、Elmo、GPT一、發展曆史二、bert三、ERNIE四、GPT—transformer的decoder

人工智能如何有效地運用于自然語言處理