論文連結：https://arxiv.org/abs/1706.03762

代碼已上傳到我的github：https://github.com/JingBob/myTransformer，裡面有更詳細的注釋

文章目錄

一、背景
二、模型架構
- 1.整體架構
- 2.編碼器
- 3.解碼器
- 4.注意力層
- - Scaled Dot-Product Attention
  - Multi-Head Attention
  - Applications of Attention in our Model
- 5.位置前饋網絡
- 6.Embeddings 和 Softmax
- 7.位置編碼
- 8.整體模型
三、模型訓練
四、實戰
參考文獻

Transformer為許多 NLP 任務提供了一種新的架構，其完全基于注意機制，完全舍棄循環卷積結構，使得其并行計算能力十分強大，而且重新整理了許多NLP任務的SOTA，不得不說是一個非常先進的模型，是以在此學習記錄下心得，主要參考的是哈佛的NLP團隊實作的一個基于PyTorch的版本：http://nlp.seas.harvard.edu/2018/04/03/attention.html

原理講解有一篇非常棒：Transformer 詳解.

一、背景

在這之前，RNN,LSTM等模型被公認是sequence modeling和transduction problems的最先進方法，其通常沿着輸入和輸出序列的符号位置計算，并将位置與時間步驟對齊，以此生成一系列隐藏狀态，這種計算方式使模型沒有辦法并行運作，效率低，且面臨對齊問題。Attention允許對序列符号依賴關系進行模組化，而不用考慮它們在輸入或輸出序列中的距離，然而Attention卻隻結合上述的Recurrent network來使用，無法解決RN的天生問題。是以提出Transformer，完全抛棄了傳統的encoder-decoder模型必須結合CNN或者RNN的固有模式，隻用Attention。主要目的是減少計算量和提高并行效率，同時不損害精度。

二、模型架構

我用的：pytorch==1.9.0

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import math, copy, time
from torch.autograd import Variable
import matplotlib.pyplot as plt
import seaborn

seaborn.set_context(context="talk")

工程目錄

Attention Is All You Need：論文筆記及pytorch複現【Transformer】一、背景二、模型架構三、模型訓練四、實戰參考文獻

1.整體架構

大多數神經序列轉換模型都有encoder-decoder結構。encoder用于将符号表示的輸入序列 (x1, …, xn)編碼映射到一個連續表示的序列z = (z1, …, zn)。給定z，decoder一次生成一個元素符号的輸出序列(y1，…，ym)。Transformer也遵循這種架構，由encoder-decoder結構構成，其結構如下圖所示：

Attention Is All You Need：論文筆記及pytorch複現【Transformer】一、背景二、模型架構三、模型訓練四、實戰參考文獻

class EncoderDecoder(nn.Module):
    """
    标準的encoder-decoder架構
    @輸入參數：
    	encoder：編碼器
    	decoder：解碼器
    	src_embed：輸入詞向量
    	tgt_embed：目标詞向量
    	generator：生成器，對應上圖的linear + softmax
    """
    def __init__(self, encoder, decoder, src_embed, tgt_embed, generator):
        super(EncoderDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.src_embed = src_embed
        self.tgt_embed = tgt_embed
        self.generator = generator

    def forward(self, src, tgt, src_mask, tgt_mask):
        "喂入和處理masked src與目标序列."
        return self.decode(self.encode(src, src_mask), src_mask, tgt, tgt_mask)

    def encode(self, src, src_mask):
        return self.encoder(self.src_embed(src), src_mask)

    def decode(self, memory, src_mask, tgt, tgt_mask):
        return self.decoder(self.tgt_embed(tgt), memory, src_mask, tgt_mask)


class Generator(nn.Module):
    " 定義标準的linear + softmax生成器."
    def __init__(self, d_model, vocab):
        super(Generator, self).__init__()
        self.proj = nn.Linear(d_model, vocab)

    def forward(self, x):
        return F.log_softmax(self.proj(x), dim=-1)

2.編碼器

Attention Is All You Need：論文筆記及pytorch複現【Transformer】一、背景二、模型架構三、模型訓練四、實戰參考文獻

# 編碼器由一堆相同的EncoderLayer堆砌而成，N=6
def clones(module, N):
    "生成N個相同的層."
    return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])

# 總的編碼器
class Encoder(nn.Module):
    "由N個相同的層（如上圖）組成"
    def __init__(self, layer, N):
        super(Encoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)

    def forward(self, x, mask):
        "輪流給每層喂入."
        for layer in self.layers:
            x = layer(x, mask)
        return self.norm(x)


# 歸一化，即每個子層的輸出為LayerNorm(x+Sublayer(x)),(x+Sublayer(x)是子層自己實作的功能。
# 将 dropout 應用于每個子層的輸出，然後再将其添加到子層輸入中并進行歸一化。
# 為了促進這些殘差連接配接，模型中的所有子層以及嵌入層産生次元輸出為512
class LayerNorm(nn.Module):
    "層歸一化"
    def __init__(self, features, eps=1e-6):
        super(LayerNorm, self).__init__()
        self.a_2 = nn.Parameter(torch.ones(features))
        self.b_2 = nn.Parameter(torch.zeros(features))
        self.eps = eps

    def forward(self, x):
        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)
        return self.a_2 * (x - mean) / (std + self.eps) + self.b_2

# Add&Norm
class SublayerConnection(nn.Module):
    """
    殘差連接配接，連的是歸一化的層.
    """
    def __init__(self, size, dropout):
        super(SublayerConnection, self).__init__()
        self.norm = LayerNorm(size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, sublayer):
        return x + self.dropout(sublayer(self.norm(x)))


# 定義編碼層，每層有兩個子層。第一個是多頭自注意力機制，第二個是簡單的、位置明确的全連接配接前饋網絡。
class EncoderLayer(nn.Module):
    " EncoderLayer由self_attn和feed_forward組成（後面再定義）"
    def __init__(self, size, self_attn, feed_forward, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = self_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 2)
        self.size = size

    def forward(self, x, mask):
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
        return self.sublayer[1](x, self.feed_forward)

3.解碼器

除了每個編碼器層中的兩個子層之外，解碼器層還插入了第三個子層，該層對編碼器層的輸出執行多頭注意。

Attention Is All You Need：論文筆記及pytorch複現【Transformer】一、背景二、模型架構三、模型訓練四、實戰參考文獻

# 解碼器和編碼器一樣，由N個相同decoderLayer堆砌而成
class Decoder(nn.Module):
    def __init__(self, layer, N):
        super(Decoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)

    def forward(self, x, memory, src_mask, tgt_mask):
        for layer in self.layers:
            x = layer(x, memory, src_mask, tgt_mask)
        return self.norm(x)


# 除了每個編碼器層中的兩個子層之外，解碼器還插入了第三個子層，該層對編碼器的輸出執行多頭注意。
# 與編碼器類似，在每個子層周圍使用殘差連接配接，然後進行層歸一化。
class DecoderLayer(nn.Module):
    "DecoderLayer由self-attn, src-attn和 feed forward 組成(後面定義)"
    def __init__(self, size, self_attn, src_attn, feed_forward, dropout):
        super(DecoderLayer, self).__init__()
        self.size = size
        self.self_attn = self_attn
        self.src_attn = src_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 3)

    def forward(self, x, memory, src_mask, tgt_mask):
        m = memory
        # masked Multi-Head Attention
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask))
        # Multi-Head Attention
        x = self.sublayer[1](x, lambda x: self.src_attn(x, m, m, src_mask))
        # feed forward
        return self.sublayer[2](x, self.feed_forward)


# 修改了解碼器堆棧中的自注意力子層，以防止位置關注後續位置。
# 這種掩蔽與輸出嵌入偏移一個位置的事實相結合，確定了位置i的預測隻能依賴小于位置i的已知輸出
def subsequent_mask(size):
    attn_shape = (1, size, size)
    subsequent_mask = np.triu(np.ones(attn_shape), k=1).astype('uint8')
    return torch.from_numpy(subsequent_mask) == 0

4.注意力層

注意力函數可以描述為将一個query和一組keys對映射到一個輸出，其中query、keys、values和輸出都是向量。輸出計算為values的權重總和，其中配置設定給每個值的權重由query與相應key的相容性函數計算。

Scaled Dot-Product Attention

首先給一個輸入X，先通過3個線性轉換把X轉換為Q（query），K(key)，V(value)。Scaled Dot-Product Attention的輸入就由次元為dk的Q,K以及次元為dv的V組成，使用所有key計算query的點積，将每個鍵除以√dk，并應用 softmax 函數來獲得值的權重。在實踐中，同時計算一組query的注意力函數，打包成一個矩陣Q。如下圖：

Attention Is All You Need：論文筆記及pytorch複現【Transformer】一、背景二、模型架構三、模型訓練四、實戰參考文獻

計算公式如下：

Attention Is All You Need：論文筆記及pytorch複現【Transformer】一、背景二、模型架構三、模型訓練四、實戰參考文獻

兩個最常用的注意力函數是加法注意力和點積（乘法）注意力，兩者複雜度相似，但點積注意力的運算速度更快，是以論文用的是點積注意力，相對于正常的點積注意力，論文多加了個縮放因子1/√dk，之是以加個縮放因子，是為了防止點積後的結果過大，導緻softmax函數落在一個梯度很小的地方。

def attention(query, key, value, mask=None, dropout=None):
    "計算'Scaled Dot Product Attention'"
    d_k = query.size(-1)
    # attention得分計算，key要轉置一下
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    p_attn = F.softmax(scores, dim=-1)
    if dropout is not None:
        p_attn = dropout(p_attn)
    return torch.matmul(p_attn, value), p_attn

Multi-Head Attention

Multi-Head Attention就是把Scaled Dot-Product Attention的過程做h次，然後把輸出Z合起來。怎麼組合呢？論文公式如下：

Attention Is All You Need：論文筆記及pytorch複現【Transformer】一、背景二、模型架構三、模型訓練四、實戰參考文獻

就是先拼接，然後乘以一個矩陣W0，使得輸出與輸入結構對稱。

Attention Is All You Need：論文筆記及pytorch複現【Transformer】一、背景二、模型架構三、模型訓練四、實戰參考文獻

注意encoder裡面是叫self-attention，decoder裡面是叫masked self-attention。

為什麼要mask呢？

傳統 Seq2Seq 中 Decoder 使用的是 RNN 模型，是以在訓練過程中輸入 t 時刻的詞，模型無論如何也看不到未來時刻的詞，因為循環神經網絡是時間驅動的，隻有當 t 時刻運算結束了，才能看到 t+1 時刻的詞。而 Transformer Decoder 抛棄了 RNN，這樣問題就來了，就是在訓練過程中，整個 ground truth 都暴露在 Decoder 中，這顯然是不對的，是以需要對 Decoder 的輸入進行 Mask，具體來說就是對輸入矩陣的上三角進行mask。

# 多頭注意力允許模型共同關注來自不同位置的不同表示子空間的資訊。
class MultiHeadedAttention(nn.Module):
    def __init__(self, h, d_model, dropout=0.1):
        super(MultiHeadedAttention, self).__init__()
        assert d_model % h == 0
        # 假設 d_v 總是等于 d_k
        self.d_k = d_model // h
        self.h = h
        self.linears = clones(nn.Linear(d_model, d_model), 4)
        self.attn = None
        self.dropout = nn.Dropout(p=dropout)

    def forward(self, query, key, value, mask=None):
        "複現上圖"
        if mask is not None:
            # 給所有h個heads應用相同的mask.
            mask = mask.unsqueeze(1)
        nbatches = query.size(0)

        # 1) 對每個batch進行線性投影得到相應向量
        #這裡用的是全連接配接層實作，全連接配接層的權重其實就對應不同的Q,K,V矩陣啦~
        query, key, value = [l(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
             for l, x in zip(self.linears, (query, key, value))]

        # 2) 每個batch使用注意力
        x, self.attn = attention(query, key, value, mask=mask, dropout=self.dropout)

        # 3) 拼接所有head然後線性變換
        x = x.transpose(1, 2).contiguous().view(nbatches, -1, self.h * self.d_k)
        return self.linears[-1](x)

看一下輸入資料mask的可視化：

def subsequent_mask(size):
    """
    用于遮住序列的一些位置
    修改了解碼器中的自注意力子層，以防止位置關注後續位置。
    這種掩蔽與輸出嵌入偏移一個位置的事實相結合，確定了位置i的預測隻能依賴小于位置i的已知輸出
    :param size: (int)向量長度
    :return: (Tensor,bool)掩碼後的矩陣，尺寸為[1,size,size]
    """
    attn_shape = (1, size, size)
    # 傳回函數的上三角矩陣，從k=1列開始
    subsequent_mask = np.triu(np.ones(attn_shape), k=1).astype('uint8')
    return torch.from_numpy(subsequent_mask) == 0


if __name__ == "__main__":
    plt.figure(figsize=(5, 5))
    plt.imshow(subsequent_mask(20)[0])
    plt.show()

Attention Is All You Need：論文筆記及pytorch複現【Transformer】一、背景二、模型架構三、模型訓練四、實戰參考文獻

Applications of Attention in our Model

Attention Is All You Need：論文筆記及pytorch複現【Transformer】一、背景二、模型架構三、模型訓練四、實戰參考文獻

5.位置前饋網絡

除了注意力子層之外，encoder和decoder中的每一層都包含一個全連接配接前饋網絡，該網絡分别應用于每個位置。這由兩個線性變換組成，中間有一個 ReLU 激活。

Attention Is All You Need：論文筆記及pytorch複現【Transformer】一、背景二、模型架構三、模型訓練四、實戰參考文獻

class PositionwiseFeedForward(nn.Module):
    "實作FFN"
    def __init__(self, d_model, d_ff, dropout=0.1):
        super(PositionwiseFeedForward, self).__init__()
        self.w_1 = nn.Linear(d_model, d_ff)
        self.w_2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        return self.w_2(self.dropout(F.relu(self.w_1(x))))

6.Embeddings 和 Softmax

與其他序列轉換模型類似，利用訓練好的embeddings将輸入token和輸出token轉換為次元向量。另外還使用線性變換和 softmax 函數将decoder輸出轉換為預測的下一個token機率。在transformer中，在兩個embedding層之間共享相同權重矩陣和pre-softmax線性變換。在嵌入層中，将這些權重乘以一個系數sqrt(模型的次元)。

class Embeddings(nn.Module):
    def __init__(self, d_model, vocab):
        super(Embeddings, self).__init__()
        self.lut = nn.Embedding(vocab, d_model)
        self.d_model = d_model

    def forward(self, x):
        return self.lut(x) * math.sqrt(self.d_model)

7.位置編碼

由于Transformer不包含遞歸和卷積，為了讓模型利用序列的順序，必須注入一些關于token在序列中的相對或絕對位置的資訊。為此，在encoder和decoder底部的輸入嵌入中添加了“位置編碼”，與輸入embeddings相加作為最後encoder和decoder的輸入。位置編碼有許多不同的選擇，論文使用sin和cos函數實作：

Attention Is All You Need：論文筆記及pytorch複現【Transformer】一、背景二、模型架構三、模型訓練四、實戰參考文獻

怎麼了解這個位置嵌入呢？這裡可以參考：如何了解Transformer論文中的positional encoding，和三角函數有什麼關系？

class PositionalEncoding(nn.Module):
    "Implement the PE function."

    def __init__(self, d_model, dropout, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)

        # Compute the positional encodings once in log space.
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) *
                             -(math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + Variable(self.pe[:, :x.size(1)],
                         requires_grad=False)
        return self.dropout(x)

為了直覺看到各個次元不同位置的編碼情況，設定輸入為[20,100]，即最大序列長度為100，字嵌入次元為20，這裡畫一下圖：

# 根據位置添加正弦波
plt.figure(figsize=(15, 5))
# 設模型字嵌入次元為20
pe = PositionalEncoding(20, 0)
# 執行PE的前向傳播，輸入張量尺寸為[1, 100, 20]
y = pe.forward(Variable(torch.zeros(1, 100, 20)))
# 随便畫出幾個次元的位置編碼
plt.plot(np.arange(100), y[0, :, 4:8].data.numpy())
plt.legend(["dim %d" % p for p in [4, 5, 6, 7]])
plt.show()

Attention Is All You Need：論文筆記及pytorch複現【Transformer】一、背景二、模型架構三、模型訓練四、實戰參考文獻

plt.figure(figsize=(10, 10))
sns.heatmap(y[0, :, :].data.numpy())
plt.title("Sinusoidal Function")
plt.xlabel("hidden dimension")
plt.ylabel("sequence length")
plt.show()

Attention Is All You Need：論文筆記及pytorch複現【Transformer】一、背景二、模型架構三、模型訓練四、實戰參考文獻

可以看到随着嵌入次元序号增大，位置編碼函數的周期變化越來越平緩，每一個位置在各個字嵌入次元上都會得到不同周期的cos和sin函數的取值組合，進而産生獨一的紋理位置資訊，最終使模型學到位置之間的依賴關系和自然語言的時序特性。

8.整體模型

def make_model(src_vocab, tgt_vocab, N=6, d_model=512, d_ff=2048, h=8, dropout=0.1):
    c = copy.deepcopy
    attn = MultiHeadedAttention(h, d_model)
    ff = PositionwiseFeedForward(d_model, d_ff, dropout)
    position = PositionalEncoding(d_model, dropout)
    model = EncoderDecoder(
        Encoder(EncoderLayer(d_model, c(attn), c(ff), dropout), N),
        Decoder(DecoderLayer(d_model, c(attn), c(attn),c(ff), dropout), N),
        nn.Sequential(Embeddings(d_model, src_vocab), c(position)),
        nn.Sequential(Embeddings(d_model, tgt_vocab), c(position)),
        Generator(d_model, tgt_vocab))

    # This was important from their code.
    # Initialize parameters with Glorot / fan_avg.
    for p in model.parameters():
        if p.dim() > 1:
            nn.init.xavier_uniform(p)
    return model

三、模型訓練

定義一個批處理對象，其中包含用于訓練的源句子和目标句子，以及建構mask。

mask的原因是在Attention的計算過程中，通常使用mini-batch 來計算，也就是一次計算多句話，即輸入資料X的次元是 [batch size, sequence length]，sequence length是句長，而一個mini-batch 是由多個不等長的句子組成的，我們需要按照這個mini-batch中最大的句長對剩餘的句子進行補齊，一般用 0 進行填充，即padding。但這樣做的話後面進行 softmax 就會産生問題。softmax函數公式如下：

Attention Is All You Need：論文筆記及pytorch複現【Transformer】一、背景二、模型架構三、模型訓練四、實戰參考文獻

而e0是1，即padding 的部分就參與了運算，相當于讓無效的部分參與了運算，這可能會産生很大的隐患，這裡用個類型為bool的mask矩陣标記這些padding的部分，後續隻要判斷真假來運算就好了。

class Batch:
    "用于在訓練使用mask儲存一批資料."
    def __init__(self, src, trg=None, pad=0):
        self.src = src
        self.src_mask = (src != pad).unsqueeze(-2)
        if trg is not None:
            self.trg = trg[:, :-1]
            self.trg_y = trg[:, 1:]
            self.trg_mask = self.make_std_mask(self.trg, pad)
            self.ntokens = (self.trg_y != pad).data.sum()

    @staticmethod
    def make_std_mask(tgt, pad):
        "生成一個mask隐藏填充将來的單詞."
        tgt_mask = (tgt != pad).unsqueeze(-2)
        tgt_mask = tgt_mask & Variable(subsequent_mask(tgt.size(-1)).type_as(tgt_mask.data))
        return tgt_mask

建立一個通用的訓練和計算損失的函數。

def run_epoch(data_iter, model, loss_compute):
    start = time.time()
    total_tokens = 0
    total_loss = 0
    tokens = 0
    for i, batch in enumerate(data_iter):
        out = model.forward(batch.src, batch.trg,
                            batch.src_mask, batch.trg_mask)
        loss = loss_compute(out, batch.trg_y, batch.ntokens)
        total_loss += loss
        total_tokens += batch.ntokens
        tokens += batch.ntokens
        if i % 50 == 1:
            elapsed = time.time() - start
            print("Epoch Step: %d Loss: %f Tokens per Sec: %f" %
                  (i, loss / batch.ntokens, tokens / elapsed))
            start = time.time()
            tokens = 0
    return total_loss / total_tokens

資料訓練和批處理

論文在包含約 450 萬個句子對的标準 WMT 2014 英德資料集上進行了訓練。句子使用 byte-pair編碼進行編碼，有大約 37000 個token的源-目标詞彙表。對于英語-法語，使用更大的 WMT 2014 英語-法語資料集，該資料集由 3600 萬個句子組成，并将token拆分為 32000 個單詞詞表。句子對按近似序列長度分批在一起。每個訓練批次包含一組句子對，其中包含大約 25000 個源token和 25000 個目标token。

global max_src_in_batch, max_tgt_in_batch

def batch_size_fn(new, count, sofar):
    global max_src_in_batch, max_tgt_in_batch
    if count == 1:
        max_src_in_batch = 0
        max_tgt_in_batch = 0
    max_src_in_batch = max(max_src_in_batch, len(new.src))
    max_tgt_in_batch = max(max_tgt_in_batch, len(new.trg) + 2)
    src_elements = count * max_src_in_batch
    tgt_elements = count * max_tgt_in_batch
    return max(src_elements, tgt_elements)

硬體裝置

Attention Is All You Need：論文筆記及pytorch複現【Transformer】一、背景二、模型架構三、模型訓練四、實戰參考文獻
和谷歌沒法比的，用自己的破筆記：RTX3060，也不怎麼考慮多GPU并行運作了。
優化器：adam

主要是根據論文的公式動态調整學習率

Attention Is All You Need：論文筆記及pytorch複現【Transformer】一、背景二、模型架構三、模型訓練四、實戰參考文獻

class NoamOpt:
    def __init__(self, model_size, factor, warmup, optimizer):
        self.optimizer = optimizer
        self._step = 0
        self.warmup = warmup
        self.factor = factor
        self.model_size = model_size
        self._rate = 0

    def step(self):
        "更新參數和學習率"
        self._step += 1
        rate = self.rate()
        for p in self.optimizer.param_groups:
            p['lr'] = rate
        self._rate = rate
        self.optimizer.step()

    def rate(self, step=None):
        "執行上面的學習率"
        if step is None:
            step = self._step
        return self.factor * (self.model_size ** (-0.5) * min(step ** (-0.5), step * self.warmup ** (-1.5)))

# 調用例子
def get_std_opt(model):
    return NoamOpt(model.src_embed[0].d_model, 2, 4000,
                   torch.optim.Adam(model.parameters(), lr=0, betas=(0.9, 0.98), eps=1e-9))

正則化

一個是dropout，另一個是标簽平滑。

在訓練期間，使用values的标簽平滑，使用 KL div 損失實作标簽平滑。目的是防止模型在訓練時過于自信地預測标簽，改善泛化能力差的問題。

class LabelSmoothing(nn.Module):
    def __init__(self, size, padding_idx, smoothing=0.0):
        super(LabelSmoothing, self).__init__()
        self.criterion = nn.KLDivLoss(size_average=False)
        self.padding_idx = padding_idx
        self.confidence = 1.0 - smoothing
        self.smoothing = smoothing
        self.size = size
        self.true_dist = None

    def forward(self, x, target):
        assert x.size(1) == self.size
        true_dist = x.data.clone()
        true_dist.fill_(self.smoothing / (self.size - 2))
        # 注意scatter中tensor類型要是long
        true_dist.scatter_(1, target.data.unsqueeze(1).long(), self.confidence)
        true_dist[:, self.padding_idx] = 0
        mask = torch.nonzero(target.data == self.padding_idx)
        if mask.dim() > 0:
            true_dist.index_fill_(0, mask.squeeze(), 0.0)
        self.true_dist = true_dist
        return self.criterion(x, Variable(true_dist, requires_grad=False))

四、實戰

翻譯任務要下資料集啥的，比較麻煩，先來個簡單的任務：給定來自小詞彙表的一組随機輸入符号，目标是生成與輸入相同的符号，稱之為src-tgt copy task。

先造個資料集

def data_gen(V, batch, nbatches):
    "為src-tgt copy task随機生成資料."
    for i in range(nbatches):
        data = torch.from_numpy(np.random.randint(1, V, size=(batch, 10)))
        data[:, 0] = 1
        src = Variable(data, requires_grad=False)
        tgt = Variable(data, requires_grad=False)
        yield Batch(src, tgt, 0)

計算損失

class SimpleLossCompute:
    def __init__(self, generator, criterion, opt=None):
        self.generator = generator
        self.criterion = criterion
        self.opt = opt

    def __call__(self, x, y, norm):
        x = self.generator(x)
        loss = self.criterion(x.contiguous().view(-1, x.size(-1)), y.contiguous().view(-1)) / norm

        loss.backward()
        if self.opt is not None:
            self.opt.step()
            self.opt.optimizer.zero_grad()
        # return loss.data.[0] * norm
        return loss.data.item() * norm

解碼用貪心政策

def greedy_decode(model, src, src_mask, max_len, start_symbol):
    memory = model.encode(src, src_mask)
    ys = torch.ones(1, 1).fill_(start_symbol).type_as(src.data)
    for i in range(max_len - 1):
        out = model.decode(memory, src_mask,
                           Variable(ys),
                           Variable(subsequent_mask(ys.size(1)).type_as(src.data)))
        prob = model.generator(out[:, -1])
        _, next_word = torch.max(prob, dim=1)
        next_word = next_word.data[0]
        ys = torch.cat([ys, torch.ones(1, 1).type_as(src.data).fill_(next_word)], dim=1)
    return ys

模型訓練

V = 11
criterion = LabelSmoothing(size=V, padding_idx=0, smoothing=0.0)
model = make_model(V, V, N=2)
model_opt = NoamOpt(model.src_embed[0].d_model, 1, 400,
                    torch.optim.Adam(model.parameters(), lr=0, betas=(0.9, 0.98), eps=1e-9))

for epoch in range(10):
    model.train()
    run_epoch(data_gen(V, 30, 20), model, SimpleLossCompute(model.generator, criterion, model_opt))
    model.eval()
    print(run_epoch(data_gen(V, 30, 5), model, SimpleLossCompute(model.generator, criterion, None)))

Attention Is All You Need：論文筆記及pytorch複現【Transformer】一、背景二、模型架構三、模型訓練四、實戰參考文獻

模型測試

model.eval()
src = Variable(torch.LongTensor([[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]]))
src_mask = Variable(torch.ones(1, 1, 10))
print(greedy_decode(model, src, src_mask, max_len=10, start_symbol=1))

輸入：[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

看看輸出：

Attention Is All You Need：論文筆記及pytorch複現【Transformer】一、背景二、模型架構三、模型訓練四、實戰參考文獻

參考文獻

Transformer 詳解

The Annotated Transformer

如何了解Transformer論文中的positional encoding，和三角函數有什麼關系？

Attention Is All You Need：論文筆記及pytorch複現【Transformer】一、背景二、模型架構三、模型訓練四、實戰參考文獻

文章目錄

一、背景

二、模型架構

1.整體架構

2.編碼器

3.解碼器

4.注意力層

Scaled Dot-Product Attention

Multi-Head Attention

Applications of Attention in our Model

5.位置前饋網絡

6.Embeddings 和 Softmax

7.位置編碼

8.整體模型

三、模型訓練

四、實戰

參考文獻

繼續閱讀

2021年銀行從業考試考情介紹,果斷收藏!

證券從業合格證書什麼時候列印？有哪些注意事項？

【幹貨滿滿】初級銀行從業考試《個人理财》重點梳理

2020年經濟師考試，難嗎？

初級銀行從業資格證有什麼用？

MBA提前面試純幹貨分享

MBA值得學麼

論文閱讀筆記（三）：Research on Network Attack Effect Evaluation Based on Confrontational Perspective一. 論文簡介二. 創新點和貢獻：三. 相關領域的概述(related work)四. 作者的方案五. 主要的資訊流（approach）六. 總結

吳恩達logistic回歸實作

【人工智能行業大師訪談1】吳恩達采訪 Geoffery Hinton

深度學習模型分析人類複雜疾病的準确性

人工智能如何有效地運用于自然語言處理

【趨高機器視覺】機器視覺技術原了解析及解決方案

解碼器用于語義分割：資料依賴的解碼可以實作靈活的特征聚合

cs231n斯坦福基于卷積神經網絡的CV學習筆記（一）KNN和線性分類器/分類器損失/反向傳播一，KNN圖像分類算法二，線性分類器三，線性分類器損失四，反向傳播五，神經網絡

【Torch】最簡潔logging使用指南

Attention Is All You Need：論文筆記及pytorch複現【Transformer】一、 背景二、模型架構三、模型訓練四、實戰參考文獻

文章目錄

一、 背景

二、模型架構

1.整體架構

2.編碼器

3.解碼器

4.注意力層

Scaled Dot-Product Attention

Multi-Head Attention

Applications of Attention in our Model

5.位置前饋網絡

6.Embeddings 和 Softmax

7.位置編碼

8.整體模型

三、模型訓練

四、實戰

參考文獻

繼續閱讀

Attention Is All You Need：論文筆記及pytorch複現【Transformer】一、背景二、模型架構三、模型訓練四、實戰參考文獻

一、背景