長短期記憶網絡通常稱為LSTMs，是一種特殊的RNN，能夠學習長期依賴關系。他們是由Hochreiter 等人在1997年提出的，在之後的工作中又被很多人精煉和推廣。它們對各種各樣的問題都非常有效，現在被廣泛使用。LSTMs被明确設計為避免長期依賴問題。長時間記憶資訊實際上是他們的預設行為，而不是他們努力學習的東西。

LSTMs網絡架構

深度學習面試題37：LSTM Networks原理(Long Short Term Memory networks)

傳回目錄

LSTM的核心思想

LSTMs的關鍵是單元狀态，即貫穿圖表頂部的水準線。細胞的狀态有點像傳送帶。它沿着整個鍊向下延伸，隻有一些小的線性互相作用。很容易讓資訊不加改變地流動。

LSTM确實有能力删除或添加資訊到細胞狀态，由稱為門的結構仔細地調節。門是一種選擇性地讓資訊通過的方式。一個LSTM有三個門，以保護和控制單元的狀态。

遺忘門(Forget gate)

遺忘門會輸出一個0到1之間的向量，然後與記憶細胞C做Pointwize的乘法，可以了解為模型正在忘記一些東西。

輸入門(Input gate)

有的資料也叫更新門

輸入門有兩條分支，左側輸出一個0到1之間的向量，表示要目前輪多少百分比的資訊更新到記憶細胞C上去；右側表示目前輪提出來的資訊。

經過遺忘門和輸入門之後，記憶細胞便有了一定的變化。

注意LSTM中的記憶細胞隻經過遺忘門和輸入門，它是不直接經過輸出門的。

輸出門(Output gate)

輸出門需要接受來自三個方向的輸入資訊，然後産生兩個方向的輸出。

三個方向輸入的資訊分别是：目前時間步的資訊、上一個時間步的輸出和目前記憶細胞的資訊。

兩個方向輸出的資訊分别是：産生目前輪的預測和作為下一個時間步的輸入。

LSTMs是如何解決長程依賴問題的？

與簡單的RNN網絡模型比，LSTMs不是僅僅依靠快速變化的hidden state的資訊去産生預測，而是還去考慮記憶細胞C中的資訊。

比如有一個有長程依賴的待預測資料：

I grew up in France… I speak fluent ().

當LSTMs讀到France後，就把France的資訊存在記憶細胞特定的位置上，再經過後面的時間步時，這個France的資訊會因為遺忘門的乘法而沖淡，但是要注意，這個沖淡的效果很弱，如果沖刷記憶的效果太強那就和簡單的RNN類似了（可能有人要問，要把這個沖刷的強度置為多少呢？答：這是LSTMs自己學出來的），當LSTMs讀到fluent時，結合記憶細胞中France的資訊，就預測出French的答案。

Peephole是啥

2000年學者Gers & Schmidhuber對LSTMs做了一些變體，peephole如圖所示，就是讓三個門能利用好記憶細胞裡的資訊，進而讓模型更強。

下圖為對應李宏毅老師的結構，是完全一樣的。

Pytorch demo

深度學習面試題37：LSTM Networks原理(Long Short Term Memory networks)

# https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html?highlight=lstm
# tensorboard --logdir=runs/lstm --host=127.0.0.1
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)


class LSTMTagger(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(LSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim

        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)

    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        lstm_out, _ = self.lstm(embeds.view(len(sentence), 1, -1))
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores


def prepare_sequence(seq, to_ix):
    idxs = [to_ix[w] for w in seq]
    return torch.tensor(idxs, dtype=torch.long)


training_data = [
    ("The dog ate the apple".split(), ["DET", "NN", "V", "DET", "NN"]),
    ("Everybody read that book".split(), ["NN", "V", "DET", "NN"])
]
word_to_ix = {}
for sent, tags in training_data:
    for word in sent:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)
print(word_to_ix)
tag_to_ix = {"DET": 0, "NN": 1, "V": 2}

# These will usually be more like 32 or 64 dimensional.
# We will keep them small, so we can see how the weights change as we train.
EMBEDDING_DIM = 6
HIDDEN_DIM = 6

model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM, len(word_to_ix), len(tag_to_ix))
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# See what the scores are before training
# Note that element i,j of the output is the score for tag j for word i.
# Here we don't need to train, so the code is wrapped in torch.no_grad()

from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter('../runs/lstm')

with torch.no_grad():
    inputs = prepare_sequence(training_data[0][0], word_to_ix)
    tag_scores = model(inputs)
    writer.add_graph(model, inputs)
    writer.close()

    print(tag_scores)

for epoch in range(300):  # again, normally you would NOT do 300 epochs, it is toy data
    for sentence, tags in training_data:
        # Step 1. Remember that Pytorch accumulates gradients.
        # We need to clear them out before each instance
        model.zero_grad()

        # Step 2. Get our inputs ready for the network, that is, turn them into
        # Tensors of word indices.
        sentence_in = prepare_sequence(sentence, word_to_ix)
        targets = prepare_sequence(tags, tag_to_ix)

        # Step 3. Run our forward pass.
        tag_scores = model(sentence_in)

        # Step 4. Compute the loss, gradients, and update the parameters by
        #  calling optimizer.step()
        loss = loss_function(tag_scores, targets)
        loss.backward()
        optimizer.step()

# See what the scores are after training
with torch.no_grad():
    inputs = prepare_sequence(training_data[0][0], word_to_ix)
    tag_scores = model(inputs)

    # The sentence is "the dog ate the apple".  i,j corresponds to score for tag j
    # for word i. The predicted tag is the maximum scoring tag.
    # Here, we can see the predicted sequence below is 0 1 2 0 1
    # since 0 is index of the maximum value of row 1,
    # 1 is the index of maximum value of row 2, etc.
    # Which is DET NOUN VERB DET NOUN, the correct sequence!
    print(tag_scores)

View Code

多層LSTM

和簡單的RNN一樣，可以疊多層，也可以雙向。

參考資料

深度學習面試題37：LSTM Networks原理(Long Short Term Memory networks)

目錄

LSTMs網絡架構

LSTM的核心思想

遺忘門(Forget gate)

輸入門(Input gate)

輸出門(Output gate)

LSTMs是如何解決長程依賴問題的？

Peephole是啥

多層LSTM

參考資料

LSTMs的關鍵是單元狀态，即貫穿圖表頂部的水準線。細胞的狀态有點像傳送帶。它沿着整個鍊向下延伸，隻有一些小的線性互相作用。很容易讓資訊不加改變地流動。

LSTM确實有能力删除或添加資訊到細胞狀态，由稱為門的結構仔細地調節。門是一種選擇性地讓資訊通過的方式。一個LSTM有三個門，以保護和控制單元的狀态。

遺忘門會輸出一個0到1之間的向量，然後與記憶細胞C做Pointwize的乘法，可以了解為模型正在忘記一些東西。

有的資料也叫更新門

輸入門有兩條分支，左側輸出一個0到1之間的向量，表示要目前輪多少百分比的資訊更新到記憶細胞C上去；右側表示目前輪提出來的資訊。

經過遺忘門和輸入門之後，記憶細胞便有了一定的變化。

注意LSTM中的記憶細胞隻經過遺忘門和輸入門，它是不直接經過輸出門的。

輸出門需要接受來自三個方向的輸入資訊，然後産生兩個方向的輸出。

三個方向輸入的資訊分别是：目前時間步的資訊、上一個時間步的輸出和目前記憶細胞的資訊。

兩個方向輸出的資訊分别是：産生目前輪的預測和作為下一個時間步的輸入。

與簡單的RNN網絡模型比，LSTMs不是僅僅依靠快速變化的hidden state的資訊去産生預測，而是還去考慮記憶細胞C中的資訊。

比如有一個有長程依賴的待預測資料：

I grew up in France… I speak fluent ().

2000年學者Gers & Schmidhuber對LSTMs做了一些變體，peephole如圖所示，就是讓三個門能利用好記憶細胞裡的資訊，進而讓模型更強。

下圖為對應李宏毅老師的結構，是完全一樣的。

Pytorch demo

和簡單的RNN一樣，可以疊多層，也可以雙向。

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

https://www.bilibili.com/video/BV1JE411g7XF?p=20

繼續閱讀

　　LSTMs網絡架構

　　LSTM的核心思想

　　遺忘門(Forget gate)

　　輸入門(Input gate)

　　輸出門(Output gate)

　　LSTMs是如何解決長程依賴問題的？

　　Peephole是啥

　　多層LSTM

　　參考資料