【翻譯】了解 LSTM 及其圖示

了解 LSTM 及其圖示

本文翻譯自 Shi Yan 的博文 Understanding LSTM and its diagrams，原文闡釋了作者對 Christopher Olah 博文 Understanding LSTM Networks 更加通俗的了解。

Understanding LSTM Networks 中譯：【翻譯】了解 LSTM 網絡

了解 LSTM 及其圖示

我不擅長解釋 LSTM，寫下這段文字是為了我個人記憶友善。我認為 Christopher Olah 的那篇博文是關于 LSTM 最棒的資料。如果想要學習 LSTM 的話，請移步到原始的文章連結。（我會在這裡畫一些更漂亮的圖示）

盡管我們不知道大腦的運作機制，但我們依然能夠感覺到它應該有一個邏輯單元和一個記憶單元。我們基于推理和經驗得到的這個結論，就像電腦一樣，我們也有邏輯單元、CPU 和 GPU，以及記憶體。

但是，當你觀察一個神經網絡的時候，它工作起來就像一個黑箱。你從一端出入，再從另一端得到輸出。整個決策過程幾乎完全取決于目前的輸入。

我覺得，說神經網絡完全沒有記憶是不恰當的。無論怎樣，學習得到的權重可以看作是訓練資料的一種記憶。但是這種記憶更加靜态。有些時候我們需要為後面的使用記住一些輸入。這種例子很多，比如股票市場。為了做出好的投資決策，我們至少要從一個時間視窗回溯股票資料。

若要讓神經網絡接受時間序列資料，最簡單的方法就是将若幹神經網絡連接配接在一起。每個神經網絡隻處理一步。你需要向神經網絡提供一個時間視窗上所有步的資料，而不是單一步。

許多時候，你處理的資料具有周期模式。舉個簡單的例子，你需要預測聖誕樹的銷量。這是件季節性很強的事，每年隻有一個高峰出現。一個好的預測政策是回溯一年前的資料。對于這類問題，你需要包含很早以前的資料，或者很強的記憶。你需要知道那哪些有價值的資料需要記住，哪些沒用資料要忘記。

理論上，簡單連接配接的神經網絡稱為遞歸神經網絡，是可以工作的。但實踐中面臨兩個難題：梯度消失和梯度爆炸，這會使神經網絡無法使用。

後來出現的 LSTM（長短期記憶網絡）通過引入記憶單元（即神經網絡的細胞）來解決上述問題。LSTM 子產品的圖示如下：

I\'m not better at explaining LSTM, I want to write this down as a way to remember it myself. I think the above blog post written by Christopher Olah is the best LSTM material you would find. Please visit the original link if you want to learn LSTM. (But I did create some nice diagrams.)

Although we don\'t know how brain functions yet, we have the feeling that it must have a logic unit and a memory unit. We make decisions by reasoning and by experience. So do computers, we have the logic units, CPUs and GPUs and we also have memories.

But when you look at a neural network, it functions like a black box. You feed in some inputs from one side, you receive some outputs from the other side. The decision it makes is mostly based on the current inputs.

I think it\'s unfair to say that neural network has no memory at all. After all, those learnt weights are some kind of memory of the training data. But this memory is more static. Sometimes we want to remember an input for later use. There are many examples of such a situation, such as the stock market. To make a good investment judgement, we have to at least look at the stock data from a time window.

The naive way to let neural network accept a time series data is connecting several neural networks together. Each of the neural networks handles one time step. Instead of feeding the data at each individual time step, you provide data at all time steps within a window, or a context, to the neural network.

A lot of times, you need to process data that has periodic patterns. As a silly example, suppose you want to predict christmas tree sales. This is a very seasonal thing and likely to peak only once a year. So a good strategy to predict christmas tree sale is looking at the data from exactly a year back. For this kind of problems, you either need to have a big context to include ancient data points, or you have a good memory. You know what data is valuable to remember for later use and what needs to be forgotten when it is useless.

Theoretically the naively connected neural network, so called recurrent neural network, can work. But in practice, it suffers from two problems: vanishing gradient and exploding gradient, which make it unusable.

Then later, LSTM (long short term memory) was invented to solve this issue by explicitly introducing a memory unit, called the cell into the network. This is the diagram of a LSTM building block.

初看起來非常複雜。讓我們忽略中間部分，隻看單元的輸入和輸出。網絡有三個輸入，\(X_t\) 是目前的輸入；\(h_{t-1}\) 是上一個 LSTM 單元的輸出；\(C_{t-1}\) 是我認為最重要的輸入——上一個 LSTM 單元的“記憶”。\(h_t\) 是目前網絡的輸出，\(C_t\) 是目前單元的記憶。

是以，一個單元接收目前的輸入、前一個輸出和前一個記憶做出決策，并且産生新的輸出，更新記憶。

At a first sight, this looks intimidating. Let\'s ignore the internals, but only look at the inputs and outputs of the unit. The network takes three inputs. \(X_t\) is the input of the current time step. \(h_{t-1}\) is the output from the previous LSTM unit and \(C_{t-1}\) is the “memory” of the previous unit, which I think is the most important input. As for outputs, \(h_t\) is the output of the current network. \(C_t\) is the memory of the current unit.

Therefore, this single unit makes decision by considering the current input, previous output and previous memory. And it generates a new output and alters its memory.

中間部分記憶 \(C_t\) 産生變化的方式非常類似于從管道中導出水流。把記憶想象成管道中的水流。你想要改變記憶流，而這種改變有兩個閥門控制。

第一個閥門是遺忘閥門。如果你關掉閥門，舊的記憶不會被保留；如果完全打開，舊的記憶就會完全通過。

第二個閥門是新記憶閥門。新的記憶會通過一個 T 形連接配接，并于同舊的記憶混合。第二個閥門決定要通過多少新的記憶。

The way its internal memory \(C_t\) changes is pretty similar to piping water through a pipe. Assuming the memory is water, it flows into a pipe. You want to change this memory flow along the way and this change is controlled by two valves.

The first valve is called the forget valve. If you shut it, no old memory will be kept. If you fully open this valve, all old memory will pass through.

The second valve is the new memory valve. New memory will come in through a T shaped joint like above and merge with the old memory. Exactly how much new memory should come in is controlled by the second valve.

在圖示中，頂部的管道是記憶管道。輸入是舊的記憶（以向量的形式）。通過的第一個 \(\times\) 是遺忘閥門，事實上這是一個逐點乘法運算。如果你将舊的記憶 \(C_{t-1}\) 與一個接近 0 的向量相乘，這意味着你想要忘記絕大部分記憶。如果你讓遺忘閥門等于 1 ，舊的記憶就會完全通過。

記憶流通過的第二個運算是加法運算符 \(+\)，即逐點相加，它的功能類似 T 形連接配接。新舊記憶通過這個運算混合。另一個閥門控制多少新的記憶來和舊的記憶混合，就是 \(+\) 下面的 \(\times\)。

兩步運算之後，你就将舊的記憶 \(C_{t-1}\) 變成了新的記憶 \(C_t\)。

On the LSTM diagram, the top “pipe” is the memory pipe. The input is the old memory (a vector). The first cross \(\times\) it passes through is the forget valve. It is actually an element-wise multiplication operation. So if you multiply the old memory \(C_{t-1}\) with a vector that is close to 0, that means you want to forget most of the old memory. You let the old memory goes through, if your forget valve equals 1.

Then the second operation the memory flow will go through is this + operator. This operator means piece-wise summation. It resembles the T shape joint pipe. New memory and the old memory will merge by this operation. How much new memory should be added to the old memory is controlled by another valve, the \(\times\) below the + sign.

After these two operations, you have the old memory \(C_{t-1}\) changed to the new memory \(C_t\).

現在讓我們看看閥門。第一個閥門稱為遺忘閥門。它由一個單層神經網絡控制。它的輸入是 \(h_{t-1}\)（前一個 LSTM 子產品的輸出）、\(X_t\)（目前 LSTM 子產品的輸入）、\(C_{t-1}\)（前一個子產品的記憶）和最終的偏移向量 \(b_0\)。這個神經網絡有一個 S 形激活函數，它的輸出向量是遺忘閥門，用來和舊的記憶 \(C_{t-1}\) 做逐點乘法。

Now lets look at the valves. The first one is called the forget valve. It is controlled by a simple one layer neural network. The inputs of the neural network is \(h_{t-1}\), the output of the previous LSTM block, \(X_t\), the input for the current LSTM block, \(C_{t-1}\), the memory of the previous block and finally a bias vector \(b_0\). This neural network has a sigmoid function as activation, and it\'s output vector is the forget valve, which will applied to the old memory \(C_{t-1}\) by element-wise multiplication.

第二個閥門稱為新記憶閥門。它也是一個單層神經網絡，接收的輸入和遺忘閥門一樣。這個閥門用來控制多少新的記憶用來影響舊的記憶。

但是，新的記憶卻由另一個神經網絡産生。這也是一個單層神經網絡，但是用 tanh 作為激活函數。這個神經網絡的輸出将會和新記憶閥門的輸出做逐點乘法，然後和舊的記憶相加産生新的記憶。

Now the second valve is called the new memory valve. Again, it is a one layer simple neural network that takes the same inputs as the forget valve. This valve controls how much the new memory should influence the old memory.

The new memory itself, however is generated by another neural network. It is also a one layer network, but uses tanh as the activation function. The output of this network will element-wise multiple the new memory valve, and add to the old memory to form the new memory.

兩個 \(\times\) 是遺忘閥門和新記憶閥門。

These two \(\times\) signs are the forget valve and the new memory valve.

最終，我們需要産生這個 LSTM 單元的輸出。這一步有一個輸出閥門，它被新的記憶、前一個輸出 \(h_{t-1}\)、目前輸入 \(X_t\) 和偏移向量共同控制。這個閥門控制向下一個 LSTM 單元輸出多少新的記憶。

前一個圖是受到 Christopher 博文的啟發。但是通常情況下，你會看到下面的圖。兩幅圖之間的主要差異是後一個圖沒有将記憶單元 C 作為 LSTM 單元的輸入。相反，它把它（記憶單元）内化成了“細胞”。

我喜歡 Christopher 的圖，它清晰地展示了記憶 C 如何從前一個單元傳遞到下一個單元。在下面的圖中，你不能輕易的發現 \(C_{t-1}\) 來自上一個單元，以及 \(C_t\) 是輸出的一部分。

我不喜歡下面的圖的第二個原因是，單元中的計算是有順序的，但是你不能直接從圖中看出來。例如，為了計算單元的輸出，你需要有新的記憶 \(C_t\)。是以，第一步應該是計算 \(C_t\)。

下面的圖試圖通過實線和虛線來強調這種“延遲”或“順序”。虛線是開始就已經就緒的舊的記憶。實線是新的記憶。計算要求新的記憶要等待 \(C_t\) 的就緒。

And finally, we need to generate the output for this LSTM unit. This step has an output valve that is controlled by the new memory, the previous output \(h_{t-1}\), the input \(X_t\) and a bias vector. This valve controls how much new memory should output to the next LSTM unit.

The above diagram is inspired by Christopher\'s blog post. But most of the time, you will see a diagram like below. The major difference between the two variations is that the following diagram doesn\'t treat the memory unit C as an input to the unit. Instead, it treats it as an internal thing “Cell”.

I like the Christopher\'s diagram, in that it explicitly shows how this memory C gets passed from the previous unit to the next. But in the following image, you can\'t easily see that \(C_{t-1}\) is actually from the previous unit, and \(C_t\) is part of the output.

The second reason I don\'t like the following diagram is that the computation you perform within the unit should be ordered, but you can\'t see it clearly from the following diagram. For example to calculate the output of this unit, you need to have \(C_t\), the new memory ready. Therefore, the first step should be evaluating \(C_t\).

The following diagram tries to represent this “delay” or “order” with dash lines and solid lines (there are errors in this picture). Dash lines means the old memory, which is available at the beginning. Some solid lines means the new memory. Operations require the new memory have to wait until \(C_t\) is available.

但是這兩幅圖是一樣的。這裡，我用和第一幅圖相同的符号和顔色重畫上面的圖：

But these two diagrams are essentially the same. Here, I want to use the same symbols and colors of the first diagram to redraw the above diagram:

這是遺忘門（閥門）關閉舊的記憶。

This is the forget gate (valve) that shuts the old memory:

這是新記憶閥門和新的記憶：

This is the new memory valve and the new memory:

這是兩個閥門和逐點加法将新舊記憶混合以産生 \(C_t\)（綠色的，在大 “Cell” 後面）。

These are the two valves and the element-wise summation to merge the old memory and the new memory to form \(C_t\) (in green, flows back to the big “Cell”):

這是輸出閥門和 LSTM 單元的輸出。

This is the output valve and output of the LSTM unit: