天天看點

LSTM:《Understanding LSTM Networks》的翻譯并解讀(二)

The Core Idea Behind LSTMs LSTMs背後的核心思想

The key to LSTMs is the cell state, the horizontal line running through the top of the diagram.

The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some minor linear interactions. It’s very easy for information to just flow along it unchanged.

LSTMs的關鍵是單元狀态,即貫穿圖頂部的水準線。

細胞狀态有點像一個傳送帶。它沿着整個鍊向下,隻有一些微小的線性互相作用。資訊很容易以不變的方式流動。

The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates.

Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation.

LSTM确實能夠删除或向細胞狀态添加資訊,這是由稱為門的結構仔細控制的。

門是一種可選地讓資訊通過的方法。它們由sigmoid神經網絡層和逐點乘法運算組成。

The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through!”

An LSTM has three of these gates, to protect and control the cell state.

sigmoid層輸出0到1之間的數字,描述每個元件應該允許通過的數量。0的值表示“不讓任何東西通過”,而1的值表示“讓所有東西通過!”

LSTM有三個這樣的門來保護和控制單元狀态。

Step-by-Step LSTM Walk Through  分步執行LSTM

The first step in our LSTM is to decide what information we’re going to throw away from the cell state. This decision is made by a sigmoid layer called the “forget gate layer.” It looks at ht−1

and xt

, and outputs a number between 0

and 1

for each number in the cell state Ct−1

. A 1

represents “completely keep this” while a 0

represents “completely get rid of this.”

LSTM的第一步是決定要從單元狀态中丢棄什麼資訊。這個決定是由一個叫做“忘記門”的sigmoid層做出的。“它檢視ht−1ht−1和xtxt,并為細胞狀态Ct−1Ct−1中的每個數輸出一個00到11之間的數字。11代表“完全保留這個”,而00代表“完全擺脫這個”。”

Let’s go back to our example of a language model trying to predict the next word based on all the previous ones. In such a problem, the cell state might include the gender of the present subject, so that the correct pronouns can be used. When we see a new subject, we want to forget the gender of the old subject.

讓我們回到我們的例子,一個語言模型試圖預測下一個單詞基于所有前面的詞。在這樣的問題中,單元格狀态可能包括目前主體的性别,這樣就可以使用正确的代詞。當我們看到一個新的主題時,我們想要忘記舊主題的性别。

The next step is to decide what new information we’re going to store in the cell state. This has two parts. First, a sigmoid layer called the “input gate layer” decides which values we’ll update. Next, a tanh layer creates a vector of new candidate values, C~t

, that could be added to the state. In the next step, we’ll combine these two to create an update to the state.

In the example of our language model, we’d want to add the gender of the new subject to the cell state, to replace the old one we’re forgetting.

下一步是決定要在單元狀态中存儲什麼新資訊。它有兩部分。首先,一個名為“輸入門層”的sigmoid層決定要更新哪些值。接下來,tanh層建立一個新的候選值向量C~tC~t,可以将其添加到狀态中。在下一個步驟中,我們将把這兩者結合起來以建立對狀态的更新。

在我們的語言模型示例中,我們希望将新主體的性别添加到單元格狀态,以替換我們忘記的舊主體。

It’s now time to update the old cell state, Ct−1

, into the new cell state Ct

. The previous steps already decided what to do, we just need to actually do it.

We multiply the old state by ft

, forgetting the things we decided to forget earlier. Then we add it∗C~t

. This is the new candidate values, scaled by how much we decided to update each state value.

現在是時候将舊的細胞狀态Ct−1Ct−1更新為新的細胞狀态CtCt了。前面的步驟已經決定了要做什麼,我們隻需要實際去做。

我們将舊狀态乘以ft,忘記了我們之前決定忘記的事情。然後我們把它加入到顯示狀态顯示狀态C~tit C~t。這是新的候選值,根據我們決定更新每個狀态值的程度進行縮放。

In the case of the language model, this is where we’d actually drop the information about the old subject’s gender and add the new information, as we decided in the previous steps.

在語言模型中,這是我們實際删除關于舊主題性别的資訊并添加新資訊的地方,正如我們在前面的步驟中所決定的那樣。

Finally, we need to decide what we’re going to output. This output will be based on our cell state, but will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state we’re going to output. Then, we put the cell state through tanh

(to push the values to be between −1

) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.

For the language model example, since it just saw a subject, it might want to output information relevant to a verb, in case that’s what is coming next. For example, it might output whether the subject is singular or plural, so that we know what form a verb should be conjugated into if that’s what follows next.

最後,我們需要決定我們要輸出什麼。此輸出将基于我們的單元格狀态,但将是經過篩選的版本。首先,我們運作一個sigmoid層,它決定我們要輸出的單元狀态的哪些部分。然後,我們将細胞狀态放入tanhtanh(将值設定為−1−1和11之間),并将其乘以s形門的輸出,這樣我們隻輸出我們決定輸出的部分。

對于語言模型示例,因為它隻是看到了一個主題,是以它可能希望輸出與動詞相關的資訊,以防接下來會發生什麼。例如,它可以輸出主語是單數還是複數,這樣我們就可以知道一個動詞接下來應該變成什麼形式。

Variants on Long Short Term Memory  LSTM的變體

What I’ve described so far is a pretty normal LSTM. But not all LSTMs are the same as the above. In fact, it seems like almost every paper involving LSTMs uses a slightly different version. The differences are minor, but it’s worth mentioning some of them.

One popular LSTM variant, introduced by Gers & Schmidhuber (2000), is adding “peephole connections.” This means that we let the gate layers look at the cell state.

到目前為止,我所描述的是一個非常普通的LSTM。但并不是所有的lstm都與上述相同。事實上,似乎幾乎每一篇涉及LSTMs的論文都使用了稍微不同的版本。差異很小,但值得一提。

一種流行的LSTM變體,由Gers和Schmidhuber(2000)引入,增加了“窺視孔連接配接”。這意味着我們讓栅極層觀察細胞的狀态。

The above diagram adds peepholes to all the gates, but many papers will give some peepholes and not others.

Another variation is to use coupled forget and input gates. Instead of separately deciding what to forget and what we should add new information to, we make those decisions together. We only forget when we’re going to input something in its place. We only input new values to the state when we forget something older.

上面的圖表在所有的門上都加了窺視孔,但是很多論文隻會給出一些窺視孔,而不會給出其他的。

另一種變化是使用耦合忘記和輸入門。我們不是單獨決定忘記什麼和應該添加什麼新資訊,而是一起做這些決定。我們隻會忘記什麼時候在它的位置上輸入東西。我們隻在忘記舊的值時才向狀态輸入新值。

A slightly more dramatic variation on the LSTM is the Gated Recurrent Unit, or GRU, introduced by Cho, et al. (2014). It combines the forget and input gates into a single “update gate.” It also merges the cell state and hidden state, and makes some other changes. The resulting model is simpler than standard LSTM models, and has been growing increasingly popular.

LSTM的一個稍微戲劇性的變化是門控遞歸單元,或GRU,由Cho等人(2014)引入。它将忘記和輸入門組合成一個“更新門”。“它還融合了細胞狀态和隐藏狀态,并做了一些其他的改變。得到的模型比标準LSTM模型更簡單,并且越來越流行。

These are only a few of the most notable LSTM variants. There are lots of others, like Depth Gated RNNs by Yao, et al. (2015). There’s also some completely different approach to tackling long-term dependencies, like Clockwork RNNs by Koutnik, et al. (2014).

Which of these variants is best? Do the differences matter? Greff, et al. (2015) do a nice comparison of popular variants, finding that they’re all about the same. Jozefowicz, et al. (2015) tested more than ten thousand RNN architectures, finding some that worked better than LSTMs on certain tasks.

這些隻是最值得注意的LSTM變體中的幾個。還有很多其他的,如姚等人(2015)的《深度門控RNNs》。還有一些完全不同的處理長期依賴關系的方法,如Koutnik等人(2014)的Clockwork RNNs。

這些變體中哪個是最好的?差異重要嗎?Greff等人(2015)對流行的變體做了一個很好的比較,發現它們都差不多。Jozefowicz等人(2015)測試了一萬多個RNN架構,發現有些架構在某些任務上比LSTMs工作得更好。

Conclusion

Earlier, I mentioned the remarkable results people are achieving with RNNs. Essentially all of these are achieved using LSTMs. They really work a lot better for most tasks!

Written down as a set of equations, LSTMs look pretty intimidating. Hopefully, walking through them step by step in this essay has made them a bit more approachable.

之前,我提到了人們使用RNNs所取得的顯著成果。基本上所有這些都是使用LSTMs實作的。對于大多數任務來說,它們确實工作得更好!

作為一組方程來寫,lstm看起來很吓人。希望在這篇文章中一步一步地介紹它們能使它們更容易了解。

LSTMs were a big step in what we can accomplish with RNNs. It’s natural to wonder: is there another big step? A common opinion among researchers is: “Yes! There is a next step and it’s attention!” The idea is to let every step of an RNN pick information to look at from some larger collection of information. For example, if you are using an RNN to create a caption describing an image, it might pick a part of the image to look at for every word it outputs. In fact, Xu, et al. (2015) do exactly this – it might be a fun starting point if you want to explore attention! There’s been a number of really exciting results using attention, and it seems like a lot more are around the corner…

LSTMs是我們使用RNNs實作目标的一大步。人們很自然地會想:還會有更大的進步嗎?研究人員普遍認為:“是的!下一步就是集中注意力!這個想法是讓RNN的每一步都從更大的資訊集合中挑選資訊。例如,如果您使用RNN來建立描述圖像的标題,它可能會選擇圖像的一部分來檢視它輸出的每個單詞。事實上,Xu等人(2015)正是這樣做的——如果你想探索注意力,這可能是一個有趣的起點!已經有很多使用注意力的令人興奮的結果,而且似乎更多的結果即将出現……

Attention isn’t the only exciting thread in RNN research. For example, Grid LSTMs by Kalchbrenner, et al. (2015) seem extremely promising. Work using RNNs in generative models – such as Gregor, et al. (2015), Chung, et al. (2015), or Bayer & Osendorfer (2015) – also seems very interesting. The last few years have been an exciting time for recurrent neural networks, and the coming ones promise to only be more so!

注意力并不是RNN研究中唯一令人興奮的線索。例如,Kalchbrenner等人(2015)的Grid LSTMs似乎非常有前途。在生成模型中使用RNNs的工作——如Gregor等人(2015)、Chung等人(2015)或Bayer & Osendorfer等人(2015)——似乎也非常有趣。過去的幾年對于遞歸神經網絡來說是激動人心的一年,而未來的幾年将會更加激動人心!

Acknowledgments  緻謝

I’m grateful to a number of people for helping me better understand LSTMs, commenting on the visualizations, and providing feedback on this post.

我非常感謝許多人幫助我更好地了解LSTMs,對其可視化進行了評論,并對本文提供了回報。

I’m very grateful to my colleagues at Google for their helpful feedback, especially Oriol Vinyals, Greg Corrado, Jon Shlens, Luke Vilnis, and Ilya Sutskever. I’m also thankful to many other friends and colleagues for taking the time to help me, including Dario Amodei, and Jacob Steinhardt. I’m especially thankful to Kyunghyun Cho for extremely thoughtful correspondence about my diagrams.

Before this post, I practiced explaining LSTMs during two seminar series I taught on neural networks. Thanks to everyone who participated in those for their patience with me, and for their feedback.

我非常感謝谷歌的同僚們提供的有用的回報,特别是Oriol Vinyals、Greg Corrado、Jon Shlens、Luke Vilnis和Ilya Sutskever。我也感謝許多其他的朋友和同僚花時間來幫助我,包括達裡奧·阿莫德和雅各布·斯坦哈特。我特别感謝Kyunghyun Cho對我的圖表所做的極其周到的回複。

在這篇文章之前,我在兩個關于神經網絡的系列研讨會上練習解釋LSTMs。感謝每一個參與其中的人,感謝他們對我的耐心,感謝他們的回報。