天天看點

LSTM:《Long Short-Term Memory》的翻譯并解讀(二)

3 CONSTANT ERROR BACKPROP  固定誤差支援

3.1 EXPONENTIALLY DECAYING ERROR   指數衰減誤差

Conventional BPTT (e.g. Williams and Zipser 1992). Output unit k's target at time t is denoted by dk (t). Using mean squared error, k's error signal is

傳統的BPTT(如Williams和Zipser 1992)。輸出單元k在t時刻的目标用dk (t)表示,利用均方誤差,k的誤差信号為

The corresponding contribution to wjl 's total weight update is #j (t)yl  (t  1), where  is the  learning rate, and l stands for an arbitrary unit connected to unit j.  Outline of Hochreiter's analysis (1991, page 19-21). Suppose we have a fully connected  net whose non-input unit indices range from 1 to n. Let us focus on local error  ow from unit u  to unit v (later we will see that the analysis immediately extends to global error  ow). The error  occurring at an arbitrary unit u at time step t is propagated \back into time" for q time steps, to  an arbitrary unit v. This will scale the error by the following fact

wjl的總權重更新的相應貢獻是#j (t)yl (t 1),其中為學習率,l表示連接配接到j單元的任意單元。Hochreiter分析概要(1991年,第19-21頁)。假設我們有一個完全連通的網絡,它的非輸入機關指數範圍從1到n。讓我們關注從機關u到機關v的局部誤差ow(稍後我們将看到分析立即擴充到全局誤差ow)。發生在任意機關u上的時間步長t的誤差被傳播回時間中,對于q時間步長,傳播回任意機關v

3.2 CONSTANT ERROR FLOW: NAIVE APPROACH 常量錯誤流:簡單的方法

A single unit. To avoid vanishing error signals, how can we achieve constant error ow through a single unit j with a single connection to itself? According to the rules above, at time t, j's local error back ow is #j (t) = f 0 j (netj (t))#j (t + 1)wjj . To enforce constant error ow through j, we h j, we

一個單元。為了避免消失的錯誤信号,我們如何通過一個單一的機關j與一個單一的連接配接到自己實作恒定的錯誤低?根據上面的規則,在t時刻,j的本地錯誤傳回ow是#j (t) = f0 j (netj (t))#j (t + 1)wjj。為了通過j來執行常誤差ow,我們h j,我們

In the experiments, this will be ensured by using the identity function fj : fj (x) = x; 8x, and by setting wjj = 1:0. We refer to this as the constant error carrousel (CEC). CEC will be LSTM's central feature (see Section 4). Of course unit j will not only be connected to itself but also to other units. This invokes two obvious, related problems (also inherent in all other gradient-based approaches): 在實驗中,利用恒等函數fj: fj (x) = x來保證;設定wjj = 1:0。我們稱之為常誤差卡魯塞爾(CEC)。CEC将是LSTM的中心特性(參見第4節)。當然,單元j不僅與自身相連,還與其他單元相連。這引發了兩個明顯的、相關的問題(也是所有其他基于梯度的方法所固有的):

1. Input weight con ict: for simplicity, let us focus on a single additional input weight wji .  Assume that the total error can be reduced by switching on unit j in response to a certain input,  and keeping it active for a long time (until it helps to compute a desired output). Provided i is nonzero,  since the same incoming weight has to be used for both storing certain inputs and ignoring  others, wji will often receive con icting weight update signals during this time (recall that j is  linear): these signals will attempt to make wji participate in (1) storing the input (by switching  on j) and (2) protecting the input (by preventing j from being switched o by irrelevant later  inputs). This con ict makes learning dicult, and calls for a more context-sensitive mechanism  for controlling \write operations" through input weights.  

2. Output weight con ict: assume j is switched on and currently stores some previous  input. For simplicity, let us focus on a single additional outgoing weight wkj . The same wkj has  to be used for both retrieving j's content at certain times and preventing j from disturbing k  at other times. As long as unit j is non-zero, wkj will attract con icting weight update signals  generated during sequence processing: these signals will attempt to make wkj participate in (1)  accessing the information stored in j and | at dierent times | (2) protecting unit k from being  perturbed by j. For instance, with many tasks there are certain \short time lag errors" that can be  reduced in early training stages. However, at later training stages j may suddenly start to cause  avoidable errors in situations that already seemed under control by attempting to participate in  reducing more dicult \long time lag errors". Again, this con ict makes learning dicult, and  calls for a more context-sensitive mechanism for controlling \read operations" through output  weights.  

1. 輸入權值限制:為了簡單起見,我們将重點放在單個額外的輸入權值wji上。假設可以通過打開單元j來響應某個輸入,并長時間保持它處于活動狀态(直到它有助于計算所需的輸出),進而減少總錯誤。提供我是零,因為相同的傳入的重量必須是用于存儲特定的輸入和無視他人,wji通常會接收con ict重量更新信号在此期間(回想一下,j是線性):這些信号将試圖使wji參與(1)存儲輸入(通過打開j)和(2)保護輸入(通過阻止j被無關緊要了o後輸入)。這使得學習變得困難,需要一種更上下文敏感的機制來“通過輸入權重”控制寫操作。

2. 輸出權值:假設j已經打開,并且目前存儲了一些以前的輸入。為了簡單起見,讓我們關注單個額外的輸出權wkj。相同的wkj必須在特定時間用于檢索j的内容,在其他時間用于防止j幹擾k。隻要機關j是零,wkj将吸引con ict重量更新信号生成的序列處理期間:這些信号将試圖使wkj參與(1)通路的資訊存儲在j和| | dierent倍(2)保護單元凱西從被攝動j。例如,許多任務有些\短時間延遲錯誤”,可以減少在早期訓練階段。然而,在後來的訓練階段,j可能會突然開始在那些似乎已經在控制之中的情況下,通過嘗試減少更多的長時間延遲錯誤來造成可避免的錯誤。同樣,這一缺點使學習變得困難,需要一種更上下文敏感的機制來“通過輸出權重”控制讀操作。

Of course, input and output weight con icts are not specic for long time lags, but occur for  short time lags as well. Their eects, however, become particularly pronounced in the long time  lag case: as the time lag increases, (1) stored information must be protected against perturbation  for longer and longer periods, and | especially in advanced stages of learning | (2) more and  more already correct outputs also require protection against perturbation.  

Due to the problems above the naive approach does not work well except in case of certain  simple problems involving local input/output representations and non-repeating input patterns  (see Hochreiter 1991 and Silva et al. 1996). The next section shows how to do it right. 當然,輸入和輸出的權系數在長時間滞後時是不特定的,但在短時間滞後時也會出現。除,然而,在長時間滞後的情況下尤為明顯:随着時間間隔的增加,(1)存儲資訊必須防止擾動時間卻越來越長,學習|和|尤其是晚期(2)越來越多的正确輸出也需要防止擾動。 

由于上述問題,天真的方法不能很好地工作,除非某些簡單的問題涉及本地輸入/輸出表示和非重複輸入模式(見Hochreiter 1991和Silva et al. 1996)。下一節将展示如何正确地執行此操作。

4 LONG SHORT-TERM MEMORY

Memory cells and gate units. To construct an architecture that allows for constant error ow through special, self-connected units without the disadvantages of the naive approach, we extend the constant error carrousel CEC embodied by the self-connected, linear unit j from Section 3.2 by introducing additional features. A multiplicative input gate unit is introduced to protect the memory contents stored in j from perturbation by irrelevant inputs. Likewise, a multiplicative output gate unit is introduced which protects other units from perturbation by currently irrelevant memory contents stored in j. 記憶單元和門單元。為了建構一個允許通過特殊的、自連接配接的單元實作恒定誤差的體系結構,同時又不存在樸素方法的缺點,我們通過引入額外的特性來擴充3.2節中自連接配接的線性單元j所包含的恒定誤差carrousel CEC。為了保護存儲在j中的存儲内容不受無關輸入的幹擾,引入了乘法輸入門單元。同樣地,一個乘法輸出門單元被引入,它保護其他單元不受目前不相關的存儲在j中的記憶體内容的幹擾。

net Figure 1: Architecture of memory cel l cj (the box) and its gate units inj ; outj . The self-recurrent connection (with weight 1.0) indicates feedback with a delay of 1 time step. It builds the basis of the \constant error carrousel" CEC. The gate units open and close access to CEC. See text and appendix A.1 for details.

圖1:memory cel l cj(盒子)的結構和它的門單元inj;outj。自循環連接配接(權值為1.0)表示回報延遲1個時間步長。它建立了恒定誤差carrousel“CEC”的基礎。星門單元打開和關閉CEC的入口。詳情見正文和附錄A.1。

ls.  Why gate units? To avoid input weight con icts, inj controls the error  ow to memory cell  cj 's input connections wcj i . To circumvent cj 's output weight con icts, outj controls the error   ow from unit j's output connections. In other words, the net can use inj to decide when to keep  or override information in memory cell cj , and outj to decide when to access memory cell cj and  when to prevent other units from being perturbed by cj (see Figure 1).  

為什麼門機關?為了避免輸入權值沖突,inj控制了記憶體單元cj的輸入連接配接的誤差。為了繞過cj的輸出權值,outj控制來自機關j的輸出連接配接的錯誤。換句話說,網絡可以使用inj來決定何時在記憶體單元cj中保留或覆寫資訊,而使用outj來決定何時通路記憶體單元cj以及何時防止其他單元受到cj的幹擾(參見圖1)。

Error signals trapped within a memory cell's CEC cannot change { but dierent error signals   owing into the cell (at dierent times) via its output gate may get superimposed. The output  gate will have to learn which errors to trap in its CEC, by appropriately scaling them. The input gate will have to learn when to release errors, again by appropriately scaling them. Essentially, the multiplicative gate units open and close access to constant error ow through CEC.

存儲單元的CEC中的錯誤信号不能改變{但是通過輸出門進入單元的不同錯誤信号(在不同的時間)可以被疊加。通過适當地擴充,輸出門必須了解在其CEC中應該捕獲哪些錯誤。輸入門必須學會何時釋放錯誤,再次通過适當地擴充它們。從本質上說,乘性門單元通過CEC打開和關閉對恒定誤差的通路。

Distributed output representations typically do require output gates. Not always are both  gate types necessary, though | one may be sucient. For instance, in Experiments 2a and 2b in  Section 5, it will be possible to use input gates only. In fact, output gates are not required in case  of local output encoding | preventing memory cells from perturbing already learned outputs can  be done by simply setting the corresponding weights to zero. Even in this case, however, output  gates can be benecial: they prevent the net's attempts at storing long time lag memories (which  are usually hard to learn) from perturbing activations representing easily learnable short time lag  memories. (This will prove quite useful in Experiment 1, for instance.)  

分布式輸出表示通常需要輸出門。雖然|一個可能是必需的,但兩個門不一定都是必需的。例如,在第5節的2a和2b實驗中,将可能隻使用輸入門。事實上,在本地輸出編碼為|的情況下,不需要輸出門,隻要将相應的權值設定為0,就可以防止記憶體單元幹擾已經學習過的輸出。然而,即使在這種情況下,輸出門也可能是有益的:它們阻止了網絡存儲長時間滞後記憶(通常很難學習)的嘗試,進而幹擾了代表容易學習的短時間滞後記憶的激活。(例如,這在實驗1中将被證明非常有用。)

Network topology. We use networks with one input layer, one hidden layer, and one output  layer. The (fully) self-connected hidden layer contains memory cells and corresponding gate units  (for convenience, we refer to both memory cells and gate units as being located in the hidden  layer). The hidden layer may also contain \conventional" hidden units providing inputs to gate  units and memory cells. All units (except for gate units) in all layers have directed connections  (serve as inputs) to all units in the layer above (or to all higher layers { Experiments 2a and 2b).  

Memory cell blocks. S memory cells sharing the same input gate and the same output gate  form a structure called a \memory cell block of size S". Memory cell blocks facilitate information  storage | as with conventional neural nets, it is not so easy to code a distributed input within a  single cell. Since each memory cell block has as many gate units as a single memory cell (namely  two), the block architecture can be even slightly more ecient (see paragraph \computational  complexity"). A memory cell block of size 1 is just a simple memory cell. In the experiments  (Section 5), we will use memory cell blocks of various sizes.  

網絡拓撲結構。我們使用一個輸入層、一個隐含層和一個輸出層的網絡。(完全)自連接配接的隐層包含記憶體單元和相應的栅極單元(為了友善起見,我們将位于隐層中的記憶體單元和栅極單元都稱為隐層)。所述隐層還可以包含提供栅極單元和存儲器單元輸入的正常“隐單元”。所有層中的所有單元(門單元除外)都有指向連接配接(作為輸入)到上面層中的所有單元(或所有更高的層{實驗2a和2b)。

存儲單元塊。共享相同的輸入門和輸出門的記憶體單元形成一個稱為大小為S的記憶體單元塊的結構。記憶單元塊促進資訊存儲|與傳統的神經網絡一樣,在單個單元内編碼分布式輸入并不容易。由于每個記憶體單元塊與單個記憶體單元(即兩個)具有同樣多的門單元,是以塊架構甚至可以更特殊一些(請參閱段落“計算複雜性”)。大小為1的記憶體單元塊隻是一個簡單的記憶體單元。在實驗(第5部分)中,我們将使用不同大小的存儲單元塊。

Learning. We use a variant of RTRL (e.g., Robinson and Fallside 1987) which properly takes  into account the altered, multiplicative dynamics caused by input and output gates. However, to  ensure non-decaying error backprop through internal states of memory cells, as with truncated  BPTT (e.g., Williams and Peng 1990), errors arriving at \memory cell net inputs" (for cell cj , this  includes netcj  , netinj  , netoutj ) do not get propagated back further in time (although they do serve  to change the incoming weights). Only within2 memory cells, errors are propagated back through  previous internal states scj  . To visualize this: once an error signal arrives at a memory cell output,  it gets scaled by output gate activation and h0  . Then it is within the memory cell's CEC, where it  can  ow back indenitely without ever being scaled. Only when it leaves the memory cell through  the input gate and g, it is scaled once more by input gate activation and g  0  . It then serves to  change the incoming weights before it is truncated (see appendix for explicit formulae).  

Computational complexity. As with Mozer's focused recurrent backprop algorithm (Mozer  1989), only the derivatives @scj  @wil  need to be stored and updated. Hence the LSTM algorithm is  very ecient, with an excellent update complexity of O(W), where W the number of weights (see  details in appendix A.1). Hence, LSTM and BPTT for fully recurrent nets have the same update  complexity per time step (while RTRL's is much worse). Unlike full BPTT, however, LSTM is  local in space and time3  : there is no need to store activation values observed during sequence  processing in a stack with potentially unlimited size.

學習。我們使用RTRL的一個變體(例如,Robinson和Fallside 1987),它适當地考慮了輸入和輸出門所引起的變化的乘法動力學。然而,以確定non-decaying錯誤backprop通過内部狀态的記憶細胞,與截斷BPTT(例如,威廉姆斯和彭1990),錯誤到達\存儲單元網絡輸入”(細胞cj,這包括netcj、netinj netoutj)得不到傳播更久遠的時代(盡管他們服務變化的權重)。隻有在2個記憶體單元中,錯誤才會通過之前的内部狀态scj傳播回來。為了可視化這一點:一旦一個錯誤信号到達一個記憶體單元輸出,它将被輸出門激活和h0縮放。然後它在記憶細胞的CEC中,在那裡它可以無限地慢下來而不需要被縮放。隻有當它通過輸入門和g離開存儲單元時,它才通過輸入門激活和g 0再次被縮放。然後,它用于在截斷之前更改傳入的權重(有關顯式公式,請參閱附錄)。

計算的複雜性。與Mozer的重點循環支援算法(Mozer 1989)一樣,隻需要存儲和更新導數@scj @wil。是以LSTM算法非常特殊,更新複雜度為O(W),其中W表示權值的數量(詳見附錄A.1)。是以,對于完全經常網,LSTM和BPTT的每一步更新複雜度是相同的(而RTRL要差得多)。但是,與完整的BPTT不同的是,LSTM在空間和時間上是局部的:不需要将序列處理期間觀察到的激活值存儲在具有無限大小的堆棧中。

Abuse problem and solutions. In the beginning of the learning phase, error reduction  may be possible without storing information over time. The network will thus tend to abuse  memory cells, e.g., as bias cells (i.e., it might make their activations constant and use the outgoing  connections as adaptive thresholds for other units). The potential diculty is: it may take a  long time to release abused memory cells and make them available for further learning. A similar  \abuse problem" appears if two memory cells store the same (redundant) information. There are  at least two solutions to the abuse problem: (1) Sequential network construction (e.g., Fahlman  1991): a memory cell and the corresponding gate units are added to the network whenever the error stops decreasing (see Experiment 2 in Section 5). (2) Output gate bias: each output gate gets a negative initial bias, to push initial memory cell activations towards zero. Memory cells with more negative bias automatically get \allocated" later (see Experiments 1, 3, 4, 5, 6 in Section 5).

濫用問題及解決方法。在學習階段的開始,可以在不存儲資訊的情況下減少錯誤。是以,該網絡将傾向于濫用記憶細胞,例如,作為偏見細胞。,它可能使它們的激活保持不變,并使用傳出連接配接作為其他單元的自适應門檻值)。潛在的問題是:釋放被濫用的記憶細胞并使其用于進一步的學習可能需要很長時間。如果兩個記憶單元存儲相同的(備援的)資訊,就會出現類似的“濫用”問題。至少有兩個解決濫用問題:(1)順序網絡建設(例如,Fahlman 1991):一個存儲單元和相應的單元門時被添加到網絡錯誤停止減少(見實驗2節5)。(2)輸出門偏見:每個輸出門負初始偏差,将最初的記憶細胞激活為零。帶有更多負偏差的記憶細胞将被自動配置設定”稍後(參見第5節中的實驗1、3、4、5、6)。

Internal state drift and remedies. If memory cell cj 's inputs are mostly positive or mostly  negative, then its internal state sj will tend to drift away over time. This is potentially dangerous,  for the h0  (sj ) will then adopt very small values, and the gradient will vanish. One way to circumvent  this problem is to choose an appropriate function h. But h(x) = x, for instance, has the  disadvantage of unrestricted memory cell output range. Our simple but eective way of solving  drift problems at the beginning of learning is to initially bias the input gate inj towards zero.  Although there is a tradeo between the magnitudes of h0  (sj ) on the one hand and of yinj  and  f 0  inj on the other, the potential negative eect of input gate bias is negligible compared to the one  of the drifting eect. With logistic sigmoid activation functions, there appears to be no need for  ne-tuning the initial bias, as conrmed by Experiments 4 and 5 in Section 5.4. 内部狀态漂移和補救措施。如果記憶細胞cj的輸入大部分是正的或大部分是負的,那麼它的内部狀态sj會随着時間的推移而漂移。這是潛在的危險,因為h0 (sj)将采用非常小的值,而梯度将消失。解決這個問題的一種方法是選擇一個合适的函數h,但是h(x) = x的缺點是不限制記憶體單元的輸出範圍。我們在學習之初解決漂移問題的簡單而有效的方法是使輸入門inj最初偏向于零。雖然在h0 (sj)與yinj和f0 inj的量級之間存在貿易,但與漂移效應相比,輸入門偏差的潛在負效應可以忽略不計。對于logistic sigmoid激活函數,似乎不需要對初始偏差進行ne調節,正如5.4節中的實驗4和實驗5所證明的那樣。

繼續閱讀