5 EXPERIMENTS 實驗
Introduction. Which tasks are appropriate to demonstrate the quality of a novel long time lag 介紹。哪些任務是合适的,以證明一個新的長時間滞後的品質
algorithm? First of all, minimal time lags between relevant input signals and corresponding teacher signals must be long for al l training sequences. In fact, many previous recurrent net algorithms sometimes manage to generalize from very short training sequences to very long test sequences. See, e.g., Pollack (1991). But a real long time lag problem does not have any short time lag exemplars in the training set. For instance, Elman's training procedure, BPTT, oine RTRL, online RTRL, etc., fail miserably on real long time lag problems. See, e.g., Hochreiter (1991) and Mozer (1992). A second important requirement is that the tasks should be complex enough such that they cannot be solved quickly by simple-minded strategies such as random weight guessing.
算法?首先,對于all訓練序列,相關輸入信号與相應教師信号之間的最小時滞必須很長。事實上,許多以前的遞歸網絡算法有時能夠将非常短的訓練序列推廣到非常長的測試序列。參見,例如Pollack(1991)。但是一個真實的長時間滞後問題在訓練集中沒有任何短時間滞後的例子。例如,Elman的訓練過程,BPTT, oine RTRL, online RTRL等,在真實的長時間滞後問題上嚴重失敗。例如Hochreiter(1991)和Mozer(1992)。第二個重要的要求是,任務應該足夠複雜,不能用簡單的政策(如随機猜測權值)快速解決。
Guessing can outperform many long time lag algorithms. Recently we discovered (Schmidhuber and Hochreiter 1996, Hochreiter and Schmidhuber 1996, 1997) that many long time lag tasks used in previous work can be solved more quickly by simple random weight guessing than by the proposed algorithms. For instance, guessing solved a variant of Bengio and Frasconi's \parity problem" (1994) problem much faster4 than the seven methods tested by Bengio et al. (1994) and Bengio and Frasconi (1994). Similarly for some of Miller and Giles' problems (1993). Of course, this does not mean that guessing is a good algorithm. It just means that some previously used problems are not extremely appropriate to demonstrate the quality of previously proposed algorithms.
猜測可以勝過許多長時間延遲的算法。最近我們發現(Schmidhuber and Hochreiter 1996, Hochreiter and Schmidhuber 1996, 1997),以前工作中使用的許多長時間延遲任務可以通過簡單的随機猜測權值來快速解決,而不是通過所提出的算法。例如,猜測解決了Bengio和Frasconi's奇偶校驗問題(1994)的一個變體,比Bengio等人(1994)和Bengio和Frasconi(1994)測試的七種方法要快得多。類似地,米勒和賈爾斯的一些問題(1993年)。當然,這并不意味着猜測是一個好的算法。這隻是意味着一些以前用過的問題不是非常适合用來示範以前提出的算法的品質。
What's common to Experiments 1{6. All our experiments (except for Experiment 1) involve long minimal time lags | there are no short time lag training exemplars facilitating learning. Solutions to most of our tasks are sparse in weight space. They require either many parameters/inputs or high weight precision, such that random weight guessing becomes infeasible.
實驗1{6。我們所有的實驗(除了實驗1)都涉及到長時間的最小滞後時間|沒有短時間的滞後訓練範例來促進學習。我們大多數任務的解在權值空間中是稀疏的。它們要麼需要許多參數/輸入,要麼需要較高的權值精度,這樣随機猜測權值就變得不可行了。
We always use on-line learning (as opposed to batch learning), and logistic sigmoids as activation functions. For Experiments 1 and 2, initial weights are chosen in the range [0:2; 0:2], for the other experiments in [0:1; 0:1]. Training sequences are generated randomly according to the various task descriptions. In slight deviation from the notation in Appendix A1, each discrete time step of each input sequence involves three processing steps:
(1) use current input to set the input units.
(2) Compute activations of hidden units (including input gates, output gates, memory cells).
(3) Compute output unit activations. Except for Experiments 1, 2a, and 2b, sequence elements are randomly generated on-line, and error signals are generated only at sequence ends. Net activations are reset after each processed input sequence.
我們總是使用線上學習(而不是批量學習),并使用邏輯sigmoids作為激活函數。實驗1和實驗2的初始權值選擇在[0:2;0:2],用于其他實驗[0:1;0:1)。根據不同的任務描述,随機生成訓練序列。與附錄A1中的符号略有偏差,每個輸入序列的每個離散時間步都涉及三個處理步驟:
(1)使用電流輸入設定輸入單元。
(2)計算隐藏單元的激活(包括輸入門、輸出門、存儲單元)。
(3)計算輸出單元激活。除實驗1、2a、2b外,序列元素線上随機生成,僅在序列末端産生誤差信号。Net激活在每個處理的輸入序列之後被重置。
For comparisons with recurrent nets taught by gradient descent, we give results only for RTRL, except for comparison 2a, which also includes BPTT. Note, however, that untruncated BPTT (see, e.g., Williams and Peng 1990) computes exactly the same gradient as oine RTRL. With long time lag problems, oine RTRL (or BPTT) and the online version of RTRL (no activation resets, online weight changes) lead to almost identical, negative results (as conrmed by additional simulations in Hochreiter 1991; see also Mozer 1992). This is because oine RTRL, online RTRL, and full BPTT all suer badly from exponential error decay.
對于用梯度下降法講授的循環網的比較,我們隻給出了RTRL的結果,除了比較2a,其中也包括了BPTT。但是,請注意未截斷的BPTT(參見, Williams和Peng(1990)計算的梯度與oine RTRL完全相同。由于存在長時間滞後問題,oine RTRL(或BPTT)和RTRL的線上版本(沒有激活重置,線上權重變化)導緻幾乎相同的負結果(如Hochreiter 1991中的額外模拟所證明的;參見Mozer 1992)。這是因為oine RTRL、online RTRL和full BPTT都嚴重依賴于指數誤差衰減。
Our LSTM architectures are selected quite arbitrarily. If nothing is known about the complexity of a given problem, a more systematic approach would be: start with a very small net consisting of one memory cell. If this does not work, try two cells, etc. Alternatively, use sequential network construction (e.g., Fahlman 1991).
我們的LSTM架構是任意選擇的。如果對給定問題的複雜性一無所知,那麼一種更系統的方法是:從一個由一個記憶單元組成的非常小的網絡開始。如果這不起作用,嘗試兩個單元格,等等。或者,使用順序網絡結構(例如,Fahlman 1991)。
Outline of experiments 試驗大綱
Experiment 1 focuses on a standard benchmark test for recurrent nets: the embedded Reber grammar. Since it allows for training sequences with short time lags, it is not a long time lag problem. We include it because (1) it provides a nice example where LSTM's output gates are truly benecial, and (2) it is a popular benchmark for recurrent nets that has been used by many authors | we want to include at least one experiment where conventional BPTT and RTRL do not fail completely (LSTM, however, clearly outperforms them). The embedded Reber grammar's minimal time lags represent a border case in the sense that it is still possible to learn to bridge them with conventional algorithms. Only slightly long minimal time lags would make this almost impossible. The more interesting tasks in our paper, however, are those that RTRL, BPTT, etc. cannot solve at all.
Experiment 2 focuses on noise-free and noisy sequences involving numerous input symbols distracting from the few important ones. The most dicult task (Task 2c) involves hundreds of distractor symbols at random positions, and minimal time lags of 1000 steps. LSTM solves it, while BPTT and RTRL already fail in case of 10-step minimal time lags (see also, e.g., Hochreiter 1991 and Mozer 1992). For this reason RTRL and BPTT are omitted in the remaining, more complex experiments, all of which involve much longer time lags.
Experiment 3 addresses long time lag problems with noise and signal on the same input line. Experiments 3a/3b focus on Bengio et al.'s 1994 \2-sequence problem". Because this problem actually can be solved quickly by random weight guessing, we also include a far more dicult 2-sequence problem (3c) which requires to learn real-valued, conditional expectations of noisy targets, given the inputs.
Experiments 4 and 5 involve distributed, continuous-valued input representations and require learning to store precise, real values for very long time periods. Relevant input signals can occur at quite dierent positions in input sequences. Again minimal time lags involve hundreds of steps. Similar tasks never have been solved by other recurrent net algorithms.
Experiment 6 involves tasks of a dierent complex type that also has not been solved by other recurrent net algorithms. Again, relevant input signals can occur at quite dierent positions in input sequences. The experiment shows that LSTM can extract information conveyed by the temporal order of widely separated inputs.
Subsection 5.7 will provide a detailed summary of experimental conditions in two tables for reference.
實驗1着重于遞歸網絡的标準基準測試:嵌入式Reber文法。因為它允許訓練序列有短的時間滞後,是以它不是一個長時間滞後的問題。我們包括是因為(1),它提供了一個很好的例子,LSTM輸出門真正benecial,和(2)這是一個流行的複發性基準網,已經被許多作者|我們希望包括至少一個實驗,正常BPTT和RTRL不完全失敗(然而,LSTM明顯優于他們)。嵌入式Reber文法的最小時間延遲代表了一種邊界情況,在這種情況下,學習用傳統算法橋接它們仍然是可能的。隻要稍微長一點的時間延遲,這幾乎是不可能的。然而,我們的論文中更有趣的任務是那些RTRL、BPTT等根本無法解決的任務。
實驗2着重于無噪聲和有噪聲的序列,這些序列涉及大量的輸入符号,分散了對少數重要符号的注意力。最複雜的任務(task 2c)包含數百個随機位置的幹擾符号,最小延遲為1000步。LSTM解決了這個問題,而BPTT和RTRL已經在10步最小時間延遲的情況下失敗了(參見Hochreiter 1991和Mozer 1992)。是以,RTRL和BPTT在剩餘的、更複雜的實驗中被忽略,所有這些實驗都涉及更長的時間滞後。
實驗3解決了在同一輸入線上存在噪聲和信号的長時間滞後問題。實驗3a/3b集中于Bengio等人的1994 \2-sequence問題”。因為這個問題實際上可以通過随機猜測權值來快速解決,是以我們還包括了一個更複雜的2-序列問題(3c),該問題要求在給定輸入的情況下學習噪聲目标的實值、條件期望。
實驗4和5涉及到分布式的連續值輸入表示,需要學習長時間存儲精确的、真實的值。相關的輸入信号可以出現在輸入序列的不同位置。同樣,最小的時間延遲涉及數百個步驟。其他遞歸網絡算法從未解決過類似的問題。
實驗6涉及到不同複雜類型的任務,其他遞歸網絡算法也沒有解決這些任務。同樣,相關的輸入信号可以出現在輸入序列的不同位置。實驗結果表明,LSTM可以提取出由時間順序的離散輸入所傳遞的資訊。
第5.7款将提供兩個表内實驗條件的詳細摘要,以供參考。
5.1 EXPERIMENT 1: EMBEDDED REBER GRAMMAR 實驗1:嵌入式REBER文法
Task. Our rst task is to learn the \embedded Reber grammar", e.g. Smith and Zipser (1989), Cleeremans et al. (1989), and Fahlman (1991). Since it allows for training sequences with short time lags (of as few as 9 steps), it is not a long time lag problem. We include it for two reasons: (1) it is a popular recurrent net benchmark used by many authors | we wanted to have at least one experiment where RTRL and BPTT do not fail completely, and
(2) it shows nicely how output gates can be bene cial.
任務。我們的首要任務是學習嵌入的Reber文法”,例如Smith和Zipser(1989)、Cleeremans等人(1989)和Fahlman(1991)。因為它允許訓練序列有短的時間滞後(隻有9個步驟),是以它不是一個長時間滞後的問題。我們引入它有兩個原因:
(1)它是一個被許多作者|使用的流行的周期性網絡基準,我們希望至少有一個RTRL和BPTT不會完全失敗的實驗,
(2)它很好地展示了輸出門是如何可以帶來好處的。