Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning

Averaged-DQN：深度強化學習的方差減少和穩定性

Abstract

Instability and variability of Deep Reinforcement Learning (DRL) algorithms tend to adversely affect their performance. Averaged-DQN is a sim-ple extension to the DQN algorithm, based on averaging previously learned Q-values estimates, which leads to a more stable training procedure and improved performance by reducing approximation error variance in the target values. To understand the effect of the algorithm, we examine the source of value function estimation errors and provide an analytical comparison within a simplified model. We further present experiments on the Arcade Learning Environment benchmark that demonstrate significantly improved stability and performance due to the proposed extension.

深度強化學習（DRL）算法的不穩定性和可變性往往會對其性能産生不利影響。 Averaged-DQN是對DQN算法的簡單擴充，基于對先前學習的Q值估計進行平均，這通過減少目标值中的近似誤差方差而導緻更穩定的訓練過程和改進的性能。為了了解算法的效果，我們檢查了價值函數估計誤差的來源，并在簡化模型中提供了分析比較。我們進一步展示了Arcade學習環境基準測試的實驗，該基準測試表明由于建議的擴充而顯着提高了穩定性和性能

Introduction

In Reinforcement Learning (RL) an agent seeks an opti-mal policy for a sequential decision making problem (Sut-ton & Barto, 1998). It does so by learning which action is optimal for each environment state. Over the course of time, many algorithms have been introduced for solv-ing RL problems including Q-learning (Watkins & Dayan, 1992), SARSA (Rummery & Niranjan, 1994; Sutton & Barto, 1998), and policy gradient methods (Sutton et al., 1999). These methods are often analyzed in the setup of linear function approximation, where convergence is guar-anteed under mild assumptions (Tsitsiklis, 1994; Jaakkola et al., 1994; Tsitsiklis & Van Roy, 1997; Even-Dar & Man-sour, 2003). In practice, real-world problems usually in-volve high-dimensional inputs forcing linear function ap-proximation methods to rely upon hand engineered features for problem-specific state representation. These problem-specific features diminish the agent flexibility, and so the need of an expressive and flexible non-linear function ap-proximation emerges. Except for few successful attempts (e.g., TD-gammon, Tesauro (1995)), the combination of non-linear function approximation and RL was considered unstable and was shown to diverge even in simple domains (Boyan & Moore, 1995).

The recent Deep Q-Network (DQN) algorithm (Mnih et al., 2013), was the first to successfully combine a power-ful non-linear function approximation technique known as Deep Neural Network (DNN) (LeCun et al., 1998; Krizhevsky et al., 2012) together with the Q-learning al-gorithm. DQN presented a remarkably flexible and stable algorithm, showing success in the majority of games within the Arcade Learning Environment (ALE) (Bellemare et al., 2013). DQN increased the training stability by breaking the RL problem into sequential supervised learning tasks. To do so, DQN introduces the concept of a target network and uses an Experience Replay buffer (ER) (Lin, 1993).

Following the DQN work, additional modifications and ex-tensions to the basic algorithm further increased training stability. Schaul et al. (2015) suggested sophisticated ER sampling strategy. Several works extended standard RL exploration techniques to deal with high-dimensional input (Bellemare et al., 2016; Tang et al., 2016; Osband et al., 2016). Mnih et al. (2016) showed that sampling from ER could be replaced with asynchronous updates from parallel environments (which enables the use of on-policy meth-ods). Wang et al. (2015) suggested a network architecture base on the advantage function decomposition (Baird III, 1993).

在強化學習（RL）中，代理人為順序決策問題尋求最優政策（Sut-ton＆Barto，1998）。它是通過了解哪種行為對每種環境狀态最佳來實作的。随着時間的推移，已經引入了許多算法來解決RL問題，包括Q-learning（Watkins＆Dayan，1992），SARSA（Rummery＆Niranjan，1994; Sutton＆Barto，1998），以及政策梯度方法（Sutton） et al。，1999）。這些方法通常線上性函數逼近的設定中進行分析，其中收斂在溫和的假設下得到保證（Tsitsiklis，1994; Jaakkola等，1994; Tsitsiklis和Van Roy，1997; Even-Dar＆Man-sour，2003; ）。在實踐中，現實世界的問題通常包括高維輸入，迫使線性函數近似方法依賴于手工設計的特征來解決特定問題的狀态。這些特定于問題的特征降低了代理的靈活性，是以出現了表達和靈活的非線性函數近似的需要。除了少數成功的嘗試（例如，TD-gammon，Tesauro（1995）），非線性函數近似和RL的組合被認為是不穩定的，并且即使在簡單的域中也顯示出分歧（Boyan＆Moore，1995）。

最近的深度Q網絡（DQN）算法（Mnih等，2013），是第一個成功結合強大的非線性函數逼近技術，稱為深度神經網絡（DNN）（LeCun等，1998） ; Krizhevsky等，2012）與Q學習算法一起。 DQN提出了一種非常靈活和穩定的算法，在Arcade學習環境（ALE）中的大多數遊戲中都取得了成功（Bellemare等，2013）。 DQN通過将RL問題分解為順序監督學習任務來提高訓練穩定性。為此，DQN引入了目标網絡的概念并使用了體驗重放緩沖區（ER）（Lin，1993）。

在DQN工作之後，對基本算法的額外修改和擴充進一步提高了訓練穩定性。 Schaul等人。（2015）建議複雜的ER采樣政策。一些工作擴充了标準RL探測技術以處理高維輸入（Bellemare等，2016; Tang等，2016; Osband等，2016）。 Mnih等人。（2016）表明，ER的抽樣可以替換為來自并行環境的異步更新（這使得能夠使用政策上的方法）。王等人。（2015）提出了基于優勢函數分解的網絡架構（Baird III，1993）。

In this work we address issues that arise from the combi-nation of Q-learning and function approximation. Thrun & Schwartz (1993) were first to investigate one of these issues which they have termed as the overestimation phenomena. The max operator in Q-learning can lead to overestimation of state-action values in the presence of noise. Van Hasselt et al. (2015) suggest the Double-DQN that uses the Double Q-learning estimator (Van Hasselt, 2010) method as a solu-tion to the problem. Additionally, Van Hasselt et al. (2015) showed that Q-learning overestimation do occur in practice.

(at least in the ALE).

This work suggests a different solution to the overestima-tion phenomena, named Averaged-DQN (Section 3), based on averaging previously learned Q-values estimates. The averaging reduces the target approximation error variance (Sections 4 and 5) which leads to stability and improved results. Additionally, we provide experimental results on selected games of the Arcade Learning Environment.

We summarize the main contributions of this paper as fol-lows:

• A novel extension to the DQN algorithm which stabi-lizes training, and improves the attained performance, by averaging over previously learned Q-values.

• Variance analysis that explains some of the DQN problems, and how the proposed extension addresses them.

• Experiments with several ALE games demonstrating the favorable effect of the proposed scheme.

在這項工作中，我們解決了Q學習和函數逼近的組合所産生的問題。 Thrun＆Schwartz（1993）首先研究了其中一個被稱為過高估計現象的問題。 Q學習中的最大算子可能導緻在存在噪聲的情況下高估狀态 - 動作值。範哈塞特等人。（2015）建議使用Double Q-learning估計器（Van Hasselt，2010）方法的Double-DQN作為問題的解決方案。此外，Van Hasselt等人。（2015）表明Q-learning高估确實在實踐中發生。

（至少在ALE中）。

這項工作提出了一種不同的高估現象解決方案，名為Averaged-DQN（第3節），基于對先前學習的Q值估計進行平均。平均值降低了目标近似誤差方差（第4節和第5節），進而提高了穩定性并改善了結果。此外，我們還提供了Arcade學習環境標明遊戲的實驗結果。

我們總結了本文的主要貢獻如下：

•DQN算法的新擴充，通過對先前學習的Q值進行平均，穩定訓練并提高獲得的性能。

•方差分析，解釋了一些DQN問題，以及建議的擴充如何解決這些問題。

•幾個ALE遊戲的實驗證明了該方案的有利影響。

2.Background

In this section we elaborate on relevant RL background, and specifically on the Q-learning algorithm.

在本節中，我們将詳細介紹相關的RL背景，特别是Q學習算法。

2.1. Reinforcement Learning

Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning

Value-based methods for solving RL problems encode poli-cies through the use of value functions, which denote the expected discounted cumulative reward from a given state s, following a policy π. Specifically we are interested in state-action value functions:

解決RL問題的基于價值的方法通過使用價值函數來編碼政策，價值函數表示遵循政策π的給定州s的預期折現累積獎勵。具體來說，我們對狀态 - 動作值函數感興趣：

Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning

2.2. Q-learning

One of the most popular RL algorithms is the Q-learning algorithm (Watkins & Dayan, 1992). This algorithm is based on a simple value iteration update (Bellman, 1957), directly estimating the optimal value function Q . Tabular Q-learning assumes a table that contains old action-value function estimates and preform updates using the follow-ing update rule:

最流行的RL算法之一是Q學習算法（Watkins＆Dayan，1992）。該算法基于簡單的值疊代更新（Bellman，1957），直接估計最優值函數Q. 表格式Q-learning假設一個表包含舊的操作 - 值函數估計和使用以下更新規則的預成型更新：

Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning

2.3. Deep Q Networks (DQN)

We present in Algorithm 1 a slightly different formulation of the DQN algorithm (Mnih et al., 2013). In iteration i the DQN algorithm solves a supervised learning problem to approximate the action-value function Q(s, a; θ) (line 6). This is an extension of implementing (1) in its function ap-proximation form (Riedmiller, 2005).

我們在算法1中提出了一種略微不同的DQN算法公式（Mnih等，2013）。在疊代i中，DQN算法解決了監督學習問題以近似動作值函數Q（s，a;θ）（第6行）。這是在函數ap-proximation形式（Riedmiller，2005）中實作（1）的擴充。

Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning

Note that in the original implementation(Mnihetal.,2013; 2015), transitions are added to the ER buffer simultaneously with the minimization of the DQN loss (line 6). Using the hyperparameters employed by Mnih et al. (2013; 2015)(detailed for completeness in AppendixE),1%of the experience transitions in ER buffer are replaced between target network parameter updates, and 8% are sampled for minimization.

請注意，在最初的實作中（Mnihetal。，2013; 2015），轉換被同時添加到ER緩沖區，同時最小化DQN損失（第6行）。使用Mnih等人使用的超參數。（2013; 2015）（詳見附錄E中的完整性），ER緩沖區中1％的經驗轉換在目标網絡參數更新之間被替換，8％被采樣用于最小化。

3.AveragedDQN

The Averaged-DQN algorithm (Algorithm 2) is an extension of the DQN algorithm. Averaged-DQN uses the K previously learned Q-values estimates to produce the current action-value estimate (line 5). The Averaged-DQN algorithm stabilizes the training process (see Figure 1), by reducing the variance of target approximation error as we elaborate in Section 5. The computational effort compared to DQN is, K-fold more forward passes through a Q-network while minimizing the DQN loss (line 7). The number of back-propagation updates remains the same as in DQN. Computational cost experiments are provided in AppedixD. The output of the algorithm is the average over the last K previously learned Q-networks.

Averaged-DQN算法（算法2）是DQN算法的擴充。 Averaged-DQN使用K先前學習的Q值估計來産生目前動作值估計（第5行）。 Averaged-DQN算法通過減少目标近似誤差的方差來穩定訓練過程（見圖1），正如我們在第5節中詳細說明的那樣。與DQN相比，計算工作量是向前通過Q網絡的K倍。最小化DQN損失（第7行）。反向傳播更新的數量與DQN中的相同。 AppedixD中提供了計算成本實驗。算法的輸出是最近K個先前學習的Q網絡的平均值。

Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning

Figure 1. DQN and Averaged-DQN performance in the Atari game of BREAKOUT. The bold lines are averages over seven independent learning trials. Every 1M frames, a performance test using �-greedy policy with � = 0.05 for 500000 frames was conducted. The shaded area presents one standard deviation. For both DQN and Averaged-DQN the hyperparameters used were taken from Mnih et al. (2015).

圖1. BREAKOUT的Atari遊戲中的DQN和Averaged-DQN性能。大膽的線條是七次獨立學習試驗的平均值。每1M幀，使用�貪婪政策進行性能測試，對于500000幀，�= 0.05。陰影區域呈現一個标準偏差。對于DQN和Averaged-DQN，使用的超參數取自Mnih等人。（2015年）。

Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning

In Figures1 and 2 we can see the performance of Averaged-DQN compared to DQN(and Double-DQN),further experimental results are given in Section 6. We note that recently-learned state-action value estimates are likely to be better than older ones, therefore we have also considered a recency-weighted average. In practice, a weighted average scheme did not improve performance and therefore is not presented here.

在圖1和圖2中，我們可以看到Averaged-DQN與DQN（和Double-DQN）相比的性能，進一步的實驗結果在第6節中給出。我們注意到最近學到的狀态 - 動作值估計可能比舊的更好。是以，我們也考慮了新近權重平均值。在實踐中，權重平均方案沒有改善性能，是以這裡沒有介紹。

4. Overestimation and Approximation Errors

Next, we discuss the various types of errors that arise due to the combination of Q-learning and function approximation in the DQN algorithm, and their effect on training stability. We refer to DQN’s performance in the BREAKOUT game in Figure 1. The source of the learning curve variance in DQN’s performance is an occasional sudden drop in the average score that is usually recovered in the next evalua-tion phase (for another illustration of the variance source see Appendix A). Another phenomenon can be observed in Figure 2, where DQN initially reaches a steady state (after 20 million frames), followed by a gradual deterioration in performance.

For the rest of this section, we list the above mentioned er-rors, and discuss our hypothesis as to the relations between each error and the instability phenomena depicted in Fig-ures 1 and 2.

接下來，我們讨論由于DQN算法中Q學習和函數逼近的組合而産生的各種類型的錯誤，以及它們對訓練穩定性的影響。我們在圖1中的BREAKOUT遊戲中參考DQN的表現.DQN表現中學習曲線方差的來源是平均得分的偶然突然下降，通常在下一個評估階段恢複（另一個方差說明）來源見附錄A）。在圖2中可以觀察到另一種現象，其中DQN最初達到穩定狀态（在2000萬幀之後），随後性能逐漸惡化。

在本節的其餘部分，我們列出了上面提到的錯誤，并讨論了我們關于每個錯誤與圖1和圖2中描述的不穩定現象之間關系的假設。

Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning

Figure 2. DQN, Double-DQN, and Averaged-DQN performance (left), and average value estimates (right) in the Atari game of ASTERIX. The bold lines are averages over seven independent learning trials. The shaded area presents one standard deviation. Every 2M frames, a performance test using -greedy policy with = 0.05 for 500000 frames was conducted. The hyperparameters used were taken from Mnih et al. (2015).

圖2. ASTERIX的Atari遊戲中的DQN，Double-DQN和Averaged-DQN性能（左）以及平均值估計（右）。大膽的線條是七次獨立學習試驗的平均值。陰影區域呈現一個标準偏差。每2M幀，使用-greedy政策進行性能測試，對于500000幀使用= 0.05。使用的超參數取自Mnih等人。（2015年）。

Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning

The optimality difference can be seen as the error of a standard tabular Q-learning, here we address the other errors. We next discuss each error in turn.

最優性差異可以看作是标準表格Q學習的錯誤，這裡我們解決其他錯誤。我們接下來依次讨論每個錯誤。

4.1. Target Approximation Error (TAE)

Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning

We hypothesize that the variability in DQN’s performance in Figure 1, that was discussed at the start of this section, is related to deviating from a steady-state policy induced by the TAE.

我們假設在本節開始時讨論的圖1中DQN性能的變化與偏離TAE引起的穩态政策有關。

4.2. Overestimation Error

Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning

The overestimation error is different in its nature from the TAE since it presents a positive bias that can cause asymp-totically sub-optimal policies, as was shown by Thrun & Schwartz (1993), and later by Van Hasselt et al. (2015) in the ALE environment. Note that a uniform bias in the action-value function will not cause a change in the induced policy. Unfortunately, the overestimation bias is uneven and is bigger in states where the Q-values are similar for the different actions, or in states which are the start of a long trajectory (as we discuss in Section 5 on accumulation of TAE variance).

Following from the above mentioned overestimation upper bound, the magnitude of the bias is controlled by the variance of the TAE.

The Double Q-learning and its DQN implementation (Double-DQN) (Van Hasselt et al., 2015; Van Hasselt, 2010) is one possible approach to tackle the overestimation problem, which replaces the positive bias with a negative one. Another possible remedy to the adverse effects of this error is to directly reduce the variance of the TAE, as in our proposed scheme (Section 5).

In Figure 2 we repeated the experiment presented in Van Hasselt et al. (2015) (along with the application of Averaged-DQN). This experiment is discussed in Van Has-selt et al. (2015) as an example of overestimation that leads to asymptotically sub-optimal policies. Since Averaged-DQN reduces the TAE variance, this experiment supports an hypothesis that the main cause for overestimation in DQN is the TAE variance.

高估誤差在性質上與TAE不同，因為它提出了一個正偏差，可以導緻全面的次優政策，如Thrun＆Schwartz（1993）和後來的Van Hasselt等人所示。（2015年）在ALE環境中。請注意，動作值函數中的統一偏差不會導緻誘導政策的變化。不幸的是，高估偏差是不均勻的，并且在不同行為的Q值相似的狀态下，或在作為長軌迹開始的狀态中更高（正如我們在第5節中讨論的TAE方差累積）。

根據上述高估估計的上限，偏差的大小由TAE的方差控制。

雙Q學習及其DQN實施（Double-DQN）（Van Hasselt等，2015; Van Hasselt，2010）是解決高估問題的一種可能方法，用負面替代正偏差。對于這種誤差的不利影響，另一種可能的補救措施是直接減少TAE的方差，如我們提出的方案（第5節）。

在圖2中，我們重複了Van Hasselt等人的實驗。（2015）（以及Averaged-DQN的應用）。 Van Has-selt等人讨論了該實驗。（2015）作為過高估計的一個例子，導緻漸近次優政策。由于Averaged-DQN降低了TAE方差，是以該實驗支援一個假設，即DQN中高估的主要原因是TAE方差。

5. TAE Variance Reduction

Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning

5.1. DQN Variance

We assume the statistical model mentioned at the start of this section. Consider a unidirectional Markov Decision Process (MDP) as in Figure 3, where the agent starts at state s0, state sM −1 is a terminal state, and the reward in any state is equal to zero.

Employing DQN on this MDP model, we get that for i > M:

我們假設本節開頭提到的統計模型。考慮如圖3所示的單向馬爾可夫決策過程（MDP），其中代理在狀态s0開始，狀态sM -1是終端狀态，并且任何狀态的獎勵等于零。

在這個MDP模型上使用DQN，我們得到i> M：

Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning

The above example gives intuition about the behavior of the TAE variance in DQN. The TAE is accumulated over the past DQN iterations on the updates trajectory. Accu-mulation of TAE errors results in bigger variance with its associated adverse effect, as was discussed in Section 4.

上面的例子給出了關于DQN中TAE方差行為的直覺。 TAE在更新軌迹上的過去DQN疊代上累積。如第4節所述，TAE誤差的準确性導緻更大的方差及其相關的不利影響。

Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning

5.2. Ensemble DQN Variance

We consider two approaches for TAE variance reduction. The first one is the Averaged-DQN and the second we term Ensemble-DQN. We start with Ensemble-DQN which is a straightforward way to obtain a 1/K variance reduction, with a computational effort of K-fold learning problems, compared to DQN. Ensemble-DQN (Algorithm 3) solves K DQN losses in parallel, then averages over the resulted Q-values estimates.

For Ensemble-DQN on the unidirectional MDP in Figure 3, we get for i > M:

我們考慮兩種減少TAE方差的方法。第一個是Averaged-DQN，第二個是Ensemble-DQN。我們從Ensemble-DQN開始，這是一種直接的方法來獲得1 / K方差減少，與DQN相比，計算費用為K倍學習問題。 Ensemble-DQN（算法3）并行地求解K DQN損失，然後對得到的Q值估計求平均值。

對于圖3中單向MDP上的Ensemble-DQN，我們得到i> M：

Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning

5.3. Averaged DQN Variance

We continue with Averaged-DQN, and calculate the vari-ance in state s0 for the unidirectional MDP in Figure 3. We get that for i > KM:

我們繼續使用Averaged-DQN，并計算圖3中單向MDP的狀态s0的變化。我們得到i> KM：

Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning

meaning that Averaged-DQN is theoretically more efficient in TAE variance reduction than Ensemble-DQN, and at least K times better than DQN. The intuition here is that Averaged-DQN averages over TAEs averages, which are the value estimates of the next states.

這意味着，平均DQN理論上在TAE方差減少方面比Ensemble-DQN更有效，并且至少比DQN好K倍。這裡的直覺是，平均DQN平均超過TAE平均值，這是下一個州的估值。

6. Experiments

The experiments were designed to address the following questions:

How does the number K of averaged target networks affect the error in value estimates, and in particular the overestimation error.
How does the averaging affect the learned polices quality.

To that end, we ran Averaged-DQN and DQN on the ALE benchmark. Additionally, we ran Averaged-DQN, Ensemble-DQN, and DQN on a Gridworld toy problem where the optimal value function can be computed exactly.

這些實驗旨在解決以下問題：

平均目标網絡的數量K如何影響值估計中的誤差，特别是高估誤差。
平均值如何影響學習政策的品質。

為此，我們在ALE基準測試中運作了Averaged-DQN和DQN。此外，我們在Gridworld玩具問題上運作了Averaged-DQN，Ensemble-DQN和DQN，其中可以精确計算最佳值函數。

6.1. Arcade Learning Environment (ALE)

To evaluate Averaged-DQN, we adopt the typical RL methodology where agent performance is measured at the end of training. We refer the reader to Liang et al. (2016) for further discussion about DQN evaluation methods on the ALE benchmark. The hyperparameters used were taken from Mnih et al. (2015), and are presented for complete-ness in Appendix E. DQN code was taken from McGill University RLLAB, and is available online1 (together with Averaged-DQN implementation).

We have evaluated the Averaged-DQN algorithm on three Atari games from the Arcade Learning Environment (Bellemare et al., 2013). The game of BREAKOUT was selected due to its popularity and the relative ease of the DQN to reach a steady state policy. In contrast, the game of SEAQUEST was selected due to its relative complexity, and the significant improvement in performance obtained by other DQN variants (e.g., Schaul et al. (2015); Wang et al. (2015)). Finally, the game of ASTERIX was presented in Van Hasselt et al. (2015) as an example to overestimation in DQN that leads to divergence.

As can be seen in Figure 4 and in Table 1 for all three games, increasing the number of averaged networks in Averaged-DQN results in lower average values estimates, better-preforming policies, and less variability between the runs of independent learning trials. For the game of AS-TERIX, we see similarly to Van Hasselt et al. (2015) that the divergence of DQN can be prevented by averaging.

Overall, the results suggest that in practice Averaged-DQN reduces the TAE variance, which leads to smaller overes-timation, stabilized learning curves and significantly im-proved performance.

為了評估Averaged-DQN，我們采用典型的RL方法，其中在訓練結束時測量代理性能。我們将讀者推薦給Liang等人。（2016）關于ALE基準的DQN評估方法的進一步讨論。使用的超參數取自Mnih等人。（2015年），并在附錄E中提供完整性.DQN代碼取自麥吉爾大學RLLAB，可線上獲得1（與Averaged-DQN實施一起）。

我們已經從Arcade學習環境評估了三個Atari遊戲的Averaged-DQN算法（Bellemare等，2013）。選擇BREAKOUT的比賽是因為它的受歡迎程度以及DQN達到穩定政策的相對容易程度。相比之下，選擇SEAQUEST遊戲是因為其相對複雜性，以及其他DQN變體（例如，Schaul等人（2015）; Wang等人（2015））獲得的性能的顯着改善。最後，AS Hasix的遊戲在Van Hasselt等人的作品中展示。（2015）作為DQN過高估計導緻分歧的一個例子。

從圖4和表1中可以看出，對于所有三個遊戲，增加Averaged-DQN中的平均網絡數量會導緻較低的平均值估計值，更好的預先形成的政策以及獨立學習試驗運作之間的可變性更小。對于AS-TERIX的遊戲，我們看到類似于Van Hasselt等人。（2015）可以通過平均來防止DQN的分歧。

總體而言，結果表明，在實踐中，Averaged-DQN降低了TAE方差，進而導緻更小的過熱度，穩定的學習曲線和顯着改善的性能。

6.2. Gridworld

The Gridworld problem (Figure 5) is a common RL bench-mark (e.g., Boyan & Moore (1995)). As opposed to the ALE, Gridworld has a smaller state space that allows the ER buffer to contain all possible state-action pairs. Addi-tionally, it allows the optimal value function Q to be accurately computed.

For the experiments, we have used Averaged-DQN, and Ensemble-DQN with ER buffer containing all possible state-action pairs. The network architecture that was used composed of a small fully connected neural network with one hidden layer of 80 neurons. For minimization of the DQN loss, the ADAM optimizer (Kingma & Ba, 2014) was used on 100 mini-batches of 32 samples per target network parameters update in the first experiment, and 300 mini-batches in the second.

Gridworld問題（圖5）是一個常見的RL基準（例如，Boyan＆Moore（1995））。與ALE相反，Gridworld具有較小的狀态空間，允許ER緩沖區包含所有可能的狀态 - 動作對。另外，它允許精确計算最佳值函數Q.

對于實驗，我們使用了Averaged-DQN和Ensemble-DQN以及包含所有可能的狀态 - 動作對的ER緩沖區。使用的網絡架構由一個小的完全連接配接的神經網絡和一個隐藏的80個神經元層組成。為了最大限度地減少DQN損失，ADAM優化器（Kingma＆Ba，2014）用于第一個實驗中每個目标網絡參數更新的100個小批量32個樣本，第二個實驗中有300個小批量。

Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning

Figure 4. The top row shows Averaged-DQN performance for the different number K of averaged networks on three Atari games. For K= 1 Averaged-DQN is reduced to DQN. The bold lines are averaged over seven independent learning trials. Every 2M frames, a performance test using -greedy policy with = 0.05 for 500000 frames was conducted. The shaded area presents one standard deviation. The bottom row shows the average value estimates for the three games. It can be seen that as the number of averaged networks is increased, overestimation of the values is reduced, performance improves, and less variability is observed. The hyperparameters used were taken from Mnih et al. (2015).

圖4.頂行顯示三個Atari遊戲中不同數量K的平均網絡的Averaged-DQN性能。對于K = 1，平均DQN減少到DQN。七個獨立的學習試驗平均粗線。每2M幀，使用-greedy政策進行性能測試，對于500000幀使用= 0.05。陰影區域呈現一個标準偏差。底行顯示了三場比賽的平均值估計值。可以看出，随着平均網絡的數量增加，對值的過高估計減少，性能提高，并且觀察到較小的可變性。使用的超參數取自Mnih等人。（2015年）。

6.ENVIRONMENT SETUP

Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning

Figure 5. Gridworld problem. The agent starts at the left-bottom of the grid. In the upper-right corner, a reward of +1 is obtained.

圖5. Gridworld問題。代理程式從網格的左下角開始。在右上角，獲得+1的獎勵。

6.2.2. OVERESTIMATION

In Figure 6 it can be seen that increasing the number K of averaged target networks leads to reduced overestimation eventually. Also, more averaged target networks seem to reduces the overshoot of the values, and leads to smoother and less inconsistent convergence.

在圖6中可以看出，增加平均目标網絡的數量K最終導緻過高估計。此外，更平均的目标網絡似乎減少了值的過沖，并且導緻更平滑和更不一緻的收斂。

6.2.3. AVERAGED VERSUS ENSEMBLE DQN

In Figure 7, it can be seen that as was predicted by the analysis in Section 5, Ensemble-DQN is also inferior to Averaged-DQN regarding variance reduction, and as a consequence far more overestimates the values. We note that Ensemble-DQN was not implemented for the ALE exper-iments due to its demanding computational effort, and the empirical evidence that was already obtained in this simple Gridworld domain.

在圖7中，可以看出，正如第5節中的分析預測的那樣，Ensemble-DQN在方差減少方面也不如Averaged-DQN，是以更加高估了這些值。我們注意到Ensemble-DQN沒有為ALE實驗實作，因為它需要大量的計算工作，以及在這個簡單的Gridworld領域已經獲得的經驗證據。

Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning

Figure 6. Averaged-DQN average predicted value in Gridworld. Increasing the number K of averaged target networks leads to a faster convergence with less overestimation (positive-bias). The bold lines are averages over 40 independent learning trials, and the shaded area presents one standard deviation. In the figure, A,B,C,D present DQN, and Averaged-DQN for K=5,10,20 aver-age overestimation.

Figure 7. Averaged-DQN and Ensemble-DQN predicted value in Gridworld. Averaging of past learned value is more beneficial than learning in parallel. The bold lines are averages over 20 independent learning trials, where the shaded area presents one standard deviation.

圖6. Gridworld中的Averaged-DQN平均預測值。增加平均目标網絡的數量K會導緻更快的收斂，而不會過高估計（正偏差）。粗線是40多個獨立學習試驗的平均值，陰影區域呈現一個标準差。在圖中，A，B，C，D表示DQN，而平均DQN表示K = 5,10,20平均高估。

圖7. Gridworld中的Averaged-DQN和Ensemble-DQN預測值。平均過去的學習價值比并行學習更有利。粗線是20多個獨立學習試驗的平均值，其中陰影區域呈現一個标準偏差。

7. Discussion and Future Directions

In this work, we have presented the Averaged-DQN algo-rithm, an extension to DQN that stabilizes training, and im-proves performance by efficient TAE variance reduction. We have shown both in theory and in practice that the pro-posed scheme is superior in TAE variance reduction, com-pared to a straightforward but computationally demanding approach such as Ensemble-DQN (Algorithm 3). We have demonstrated in several games of Atari that increasing the number K of averaged target networks leads to better poli-

cies while reducing overestimation. Averaged-DQN is a simple extension that can be easily integrated with other DQN variants such as Schaul et al. (2015); Van Hasselt et al. (2015); Wang et al. (2015); Bellemare et al. (2016); He et al. (2016). Indeed, it would be of interest to study the added value of averaging when combined with these variants. Also, since Averaged-DQN has variance reduc-tion effect on the learning curve, a more systematic com-parison between the different variants can be facilitated as discussed in (Liang et al., 2016).

In future work, we may dynamically learn when and how many networks to average for best results. One simple sug-gestion may be to correlate the number of networks with the state TD-error, similarly to Schaul et al. (2015). Finally, incorporating averaging techniques similar to Averaged-DQN within on-policy methods such as SARSA and Actor-Critic methods (Mnih et al., 2016) can further stabilize these algorithms.

在這項工作中，我們提出了Averaged-DQN算法，這是DQN的擴充，可以穩定訓練，并通過有效的TAE方差減少來提高性能。我們已經在理論和實踐中表明，與TAse方差減少相比，提出的方案更優越，與Ensemble-DQN（算法3）等簡單但計算要求更高的方法相比。我們已經在幾個Atari遊戲中證明，增加平均目标網絡的數量K會導緻更好的政策

在減少高估的同時減少。 Averaged-DQN是一個簡單的擴充，可以很容易地與其他DQN變體內建，如Schaul等。（2015）;範哈塞特等人。（2015）;王等人。（2015）; Bellemare等。（2016）;他等人。（2016）。實際上，當與這些變體結合時，研究平均值的附加值将是有意義的。此外，由于Averaged-DQN對學習曲線具有方差減少效應，是以可以促進不同變體之間更系統的比較，如（Liang et al。，2016）中所讨論的。

在未來的工作中，我們可以動态地了解何時以及為了獲得最佳結果而平均的網絡數量。一個簡單的建議可能是将網絡數量與狀态TD誤差相關聯，類似于Schaul等人。（2015年）。最後，在諸如SARSA和Actor-Critic方法（Mnih等，2016）等政策方法中引入類似于Averaged-DQN的平均技術可以進一步穩定這些算法。

References

Arthur E Bryson, Yu Chi Ho. Applied Optimal Control: Optimization Estimation and Control. Hemisphere Pub-lishing, 1975.

Baird III, Leemon C. Advantage updating. Technical re-port, DTIC Document, 1993.

Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. The arcade learning environment: An evaluation plat-form for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.

Bellemare, Marc G, Srinivasan, Sriram, Ostrovski, Georg, Schaul, Tom, Saxton, David, and Munos, Remi. Uni-fying count-based exploration and intrinsic motivation. arXiv preprint arXiv:1606.01868, 2016.

Bellman, Richard. A Markovian decision process. Indiana Univ. Math. J., 6:679–684, 1957.

Boyan, Justin and Moore, Andrew W. Generalization in reinforcement learning: Safely approximating the value function. Advances in neural information processing systems, pp. 369–376, 1995.

Even-Dar, Eyal and Mansour, Yishay. Learning rates for q-learning. Journal of Machine Learning Research, 5 (Dec):1–25, 2003.

He, Frank S., Yang Liu, Alexander G. Schwing, and Peng, Jian. Learning to play in a day: Faster deep reinforce-ment learning by optimality tightening. arXiv preprint arXiv:1611.01606, 2016.

Jaakkola, Tommi, Jordan, Michael I, and Singh, Satinder P. On the convergence of stochastic iterative dynamic pro-gramming algorithms. Neural Computation, 6(6):1185– 1201, 1994.

Kingma, Diederik P. and Ba, Jimmy. Adam: A method for stochastic optimization. arXiv preprint arXiv: 1412.6980, 2014.

Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep convolutional neural networks. In Advances in NIPS, pp. 1097–1105, 2012.

LeCun, Yann, Bottou, Leon,´ Bengio, Yoshua, and Haffner, Patrick. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278– 2324, 1998.

Liang, Yitao, Machado, Marlos C, Talvitie, Erik, and Bowl-ing, Michael. State of the art control of Atari games using shallow reinforcement learning. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, pp. 485–493, 2016.

Lin, Long-Ji. Reinforcement learning for robots using neu-ral networks. Technical report, DTIC Document, 1993.

Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Graves, Alex, Antonoglou, Ioannis, Wierstra, Daan, and Riedmiller, Martin. Playing Atari with deep reinforce-ment learning. arXiv preprint arXiv:1312.5602, 2013.

Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A, Veness, Joel, Bellemare, Marc G, Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K, Ostrovski, Georg, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529– 533, 2015.

Mnih, Volodymyr, Badia, Adria Puigdomenech, Mirza, Mehdi, Graves, Alex, Lillicrap, Timothy P, Harley, Tim, Silver, David, and Kavukcuoglu, Koray. Asynchronous methods for deep reinforcement learning. arXiv preprint arXiv:1602.01783, 2016.

Osband, Ian, Blundell, Charles, Pritzel, Alexander, and Van Roy, Benjamin. Deep exploration via bootstrapped DQN. arXiv preprint arXiv:1602.04621, 2016.

Riedmiller, Martin. Neural fitted Q iteration–first experi-ences with a data efficient neural reinforcement learning method. In European Conference on Machine Learning, pp. 317–328. Springer, 2005.

Rummery, Gavin A and Niranjan, Mahesan. On-line Q-learning using connectionist systems. University of Cambridge, Department of Engineering, 1994.

Schaul, Tom, Quan, John, Antonoglou, Ioannis, and Sil-ver, David. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015.

Sutton, Richard S and Barto, Andrew G. Reinforcement Learning: An Introduction. MIT Press Cambridge, 1998.

Sutton, Richard S, McAllester, David A, Singh, Satinder P, and Mansour, Yishay. Policy gradient methods for re-inforcement learning with function approximation. In NIPS, volume 99, pp. 1057–1063, 1999.

Tang, Haoran, Rein Houthooft, Davis Foote, Adam Stooke, Xi Chen, Yan Duan, John Schulman, and Filip De Turck, Pieter Abbeel. #exploration: A study of count-based ex-ploration for deep reinforcement learning. arXiv preprint arXiv:1611.04717, 2016.

Tesauro, Gerald. Temporal difference learning and td-gammon. Communications of the ACM, 38(3):58–68, 1995.

Thrun, Sebastian and Schwartz, Anton. Issues in using function approximation for reinforcement learning. In

Proceedings of the 1993 Connectionist Models Summer

School Hillsdale, NJ. Lawrence Erlbaum, 1993.

Tsitsiklis, John N. Asynchronous stochastic approxima-

tion and q-learning. Machine Learning, 16(3):185–202,

Tsitsiklis, John N and Van Roy, Benjamin. An analysis

of temporal-difference learning with function approxi-

mation. IEEE transactions on automatic control, 42(5):

674–690, 1997.

Van Hasselt, Hado. Double Q-learning. In Lafferty, J. D.,

Williams, C. K. I., Shawe-Taylor, J., Zemel, R. S., and

Culotta, A. (eds.), Advances in Neural Information Pro-

cessing Systems 23, pp. 2613–2621. 2010.

Van Hasselt, Hado, Guez, Arthur, and Silver, David. Deep

reinforcement learning with double Q-learning. arXiv

preprint arXiv: 1509.06461, 2015.

Wang, Ziyu, de Freitas, Nando, and Lanctot, Marc. Dueling

network architectures for deep reinforcement learning.

arXiv preprint arXiv: 1511.06581, 2015.

Watkins, Christopher JCH and Dayan, Peter. Q-learning.

Machine Learning, 8(3-4):279–292, 1992.

Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning

Averaged-DQN：深度強化學習的方差減少和穩定性

Abstract

Introduction

2.Background

2.1. Reinforcement Learning

2.2. Q-learning

2.3. Deep Q Networks (DQN)

3.AveragedDQN

4. Overestimation and Approximation Errors

4.1. Target Approximation Error (TAE)

4.2. Overestimation Error

5. TAE Variance Reduction

5.1. DQN Variance

5.2. Ensemble DQN Variance

5.3. Averaged DQN Variance

6. Experiments

6.1. Arcade Learning Environment (ALE)

6.2. Gridworld

6.ENVIRONMENT SETUP

6.2.2. OVERESTIMATION

6.2.3. AVERAGED VERSUS ENSEMBLE DQN

7. Discussion and Future Directions

References

繼續閱讀