天天看點

[paper reading] Trust Region Policy OptimizationTrust Region Policy Optimization

Trust Region Policy Optimization

https://arxiv.org/pdf/1502.05477.pdf

有保證的單調遞增政策梯度

指出問題

沒啥問題,直接提出單調遞增政策梯度

理論證明

( S , A , P , r , ρ 0 , γ ) (\mathcal{S}, \mathcal{A}, P, r, \rho_0, \gamma) (S,A,P,r,ρ0​,γ)

ρ 0 \rho_0 ρ0​ : 初始狀态 s 0 s_0 s0​ 的分布

那麼期望折扣回報可以表示為

η ( π ) = E s 0 , a 0 , … [ ∑ t = 0 ∞ γ t r ( s t ) ] , where s 0 ∼ ρ 0 ,   a t ∼ π ( a t ∣ s t ) ,   s t + 1 ∼ P ( s t + 1 ∣ s t , a t ) \eta(\pi)=\mathbb{E}_{s_0,a_0,\ldots} \left[ \sum_{t=0}^\infty \gamma^t r(s_t)\right], \quad \text{where} \\ s_0 \sim \rho_0,\ a_t \sim \pi(a_t|s_t),\ s_{t+1} \sim P(s_{t+1}|s_t,a_t) η(π)=Es0​,a0​,…​[t=0∑∞​γtr(st​)],wheres0​∼ρ0​, at​∼π(at​∣st​), st+1​∼P(st+1​∣st​,at​)

狀态分布表示為

ρ π ~ ( s ) = ∑ t = 0 ∞ P ( s t = s ∣ s 0 , π ) \rho_{\tilde\pi}(s)=\sum_{t=0}^\infty P(s_t=s|s_0,\pi) ρπ~​(s)=t=0∑∞​P(st​=s∣s0​,π)

[Kakade 2002] 中提到,對于任意兩種政策 π ~ \tilde\pi π~ 和 π \pi π,有

η ( π ~ ) = η ( π ) + ∑ s ρ π ~ ( s ) ∑ a π ~ ( a ∣ s ) A π ( s , a ) \eta(\tilde\pi)=\eta(\pi)+\sum_s \rho_{\tilde\pi}(s)\sum_a \tilde\pi(a|s)A_\pi (s,a) η(π~)=η(π)+s∑​ρπ~​(s)a∑​π~(a∣s)Aπ​(s,a)

令 π o l d \pi_{old} πold​ 為目前政策,令 π ′ = arg max ⁡ π ′ L π o l d ( π ′ ) \pi'=\argmax_{\pi'}L_{\pi_{old}}(\pi') π′=π′argmax​Lπold​​(π′). π n e w \pi_{new} πnew​ 是一個混合政策(mixture):

π n e w = ( 1 − α ) π o l d ( a ∣ s ) + α π ′ ( a ∣ s ) \pi_{new}=(1-\alpha)\pi_{old}(a|s)+\alpha\pi'(a|s) πnew​=(1−α)πold​(a∣s)+απ′(a∣s)

根據 [Kakade 2002] 可知

η ( π n e w ) − η ( π o l d ) = ∑ s ρ π n e w ( s ) ∑ a π n e w ( a ∣ s ) A π o l d ( s , a ) ≥ ∑ s ρ π o l d ( s ) ∑ a π n e w ( a ∣ s ) A π o l d ( s , a ) − 2 α ϵ ( 1 1 − γ − 1 1 − γ ( 1 − α ) ) = ∑ s ρ π o l d ( s ) ∑ a π n e w ( a ∣ s ) A π o l d ( s , a ) − 2 ϵ γ ( 1 − γ ) 2 α 2 where  ϵ = max ⁡ s ∣ E a ∼ π ′ ( a ∣ s ) [ A π ( s , a ) ] ∣ \begin{aligned} \eta(\pi_{new}) - \eta(\pi_{old}) &= \sum_{s} \rho_{\pi_{new}}(s) \sum_a \pi_{new}(a|s) A_{\pi_{old}}(s,a) \\ &\geq\sum_{s} \rho_{\pi_{old}}(s) \sum_a \pi_{new}(a|s) A_{\pi_{old}}(s,a) - 2 \alpha \epsilon \left( \frac{1}{1-\gamma} - \frac{1}{1-\gamma(1-\alpha)} \right) \\ &=\sum_{s} \rho_{\pi_{old}}(s) \sum_a \pi_{new}(a|s) A_{\pi_{old}}(s,a) - \frac{2 \epsilon \gamma}{(1-\gamma)^2} \alpha^2 \\ \text{where } \epsilon &= \max_s \left| \mathbb{E}_{a \sim \pi'(a|s)} [A_\pi(s,a)] \right| \end{aligned} η(πnew​)−η(πold​)where ϵ​=s∑​ρπnew​​(s)a∑​πnew​(a∣s)Aπold​​(s,a)≥s∑​ρπold​​(s)a∑​πnew​(a∣s)Aπold​​(s,a)−2αϵ(1−γ1​−1−γ(1−α)1​)=s∑​ρπold​​(s)a∑​πnew​(a∣s)Aπold​​(s,a)−(1−γ)22ϵγ​α2=smax​∣∣​Ea∼π′(a∣s)​[Aπ​(s,a)]∣∣​​

這裡對下界做了一些調整,更加簡潔一點

1 1 − γ − 1 1 − γ ( 1 − α ) = 1 − γ ( 1 − α ) − ( 1 − γ ) ( 1 − γ ) [ 1 − γ ( 1 − α ) ] = α γ ( 1 − γ ) [ 1 − γ ( 1 − α ) ] ≥ α γ ( 1 − γ ) 2 \begin{aligned} \frac{1}{1-\gamma}-\frac{1}{1-\gamma(1-\alpha)}&=\frac{1-\gamma(1-\alpha)-(1-\gamma)}{(1-\gamma) [1-\gamma(1-\alpha)]} \\ &=\frac{\alpha \gamma}{(1-\gamma) [1-\gamma(1-\alpha)]} \\ &\geq \frac{\alpha \gamma}{(1-\gamma)^2} \end{aligned} 1−γ1​−1−γ(1−α)1​​=(1−γ)[1−γ(1−α)]1−γ(1−α)−(1−γ)​=(1−γ)[1−γ(1−α)]αγ​≥(1−γ)2αγ​​

π n e w \pi_{new} πnew​ 是一個混合政策, α \alpha α 是混合系數,但是實際上混合政策很少被使用,導緻上述公式難以解決實際問題。于是,使用總變化散度(total variation divergence) D T V ( p   ∣ ∣   q ) = 1 2 ∑ i ∣ p i − q i ∣ D_{TV}(p\ ||\ q)=\frac{1}{2}\sum_i |p_i-q_i| DTV​(p ∣∣ q)=21​∑i​∣pi​−qi​∣ 來取代 α \alpha α,其中 p   q p\ q p q 是離散機率分布。把 D T V max ⁡ ( π , π ~ ) D_{TV}^{\max}(\pi,\tilde\pi) DTVmax​(π,π~) 定義為

D T V max ⁡ ( π , π ~ ) = max ⁡ s D T V ( π ( ⋅ ∣ s )   ∣ ∣   π ~ ( ⋅ ∣ s ) ) D_{TV}^{\max}(\pi,\tilde\pi)=\max_s D_{TV}(\pi(\cdot|s)\ ||\ \tilde\pi(\cdot|s)) DTVmax​(π,π~)=smax​DTV​(π(⋅∣s) ∣∣ π~(⋅∣s))

Theorem 1. 令 α = D T V max ⁡ ( π , π ~ ) \alpha = D_{TV}^{\max}(\pi,\tilde\pi) α=DTVmax​(π,π~)。下界依舊成立:

η ( π n e w ) − η ( π o l d ) ≥ ∑ s ρ π o l d ( s ) ∑ a π n e w ( a ∣ s ) A π o l d ( s , a ) − 4 ϵ γ ( 1 − γ ) 2 α 2 where  ϵ = max ⁡ s , a ∣ A π ( s , a ) ∣ \eta(\pi_{new}) - \eta(\pi_{old}) \geq \sum_{s} \rho_{\pi_{old}}(s) \sum_a \pi_{new}(a|s) A_{\pi_{old}}(s,a) - \frac{4 \epsilon \gamma}{(1-\gamma)^2} \alpha^2 \\ \text{where } \epsilon = \max_{s,a} \left| A_\pi(s,a) \right| η(πnew​)−η(πold​)≥s∑​ρπold​​(s)a∑​πnew​(a∣s)Aπold​​(s,a)−(1−γ)24ϵγ​α2where ϵ=s,amax​∣Aπ​(s,a)∣

原文裡的證明有點複雜,有時間再看吧…

下面,總變化散度(total variation divergence)和 KL 散度其實是有關系的: D T V ( p   ∣ ∣   q ) 2 ≤ D K L ( p   ∣ ∣   q ) D_{TV}(p\ ||\ q)^2 \leq D_{KL}(p\ ||\ q) DTV​(p ∣∣ q)2≤DKL​(p ∣∣ q)。令 D K L max ⁡ ( π , π ~ ) = max ⁡ s D K L ( π ( ⋅ ∣ s )   ∣ ∣   π ~ ( ⋅ ∣ s ) ) D_{KL}^{\max}(\pi,\tilde\pi)=\max_s D_{KL}(\pi(\cdot|s)\ ||\ \tilde\pi(\cdot|s)) DKLmax​(π,π~)=maxs​DKL​(π(⋅∣s) ∣∣ π~(⋅∣s)),那麼下界成立:

η ( π n e w ) − η ( π o l d ) ≥ ∑ s ρ π o l d ( s ) ∑ a π n e w ( a ∣ s ) A π o l d ( s , a ) − 4 ϵ γ ( 1 − γ ) 2 D K L max ⁡ ( π , π ~ ) \eta(\pi_{new}) - \eta(\pi_{old}) \geq \sum_{s} \rho_{\pi_{old}}(s) \sum_a \pi_{new}(a|s) A_{\pi_{old}}(s,a) - \frac{4 \epsilon \gamma}{(1-\gamma)^2} D_{KL}^{\max}(\pi,\tilde\pi) η(πnew​)−η(πold​)≥s∑​ρπold​​(s)a∑​πnew​(a∣s)Aπold​​(s,a)−(1−γ)24ϵγ​DKLmax​(π,π~)

實際上,如果我們使用懲罰系數 C,步長将會變得非常非常小。一種可行的辦法是把 KL 那一項作為限制條件:

maximize θ ∑ s ρ π o l d ( s ) ∑ a π n e w ( a ∣ s ) A π o l d ( s , a ) subject to  D K L max ⁡ ( π , π ~ ) ≤ δ \text{maximize}_\theta \sum_{s} \rho_{\pi_{old}}(s) \sum_a \pi_{new}(a|s) A_{\pi_{old}}(s,a) \\ \text{subject to}\ D_{KL}^{\max}(\pi,\tilde\pi) \leq \delta maximizeθ​s∑​ρπold​​(s)a∑​πnew​(a∣s)Aπold​​(s,a)subject to DKLmax​(π,π~)≤δ

由于計算 D K L max ⁡ ( π , π ~ ) D_{KL}^{\max}(\pi,\tilde\pi) DKLmax​(π,π~) 比較麻煩,可以用平均 KL 散度代替:

D ‾ K L ρ ( π , π ~ ) : = E s ∼ ρ [ D K L ( π ( ⋅ ∣ s )   ∣ ∣   π ~ ( ⋅ ∣ s ) ) ] \overline D_{KL}^\rho (\pi,\tilde\pi):=\mathbb{E}_{s \sim \rho} [D_{KL} (\pi(\cdot|s)\ ||\ \tilde\pi(\cdot|s))] DKLρ​(π,π~):=Es∼ρ​[DKL​(π(⋅∣s) ∣∣ π~(⋅∣s))]

重要性采樣

實際上的資料由蒙特卡洛采樣得到,我們使用采樣來對公式進行一下替換:

∑ s ρ π o l d ( s ) ∑ a π n e w ( a ∣ s ) A π o l d ( s , a ) = E s ∼ ρ π o l d ∑ a π n e w ( a ∣ s ) A π o l d ( s , a ) = E s ∼ ρ π o l d ∑ a π o l d ( a ∣ s ) π n e w ( a ∣ s ) π o l d ( a ∣ s ) A π o l d ( s , a ) = E s ∼ ρ π o l d E a ∼ π o l d π n e w ( a ∣ s ) π o l d ( a ∣ s ) A π o l d ( s , a ) \begin{aligned} \sum_{s} \rho_{\pi_{old}}(s) \sum_a \pi_{new}(a|s) A_{\pi_{old}}(s,a)&=\mathbb{E}_{s \sim \rho_{\pi_{old}}} \sum_a \pi_{new}(a|s) A_{\pi_{old}}(s,a) \\ &=\mathbb{E}_{s \sim \rho_{\pi_{old}}} \sum_a \pi_{old}(a|s) \frac{\pi_{new}(a|s)}{\pi_{old}(a|s)} A_{\pi_{old}}(s,a) \\ &=\mathbb{E}_{s \sim \rho_{\pi_{old}}} \mathbb{E}_{a \sim \pi_{old}} \frac{\pi_{new}(a|s)}{\pi_{old}(a|s)} A_{\pi_{old}}(s,a) \\ \end{aligned} s∑​ρπold​​(s)a∑​πnew​(a∣s)Aπold​​(s,a)​=Es∼ρπold​​​a∑​πnew​(a∣s)Aπold​​(s,a)=Es∼ρπold​​​a∑​πold​(a∣s)πold​(a∣s)πnew​(a∣s)​Aπold​​(s,a)=Es∼ρπold​​​Ea∼πold​​πold​(a∣s)πnew​(a∣s)​Aπold​​(s,a)​

是以,代理目标(surrogate objective)可以寫為:

maximize θ E s ∼ ρ π o l d E a ∼ π o l d π n e w ( a ∣ s ) π o l d ( a ∣ s ) A π o l d ( s , a ) subject to  E s ∼ ρ π o l d [ D K L ( π o l d ( ⋅ ∣ s )   ∣ ∣   π n e w ( ⋅ ∣ s ) ) ] ≤ δ \text{maximize}_\theta \mathbb{E}_{s \sim \rho_{\pi_{old}}} \mathbb{E}_{a \sim \pi_{old}} \frac{\pi_{new}(a|s)}{\pi_{old}(a|s)} A_{\pi_{old}}(s,a) \\ \text{subject to}\ \mathbb{E}_{s \sim \rho_{\pi_{old}}} [D_{KL} (\pi_{old}(\cdot|s)\ ||\ \pi_{new}(\cdot|s))] \leq \delta maximizeθ​Es∼ρπold​​​Ea∼πold​​πold​(a∣s)πnew​(a∣s)​Aπold​​(s,a)subject to Es∼ρπold​​​[DKL​(πold​(⋅∣s) ∣∣ πnew​(⋅∣s))]≤δ

然後這裡介紹了兩種采樣方式,分别是單路徑(single path)和藤蔓(vine)。單路徑法很簡單,就是蒙特卡洛法。藤蔓法比較難以了解。

[paper reading] Trust Region Policy OptimizationTrust Region Policy Optimization

藤蔓法,首先使用蒙特卡洛方法得到一條軌迹,然後在這條軌迹選擇一個大小為 N N N 的狀态子集,用 rollout set 表示。對于該狀态子集中的每個狀态 s n s_n sn​,我們通過 a n , k ∼ q ( ⋅ ∣ s n ) a_{n,k} \sim q(\cdot|s_n) an,k​∼q(⋅∣sn​) 采樣 K K K 個動作。實際上發現,對于連續動作問題(如 robotic locomotion), q ( ⋅ ∣ s n ) = π θ ( ⋅ ∣ s n ) q(\cdot|s_n)=\pi_\theta(\cdot|s_n) q(⋅∣sn​)=πθ​(⋅∣sn​) 有效果,對于離散動作問題(如 atari),随機分布有效果,因為可以實作更好的探索。

但是由于藤蔓法很麻煩,單單把 gym 環境重置到指定狀态就很難做到…是以藤蔓法具體細節這裡就不寫了。

實用算法

再寫一下剛剛提到的代理目标(surrogate objective):

maximize θ E s ∼ ρ π o l d E a ∼ π o l d π n e w ( a ∣ s ) π o l d ( a ∣ s ) A π o l d ( s , a ) subject to  E s ∼ ρ π o l d [ D K L ( π o l d ( ⋅ ∣ s )   ∣ ∣   π n e w ( ⋅ ∣ s ) ) ] ≤ δ \text{maximize}_\theta \mathbb{E}_{s \sim \rho_{\pi_{old}}} \mathbb{E}_{a \sim \pi_{old}} \frac{\pi_{new}(a|s)}{\pi_{old}(a|s)} A_{\pi_{old}}(s,a) \\ \text{subject to}\ \mathbb{E}_{s \sim \rho_{\pi_{old}}} [D_{KL} (\pi_{old}(\cdot|s)\ ||\ \pi_{new}(\cdot|s))] \leq \delta maximizeθ​Es∼ρπold​​​Ea∼πold​​πold​(a∣s)πnew​(a∣s)​Aπold​​(s,a)subject to Es∼ρπold​​​[DKL​(πold​(⋅∣s) ∣∣ πnew​(⋅∣s))]≤δ

我們定義

J ( θ ) = E s ∼ ρ π o l d E a ∼ π o l d π n e w ( a ∣ s ) π o l d ( a ∣ s ) A π o l d ( s , a ) J(\theta)=\mathbb{E}_{s \sim \rho_{\pi_{old}}} \mathbb{E}_{a \sim \pi_{old}} \frac{\pi_{new}(a|s)}{\pi_{old}(a|s)} A_{\pi_{old}}(s,a) J(θ)=Es∼ρπold​​​Ea∼πold​​πold​(a∣s)πnew​(a∣s)​Aπold​​(s,a)

那麼 maximize θ J ( θ ) \text{maximize}_\theta J(\theta) maximizeθ​J(θ) 可以寫為 maximize θ ∇ θ J ( θ ) T Δ θ \text{maximize}_\theta \nabla_\theta J(\theta)^T \Delta \theta maximizeθ​∇θ​J(θ)TΔθ

我們知道通過 KL 散度的二階泰勒展開,可以得到和費謝爾矩陣的關系:

K L ( f ( x ; θ ) ∣ ∣ f ( x ; θ + Δ θ ) ) ≈ 1 2 Δ θ T F Δ θ KL(f(x;\theta)||f(x;\theta+\Delta\theta))\approx\frac{1}{2} \Delta\theta^T F \Delta\theta KL(f(x;θ)∣∣f(x;θ+Δθ))≈21​ΔθTFΔθ

看一下 KL 散度的二階導數( θ − \theta^- θ− 表示參數為常值):

這裡是參考 https://zhuanlan.zhihu.com/p/106984936,作者講得好,建議看。

∇ θ 2 K L ( f ( x ; θ − ) ∣ ∣ f ( x ; θ ) ) = ∇ θ 2 [ ∫ f ( x ; θ − ) log ⁡ f ( x ; θ − ) d x − ∫ f ( x ; θ − ) log ⁡ f ( x ; θ ) d x ] = − ∫ f ( x ; θ − ) ∇ θ 2 log ⁡ f ( x ; θ ) d x = − E [ ∇ θ 2 log ⁡ f ( x ; θ ) ] = F \begin{aligned} \nabla_\theta^2 KL(f(x;\theta^-)||f(x;\theta))&=\nabla_\theta^2 \left[ \int f(x;\theta^-) \log f(x;\theta^-)dx-\int f(x;\theta^-) \log f(x;\theta)dx \right] \\ &=-\int f(x;\theta^-) \nabla_\theta^2 \log f(x;\theta) dx \\ &=-E \left[ \nabla_\theta^2 \log f(x;\theta) \right] \\ &= F \end{aligned} ∇θ2​KL(f(x;θ−)∣∣f(x;θ))​=∇θ2​[∫f(x;θ−)logf(x;θ−)dx−∫f(x;θ−)logf(x;θ)dx]=−∫f(x;θ−)∇θ2​logf(x;θ)dx=−E[∇θ2​logf(x;θ)]=F​

文章中使用 KL 散度的二階導 H = ∇ θ 2 K L ( ⋅ ) H=\nabla_\theta^2KL(\cdot) H=∇θ2​KL(⋅) 來作為費謝爾矩陣 F F F 的估計。

是以,我們可以把代理目标重寫為:

$$$$

maximize θ ∇ θ J ( θ ) T Δ θ subject to  1 2 Δ θ T H Δ θ = ϵ \text{maximize}_\theta \nabla_\theta J(\theta)^T \Delta \theta \\ \text{subject to }\frac{1}{2} \Delta\theta^TH\Delta\theta = \epsilon maximizeθ​∇θ​J(θ)TΔθsubject to 21​ΔθTHΔθ=ϵ

這是一個簡單的限制優化問題,使用拉格朗日乘數法(Lagrange Multiplier Method)來解:

f ( θ ) = ∇ θ J ( θ ) T Δ θ + λ ( 1 2 Δ θ T H Δ θ − ϵ ) f(\theta)=\nabla_\theta J(\theta)^T \Delta \theta + \lambda(\frac{1}{2} \Delta\theta^TH\Delta\theta - \epsilon) f(θ)=∇θ​J(θ)TΔθ+λ(21​ΔθTHΔθ−ϵ)

則有方程組

{ ∇ Δ θ f ( θ ) = ∇ θ J ( θ ) + λ H Δ θ = 0 ( 1 ) 1 2 Δ θ T H Δ θ − ϵ = 0 ( 2 ) \begin{cases} \nabla_{\Delta\theta}f(\theta)=\nabla_\theta J(\theta)+\lambda H \Delta\theta = 0 &(1)\\ \frac{1}{2} \Delta\theta^TH\Delta\theta - \epsilon = 0 &(2)\\ \end{cases} {∇Δθ​f(θ)=∇θ​J(θ)+λHΔθ=021​ΔθTHΔθ−ϵ=0​(1)(2)​

由 ( 1 ) (1) (1) 解得 Δ θ = − 1 λ H − 1 g \Delta\theta=-\frac{1}{\lambda}H^{-1}g Δθ=−λ1​H−1g,代入 ( 2 ) (2) (2) 中得(注意, H H H 是海森矩陣,是對稱的,即 H T = H H^T=H HT=H,其逆矩陣也是對稱矩陣)

1 2 λ 2 g T ( H − 1 ) T H H − 1 g = ϵ 1 2 ϵ g T H − 1 g = λ 2 g T H − 1 g 2 ϵ = λ \begin{aligned} \frac{1}{2 \lambda^2}g^T(H^{-1})^THH^{-1}g&=\epsilon \\ \frac{1}{2\epsilon} g^TH^{-1}g&=\lambda^2 \\ \sqrt{\frac{g^TH^{-1}g}{2\epsilon}}&=\lambda \end{aligned} 2λ21​gT(H−1)THH−1g2ϵ1​gTH−1g2ϵgTH−1g​

​​=ϵ=λ2=λ​

代回得

Δ θ = − 1 λ H − 1 g = − 2 ϵ g T H − 1 g H − 1 g \begin{aligned} \Delta\theta&=-\frac{1}{\lambda}H^{-1}g \\ &=-\sqrt{\frac{2\epsilon}{g^TH^{-1}g}}H^{-1}g \end{aligned} Δθ​=−λ1​H−1g=−gTH−1g2ϵ​

​H−1g​

繼續閱讀