Trust Region Policy Optimization
https://arxiv.org/pdf/1502.05477.pdf
有保證的單調遞增政策梯度
指出問題
沒啥問題,直接提出單調遞增政策梯度
理論證明
( S , A , P , r , ρ 0 , γ ) (\mathcal{S}, \mathcal{A}, P, r, \rho_0, \gamma) (S,A,P,r,ρ0,γ)
ρ 0 \rho_0 ρ0 : 初始狀态 s 0 s_0 s0 的分布
那麼期望折扣回報可以表示為
η ( π ) = E s 0 , a 0 , … [ ∑ t = 0 ∞ γ t r ( s t ) ] , where s 0 ∼ ρ 0 , a t ∼ π ( a t ∣ s t ) , s t + 1 ∼ P ( s t + 1 ∣ s t , a t ) \eta(\pi)=\mathbb{E}_{s_0,a_0,\ldots} \left[ \sum_{t=0}^\infty \gamma^t r(s_t)\right], \quad \text{where} \\ s_0 \sim \rho_0,\ a_t \sim \pi(a_t|s_t),\ s_{t+1} \sim P(s_{t+1}|s_t,a_t) η(π)=Es0,a0,…[t=0∑∞γtr(st)],wheres0∼ρ0, at∼π(at∣st), st+1∼P(st+1∣st,at)
狀态分布表示為
ρ π ~ ( s ) = ∑ t = 0 ∞ P ( s t = s ∣ s 0 , π ) \rho_{\tilde\pi}(s)=\sum_{t=0}^\infty P(s_t=s|s_0,\pi) ρπ~(s)=t=0∑∞P(st=s∣s0,π)
[Kakade 2002] 中提到,對于任意兩種政策 π ~ \tilde\pi π~ 和 π \pi π,有
η ( π ~ ) = η ( π ) + ∑ s ρ π ~ ( s ) ∑ a π ~ ( a ∣ s ) A π ( s , a ) \eta(\tilde\pi)=\eta(\pi)+\sum_s \rho_{\tilde\pi}(s)\sum_a \tilde\pi(a|s)A_\pi (s,a) η(π~)=η(π)+s∑ρπ~(s)a∑π~(a∣s)Aπ(s,a)
令 π o l d \pi_{old} πold 為目前政策,令 π ′ = arg max π ′ L π o l d ( π ′ ) \pi'=\argmax_{\pi'}L_{\pi_{old}}(\pi') π′=π′argmaxLπold(π′). π n e w \pi_{new} πnew 是一個混合政策(mixture):
π n e w = ( 1 − α ) π o l d ( a ∣ s ) + α π ′ ( a ∣ s ) \pi_{new}=(1-\alpha)\pi_{old}(a|s)+\alpha\pi'(a|s) πnew=(1−α)πold(a∣s)+απ′(a∣s)
根據 [Kakade 2002] 可知
η ( π n e w ) − η ( π o l d ) = ∑ s ρ π n e w ( s ) ∑ a π n e w ( a ∣ s ) A π o l d ( s , a ) ≥ ∑ s ρ π o l d ( s ) ∑ a π n e w ( a ∣ s ) A π o l d ( s , a ) − 2 α ϵ ( 1 1 − γ − 1 1 − γ ( 1 − α ) ) = ∑ s ρ π o l d ( s ) ∑ a π n e w ( a ∣ s ) A π o l d ( s , a ) − 2 ϵ γ ( 1 − γ ) 2 α 2 where ϵ = max s ∣ E a ∼ π ′ ( a ∣ s ) [ A π ( s , a ) ] ∣ \begin{aligned} \eta(\pi_{new}) - \eta(\pi_{old}) &= \sum_{s} \rho_{\pi_{new}}(s) \sum_a \pi_{new}(a|s) A_{\pi_{old}}(s,a) \\ &\geq\sum_{s} \rho_{\pi_{old}}(s) \sum_a \pi_{new}(a|s) A_{\pi_{old}}(s,a) - 2 \alpha \epsilon \left( \frac{1}{1-\gamma} - \frac{1}{1-\gamma(1-\alpha)} \right) \\ &=\sum_{s} \rho_{\pi_{old}}(s) \sum_a \pi_{new}(a|s) A_{\pi_{old}}(s,a) - \frac{2 \epsilon \gamma}{(1-\gamma)^2} \alpha^2 \\ \text{where } \epsilon &= \max_s \left| \mathbb{E}_{a \sim \pi'(a|s)} [A_\pi(s,a)] \right| \end{aligned} η(πnew)−η(πold)where ϵ=s∑ρπnew(s)a∑πnew(a∣s)Aπold(s,a)≥s∑ρπold(s)a∑πnew(a∣s)Aπold(s,a)−2αϵ(1−γ1−1−γ(1−α)1)=s∑ρπold(s)a∑πnew(a∣s)Aπold(s,a)−(1−γ)22ϵγα2=smax∣∣Ea∼π′(a∣s)[Aπ(s,a)]∣∣
這裡對下界做了一些調整,更加簡潔一點
1 1 − γ − 1 1 − γ ( 1 − α ) = 1 − γ ( 1 − α ) − ( 1 − γ ) ( 1 − γ ) [ 1 − γ ( 1 − α ) ] = α γ ( 1 − γ ) [ 1 − γ ( 1 − α ) ] ≥ α γ ( 1 − γ ) 2 \begin{aligned} \frac{1}{1-\gamma}-\frac{1}{1-\gamma(1-\alpha)}&=\frac{1-\gamma(1-\alpha)-(1-\gamma)}{(1-\gamma) [1-\gamma(1-\alpha)]} \\ &=\frac{\alpha \gamma}{(1-\gamma) [1-\gamma(1-\alpha)]} \\ &\geq \frac{\alpha \gamma}{(1-\gamma)^2} \end{aligned} 1−γ1−1−γ(1−α)1=(1−γ)[1−γ(1−α)]1−γ(1−α)−(1−γ)=(1−γ)[1−γ(1−α)]αγ≥(1−γ)2αγ
π n e w \pi_{new} πnew 是一個混合政策, α \alpha α 是混合系數,但是實際上混合政策很少被使用,導緻上述公式難以解決實際問題。于是,使用總變化散度(total variation divergence) D T V ( p ∣ ∣ q ) = 1 2 ∑ i ∣ p i − q i ∣ D_{TV}(p\ ||\ q)=\frac{1}{2}\sum_i |p_i-q_i| DTV(p ∣∣ q)=21∑i∣pi−qi∣ 來取代 α \alpha α,其中 p q p\ q p q 是離散機率分布。把 D T V max ( π , π ~ ) D_{TV}^{\max}(\pi,\tilde\pi) DTVmax(π,π~) 定義為
D T V max ( π , π ~ ) = max s D T V ( π ( ⋅ ∣ s ) ∣ ∣ π ~ ( ⋅ ∣ s ) ) D_{TV}^{\max}(\pi,\tilde\pi)=\max_s D_{TV}(\pi(\cdot|s)\ ||\ \tilde\pi(\cdot|s)) DTVmax(π,π~)=smaxDTV(π(⋅∣s) ∣∣ π~(⋅∣s))
Theorem 1. 令 α = D T V max ( π , π ~ ) \alpha = D_{TV}^{\max}(\pi,\tilde\pi) α=DTVmax(π,π~)。下界依舊成立:
η ( π n e w ) − η ( π o l d ) ≥ ∑ s ρ π o l d ( s ) ∑ a π n e w ( a ∣ s ) A π o l d ( s , a ) − 4 ϵ γ ( 1 − γ ) 2 α 2 where ϵ = max s , a ∣ A π ( s , a ) ∣ \eta(\pi_{new}) - \eta(\pi_{old}) \geq \sum_{s} \rho_{\pi_{old}}(s) \sum_a \pi_{new}(a|s) A_{\pi_{old}}(s,a) - \frac{4 \epsilon \gamma}{(1-\gamma)^2} \alpha^2 \\ \text{where } \epsilon = \max_{s,a} \left| A_\pi(s,a) \right| η(πnew)−η(πold)≥s∑ρπold(s)a∑πnew(a∣s)Aπold(s,a)−(1−γ)24ϵγα2where ϵ=s,amax∣Aπ(s,a)∣
原文裡的證明有點複雜,有時間再看吧…
下面,總變化散度(total variation divergence)和 KL 散度其實是有關系的: D T V ( p ∣ ∣ q ) 2 ≤ D K L ( p ∣ ∣ q ) D_{TV}(p\ ||\ q)^2 \leq D_{KL}(p\ ||\ q) DTV(p ∣∣ q)2≤DKL(p ∣∣ q)。令 D K L max ( π , π ~ ) = max s D K L ( π ( ⋅ ∣ s ) ∣ ∣ π ~ ( ⋅ ∣ s ) ) D_{KL}^{\max}(\pi,\tilde\pi)=\max_s D_{KL}(\pi(\cdot|s)\ ||\ \tilde\pi(\cdot|s)) DKLmax(π,π~)=maxsDKL(π(⋅∣s) ∣∣ π~(⋅∣s)),那麼下界成立:
η ( π n e w ) − η ( π o l d ) ≥ ∑ s ρ π o l d ( s ) ∑ a π n e w ( a ∣ s ) A π o l d ( s , a ) − 4 ϵ γ ( 1 − γ ) 2 D K L max ( π , π ~ ) \eta(\pi_{new}) - \eta(\pi_{old}) \geq \sum_{s} \rho_{\pi_{old}}(s) \sum_a \pi_{new}(a|s) A_{\pi_{old}}(s,a) - \frac{4 \epsilon \gamma}{(1-\gamma)^2} D_{KL}^{\max}(\pi,\tilde\pi) η(πnew)−η(πold)≥s∑ρπold(s)a∑πnew(a∣s)Aπold(s,a)−(1−γ)24ϵγDKLmax(π,π~)
實際上,如果我們使用懲罰系數 C,步長将會變得非常非常小。一種可行的辦法是把 KL 那一項作為限制條件:
maximize θ ∑ s ρ π o l d ( s ) ∑ a π n e w ( a ∣ s ) A π o l d ( s , a ) subject to D K L max ( π , π ~ ) ≤ δ \text{maximize}_\theta \sum_{s} \rho_{\pi_{old}}(s) \sum_a \pi_{new}(a|s) A_{\pi_{old}}(s,a) \\ \text{subject to}\ D_{KL}^{\max}(\pi,\tilde\pi) \leq \delta maximizeθs∑ρπold(s)a∑πnew(a∣s)Aπold(s,a)subject to DKLmax(π,π~)≤δ
由于計算 D K L max ( π , π ~ ) D_{KL}^{\max}(\pi,\tilde\pi) DKLmax(π,π~) 比較麻煩,可以用平均 KL 散度代替:
D ‾ K L ρ ( π , π ~ ) : = E s ∼ ρ [ D K L ( π ( ⋅ ∣ s ) ∣ ∣ π ~ ( ⋅ ∣ s ) ) ] \overline D_{KL}^\rho (\pi,\tilde\pi):=\mathbb{E}_{s \sim \rho} [D_{KL} (\pi(\cdot|s)\ ||\ \tilde\pi(\cdot|s))] DKLρ(π,π~):=Es∼ρ[DKL(π(⋅∣s) ∣∣ π~(⋅∣s))]
重要性采樣
實際上的資料由蒙特卡洛采樣得到,我們使用采樣來對公式進行一下替換:
∑ s ρ π o l d ( s ) ∑ a π n e w ( a ∣ s ) A π o l d ( s , a ) = E s ∼ ρ π o l d ∑ a π n e w ( a ∣ s ) A π o l d ( s , a ) = E s ∼ ρ π o l d ∑ a π o l d ( a ∣ s ) π n e w ( a ∣ s ) π o l d ( a ∣ s ) A π o l d ( s , a ) = E s ∼ ρ π o l d E a ∼ π o l d π n e w ( a ∣ s ) π o l d ( a ∣ s ) A π o l d ( s , a ) \begin{aligned} \sum_{s} \rho_{\pi_{old}}(s) \sum_a \pi_{new}(a|s) A_{\pi_{old}}(s,a)&=\mathbb{E}_{s \sim \rho_{\pi_{old}}} \sum_a \pi_{new}(a|s) A_{\pi_{old}}(s,a) \\ &=\mathbb{E}_{s \sim \rho_{\pi_{old}}} \sum_a \pi_{old}(a|s) \frac{\pi_{new}(a|s)}{\pi_{old}(a|s)} A_{\pi_{old}}(s,a) \\ &=\mathbb{E}_{s \sim \rho_{\pi_{old}}} \mathbb{E}_{a \sim \pi_{old}} \frac{\pi_{new}(a|s)}{\pi_{old}(a|s)} A_{\pi_{old}}(s,a) \\ \end{aligned} s∑ρπold(s)a∑πnew(a∣s)Aπold(s,a)=Es∼ρπolda∑πnew(a∣s)Aπold(s,a)=Es∼ρπolda∑πold(a∣s)πold(a∣s)πnew(a∣s)Aπold(s,a)=Es∼ρπoldEa∼πoldπold(a∣s)πnew(a∣s)Aπold(s,a)
是以,代理目标(surrogate objective)可以寫為:
maximize θ E s ∼ ρ π o l d E a ∼ π o l d π n e w ( a ∣ s ) π o l d ( a ∣ s ) A π o l d ( s , a ) subject to E s ∼ ρ π o l d [ D K L ( π o l d ( ⋅ ∣ s ) ∣ ∣ π n e w ( ⋅ ∣ s ) ) ] ≤ δ \text{maximize}_\theta \mathbb{E}_{s \sim \rho_{\pi_{old}}} \mathbb{E}_{a \sim \pi_{old}} \frac{\pi_{new}(a|s)}{\pi_{old}(a|s)} A_{\pi_{old}}(s,a) \\ \text{subject to}\ \mathbb{E}_{s \sim \rho_{\pi_{old}}} [D_{KL} (\pi_{old}(\cdot|s)\ ||\ \pi_{new}(\cdot|s))] \leq \delta maximizeθEs∼ρπoldEa∼πoldπold(a∣s)πnew(a∣s)Aπold(s,a)subject to Es∼ρπold[DKL(πold(⋅∣s) ∣∣ πnew(⋅∣s))]≤δ
然後這裡介紹了兩種采樣方式,分别是單路徑(single path)和藤蔓(vine)。單路徑法很簡單,就是蒙特卡洛法。藤蔓法比較難以了解。

藤蔓法,首先使用蒙特卡洛方法得到一條軌迹,然後在這條軌迹選擇一個大小為 N N N 的狀态子集,用 rollout set 表示。對于該狀态子集中的每個狀态 s n s_n sn,我們通過 a n , k ∼ q ( ⋅ ∣ s n ) a_{n,k} \sim q(\cdot|s_n) an,k∼q(⋅∣sn) 采樣 K K K 個動作。實際上發現,對于連續動作問題(如 robotic locomotion), q ( ⋅ ∣ s n ) = π θ ( ⋅ ∣ s n ) q(\cdot|s_n)=\pi_\theta(\cdot|s_n) q(⋅∣sn)=πθ(⋅∣sn) 有效果,對于離散動作問題(如 atari),随機分布有效果,因為可以實作更好的探索。
但是由于藤蔓法很麻煩,單單把 gym 環境重置到指定狀态就很難做到…是以藤蔓法具體細節這裡就不寫了。
實用算法
再寫一下剛剛提到的代理目标(surrogate objective):
maximize θ E s ∼ ρ π o l d E a ∼ π o l d π n e w ( a ∣ s ) π o l d ( a ∣ s ) A π o l d ( s , a ) subject to E s ∼ ρ π o l d [ D K L ( π o l d ( ⋅ ∣ s ) ∣ ∣ π n e w ( ⋅ ∣ s ) ) ] ≤ δ \text{maximize}_\theta \mathbb{E}_{s \sim \rho_{\pi_{old}}} \mathbb{E}_{a \sim \pi_{old}} \frac{\pi_{new}(a|s)}{\pi_{old}(a|s)} A_{\pi_{old}}(s,a) \\ \text{subject to}\ \mathbb{E}_{s \sim \rho_{\pi_{old}}} [D_{KL} (\pi_{old}(\cdot|s)\ ||\ \pi_{new}(\cdot|s))] \leq \delta maximizeθEs∼ρπoldEa∼πoldπold(a∣s)πnew(a∣s)Aπold(s,a)subject to Es∼ρπold[DKL(πold(⋅∣s) ∣∣ πnew(⋅∣s))]≤δ
我們定義
J ( θ ) = E s ∼ ρ π o l d E a ∼ π o l d π n e w ( a ∣ s ) π o l d ( a ∣ s ) A π o l d ( s , a ) J(\theta)=\mathbb{E}_{s \sim \rho_{\pi_{old}}} \mathbb{E}_{a \sim \pi_{old}} \frac{\pi_{new}(a|s)}{\pi_{old}(a|s)} A_{\pi_{old}}(s,a) J(θ)=Es∼ρπoldEa∼πoldπold(a∣s)πnew(a∣s)Aπold(s,a)
那麼 maximize θ J ( θ ) \text{maximize}_\theta J(\theta) maximizeθJ(θ) 可以寫為 maximize θ ∇ θ J ( θ ) T Δ θ \text{maximize}_\theta \nabla_\theta J(\theta)^T \Delta \theta maximizeθ∇θJ(θ)TΔθ
我們知道通過 KL 散度的二階泰勒展開,可以得到和費謝爾矩陣的關系:
K L ( f ( x ; θ ) ∣ ∣ f ( x ; θ + Δ θ ) ) ≈ 1 2 Δ θ T F Δ θ KL(f(x;\theta)||f(x;\theta+\Delta\theta))\approx\frac{1}{2} \Delta\theta^T F \Delta\theta KL(f(x;θ)∣∣f(x;θ+Δθ))≈21ΔθTFΔθ
看一下 KL 散度的二階導數( θ − \theta^- θ− 表示參數為常值):
這裡是參考 https://zhuanlan.zhihu.com/p/106984936,作者講得好,建議看。
∇ θ 2 K L ( f ( x ; θ − ) ∣ ∣ f ( x ; θ ) ) = ∇ θ 2 [ ∫ f ( x ; θ − ) log f ( x ; θ − ) d x − ∫ f ( x ; θ − ) log f ( x ; θ ) d x ] = − ∫ f ( x ; θ − ) ∇ θ 2 log f ( x ; θ ) d x = − E [ ∇ θ 2 log f ( x ; θ ) ] = F \begin{aligned} \nabla_\theta^2 KL(f(x;\theta^-)||f(x;\theta))&=\nabla_\theta^2 \left[ \int f(x;\theta^-) \log f(x;\theta^-)dx-\int f(x;\theta^-) \log f(x;\theta)dx \right] \\ &=-\int f(x;\theta^-) \nabla_\theta^2 \log f(x;\theta) dx \\ &=-E \left[ \nabla_\theta^2 \log f(x;\theta) \right] \\ &= F \end{aligned} ∇θ2KL(f(x;θ−)∣∣f(x;θ))=∇θ2[∫f(x;θ−)logf(x;θ−)dx−∫f(x;θ−)logf(x;θ)dx]=−∫f(x;θ−)∇θ2logf(x;θ)dx=−E[∇θ2logf(x;θ)]=F
文章中使用 KL 散度的二階導 H = ∇ θ 2 K L ( ⋅ ) H=\nabla_\theta^2KL(\cdot) H=∇θ2KL(⋅) 來作為費謝爾矩陣 F F F 的估計。
是以,我們可以把代理目标重寫為:
$$$$
maximize θ ∇ θ J ( θ ) T Δ θ subject to 1 2 Δ θ T H Δ θ = ϵ \text{maximize}_\theta \nabla_\theta J(\theta)^T \Delta \theta \\ \text{subject to }\frac{1}{2} \Delta\theta^TH\Delta\theta = \epsilon maximizeθ∇θJ(θ)TΔθsubject to 21ΔθTHΔθ=ϵ
這是一個簡單的限制優化問題,使用拉格朗日乘數法(Lagrange Multiplier Method)來解:
f ( θ ) = ∇ θ J ( θ ) T Δ θ + λ ( 1 2 Δ θ T H Δ θ − ϵ ) f(\theta)=\nabla_\theta J(\theta)^T \Delta \theta + \lambda(\frac{1}{2} \Delta\theta^TH\Delta\theta - \epsilon) f(θ)=∇θJ(θ)TΔθ+λ(21ΔθTHΔθ−ϵ)
則有方程組
{ ∇ Δ θ f ( θ ) = ∇ θ J ( θ ) + λ H Δ θ = 0 ( 1 ) 1 2 Δ θ T H Δ θ − ϵ = 0 ( 2 ) \begin{cases} \nabla_{\Delta\theta}f(\theta)=\nabla_\theta J(\theta)+\lambda H \Delta\theta = 0 &(1)\\ \frac{1}{2} \Delta\theta^TH\Delta\theta - \epsilon = 0 &(2)\\ \end{cases} {∇Δθf(θ)=∇θJ(θ)+λHΔθ=021ΔθTHΔθ−ϵ=0(1)(2)
由 ( 1 ) (1) (1) 解得 Δ θ = − 1 λ H − 1 g \Delta\theta=-\frac{1}{\lambda}H^{-1}g Δθ=−λ1H−1g,代入 ( 2 ) (2) (2) 中得(注意, H H H 是海森矩陣,是對稱的,即 H T = H H^T=H HT=H,其逆矩陣也是對稱矩陣)
1 2 λ 2 g T ( H − 1 ) T H H − 1 g = ϵ 1 2 ϵ g T H − 1 g = λ 2 g T H − 1 g 2 ϵ = λ \begin{aligned} \frac{1}{2 \lambda^2}g^T(H^{-1})^THH^{-1}g&=\epsilon \\ \frac{1}{2\epsilon} g^TH^{-1}g&=\lambda^2 \\ \sqrt{\frac{g^TH^{-1}g}{2\epsilon}}&=\lambda \end{aligned} 2λ21gT(H−1)THH−1g2ϵ1gTH−1g2ϵgTH−1g
=ϵ=λ2=λ
代回得
Δ θ = − 1 λ H − 1 g = − 2 ϵ g T H − 1 g H − 1 g \begin{aligned} \Delta\theta&=-\frac{1}{\lambda}H^{-1}g \\ &=-\sqrt{\frac{2\epsilon}{g^TH^{-1}g}}H^{-1}g \end{aligned} Δθ=−λ1H−1g=−gTH−1g2ϵ
H−1g