上一篇博文的内容整理了我們如何去近似價值函數或者是動作價值函數的方法：

\[V_{\theta}(s)\approx V^{\pi}(s) \\

Q_{\theta}(s)\approx Q^{\pi}(s, a)

通過機器學習的方法我們一旦近似了價值函數或者是動作價值函數就可以通過一些政策進行控制，比如 \(\epsilon\)-greedy。

那麼我們簡單回顧下 RL 的學習目标：通過 agent 與環境進行互動，擷取累計回報最大化。既然我們最終要學習如何與環境互動的政策，那麼我們可以直接學習政策嗎，而之前先近似價值函數，再通過貪婪政策控制的思路更像是"曲線救國"。

這就是本篇文章的内容，我們如何直接來學習政策，用數學的形式表達就是：

\[\pi_{\theta}(s, a) = P[a | s, \theta]

這就是被稱為政策梯度（Policy Gradient，簡稱PG）算法。

當然，本篇内容同樣的是針對 model-free 的強化學習。

Value-Based vs. Policy-Based RL

Value-Based：

學習價值函數
Implicit policy，比如 \(\epsilon\)-greedy

Policy-Based：

沒有價值函數
直接學習政策

Actor-Critic：

學習政策

三者的關系可以形式化地表示如下：

![](https://img2018.cnblogs.com/blog/764050/201811/764050-20181102094901268-512503979.png)

認識到 Value-Based 與 Policy-Based 差別後，我們再來讨論下 Policy-Based RL 的優缺點：

優點：

收斂性更好
對于具有高維或者連續動作空間的問題更加有效
可以學習随機政策

缺點：

絕大多數情況下收斂到局部最優點，而非全局最優
評估一個政策一般情況下低效且存在較高的方差

Policy Search

我們首先定義下目标函數。

Policy Objective Functions

目标：給定一個帶有參數 \(\theta\) 的政策 \(\pi_{\theta}(s, a)\)，找到最優的參數 \(\theta\)。

但是我們如何評估不同參數下政策 \(\pi_{\theta}(s, a)\) 的優劣呢？

對于episode 任務來說，我們可以使用start value：

\[J_1(\theta)=V^{\pi_{\theta}}(s_1)=E_{\pi_{\theta}}[v_1]

對于連續性任務來說，我們可以使用 average value：

\[J_{avV}(\theta)=\sum_{s}d^{\pi_{\theta}}(s)V^{\pi_{\theta}}(s)

或者每一步的平均回報：

\[J_{avR}(\theta)=\sum_{s}d^{\pi_{\theta}}(s)\sum_{a}\pi_{\theta}(s, a)R_s^a

其中 \(d^{\pi_{\theta}}(s)\) 是馬爾卡夫鍊在 \(\pi_{\theta}\) 下的靜态分布。

Policy Optimisation

在明确目标以後，我們再來看基于政策的 RL 為一個典型的優化問題：找出 \(\theta\) 最大化 \(J(\theta)\)。

最優化的方法有很多，比如不依賴梯度（gradient-free）的算法：

爬山算法
模拟退火
進化算法
...

但是一般來說，如果我們能在問題中獲得梯度的話，基于梯度的最優化方法具有比較好的效果：

梯度下降
共轭梯度
拟牛頓法

我們本篇讨論梯度下降的方法。

政策梯度定理

假設政策 \(\pi_{\theta}\) 為零的時候可微，并且已知梯度 \(\triangledown_{\theta}\pi_{\theta}(s, a)\)，定義 \(\triangledown_{\theta}\log\pi_{\theta}(s, a)\) 為得分函數（score function）。二者關系如下：

\[\triangledown_{\theta}\pi_{\theta}(s, a) = \triangledown_{\theta}\pi_{\theta}(s, a) \frac{\triangledown_{\theta}\pi_{\theta}(s, a)}{\pi_{\theta}(s, a)}=\pi_{\theta}(s, a)\triangledown_{\theta}\log\pi_{\theta}(s, a)

接下來我們考慮一個隻走一步的MDP，對它使用政策梯度下降。\(\pi_{\theta}(s, a)\) 表示關于參數 \(\theta\) 的函數，映射是 \(p(a|s,\theta)\)。它在狀态 \(s\) 向前走一步，獲得獎勵\(r=R_{s, a}\)。那麼選擇行動 \(a\) 的獎勵為 \(\pi_{\theta}(s, a)R_{s, a}\)，在狀态 \(s\) 的權重獎勵為 \(\sum_{a\in A}\pi_{\theta}(s, a)R_{s, a}\)，應用政策所能獲得的獎勵期望及梯度為：

\[J(\theta)=E_{\pi_{\theta}}[r] = \sum_{s\in S}d(s)\sum_{a\in A}\pi_{\theta}(s, a)R_{s, a}\\

\triangledown_{\theta}J(\theta) = \color{Red}{\sum_{s\in S}d(s)\sum_{a\in A}\pi_{\theta}(s, a)}\triangledown_{\theta}\log\pi_{\theta}(s, a)R_{s, a}=E_{\pi_{\theta}}[\triangledown_{\theta}\log\pi_{\theta}(s, a)r]

再考慮走了多步的MDP，使用 \(Q^{\pi_{\theta}}(s, a)\) 代替獎勵值 \(r\)，對于任意可微的政策，政策梯度為：

\[\triangledown_{\theta}J(\theta) = E_{\pi_{\theta}}[\triangledown_{\theta}\log\pi_{\theta}(s, a)Q^{\pi_{\theta}}(s, a)]

對于任意可微政策 \(\pi_{\theta}(s, a)\)，任意政策目标方程 \(J = J_1, J_{avR}, ...\)，政策梯度：

蒙特卡洛政策梯度算法（REINFORCE）

Monte-Carlo政策梯度算法，即REINFORCE：

通過采樣episode來更新參數：；
使用随機梯度上升法更新參數；
使用return \(v_t\) 作為 \(Q^{\pi_{\theta}}(s_t, a_t)\) 的無偏估計

則 \(\Delta\theta_t = \alpha \triangledown_{\theta}\log\pi_{\theta}(s_t, a_t)v_t\)，具體如下：

[Reinforcement Learning] Policy Gradient Methods

Actir-Critic 政策梯度算法

Monte-Carlo政策梯度的方差較高，是以放棄用return來估計行動-價值函數Q，而是使用 critic 來估計Q：

\[Q_w(s, a)\approx Q^{\pi_{\theta}}(s, a)

這就是大名鼎鼎的 Actor-Critic 算法，它有兩套參數：

Critic：更新動作價值函數參數 \(w\)
Actor：朝着 Critic 方向更新政策參數 \(\theta\)

Actor-Critic 算法是一個近似的政策梯度算法：

\[\triangledown_\theta J(\theta)\approx E_{\pi_{\theta}}[\triangledown_{\theta}\log \pi_{\theta}(s, a)Q_w(s, a)]\\

\Delta\theta = \alpha\triangledown_\theta\log\pi_{\theta}(s,a)Q_w(s,a)

Critic 本質就是在進行政策評估：How good is policy \(\pi_{\theta}\) for current parameters \(\theta\).

政策評估我們之前介紹過MC、TD、TD(\(\lambda\))，以及價值函數近似方法。如下所示，簡單的 Actir-Critic 算法 Critic 為動作價值函數近似，使用最為簡單的線性方程，即：\(Q_w(s, a) = \phi(s, a)^T w\)，具體的僞代碼如下所示：

在 Actir-Critic 算法中，對政策進行了估計，這會産生誤差（bias），但是當滿足以下兩個條件時，政策梯度是準确的：

價值函數的估計值沒有和政策相違背，即：\(\triangledown_w Q_w(s,a) = \triangledown_\theta\log\pi_{\theta}(s,a)\)
價值函數的參數w能夠最小化誤差，即：\(\epsilon = E_{\pi_{\theta}}[(Q^{\pi_{\theta}}(s, a) - Q_w(s,a))^2]\)

優勢函數

另外，我們可以通過将政策梯度減去一個基線函數（baseline funtion）B(s)，可以在不改變期望的情況下降低方差（variance）。證明不改變期望，就是證明相加和為0：

\[\begin{align}

E_{\pi_{\theta}}[\triangledown_\theta\log\pi_{\theta}(s,a)B(s)]

&=\sum_{s\in S}d^{\pi_{\theta}}(s)\sum_a \triangledown_\theta\pi_{\theta}(s, a)B(s)\\

&=\sum_{s\in S}d^{\pi_{\theta}}(s)B(s)\triangledown_\theta\sum_{a\in A}\pi_{\theta}(s,a )\\

&= 0

\end{align}

狀态價值函數 \(V^{\pi_{\theta}}(s)\) 是一個好的基線。是以可以通過使用優勢函數（Advantage function）\(A^{\pi_{\theta}}(s,a)\) 來重寫價值梯度函數。

\[A^{\pi_{\theta}}(s,a)=Q^{\pi_{\theta}}(s,a)-V^{\pi_{\theta}}(s)\\

\triangledown_\theta J(\theta)=E_{\pi_{\theta}}[\triangledown_\theta\log\pi_{\theta}(s,a)A^{\pi_{\theta}}(s,a)]

設 \(V^{\pi_{\theta}}(s)\) 是真實的價值函數，TD算法利用bellman方程來逼近真實值，誤差為 \(\delta^{\pi_{\theta}}=r+\gamma V^{\pi_{\theta}}(s') - V^{\pi_{\theta}}(s)\)。該誤差是優勢函數的無偏估計。是以我們可以使用該誤差計算政策梯度：

\[\triangledown_\theta J(\theta)=E_{\pi_{\theta}}[\triangledown_\theta\log\pi_{\theta}(s,a)\delta^{\pi_{\theta}}]

該方法隻需要critic，不需要actor。更多關于 Advantage Function 的可以看這裡。

最後總結一下政策梯度算法：

Reference

[1] Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto, 2018

[2] David Silver's Homepage

[3] Advantage Learning

作者：Poll的筆記

部落格出處：http://www.cnblogs.com/maybe2030/

本文版權歸作者和部落格園所有，歡迎轉載，轉載請标明出處。

<如果你覺得本文還不錯，對你的學習帶來了些許幫助，請幫忙點選右下角的推薦>

[Reinforcement Learning] Policy Gradient Methods

Value-Based vs. Policy-Based RL

Policy Search

Policy Objective Functions

Policy Optimisation

政策梯度定理

蒙特卡洛政策梯度算法（REINFORCE）

Actir-Critic 政策梯度算法

優勢函數

Reference

繼續閱讀

算法導論8-5思考題-平均排序-average sorting

POJ題目分類（不定期更新）

繩索資料結構（字元串快速拼接）

Count ways to reach the n’th stair Count ways to reach the n’th stair

符合泊松分布的事件模拟到達時間生成符合泊松分布的事件模拟到達時間生成

相隔為1的編輯距離

Algorithms Review: Divide and Conquer(Binary Search & Merge Sort)

watermark performance standard &amp; algorithms

Visual Tracking 和 Motion Estimation的差別

[zz]The Most Important Algorithms (in CS and Math)

High-level Synthesis from AutoESL: A Game-changer for Chip Design

采用ODC改善軟體品質：一個案例研究

各種二分查找

查找算法學習之二分查找（Python版本）——BinarySearch

一道某高大上網際網路公司的筆試題分享

【python】【資料處理】畫多元資料分布圖