paper2-Policy Gradient Methods for Reinforcement Learning with Function Approximation一、新收獲二、總結

2023-04-25 09:35:01

Policy Gradient Methods for Reinforcement Learning with Function Approximation

一、新收獲
- 1、對文章段落的了解和收獲
- - abstract
  - (1)、Policy Gradient Theorem
  - (2)、 Policy Gradient with Approximation
  - (3)、 Application to Deriving Algorithms and Advantages
  - (4)、Convergence of Policy Iteration with Function Approximation
二、總結

一、新收獲

1、對文章段落的了解和收獲

abstract

直接指出policy gradient是根據期望獎勵的梯度更新參數的。

本文提出的主要新方法為：梯度可以以近似作用值（an approximate action-value）或優勢函數(advantage function) ，從經驗中估算的形式來編寫。

值函數方法在許多應用中的效果都很好，但是存在一些限制：

(1)它以尋找确定性政策為導向，而最優政策通常是随機的，選擇具有特定機率的不同動作；

(2)動作估計值的非常小的變化都有可能改變這個動作被選擇的可能性；

(1)、Policy Gradient Theorem

這部分提出了政策梯度定理，并在附錄中給出了證明過程。

paper2-Policy Gradient Methods for Reinforcement Learning with Function Approximation一、新收獲二、總結

其中：

paper2-Policy Gradient Methods for Reinforcement Learning with Function Approximation一、新收獲二、總結

或者：

paper2-Policy Gradient Methods for Reinforcement Learning with Function Approximation一、新收獲二、總結

(2)、 Policy Gradient with Approximation

這部分提出了函數逼近的政策梯度定理：

paper2-Policy Gradient Methods for Reinforcement Learning with Function Approximation一、新收獲二、總結

(3)、 Application to Deriving Algorithms and Advantages

這部分寫的是關于算法的派生，比如，定理2可以派生出值函數參數化的逼近形式，線性的還是非線性的政策參數都可以派生出不同的形式；另外fw 也可以被派生為優勢函數的逼近器。優勢函數公式如下：

paper2-Policy Gradient Methods for Reinforcement Learning with Function Approximation一、新收獲二、總結

優勢函數表示選擇的這個動作a好于平均動作的程度。

(4)、Convergence of Policy Iteration with Function Approximation

這部分主要提出定理3：帶有函數逼近的政策疊代能夠收斂到局部最優，在文章中給出了證明。

paper2-Policy Gradient Methods for Reinforcement Learning with Function Approximation一、新收獲二、總結

二、總結

這篇文章主要是寫了關于政策梯度的三個定理，并給出了定理的使用條件和證明過程，想了解清楚什麼是政策梯度的同學，不建議閱讀這篇文章，那些想弄清楚公式是如何推導的，如何得到公式結果的同學，建議閱讀這篇文章，文章的公式推導十分詳細，步驟沒有忽略掉的，容易讓人看懂

文章分析就到這裡結束了，十分感謝大家觀看！

paper2-Policy Gradient Methods for Reinforcement Learning with Function Approximation一、新收獲二、總結

Policy Gradient Methods for Reinforcement Learning with Function Approximation

一、新收獲

1、對文章段落的了解和收獲

abstract

(1)、Policy Gradient Theorem

(2)、 Policy Gradient with Approximation

(3)、 Application to Deriving Algorithms and Advantages

(4)、Convergence of Policy Iteration with Function Approximation

二、總結

繼續閱讀

今天來給大家介紹一下基于強化學習的時間行為檢測自适應模型

利用DQN解決Gym庫的CartPole問題

MICCAI2020論文下載下傳擷取

醫學圖像最新相關研究方向、論文下載下傳及其思考---MICCAI2019論文

作業系統筆記（一）計算機系統概述一、作業系統的基本概念二、作業系統的發展與分類三、作業系統的運作環境和體系結構四、異常和中斷五、系統調用

AlphaGo Zero是如何工作的？——AlphaGo Zero背後的強化學習算法原理

論文：Hourglass Module相關整理

考證大全 | 證券從業資格考試

敲黑闆！2021年證券從業考試考點預測

2021年銀行從業考試考情介紹,果斷收藏!

證券從業合格證書什麼時候列印？有哪些注意事項？

【幹貨滿滿】初級銀行從業考試《個人理财》重點梳理

2020年經濟師考試，難嗎？

MBA提前面試純幹貨分享

MBA值得學麼

軟考-高項-論文-資訊系統項目的風險管理