天天看點

時刻與比對:模仿學習中的權衡與處理(CS LG)

我們通過矩比對的角度提供了一大類先前的模仿學習算法的統一視圖。歸根結底,我們的分類方案基于學習者是否嘗試比對專家行為的(1)獎勵或(2)行動價值時刻,每種選擇都導緻不同的算法方法。通過考慮學習者和專家行為之間的對抗性選擇差異,我們能夠得出政策績效的界限,該界限适用于所有這些類别中的所有算法,這是我們所知的第一個。我們還介紹了可恢複性的概念,該概念在以前的許多模仿學習分析中都沒有展現,這使我們能夠清楚地描述每個算法系列能夠緩解複合錯誤的程度。我們推導了兩個新穎的算法模闆AdVIL和AdRIL,它們具有有力的保證,簡單的實作和有競争力的經驗性能。

标題原文:Of Moments and Matching: Trade-offs and Treatments in Imitation Learning

原文:We provide a unifying view of a large family of previous

imitation learning algorithms through the lens of moment matching. At its core, our classification scheme is based on whether the learner attempts to match (1) reward or (2) action-value moments of the expert's behavior, with each option leading to differing algorithmic approaches. By considering adversarially chosen divergences between learner and expert behavior, we are able to derive bounds on policy performance that apply for all algorithms in each of these classes, the first to our knowledge. We also introduce the notion of recoverability, implicit in many previous analyses of imitation learning, which allows us to cleanly delineate how well each algorithmic family is able to mitigate compounding errors. We derive two novel algorithm templates, AdVIL and AdRIL, with strong guarantees, simple implementation, and competitive empirical performance.

2103.03236.pdf