天天看點

TRAJECTORY VAE FOR MULTI-MODAL IMITATION(用于多模态模拟的軌迹VAE)

ABSTRACT

We address the problem of imitating multi-modal expert demonstrations in sequential decision making problems. In many practical applications, for example video games, behavioural demonstrations are readily available that contain multi-modal structure not captured by typical existing imitation learning approaches. For ex-ample, differences in the observed players’ behaviours may be representative of different underlying playstyles.

In this paper, we use a generative model to capture different emergent playstyles in an unsupervised manner, enabling the imitation of a diverse range of distinct behaviours. We utilise a variational autoencoder to learn an embedding of the different types of expert demonstrations on the trajectory level, and jointly learn a latent representation with a policy. In experiments on a range of 2D continuous control problems representative of Minecraft environments, we empirically demon-strate that our model can capture a multi-modal structured latent space from the demonstrated behavioural trajectories.

我們解決了在連續決策問題中模仿多模态專家示範的問題。在許多實際應用中,例如視訊遊戲,容易獲得包含多模态結構的行為示範,這些結構未被典型的現有模仿學習方法捕獲。例如,觀察到的球員行為的差異可能代表不同的潛在遊戲方式。

在本文中,我們使用生成模型以無人監督的方式捕捉不同的緊急遊戲風格,進而模仿各種不同的行為。我們利用變分自動編碼器來學習在軌迹水準上嵌入不同類型的專家示範,并通過政策共同學習潛在的表示。在代表Minecraft環境的一系列2D連續控制問題的實驗中,我們憑經驗證明我們的模型可以從所展示的行為軌迹中捕獲多模态結構潛在空間。

1INTRODUCTION

Imitation learning has become successful in a wide range of sequential decision making problems, in which the goal is to mimic expert behaviour given demonstrations (Ziebart et al., 2008; Wang et al., 2017; Li et al., 2017; D’Este et al., 2003). Compared with reinforcement learning, imitation learning does not require access to a reward function – a key advantage in domains where rewards are not naturally or easily obtained. Instead, the agent learns a behavioural policy implicitly through demonstrated trajectories.

Expert demonstrations are typically assumed to be provided by a human demonstrator and generally can vary from person to person, e.g., according to their personality, experience and skill at the task. Therefore, when capturing demonstrations from multiple humans, observed behaviours may be distinctly different due to multi-modal structure caused by differences between demonstrators. Variations like these, which are very common in video games where players often cluster into distinct play styles, are typically not modelled explicitly as the structure of these differences is not known a priori but instead emerge over time as part of the changing meta-game.

In this paper, we propose Trajectory Variational Autoencoder (T-VAE) a deep generative model that learns a structured representation of the latent features of human demonstrations that result in diverse behaviour, enabling the imitation of different types of emergent behaviour. In particular, we use a Variational Autoencoder (VAE) to maximise the Evidence Lower Bound (ELBO) of the log likelihood of the expert demonstrations on the trajectory level where the policy is directly learned from optimising the ELBO. Not only can our model reconstruct expert demonstrations, but we empirically demonstrate it learns a meaningful latent representation of distinct emergent variances in the observed trajectories.

模仿學習在各種順序決策問題中取得了成功,其中的目标是模仿專家行為(Ziebart et al。,2008; Wang et al。,2017; Li et al。,2017; D’) Este等,2003)。與強化學習相比,模仿學習不需要獲得獎勵功能 - 在自然或不容易獲得獎勵的領域中的關鍵優勢。相反,代理通過示範的軌迹隐含地學習行為政策。

專家示範通常假定由人類示範者提供,并且通常可以因人而異,例如,根據他們的個性,經驗和任務技能。是以,當捕獲來自多個人的示範時,由于示範者之間的差異導緻的多模态結構,觀察到的行為可能明顯不同。這些變體在視訊遊戲中非常常見,其中玩家經常聚內建不同的遊戲風格,通常不會明确地模組化,因為這些差異的結構不是先驗已知的,而是作為變化的元遊戲的一部分而随着時間的推移而出現。

在本文中,我們提出了軌迹變分自動編碼器(T-VAE)一種深度生成模型,該模型學習人類示範潛在特征的結構化表示,進而産生不同的行為,進而能夠模仿不同類型的緊急行為。特别是,我們使用變分自動編碼器(VAE)來最大化在軌迹級别上專家示範的對數似然的證據下界(ELBO),其中政策是從優化ELBO直接學習的。我們的模型不僅可以重建專家示範,而且我們憑經驗證明它可以在觀察到的軌迹中學習不同的緊急變化的有意義的潛在表示。

2RELATED WORK

Popular imitation learning methods include behavior cloning (BC) (Pomerleau, 1991), which is a supervised learning method that learns a policy from expert demonstration of state-action pairs. However, this approach assumes independent observations which is not the case for sequential decision making problems, as future observations depend on previous actions. It has been shown that BC cannot generalise well to unseen observations (Ross & Bagnell, 2010). Ross et al. (2011) proposed a new iterative algorithm, which trains a stationary deterministic policy with no regret learning in an online setting to overcome this issue. Torabi et al. (2018) also improve behaviour cloning with a two-phase approach where the agent first learns an inverse dynamics model via interacting with the environment in a self-supervised fashion, and then use the model to infer missing actions given expert demonstrations. An alternative approach is Apprenticeship Learning (AC) (Abbeel & Ng, 2004), which uses inverse reinforcement learning to infer a reward function from expert trajectories. However, it suffers from expensive computation due to the requirement of repeatedly performing reinforcement learning from tabula-rasa to convergence. Whilst each of these methods has had successful applications, none are able to capture multi-modal structure in the demonstration data representative of underlying emergent differences in playstyle.

More recently, the learning of a latent space for imitation learning has been studied in the literature. Generative Adversarial Imitation Learning (GAIL) (Ho & Ermon, 2016) learns a latent space of demonstrations with a Generative Adverserial Network (GAN) (Goodfellow et al., 2014) like ap-proach which is inherently mode-seeking and does not explicitly model multi-modal structure in the demonstrations. This limitation was addressed by (Li et al., 2017), who built on the GAIL framework to infer a latent structure of expert demonstrations enabling imitation of diverse behaviours. Similarly, (Wang et al., 2017) combined a VAE with a GAN architecture to imitate diverse behaviours. However, these methods require interacting with the environment and rollouts of the policy whilst learning. For comparison we note our method does not need access to the environment simulator during training and is computationally cheaper, as the policy is learned simply by gradient descent using a fixed dataset of trajectories. Additionally, whilst the aim in GAIL is to keep the agent behaviour close to the expert’s state distribution, our model can serve as an alternative approach to capturing state sequence structure.

In work more closely related to our approach, (Co-Reyes et al., 2018) have also proposed a Variational Auto encoder (VAE) (Kingma & Welling, 2013) that embeds the expert demonstration on the trajectory level which showed promising results. However their approach only encodes the trajectories of the states whereas ours encodes both the state and action trajectories, which also allows us to learn the policy directly from the probabilistic model rather than adding a penalty term to the ELBO. Rabinowitz et al. (2018) also learns an interpretable representation of the latent space in a hierarchical way, but their focus is more on representing the mental states of other agents and is different from our goal of imitating diverse emergent behaviours.

流行的模仿學習方法包括行為克隆(BC)(Pomerleau,1991),這是一種監督學習方法,通過狀态 - 動作對的專家示範來學習政策。然而,這種方法假設獨立觀察,而順序決策制定問題并非如此,因為未來的觀察依賴于先前的行動。已經證明BC不能很好地概括為看不見的觀察結果(Ross&Bagnell,2010)。羅斯等人。 (2011)提出了一種新的疊代算法,該算法訓練一個固定的确定性政策,線上設定中沒有後悔學習來克服這個問題。托拉比等人。 (2018)還通過兩階段方法改進行為克隆,其中代理首先通過以自我監督的方式與環境互動來學習逆動力學模型,然後使用該模型來推斷給定專家示範的缺失動作。另一種方法是學徒學習(AC)(Abbeel&Ng,2004),它使用逆強化學習從專家軌迹推斷出獎勵函數。然而,由于需要重複地從表格到收斂進行強化學習,是以它遭受昂貴的計算。雖然這些方法中的每一種都有成功的應用,但沒有一種能夠在示範資料中捕獲多模态結構,代表了遊戲風格中潛在的緊急差異。

最近,在文獻中已經研究了用于模仿學習的潛在空間的學習。生成性對抗性模仿學習(GAIL)(Ho&Ermon,2016)通過生成性逆境網絡(GAN)(Goodfellow et al。,2014)學習示範的潛在空間,就像ap-proach本身就是模式尋求而沒有明确地模型中的多模态結構。這個限制由(Li et al。,2017)解決,他建立在GAIL架構上,推斷出能夠模仿不同行為的專家示範的潛在結構。同樣,(Wang et al。,2017)将VAE與GAN架構相結合,以模仿不同的行為。但是,這些方法需要在學習的同時與環境進行互動并推出政策。為了進行比較,我們注意到我們的方法在訓練期間不需要通路環境模拟器,并且在計算上更便宜,因為僅使用固定的軌迹資料集通過梯度下降來學習政策。此外,雖然GAIL的目标是使代理行為保持接近專家的狀态分布,但我們的模型可以作為捕獲狀态序列結構的替代方法。

在與我們的方法更密切相關的工作中,(Co-Reyes等,2018)也提出了一種變分自動編碼器(VAE)(Kingma&Welling,2013),它将專家示範嵌入軌迹水準,顯示出有希望的結果。然而,他們的方法隻編碼狀态的軌迹,而我們的方法編碼狀态和行動軌迹,這也允許我們直接從機率模型學習政策,而不是在ELBO中添加懲罰項。 Rabinowitz等人。 (2018)也以層次方式學習潛在空間的可解釋表示,但他們更關注的是代表其他代理人的心理狀态,并且不同于我們模仿不同的新興行為的目标。

3METHODS

3.1 PRELIMINARIES

TRAJECTORY VAE FOR MULTI-MODAL IMITATION(用于多模态模拟的軌迹VAE)
TRAJECTORY VAE FOR MULTI-MODAL IMITATION(用于多模态模拟的軌迹VAE)
3.ENCODER NETWORK

We encode whole trajectories into the latent space in order to embed useful features of different behaviours and extract distinguishing features which differ from trajectory to trajectory. Note that the latent z is therefore a single variable rather than a sequence that depends on t. In order to utilise

我們将整個軌迹編碼到潛在空間中,以嵌入不同行為的有用特征,并提取不同于軌迹到軌迹的差別特征。 請注意,潛在z是以是單個變量而不是依賴于t的序列。 為了利用

TRAJECTORY VAE FOR MULTI-MODAL IMITATION(用于多模态模拟的軌迹VAE)
TRAJECTORY VAE FOR MULTI-MODAL IMITATION(用于多模态模拟的軌迹VAE)
3.VARIATIONAL BOUND

The marginal likelihood for each trajectory can be written as

TRAJECTORY VAE FOR MULTI-MODAL IMITATION(用于多模态模拟的軌迹VAE)
GENERATING TRAJECTORIES
TRAJECTORY VAE FOR MULTI-MODAL IMITATION(用于多模态模拟的軌迹VAE)
4EXPERIMENT

4.1 2D NAVIGATION EXAMPLE

TRAJECTORY VAE FOR MULTI-MODAL IMITATION(用于多模态模拟的軌迹VAE)

We first apply our model to a 2D navigation example with 3 types of trajectories representative of players moving towards different goal locations. This experiment confirms our approach can detect and imitate multi-modal structure demonstrations, and learns a meaningful and consistent latent representation. Starting from (0; 0), the state space consists of the 2D (continuous) coordinates and the action is the angle along which to move a fixed distance (=1). The time horizon is fixed to be 100.

In Figure 2, the ground truth trajectories are given in (a), and we reconstruct the trajectories through the state decoder and the policy decoder in (b) and © respectively. It can be seen that they are consistent with each other and represent the test set well. The latent embedding can be found in (d), where we can clearly identify 3 clusters corresponding to the 3 types of trajectories.

Figure 3 shows interpolations as we navigate through the latent space, i.e. we sample a 4 by 4 grid in the latent space, and generate trajectories using the state decoder and the policy decoder. We can see that the T-VAE shows consistent behaviour as we interpolate in the latent space. This confirms that our approach can detect and imitate latent structure, and that it learns a meaningful latent representation that captures the main dimensions of variation.

我們首先将我們的模型應用于2D導航示例,其中3種類型的軌迹代表玩家朝向不同的目标位置移動。該實驗證明了我們的方法可以檢測和模仿多模态結構示範,并學習有意義且一緻的潛在表示。從(0; 0)開始,狀态空間由2D(連續)坐标組成,動作是移動固定距離(= 1)的角度。時間範圍固定為100。

在圖2中,地面實況軌迹在(a)中給出,并且我們分别通過狀态解碼器和(b)和(c)中的政策解碼器重建軌迹。可以看出它們彼此一緻并且很好地代表了測試集。潛在嵌入可以在(d)中找到,其中我們可以清楚地識别對應于3種類型的軌迹的3個簇。

圖3示出了當我們在潛在空間中導航時的插值,即我們在潛在空間中采樣4乘4網格,并使用狀态解碼器和政策解碼器生成軌迹。我們可以看到,當我們在潛在空間中插值時,T-VAE顯示出一緻的行為。這證明了我們的方法可以檢測和模仿潛在結構,并且它學習了一個有意義的潛在表示,捕獲變異的主要次元。

TRAJECTORY VAE FOR MULTI-MODAL IMITATION(用于多模态模拟的軌迹VAE)

Figure 3: Intepolation of latent space for (a) state decoder; and (b) policy decoder. It can be observed that the top left corner, top right corner and bottom right corner behave like the red, blue and green type of trajectories respectively and the bottom left corner has a mixed behaviour.

圖3:(a)狀态解碼器的潛在空間的插值; (b)政策解碼器。 可以觀察到,左上角,右上角和右下角分别表現為紅色,藍色和綠色類型的軌迹,左下角具有混合行為。

2D CIRCLE EXAMPLE

TRAJECTORY VAE FOR MULTI-MODAL IMITATION(用于多模态模拟的軌迹VAE)

Figure 4: (a):Ground truth of trajectories on the test set; (b): reconstructed trajectories with state decoder; © reconstructed trajectories with policy decoder; (d) 2D latent space.

We next apply our model to another 2D example, designed to replicate the experimental setting in Figure 1 of Li et al. (2017). There are three types of circles (in the figures these are coloured in red, blue and green) as shown in Figure 4a. The agent starts from (0; 0), the observation consists of the continuous 2D coordinates and the action is the relative angle towards which the agent moves. The reconstructed test set using state decoder and policy decoder, and visualisations of the 2D latent space can be found in Figure 4.

These results show that when the sequence length is not fixed (as in the previous example), T-VAE is still able to produce consistency between the state and policy decoders and learn latent features

that underpins different behaviours. Furthermore, as figure 1 in Li et al. (2017) already showed that both behaviour cloning and GAIL fail at this task whereas InfoGAIL and now T-VAE perform well, it seems that using a latent representation to capture long term dependency is crucial in this example.

圖4:(a):測試集上軌迹的基本事實; (b):用狀态解碼器重建軌迹; (c)使用政策解碼器重建軌迹; (d)2D潛伏空間。

接下來,我們将模型應用于另一個2D執行個體,旨在複制Li等人的圖1中的實驗設定。 (2017年)。有三種類型的圓圈(在圖中這些圓圈用紅色,藍色和綠色着色),如圖4a所示。代理從(0; 0)開始,觀察由連續的2D坐标組成,動作是代理移動的相對角度。使用狀态解碼器和政策解碼器的重建測試集以及2D潛在空間的可視化可以在圖4中找到。

這些結果表明,當序列長度不固定時(如前例所示),T-VAE仍然能夠在狀态和政策解碼器之間産生一緻性并學習潛在特征

這支撐着不同的行為。此外,如Li等人的圖1所示。 (2017)已經表明行為克隆和GAIL都在這項任務失敗,而InfoGAIL和現在的T-VAE表現良好,似乎使用潛在表示來捕獲長期依賴性在這個例子中是至關重要的。

4.3 ZOMBIE ATTACK SCENARIO

Finally, we evaluate our model on a simplified 2D Minecraft-like environment. This set of experiments show that T-VAE is able to capture long-term dependencies, model mixed action space, and the performance is improved when using a rolling window during prediction. In each episode, the agent needs to reach a goal. There is a zombie moving towards the agent and there are two types of demonstrated expert behaviour: the ”attacking” behaviour where the agent moves to the zombie and attacks it before going to the goal, or the ”avoiding” behaviour where the agent avoids the zombie and reaches the goal. The initial position of the agent and the goal are kept fixed whereas the initial position of the zombie is sampled uniformly at random. The observation space consists of the distance and angle to the goal and the zombie respectively, and there are two types of actions:

1)the angle along which the agent moves by a fixed step size (=0.5), and 2) a Bernoulli variable indicating whether to attack the zombie in a given timestep or not, which is very sparse and typically only equals to 1 once for the ’attacking’ behaviour. Thus, this experiment setup exemplifies a mixed continuous-discrete action space. Episodes end when the agent reaches the goal or the number of time steps reaches the maximum number allowed, which is defined to be the maximum sequence length in the training set (30).

Figure 5 shows the ground truth and reconstruction of the two types of behaviours on the test set, and Figure 6 shows the learned latent space. We also provide animations: https: //youtu.be/fvcJbYnRND8 and ’avoiding’ ’region’https://youtu.be/DAruY-Dd9z8. These show test time behaviour where we randomly sample from the posterior distribution of the latent variable z in the latent space corresponding to the ’attacking’ cluster.

To examine the diversity of the generated behaviour, we randomly select a latent z in the ’attacking’ and ’avoiding’ clusters in Figure 6a and generate 1000 trajectories. The histogram for different statistics are displayed in Figure 7, where the top and bottom rows represent ’attacking’ and ’avoiding’ behaviour respectively. We can see a clear differentiation between these two different latent variables. Although the agent does not always succeed in killing the zombie, as shown in Figure 7b, the closest distances to the zombie (shown in Figure 7d) are almost all within the demonstrated range, meaning that the agent moves to the zombie but attacked at slightly different timing.

Results comparing with different rolling window length can be found in Figure 8. For the attacking agent, each episode is a success if the zombie is dead and the agent reaches the goal. For the avoiding agent, each episode is a success if the agent reaches the goal and is beyond the zombie’s attacking range. It can be seen that for small rolling window lengths, the performance is worse for ’attacking’ agent, since the model fails to capture long-term dependencies but provided a sufficient window length diverse behaviours can be imitated.

最後,我們在簡化的類似Minecraft的環境中評估我們的模型。這組實驗表明,T-VAE能夠捕獲長期依賴關系,模組化混合動作空間,并且在預測期間使用滾動視窗時性能得到改善。在每集中,代理需要達到目标。有一個僵屍向代理移動,有兩種類型的已證明的專家行為:代理移動到僵屍并在進入目标之前攻擊僵屍的“攻擊”行為,或者代理人避開的“避免”行為僵屍并達到了目标。代理的初始位置和目标保持固定,而僵屍的初始位置随機均勻采樣。觀察空間分别由目标和僵屍的距離和角度組成,有兩種類型的動作:

1)代理移動固定步長(= 0.5)的角度,和2)訓示是否在給定時間步長内攻擊僵屍的伯努利變量,這是非常稀疏的,通常僅等于1 '攻擊’行為。是以,該實驗設定舉例說明了混合的連續離散動作空間。當代理達到目标或時間步數達到允許的最大數量時,情節結束,這被定義為訓練集(30)中的最大序列長度。

圖5顯示了測試集上兩種行為的基本事實和重建,圖6顯示了學習的潛在空間。我們還提供動畫:https://youtu.be/fvcJbYnRND8和’avoid’'region’https://youtu.be/DAruY-Dd9z8。這些顯示了測試時間行為,其中我們從對應于“攻擊”聚類的潛在空間中的潛在變量z的後驗分布中随機抽樣。

為了檢查所生成行為的多樣性,我們在圖6a中的“攻擊”和“避免”群集中随機選擇潛在z并生成1000個軌迹。不同統計資料的直方圖如圖7所示,其中頂行和底行分别代表“攻擊”和“避免”行為。我們可以看到這兩個不同潛在變量之間的明顯差別。雖然代理并不總能成功殺死僵屍,如圖7b所示,與僵屍的最近距離(如圖7d所示)幾乎都在所示的範圍内,這意味着代理移動到僵屍但略有攻擊不同時間。

與不同滾動視窗長度相比的結果可以在圖8中找到。對于攻擊代理,如果僵屍已經死并且代理達到目标,則每一集都是成功的。對于避免代理人,如果代理人達到目标并超出僵屍的攻擊範圍,則每一集都是成功的。可以看出,對于小滾動視窗長度,“攻擊”代理的性能更差,因為模型無法捕獲長期依賴性但提供了足夠的視窗長度,可以模仿不同的行為。

5 CONCLUSION

In this paper, we proposed a new method – Trajectory Variational Autoencoder (T-VAE) – for imitation learning that is designed to capture latent multi-modal structure in demonstrated behaviour. Our approach encodes trajectories of state-action pairs and learns latent representations with a VAE on the trajectory level.

T-VAE encourages consistency between the state and action decoders, helping avoid compound errors that are common in simpler behavioural cloning approaches to imitation learning. We demonstrate that this approach successfully avoids compound errors in several tasks that require long-term consistency and generalisation.

Our model is successful in generating diverse behaviours and learning a policy directly from a probabilistic model. It is simple to train and gives promising results in a range of tasks, including a zombie task that requires generalisation given a moving opponent as well as a mixed continuous-discrete action space.

在本文中,我們提出了一種新的方法 - 軌迹變分自動編碼器(T-VAE) - 用于模仿學習,旨在捕獲已示範行為中的潛在多模态結構。 我們的方法編碼狀态 - 動作對的軌迹,并在軌迹水準上用VAE學習潛在表示。

T-VAE鼓勵狀态和動作解碼器之間的一緻性,有助于避免複雜錯誤,這種錯誤在模仿學習的簡單行為克隆方法中很常見。 我們證明了這種方法成功地避免了需要長期一緻性和泛化的幾個任務中的複合錯誤。

我們的模型成功地生成了多種行為,并直接從機率模型中學習政策。 訓練并在一系列任務中給出有希望的結果是很簡單的,包括僵屍任務,需要在移動的對手和混合的連續離散動作空間的情況下進行泛化。

TRAJECTORY VAE FOR MULTI-MODAL IMITATION(用于多模态模拟的軌迹VAE)

Figure 5: (a) Ground truth and (b) reconstruction (b) for the zombie attack scenario. The agent starts at (0; 0), the goal is positioned at (5; 5), and the zombie starts at a random location and moves towards the agent.

圖5:(a)地面實況和(b)僵屍攻擊情景的重建(b)。 代理從(0; 0)開始,目标位于(5; 5),僵屍從随機位置開始并向代理移動。

TRAJECTORY VAE FOR MULTI-MODAL IMITATION(用于多模态模拟的軌迹VAE)
TRAJECTORY VAE FOR MULTI-MODAL IMITATION(用于多模态模拟的軌迹VAE)

Figure 7: Top row and bottom row display the results of the trajectories generating from the ’attacking’ and ’avoiding’ cluster respectively. The first and second column show whether the agent attacks the zombie and whether the zombie is dead in the episode, the difference is sometimes agents attacks the zombie but are not in the attacking range so that the zombie does not die. The third and fourth column show the closest distance to the goal and the zombie in each episode. The agent reaches the goal when the distance to it is < 0:5 which is indicated by the red dash line (successful).

圖7:頂行和底行分别顯示從’攻擊’和’避免’叢集生成的軌迹的結果。 第一列和第二列顯示代理是否攻擊僵屍以及僵屍是否在劇集中死亡,不同的是有時候攻擊者攻擊僵屍但不在攻擊範圍内以便僵屍不會死亡。 第三和第四列顯示每個劇集中與目标和僵屍的最近距離。 當代理距離<0:5時,代理到達目标,由紅色虛線表示(成功)。

TRAJECTORY VAE FOR MULTI-MODAL IMITATION(用于多模态模拟的軌迹VAE)

A wide range of future work can be built upon ours. For example, bootstrapping reinforcement learning with these initial policies to improve beyond demonstrated behaviour provided an additional reward signal whilst aiming to maintain the diversity in behaviours.

我們可以在此基礎上開展廣泛的未來工作。 例如,使用這些初始政策進行強化學習以改善已證明的行為,提供額外的獎勵信号,同時旨在保持行為的多樣性。

REFERENCES

Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, pp. 1. ACM, 2004.

John D Co-Reyes, YuXuan Liu, Abhishek Gupta, Benjamin Eysenbach, Pieter Abbeel, and Sergey Levine. Self-consistent trajectory autoencoder: Hierarchical reinforcement learning with trajectory embeddings. arXiv preprint arXiv:1806.02813, 2018.

Claire D’Este, Mark O’Sullivan, and Nicholas Hannah. Behavioural cloning and robot control. In Robotics and Applications, pp. 179–182, 2003.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural informa-tion processing systems, pp. 2672–2680, 2014.

Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, pp. 4565–4573, 2016.

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.

Yunzhu Li, Jiaming Song, and Stefano Ermon. Infogail: Interpretable imitation learning from visual demonstrations. In Advances in Neural Information Processing Systems, pp. 3812–3822, 2017.

Dean A Pomerleau. Efficient training of artificial neural networks for autonomous navigation. Neural Computation, 3(1):88–97, 1991.

Neil C Rabinowitz, Frank Perbet, H Francis Song, Chiyuan Zhang, SM Eslami, and Matthew Botvinick. Machine theory of mind. arXiv preprint arXiv:1802.07740, 2018.

Stephane´ Ross and Drew Bagnell. Efficient reductions for imitation learning. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 661–668, 2010.

Stephane´ Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627–635, 2011.

Faraz Torabi, Garrett Warnell, and Peter Stone. Behavioral cloning from observation. arXiv preprint arXiv:1805.01954, 2018.

Ziyu Wang, Josh S Merel, Scott E Reed, Nando de Freitas, Gregory Wayne, and Nicolas Heess.

Robust imitation of diverse behaviors. In Advances in Neural Information Processing Systems, pp.

5320–5329, 2017.

Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In AAAI, volume 8, pp. 1433–1438. Chicago, IL, USA, 2008.

繼續閱讀