Colliding between reinforcement learning and visual language models, UC Berkeley proposed the Language Reward Moderation LAMP framework

In the field of reinforcement learning (RL), an important research direction is how to cleverly design the reward mechanism of the model, the traditional way is to design the manual reward function, and feed back to the model based on the results of the model performing the task. Later, there was a sparse reward mechanism represented by learned reward functions (LRF), which determined specific reward functions through data-driven learning, which showed good performance in many complex real-world tasks.

This article presents a new paper from the UC Berkeley research team, in which the authors question whether it makes sense to use LRF in place of task rewards. Therefore, this paper takes the zero-shot ability of the current hot visual language models (VLMs) as the research object, the author believes that this zero-shot ability can be used as a pre-training supervision signal for RL models, rather than simply as a reward in downstream tasks, and proposes a language reward conditioning pre-training model LAMP, LAMP first uses pre-trained VLMs with parameter freezing. And by comparing and querying with the visual information captured by the agent on the rich language instruction set, a variety of pre-training rewards are generated, and then these rewards are optimized by reinforcement learning algorithms. Through extensive experiments, the authors show that LAMP is different from the previous VLMs pre-training method, and can achieve very amazing sample efficient learning in the field of robot manipulation tasks.

Colliding between reinforcement learning and visual language models, UC Berkeley proposed the Language Reward Moderation LAMP framework

Article Links:

https://arxiv.org/abs/2308.12270

Code repository:

https://github.com/ademiadeniji/lamp

Colliding between reinforcement learning and visual language models, UC Berkeley proposed the Language Reward Moderation LAMP framework

1. Introduction

Looking back, the field of reinforcement learning has also undergone a development process from hand-designed reward functions to network self-learning. Hand-designed reward functions tend to be over-engineered, which makes them unsuitable for new agents and new environments, so the optimal reward function developed to learn from a large amount of demo data is also a lot of noise and false rewards, which is unreliable in complex tasks such as high-precision robot manipulation. The author is inspired by existing large-scale pre-trained VLMs, which can show efficient zero-shot performance on a variety of tasks and have the ability to quickly adapt to new tasks. At the same time, the training process of VLMs is realized by calculating the alignment score between the feature representation of the image by the agent model and the task-specific text language, which has an implicit multi-task adaptability, that is, it only needs to use different language instructions for prompts, and can generate a variety of scalable methods with different rewards. This feature is particularly consistent with the assumption of RL pre-training, that is, this cross-task reward is used as a pre-training tool for RL general-purpose agents, rather than relying on the previous noisy LRF to train expert RL models that can only run on a single task.

In the pre-training stage, LAMP can use highly diverse language prompts and visual features extracted from the surrogate model to form text visual pairs, and input these data pairs into VLMs for query, thereby generating diverse and different shapes of pre-training rewards. In the stage of downstream task fine-tuning, a simple language-conditioned multi-task reinforcement learning algorithm can be used to optimize these rewards, and experiments show that LAMP can effectively reduce the number of samples fine-tuned in downstream tasks in the real robot environment, but at the same time maintain better manipulation performance.

II. Methodology

The figure below shows the specific implementation process of LAMP, which mainly includes two training stages: (1) the task-independent RL pre-training stage, which uses a series of language instructions to query the reward from the VLMs model to pre-train the RL agent model. (2) In the fine-tuning stage of downstream tasks, the instructions of the new task are used, and the pre-trained learned strategies are adjusted under the conditions of these language instructions, and the target task is solved by maximizing the reward of the new task.

2.1 Language reward regulation

In order to extract the pre-training reward signal of RL from VLMs, the authors selected R3M [1] as the visual language feature extractor, and R3M extracted the feature semantic representation from the large-scale first-person human video dataset Ego4D, which effectively improved the data efficiency of imitation learning in the real-world robot field. Language input used

to process,

It is a pre-trained DistilBERT transformer model that can efficiently aggregate the embedding encoding of each word in text instructions. Using R3M as a reward score generator between text instructions and visual observation features, the authors believe that R3M scores are more suitable for providing action rewards at the visual level because its representations are explicitly trained to understand temporal information in videos. Specifically, the rewards defined using R3M scores are as follows:

thereinto

represents the score predictor in R3M,

Represent images separately

arrive

The following figure shows the visual language alignment effect of R3M with two other models, InternVideo [2] and ZeST [3] on the downstream task of RLBench, but from the reward curve, the reward trend of the three methods is not stable, which indicates that it is difficult to directly use these rewards to optimize the final model. Therefore, the authors only use these rewards as an exploration signal during the pre-training phase.

2.2 Behavioral learning conditional on language

In order to make the trained RL model can be used for a variety of different downstream tasks, the author designed a set of tasks with visual effects and various objects for LAMP, first built a custom environment based on the RLBench simulation toolkit, in order to simulate realistic visual scenes, the authors downloaded a large number of real scene images from the Ego4D dataset and superimposed them as textures on the desktop and background of the environment. To create a variety of objects and features, the authors imported a large number of ShapeNet 3D object meshes into the environment, so that the visual textures and objects that appear during training are random at each iteration. Since the reward score obtained by LAMP can be used to measure the distance between the task solved by the agent model and the actual task requirement, it can be easily combined with some unsupervised RL methods. Therefore, in order to stimulate LAMP's ability to explore new tasks, the authors combine the LAMP reward with the intrinsic reward of the Plan2Explore algorithm [4], an unsupervised reinforcement learning algorithm that tends to explore the novelty of the task, which utilizes the difference between the hidden state prediction and the hidden state prediction in the future moment as a novelty score, which can be expressed as

, thus the agent objective function of the pre-training phase can be obtained, expressed as the following weighted reward sum:

The authors used ChatGPT to generate a series of robotic manipulation tasks, such as "Push Button" and "Pick up Cup", from which LAMP randomly drew some verbal cues

, and then get its corresponding visual embedding

, after which the final reward is calculated according to the method described in the previous section. After the pre-training, LAMP gets a more general language conditioning strategy, which can guide the robot to complete the language

Specifies various behaviors. As shown in the figure below, the pre-training process is mainly carried out on a random environment based on Ego4D textures.

Since LAMP has learned a certain language condition strategy, it only needs to select language instructions that roughly correspond to the semantics of the downstream task

The authors highlight a significant advantage of LAMP, which uses language as a task specifier, which allows us to fine-tune the model for downstream tasks in a very low-cost way.

3. Experimental effect

The experiments in this paper are carried out on 96 random domain environments, which are obtained by randomly sampling different Ego4D textures, and the author also samples the environment of the RLBench default environment texture with a probability of 0.2, for the robot's operation space, the author sets a 4-dimensional continuous action space, where the first three dimensions represent the position information of the robot end effector, and the last dimension is used to control the gripper action of the robotic arm. The authors chose a de novo trained surrogate model and the Plan2Explore (P2E) method as a comparative baseline for experimentation

3.1 Model fine-tuning effect

The authors selected five common operation tasks: Pick Up Cup, Take Lid Off Saucepan, Push Button, Close Microwave, and Turn Tap (open the faucet), and the following figure shows the comparison of the experimental results.

It can be seen that starting from scratch for randomly initialized agents for new task training will show a high sample complexity, in most RLBench tasks, the Plan2Explore method using unsupervised exploration significantly exceeds the performance of training from scratch, and then it can be observed that the performance of the LAMP method proposed in this paper is better, the author analyzes that LAMP uses VLMs rewards for pre-training, which can make the agent model get more diversified rewards. The resulting learned representations allow it to be quickly adapted to completely new tasks during fine-tuning.

3.2 Ablation experiments on language cues

One advantage of using pre-trained VLMs is that the reward can be obtained by entering a variety of query texts, and the authors have done an ablation study on the different prompt styles used in the pre-training phase, using the following 6 language prompt styles:

Among them, prompt styles 1-5 mainly compare verbs and nouns related and a variety of unrelated situations, while prompt style 6, the author directly selects a more difficult Shakespearean text fragment to observe the sample adaptation completely outside the pre-training distribution, the figure below shows the comparison of model fine-tuning effects after pretraining with different prompt styles.

Among them, tips 1-5 are all prompts based on task actions, and the task "pick up the cup" is selected here because the task name is simple and very similar to the prompt in the pre-training, and it can be seen that in this task, the semantically similar but diverse prompt style 2 achieves the best performance. In the right side of the figure above, the author focused on analyzing the impact of Shakespeare's text on model fine-tuning, in which the model using the best prompt style 2 is compared, and it can be seen that after removing the P2E model, the performance of LAMP Prompt 6 and LAMP Prompt 2 is basically the same, but when the P2E model is added, using language prompts outside the distribution will seriously affect the performance of LAMP.

3.3 Comparison with other visual language reward models

In addition to studying language cues, the authors also compared the effects of using different VLMs in the pre-training stage, here the authors chose the ZeST model, which is generally trained in the same way as the CLIP model, and also by extracting the similarity between text features and image features as a reward model.

The figure above shows a comparison of LAMP's fine-tuning effects using R3M and ZeST on downstream tasks of "Pick Up Cup", where R3M seems to bring better sustained performance, but the performance of ZeST pre-training is not bad. From this, the authors conclude that the proposed approach is not inherently dependent on specific VLMs, and that more powerful VLMs can be replaced in the future to further improve performance.

IV. Summary

In this work, the authors study how to use the flexibility of VLMs as a means of diversified reinforcement learning reward generation, and propose a language cue-based reward adjustment model LAMP, which breaks through many limitations of learning reward functions in traditional deep reinforcement learning, and uses the powerful zero-shot generalization ability of VLMs to generate many different rewards during model pretraining. In addition, the authors found that VLMs-based reward models can be combined with many new RL optimization methods, such as Plan2Explore, which can bring powerful performance. Through a large number of experiments, this paper shows that the LAMP method shows superior reinforcement learning optimization ability in a variety of challenging scenarios.

reference

[1] Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation, 2022. [2] Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hon jie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, Limin Wang, and Yu Qiao. Internvideo: General video foundation models via generative and discriminative learning, 2022. [3] Yuchen Cui, Scott Niekum, Abhinav Gupta, Vikash Kumar, and Aravind Rajeswaran. Can foundation models perform zero-shot task specification for robot manipulation?, 2022. [4] Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, and Deepak Pathak. Planning to explore via self-supervised world models. CoRR, abs/2005.05960, 2020

Author:seven_

Illustration by IconScout Store from IconScout

-The End-

Scan the code to watch!

New this week!

"AI Technology Stream" original submission plan

TechBeat is an AI Learning Community (www.techbeat.net) established by Jiangmen Ventures. The community has launched 500+ talk videos and 3000+ technical dry goods articles, covering CV/NLP/ML/ROBOTIS, etc.; Hold top meetings and other online communication activities on a regular basis every month, and hold offline gatherings and exchange activities for technicians from time to time. We are striving to become a high-quality, knowledge-based communication platform that AI talents love, hoping to create more professional services and experiences for AI talents, and accelerate and accompany their growth.

Contents

Latest Technology Interpretation/Systematic Knowledge Sharing //

Cutting-edge information commentary/experience narration //

Instructions for submission

Manuscripts need to be original articles and indicate author information.

We will select some directions in in-depth technical analysis and scientific research experience, inspire users with more inspirational articles, and do original content rewards

Submission method

Send mail to

[email protected]

Or add staff WeChat (chemn493) to submit articles to communicate the details of submissions; You can also pay attention to the "Jiangmen Venture Capital" public account, and reply to the word "submission" in the background to get submission instructions.

>>> Add WeChat!

About me "door" ▼

Jiangmen is a new venture capital institution focusing on the core technology field of digital intelligence, and is also a benchmark incubator in Beijing. The company is committed to discovering and cultivating scientific and technological innovation enterprises with global influence by connecting technology and business, and promoting enterprise innovation and development and industrial upgrading.

Founded at the end of 2015, the founding team was built by the founding team of Microsoft Venture Capital in China, and has selected and deeply incubated 126 innovative technology-based startups for Microsoft.

If you are a start-up in the technology field and want not only investment, but also a series of continuous and valuable post-investment services, please send or recommend a project to my "door":

⤵ One click to send you to TechBeat Happy Planet

Colliding between reinforcement learning and visual language models, UC Berkeley proposed the Language Reward Moderation LAMP framework

1. Introduction

II. Methodology

2.1 Language reward regulation

2.2 Behavioral learning conditional on language

3. Experimental effect

3.1 Model fine-tuning effect

3.2 Ablation experiments on language cues

3.3 Comparison with other visual language reward models

IV. Summary

reference

Read on