Scientists have proposed imitation learning algorithms to effectively align agents with the real world

Embodied agents with multimodal capabilities are the most important component of achieving artificial general intelligence, and it is hoped that it will be able to be implemented to help complete tasks in daily life, such as common household chores, autonomous driving, and robot operations.

At present, there is no accepted technical solution in the field that can effectively train multimodal embodied agents.

There is the well-known Scaling Laws theory in large language models, which is simply understood that the larger the model, the more data, and the better the performance will be obtained. However, it is difficult to replicate the success of large language models in the task of training embodied agents.

The main reasons are:

First, unlike the massive corpora used to train large language models, the data related to embodied intelligence is very single and expensive (in the multi-million dollar range), and second, there is a lack of effective training methods like supervised learning.

Based on this, SUSTech, the University of Maryland, College Park, the University of Technology Sydney, and JD Discovery Research Institute proposed a new training framework for embodied agents to solve the problem of misalignment between the training of multimodal embodied agents and environmental changes.

The large language model provides empirical feedback and guidance for agents in imitation learning, which significantly improves the success rate of housekeeping robot tasks.

In previous studies, it was often assumed that when training an embodied agent, the offline dataset would perform better if it was large enough.

The study provides a new perspective for the field: even if the dataset is large enough, the changes in the future world are infinite, and it is difficult to exhaust and generalize all the possibilities. Therefore, it is necessary to collect feedback data from the environment in real time and then continuously learn from each other.

近日，相关论文以《由平行文本世界中的大语言模型训练的多模态智能体》（Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld）为题发表在预印本网站 arXiv[1]，并且已被 CVPR 2024 会议接收。

Yang Yijun, a doctoral student at Southern University of Science and Technology, is the first author, and Shi Yuhui, chair professor of Southern University of Science and Technology, and Dr. Shen Li (now associate professor of Sun Yat-sen University) of Jingdong Discovery Research Institute are the co-corresponding authors.

Scientists have proposed imitation learning algorithms to effectively align agents with the real world

Figure丨Related papers (source: arXiv)

Key issue: Agent dynamics are not aligned with the environment

The researchers hope to train embodied agents who follow verbal instructions from the visual input state. However, under the existing framework, such embodied agents are often trained and learned from offline, fixed datasets, which can lead to a series of problems such as hallucinations, distribution shifts, and sparse rewards.

Specifically:

First, hallucinations, also known as misalignment with human goals.

Under the condition that the agent is trained on a fixed, offline dataset, it can only reflect what happened in the world a certain point in time.

However, the world is dynamic, and if an agent encounters a scene or situation that has never appeared in the dataset before, it will perform some actions or actions that seem unreasonable to humans, which is often referred to as "hallucinatory phenomena", which is manifested in the fact that the agent performs wrong, irrational, and dangerous behaviors.

Yang Yijun pointed out that "the most direct way to completely solve the problem of the illusion of the agent is to let the agent constantly interact with the environment, collect feedback data from the environment in real time, and then continue to learn interactively, and so on." ”

(Source: Southern University of Science and Technology)

Second, the distribution shift, also known as misalignment with environmental dynamics.

The distribution shift problem is similar to an illusion in that it refers to the fact that the distribution of the data that was originally learned is different from the distribution of the data in the future. Over time, the distribution of data is constantly changing when making decisions, so there will be shifts, which will lead to some abnormal actions or model outputs when the agents who have been fully trained on the original dataset make decisions.

Third, sparse rewards. In fact, training an agent in a way similar to reinforcement learning by interacting with the environment can get very sparse feedback from the environment.

Yang Yijun explained: "The successful completion of a task requires the accumulation of multiple decisions. However, the agent may not get any valuable feedback at some or all of the steps in between, and only get successful feedback after the final task is completed. ”

Therefore, if the intermediate steps of the task are too long and the agent is not guided step by step, it may be difficult for the agent to achieve the final goal.

Cross-modal training of embodied agents with large language models

In this study, Tianyi Zhou, an assistant professor at the University of Maryland, College Park, identified the key problems of the above-mentioned misalignment of agents with environmental dynamics.

Later, after team discussions, Yang Yijun proposed that the strategy of training agents can be more efficient by constantly interacting with the environment and then using large language models to provide step-by-step guidance based on environmental feedback.

"In fact, we are the first team in the field to realize the dynamic misalignment between agents and the environment, which was also affirmed by the reviewers at CVPR 2024. He said.

The researchers propose an algorithmic framework for cross-modal imitation learning to obtain real-time feedback about the environment. It is important to understand that there are two key roles in imitation learning: Teacher/Teacher and Student/Student.

After getting the state information of the environment, it is first fed to the large language model "teacher", and then the "teacher" summarizes the feedback and outputs a goal that is easier to learn for the "student" to imitate.

"The teacher's output solves the previous problem of sparse rewards, so that the teacher can provide guidance to the student at every step of the environmental feedback, and solves the problem of needing to know whether all the tasks are successful until they are completed," Yang said. ”

(来:arXiv)

In terms of imitation learning, the traditional method is to use human annotation for training and learning guidance. For example, several options are provided at each step, and then it is up to the person to choose the option that will best help accomplish the end goal in terms of execution.

It is important to understand that the way to learn from human feedback is not only time-consuming and laborious, but also requires the person giving the feedback to have professional subject knowledge, especially for robot-related problems, which will increase the cost of annotation.

At present, large language models have been able to complete many kinds of tasks, including some decision-making tasks. Therefore, the research group innovatively proposes to replace humans with large language models to provide feedback signals in the process of imitation learning.

They invoke the GPT-4 model and let it choose among the optional actions at each step as a more appropriate text action for environmental feedback, and further guide the "student" to reach the final goal.

Figure丨Rich test scenarios, agents are required to complete a variety of housework in different scenarios (source: ProcTHOR)

The success rate is about 91%. In the case of no human intervention, including only the robot camera to see the scene, the success rate is about 20%.

The team's Unity3D-rendered simulation environment, ALFWorld, requires robots to complete tasks such as washing dishes, picking up apples, and taking out the garbage in thousands of different housework scenarios. Agents trained with this new method have a significant 20%-70% success rate in tasks, resulting in an 88% success rate.

"It's also the only way to come close to human success at the moment. In the future, if our approach goes further to achieve economies of scale, it is possible to achieve or exceed a 91% success rate in a test environment with a larger model. Yang Yijun said.

Figure丨Comparison of three visual-language model-based agents in the visual environment in ALFWorld (Source: arXiv)

The Embodied Agent Training Framework will continue to be expanded

Before the advent of large language models, Yang Yijun's research direction was reinforcement learning, and his research included offline reinforcement learning, continuous reinforcement learning, etc. These explorations have also laid a solid foundation for this research, and have a certain enlightening and promoting effect.

"Based on the consideration of applying technology to practical problems, with the emergence of large language models, my research direction has gradually shifted to using the prior knowledge of large language models to help improve the efficiency of reinforcement learning algorithms. He said.

Picture丨Yang Yijun (source: Yang Yijun)

It cannot be ignored that the biggest problem of reinforcement learning is that through continuous interaction with the environment through trial and error, a huge amount of data is required to learn a more ideal strategy, but the data in embodied intelligence is expensive, which is also one of the most difficult problems to solve.

In the next step, the group plans to continue to extend the method to achieve higher performance. Yang Yijun said: "We will try to introduce human feedback into the algorithm framework. In addition, human feedback can be mixed with feedback from large language models to solve costly problems. ”

On the other hand, they also intend to try to solve the problem of too many interactions between the data and the environment from the perspective of optimizing the imitation learning algorithm. In fact, the number and cost of an agent's interaction with the environment are closely related. The researchers wanted to limit the number of interactions with the environment as much as possible while achieving the same learning performance.

For example, the use of meta-learning allows the robot to reuse previously trained, common-sense and generic prior knowledge to help accelerate the completion of similar tasks (continuous reinforcement learning), which can greatly reduce the interaction of the environment.

Yang Yijun said, for example: "For example, the robot has learned to wash dishes before, and when it learns to wash dishes again, it is essentially similar to washing dishes. ”

In the past, many people thought that algorithms were designed to be clever enough to actually solve a problem, but with the emergence and development of large language models, it has gradually changed the way people look at solving AI problems.

At this stage, the algorithm can be simple enough, but the amount of computing resources and data required needs to be large enough. In other words, data and computing resources have become more important than algorithms.

Previous research on artificial intelligence has mainly focused on perception problems, which solve problems or functions that can recognize objects, such as using computer vision for detection segmentation, depth estimation, target recognition, etc.

Talking about the possible next development of artificial intelligence, Yang Yijun said: "The next step of artificial intelligence should be to transform from perception problems to decision-making problems. ”

In the future, it is hoped that with the help of large language models to solve problems, more data, more computing power and larger models to solve decision-making problems.

"In terms of decision-making, we look forward to the emergence of a common decision-making model to solve a wide variety of decision-making problems, and I think this could be a milestone in the future. Yang Yijun finally said.

Resources:

1.Yijun Yang et al. Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld. arXiv:2311.16714v2(2024). https://arxiv.org/abs/2311.16714

2.https://procthor.allenai.org/

Operation/Typesetting: He Chenlong

Scientists have proposed imitation learning algorithms to effectively align agents with the real world

Read on