Invited Article丨Taskless Learning and Its Application in Robot Task and Motion Planning

Text / Zhang Xianqi, Fan Xiaopeng

Abstract:

This paper proposes a method of taskless learning, expounds its differences and connections with existing methods (including self-supervised learning, transfer learning, imitation learning, and reinforcement learning), and then introduces the application of taskless learning in the field of robot task and motion planning, and analyzes the advantages and main research difficulties of taskless learning in this field. Finally, the development of taskless learning in the field of robotics and the application prospect in production and life are prospected.

Keyword:

Taskless learning, task and motion planning, robotics, artificial intelligence

0 Introduction

In 1961, the first industrial robot, the Unimate, appeared on the production line of General Motors, and since then, robots have flourished in the field of industrial production. In contrast, the development of domestic robots has not been satisfactory, and the universal robot that served Russum in the 1921 stage play is still not realized today, 100 years later. Compared with industrial robots, domestic robots require stronger intelligence, and related research also faces more difficulties, such as intent recognition, tool use and construction, task-oriented object replacement, user personalization, etc. In recent years, the rapid development of artificial intelligence technologies such as intelligent decision-making and large language models has made it possible to realize robots with human-like intelligence.

At present, the methods used for intelligent decision-making of robots can be mainly divided into two categories: reinforcement learning and imitation learning. Reinforcement learning methods need to set task-related reward functions to guide the agent to learn how to complete the task. In contrast, imitation learning allows agents to learn from pre-collected expert data, imitate the expert's behavior (behavior cloning), or learn a reasonable reward function based on the expert's behavior (inverse reinforcement learning). However, there are still some disadvantages of these two methods, such as setting the task-related reward function usually makes the model generalization performance poor, and the cost of collecting expert data is high. To this end, we propose a new learning method called taskagnostic learning.

The proposal of taskless learning is mainly based on the fragmentation and purposelessness of human knowledge. The fragmented nature of knowledge is manifested in the knowledge required to complete a specific task, which is usually not learned coherently and completely. It is not common to learn all the knowledge related to the task at one time and then complete the task, but it is accumulated in life, and the fragmented knowledge is screened and integrated to complete the task when facing the specific task (you may also need to learn some new knowledge related to the task). For example, we have known for a long time how to open a door and how to place a cup, so for the task of putting a plate in the refrigerator, it is only necessary to migrate and merge this fragmented knowledge. The aimlessness is manifested in the fact that many knowledge acquisitions do not have specific goals, but are more inclined to be accidental in the process of environmental exploration. For example, if we find that a newly bought cup is a bit heavy, or that a bookmark is a bit hand-cutting, this knowledge can suddenly become useful for a specific task, such as suddenly remembering that a cup can be used to hold down a note to prevent it from being blown away by the wind, or realizing that a hand-cutting bookmark might be appropriate for unpacking. Inspired by the above phenomenon, we propose taskless learning. In addition, we prefer to refer to it as taskless learning rather than task-insensitive learning, because there may be no purpose/task in the process of learning knowledge.

In the following content, the definition of taskless learning is given, and the differences and connections with existing methods are given, secondly, a robot task and motion planning method based on taskless learning is introduced, the advantages and difficulties of taskless learning in this research field are discussed, and finally the development and application prospects of taskless learning are prospected.

1 Task-free learning

1.1 Basic Definitions

If the training data of a learning method is collected entirely by methods that are not directly related to the final task and there is no need to repeat the training model when solving the final task, we refer to this learning style as taskless learning. At the same time, the default training data contains fragmented knowledge that can solve the final task. Specific to the research field related to robots, if the environmental exploration method is not directly related to the final task, such as completely random environmental exploration, novelty-guided environmental exploration, etc., we call the method of using such exploration data to guide the agent to learn knowledge as an intelligent decision-making method based on taskless learning.

1.2 Differences and connections with existing methods

Self-supervised learning is usually used for the pre-training of feature extraction models, and self-supervised learning is completed by setting data labels (i.e., supervised signals) by themselves. For example, a model can be trained to recover an image by covering part of an image with a mask, or the image can be shuffled and the image is scrambled to train the model to arrange the image blocks correctly. This approach is more focused on enabling the model to extract features better and can be used as a secondary task to improve model performance, or as a pre-trained model and fine-tuned for downstream tasks to improve performance. In contrast, taskless learning focuses more on the relationship between the training data and the test task, and does not emphasize how the supervised signal of the training data is generated, which can be set by historical task-related information or through self-supervised methods.

Transfer learning reduces the data requirements of the model for new tasks (i.e., the target domain) and enables the model to achieve better performance on the target domain tasks by migrating the knowledge learned from the source domain to the target domain. This usually requires a strong correlation or similarity between the source domain and the target domain. There is some similarity between this and taskless learning, i.e., knowledge that can be solved in the source domain/training data that can solve the target domain/target task. However, taskless learning emphasizes the fragmentation of knowledge, that is, the tasks in the overall training data may be quite different from the target tasks.

Reinforcement learning (RL) has been very successful in many fields in recent years, such as AlphaGo, by setting reward functions to guide agents to explore the environment and learn how to solve tasks. However, setting task-related reward functions usually requires task-related expertise and makes agent generalization performance poor. A branch of taskless learning that is more relevant is goal-oriented reinforcement learning, which adds additional goals as input to enable agents to complete multitasking, but it still needs to set goal-related reward functions. In addition, in recent years, many scholars have proposed self-supervised reinforcement learning, but these methods usually use self-supervised methods to extract state features, or combine them with the setting of reward functions, completely abandoning task-related reward functions, which is still very difficult, making it fundamentally different from taskless learning.

Imitation learning (IL) requires the collection of large amounts of expert data for agent training. Agents can train agents to mimic expert behavior in a supervised learning manner (behavior cloning), or learn a good reward function from expert demonstrations (inverse reinforcement learning). In addition, there is a lot of work to combine it with the idea of generative confrontation (generative confrontation imitation learning). However, expert data is often a demonstration of a specific task, which is quite different from the training data requirements of taskless learning. At present, some works use data that is not directly related to the target task to assist reinforcement learning and imitation learning to train agents, but this kind of data is still used as a secondary aid.

2 Application of taskless learning in robot task and motion planning

2.1 Robot task and motion planning method based on task-free learning

This subsection mainly introduces the robot task and motion planning method based on taskless learning, and its main framework is shown in Figure 1.

Invited Article丨Taskless Learning and Its Application in Robot Task and Motion Planning

Figure 1 Robot task and motion planning based on taskless learning

2.1.1 Scene reconstruction and understanding

我们选择真实场景 - 虚拟场景 - 真实场景的架构（real to simulation to real，Real2Sim2Real）来进行机器人任务和运动规划（task and motion

planning, TAMP), that is, through 3D reconstruction and scene information estimation and other technologies, the real scene information is reconstructed in the virtual scene (i.e., the physical simulator), and the selected action is executed in the real scene after the decision is made in the virtual scene. To complete the conversion of Real2Sim, a 3D reconstruction method based on depth maps is used, while object properties (size, material, etc.) are estimated by relevant artificial intelligence methods. Commonly used robots and physical simulators due to building virtual scenes, as shown in Figure 2.

Figure 2 Common robots and physics simulators

2.1.2 Environmental Exploration

In order for the agent to understand the effect of different actions interacting with objects in the environment, intuitive physics and other information, it is necessary to perform different actions in the environment to collect data, so as to facilitate the learning of the agent in the later stage. The environmental exploration method is conducted in a way that is not directly related to the mission goal to simulate the way people interact with the environment in their daily lives. Exploration methods can be random exploration, novelty-guided environmental exploration, or other intrinsic reward-driven exploration methods that are not directly related to the task. The data is saved as [... , State I, Action I, State I+1,...] , which contains fragmented knowledge for solving downstream tasks. For robot action execution, we use kinematics and dynamics to solve the problem by default, and do not make additional requirements for the control method (motion control, force control, or hybrid control, etc., depending on the specific problem).

2.1.3 Knowledge Learning

The ability to learn the properties and functions of objects, summarize and abstract objective laws, and predict the results of action execution is the core of human intelligence, and it is also the key problem to be solved by artificial intelligence. Knowledge can be classified into low-level knowledge and high-level knowledge, in which low-level knowledge is related to the specific environment, mainly involving research fields such as scene understanding, that is, in the current environment, the results of robots interacting with objects with different actions, and high-level knowledge is only related to attributes such as object categories, mainly involving functional and affordance reasoning and tool USE), physics/intuitive physics, causality and other research fields. For the learning of low-level knowledge, we can directly use the object information in the current scene as the input of the neural network model, and for the high-level knowledge, we can extract the body type, shape, material and other information as the input of the neural network model.

In order to learn this kind of knowledge, a simple way is to input the object state and the corresponding actions before and after the task execution into the neural network, and use the extracted feature information as the corresponding action execution effect. In order to make the extracted action effect features more accurate, additional constraints may be required, such as making the same action feature as similar as possible. In some cases, the combination of action effect features and object features can be used as fragmented knowledge that can be combined to accomplish specific tasks. Of course, the representation of fragmented knowledge can also be expressed in other ways.

2.1.4 Mission and movement planning

In task and motion planning, task planning decomposes a target task into multiple subtasks, which is similar to the situation in which humans continue to simplify complex problems into multiple simple sub-problems to solve, while motion planning solves the problem of the actual movement of the robot from a starting state to the end state, which needs to meet the constraints such as no collision, meeting the specific robot joint moment and pose limits. Because task planning focuses only on discrete task spaces and often does not take into account real-world environments and robot hardware implementations, there may be situations where subtasks are difficult to complete. For this reason, many scholars in recent years have combined the two, using a planner to consider both task and exercise planning.

In order to apply taskless learning to task and motion planning problems, a task decomposition method is to take the current environmental state and specific task information as the input of the neural network, and the model output as a subtask, and at the same time, the constraint subtask features can be synthesized by some fragmented knowledge. Through the sub-task features and the learned fragmented knowledge features, the appropriate fragmented knowledge is selected for completing the task. For robot motion planning problems, the saved exploration data can be used to train neural network models to solve, similar to goal-oriented reinforcement learning methods, or other traditional methods, such as grid methods, virtual potential fields, etc.

We use the framework of real scene-virtual scene-real scene to simulate the way of human thinking, that is, the brain reconstructs the real scene through visual and other information, thinks in the brain and simulates the result of the action, and finally selects the appropriate action to be executed in the real scene to complete the task. Correspondingly, firstly, the depth image obtained by the depth camera is used to reconstruct the real scene in the virtual environment, then the thinking (planning) is carried out in the virtual scene, and finally the thinking result (action) is executed in the real scene to obtain a new environmental state, and the iterative thinking and execution are carried out until the task is completed. Since the thinking (planning) is carried out in a virtual environment, multiple ideas (i.e., action sequences) can be generated, and the optimal solution can be selected for execution in the real environment, which is similar to Monte Carlo tree search, but the simulation part is not directly calculated (for example, Go, etc., the next board state can be directly calculated according to the current board state and actions) or estimated by the neural network model, but is simulated and calculated in the virtual environment (physical simulator).

2.2 Analysis of advantages and difficulties

The adoption of taskless learning greatly reduces the data requirements of related artificial intelligence methods and promotes the realization of human-like intelligent robots. In terms of data, since taskless learning only requires exploratory data that is not directly related to the task, there is no need to collect expert data or design reward functions that may require strong domain experience, which greatly reduces data costs. In addition, because exploratory data that is not directly related to the task is more easily available, agents can obtain training data on a larger scale. One of the difficulties of taskless learning is the learning and representation of fragmented knowledge, as well as the retrieval and splicing of fragmented knowledge when facing specific tasks. Due to the complete lack of expert data or reward information guidance, it is difficult to achieve a good generalization effect in the representation of fragmented knowledge for different environments and tasks. One possible solution is to combine taskless learning with imitation learning, and guide the extraction and splicing of fragmented knowledge through a small amount of expert data, which may be more reasonable, similar to the human knowledge learning process, partly from their own exploration and thinking, and partly from the words and deeds of parents and teachers.

3 Future Prospects

3.1 Development prospects

The following introduces several research areas that may be combined with taskless learning, and also corresponds to several urgent problems to be solved in intelligent robots.

3.1.1 Use and Construction of Tools

In contrast to the direct interaction between the robot and the object, the tool enables the robot to interact with the object indirectly. The right tools can make the task easier, for example, using boxes to carry multiple objects at once. However, in life, the use and construction of tools is often a difficult problem for robots.

For the use of tools, on the one hand, the same tools are used differently in different tasks. For example, the direct use of downward force can be used to cut off objects that are biased towards rigid bodies, but in order to cut deformable objects, additional front and rear forces are also necessary to "saw". On the other hand, the same tool may be used in different ways at different stages of the same task. For example, when nailing a nail in a wall, you might initially use a forward grip position and a smaller swing amplitude to hold the nail in place, and later use a more backward grip position and a larger swing amplitude to submerge the nail into the wall with more force.

For the construction of tools, on the one hand, a certain property of an object can make an object a tool. For example, use one flat surface of a book as a tray. On the other hand, when it comes to specific tasks, tools built on a single attribute may not meet the needs of the task. For example, a book can be used as a tray for transporting fruit, but a teacup filled with water may not be a good choice.

3.1.2 Task-oriented object replacement

Another major challenge faced by intelligent robots is task-oriented object replacement, mainly due to the difficulty of covering the ever-changing work environment in the training data. It is highly likely that the objects involved in the agent planning results may not exist in the actual working environment. However, it is difficult to determine the similarity between objects and choose a replacement item. This is different from the current artificial intelligence that usually uses physical features to determine similarity, and it is usually related to specific tasks and needs to meet some conventional habits. For example, in the kitchen, sometimes salt and soy sauce can be substituted with each other, and sometimes soy sauce and vinegar can be substituted with each other, but even if rock sugar and fruit candy are similar, the latter is rarely used as a substitute for the former in the kitchen.

3.1.3 User Personalization

The user's personalized customization can be regarded as the agent's adaptation to the user's preferences. Today, most applications include the function of personalized user recommendation, and for intelligent robots that are directly used to meet user needs, it is also an inevitable development direction for intelligent robots to consider user preferences in decision-making to improve user experience. However, individual preferences are often more difficult to learn. One reason is that human expressions of preference are more complex, and multiple object interactions do not directly indicate their preference for the object, but are also related to the type and intention of interaction. For example, a book that is usually used as a table corner or as a table cushion has a negative preference despite multiple interactions, which makes it wrong for the agent to put the book on the shelf or use another book as a table mat, but it may be correct for other books by the same author to be treated the same.

3.2 Application prospects

Since taskless learning reduces the agent's data requirements, combined with the real-scene-virtual-real-scene framework, the exploration data can be used to efficiently learn in the virtual scene, verify in the real scene, and use the feedback information to gradually adjust to realize the adaptation of the learned knowledge to the specific working environment. In terms of production, it can replace people in dangerous work areas, such as coal mines, fields, etc., and in life, it can be used for domestic robots to explore the home environment and learn the properties of objects, so as to facilitate knowledge transfer to complete specific tasks.

4 Concluding remarks

In this paper, the taskless learning method is proposed, the definition of taskless learning and its differences and connections with existing methods are introduced, and its application in robot task and motion planning is introduced. We look forward to the further development of intelligent robots to facilitate people's lives as soon as possible.

(References omitted)

Zhang Xianqi

Ph.D. candidate of Harbin Institute of Technology. His main research interests are robotics and computer vision.

Fan Xiaopeng

Professor of Harbin Institute of Technology, national high-level talent. His main research interests are video codec, computer vision, and robotics.

Excerpt from "Newsletter of Chinese Society of Artificial Intelligence"

Vol. 14, No. 2, 2024

A New Paradigm in Scientific Research: A Special Topic on Basic Models under All-in-One

Scan to join us

Get more resources for the Society

Invited Article丨Taskless Learning and Its Application in Robot Task and Motion Planning

Read on