laitimes

Guotai Junan Securities: Embodied intelligence, the next wave of artificial intelligence

author:Finance

Guotai Junan believes that "embodied intelligence" has the ability to perceive and learn to act like human children; The basic assumption of "embodied intelligence" is that intelligent behavior can be learned by agents with corresponding forms by adapting to the environment; Tesla Bot functions are progressing rapidly, commercialization prospects are promising, and the "computing power overlord" NVIDIA has a high-profile layout of embodied intelligence; The AI value brought by embodied intelligence is far greater than that of humanoid robots.

The following is the original text:

From symbolism to connectionism, the interaction of agents with the real world is gaining increasing emphasis. In the period following Dartmouth in the fifties, research on artificial intelligence was largely limited to symbolic processing paradigms (symbolism). The limitations of symbolism were soon exposed in practical applications, and prompted the development of connectionism, forming a variety of methods, including multilayer perceptrons, forward neural networks, recurrent neural networks, and deep neural networks that are popular in academia and industry today. This method of simulating cognitive processes with artificial neural networks has indeed made great progress in adaptation, generalization and learning, but it has not really solved the problem of interaction between agents and the real physical world. The "Moravik paradox" can be popularly expressed as follows: it is relatively easy to make computers play chess like adults, but it is quite difficult or even impossible to make computers have the ability to perceive and act like a one-year-old child.

In response to the above problems, the concept of "Embodied AI" came into being. Aiming at the interaction problem of agents, Minsky proposed the concept of "reinforcement learning" from the perspective of behavioral learning. In 1986, Brooks emphasized from a cybernetic perspective that intelligence is Embodied and contextlized, that the traditional classical AI evolutionary path centered on representation is wrong, and that the way to eliminate representation is to make behavior-based robots. In his book How the Body Shapes the Way We Think, Rolf Pfeifer clearly describes "the embodiment of intelligence" by analyzing "how the body affects intelligence", and clarifies the profound impact of "embodiment" on understanding the nature of intelligence and studying artificial intelligence systems. These works have laid a solid foundation for the behaviorist approach represented by embodied intelligence, the third school of artificial intelligence.

The basic assumption of "embodied intelligence" is that intelligent behavior can be learned by agents with corresponding forms by adapting to the environment. It can be simply understood as a variety of different forms of robots, allowing them to perform a variety of tasks in a real physical environment to complete the evolution process of artificial intelligence. To understand separately, the basic meaning of "embodiment" is the dependence of cognition on the body, that is, the body has an impact on cognition, in other words, the body participates in cognition and affects mental processes such as thinking and judgment. "Embodiment" means that cognition cannot exist alone from the body. In addition, the relative concept of "embodiment" is "disembodiment", which refers to the decoupling of cognition and body (the large model represented by ChatGPT only realizes detached intelligence); "Intelligence" represents the ability of an agent (biological or mechanical) to understand and transform the objective world through its own learning after interacting with the environment. In addition, some robots trained through reinforcement learning can also be considered a form of embodied intelligence, such as OpenAI's one-handed Rubik's Cube robot. Therefore, embodied intelligence aims to create an agent that combines software and hardware and can learn and evolve autonomously based on the interaction between machines and the physical world.

The concept of embodiment is verifiable and measurable. The concept of the world as understood by human beings includes not only non-embodied concepts such as responsibility, honor, feelings, and desires unique to human beings, but also embodied concepts such as cups, cars and corresponding behaviors. The embodied concept is accessible, verifiable and interpretable, that is, the entity and behavior corresponding to the embodied concept can be measured, verified by the completion of tasks, and the inference of concepts can be realized through embodied learning. In contrast, the basic elements of non-embodied concepts cannot be measurable and verifiable.

Guotai Junan Securities: Embodied intelligence, the next wave of artificial intelligence

"The unity of knowledge and action" is the scientific stance of embodied intelligence. According to the technical implementation logic of embodied intelligence, "knowing" is based on "doing", that is, only through "embodiment" can a certain scene be understood. For example, there is a bedroom, which has behavioral characteristics such as sleeping, resting, and putting clothes, and this kind of behavior is designed based on the human body, so to truly understand the scene of the bedroom, it is necessary to be able to directly verify it by sitting on a chair, lying on the bed and other behavioral tasks. Similarly, the robot can achieve the above behavior by understanding the scene to represent that it truly understands the scene. Because in essence, the categories of objects and scenes are mostly defined by functions and tasks, "what I can use it for, what it is", for example, a hammer cannot be called a stick, a hammer has its unique behavioral properties.

Embodied knowledge accounts for a high proportion of ancient Chinese characters. Most of the ancient Chinese characters such as oracle bone script portray a concept through the representation of behavior, such as the ancient writing of "fight", which represents two people's hands pulling a rope, so understanding behavior is the key to understanding concepts and scenes.

Guotai Junan Securities: Embodied intelligence, the next wave of artificial intelligence

Therefore, computer vision and NLP are more tools for embodied intelligence, and general artificial intelligence is the ultimate goal of embodied intelligence. Embodied intelligence should be able to achieve some phenomena of using the body (various parts) to complete physical tasks, such as foreigners who cannot use chopsticks, but can still fork up and eat, so embodied intelligence must also complete the task through the physical environment, showing scenes that were not covered before completion. Therefore, according to the characteristics of embodied intelligence can be judged, just as the concepts of speed, momentum, elasticity and other concepts in the field of classical mechanics laid the foundation for the field of physics and drove the development of subsequent science, in the same way, embodied intelligence is expected to become the driving force of general artificial intelligence because it realizes knowledge, concepts, explainability and behavioral causality.

Embodied intelligence starts with availability. Availability means letting machines know what objects and scenes can provide, such as how entire bodies and parts fit effectively to the scene. According to the example in the "Gendexgrasp: Generalizable dexterous grasping" paper, if two or three or five fingers are used to hold a pillar, if different hands can produce an error-free grip, it means that there is availability, and physics is the key to machine understanding availability.

Guotai Junan Securities: Embodied intelligence, the next wave of artificial intelligence

Embodied intelligence also needs to be functional. Embodied intelligence In the process of using objects as tools, it is necessary to be able to understand functions guided by task execution. To understand the world from an agent, the core lies in the task - changing the state of the entity, and it is the task realization that drives the agent. For example, in the process of solving the "shovel" task, it is necessary to use different tools to shovel the soil, such as cups, shovels, pans, etc., which must be able to enable the agent to achieve the task of "shoveling soil". Therefore, the functionality of embodied intelligence is to give an object a function to solve a specific task.

Guotai Junan Securities: Embodied intelligence, the next wave of artificial intelligence

Embodied intelligence requires the realization of causal chains. As far as the "shoveling" example mentioned above, whether the agent can successfully shovel the soil has a causal relationship, such as controlling the way to swing the hammer, momentum, impulse and other indicators of change degree and change process, need to be controlled by mathematical and physical causal chains. Professor Zhu Songchun's team from the Institute of Artificial Intelligence introduced a learning and planning framework, and proved that the proposed learning and planning framework can identify the basic physical quantities that are important for mission success, so that the agent can independently plan effective tool use strategies and imitate the basic characteristics of human use of tools.

Learning how an agent uses tools involves multiple cognitive and intelligent processes that are not easy even for humans. Getting a robot to master all the skills covered by the tool is a challenging task that has three levels: one is low-level motion control. Many studies are based on impedance control to track the trajectory used by the tool, or to change forces and motion constraints at different stages, or to use learning-based methods to control the trajectory of the robot. In the underlying control, robust execution of motion trajectories is at the heart of the concern. The second is mesolayer characterization. Various intermediate representations of benefit to downstream tasks are proposed for a better understanding of the use of tools. Although the introduction of these representations facilitates learning more different tool-using skills, they are currently limited to the geometric association between the shape of the tool and the task. The third is to understand the high-level concepts involved in the use of tools, such as the functionality and affordance of objects, as well as the causal relationship and common sense involved in the use of tools, so as to achieve better generalization.

Most of the existing embodied intelligence work focuses on one of the three basic characteristics above. Either it focuses mainly on the robot's motion trajectory without understanding the task itself, or oversimplifies motion planning for high-level conceptual understanding, which cannot cover all levels more comprehensively. As a result, robots are far from being able to develop tool-using strategies based on specific contexts, and due to significant differences in kinematic structure, the human-used strategies observed by robots may not be ideal for them. For example, given a set of objects (typical tools or other objects), how can the robot determine which one will be the best choice for the task? Once an object has been chosen as a tool, how can the robot use it effectively, given the specific kinematic structure and dynamics constraints of the robot and the tool? These questions are also the frontier research areas of the industry.

For machines to understand entities and behaviors, they have to answer three core scientific questions. First of all, from the perspective of machine cognition, how to make machines understand behavior? Secondly, from the perspective of neurocognition, what is the intrinsic relationship between machine cognitive semantics and neurocognition? Moreover, from the perspective of embodied cognition, how to transfer behavioral understanding knowledge to robot systems?

To achieve embodied intelligence, we must first answer the question of whether machines can clone human behavior. Behavioral cognition is an important and core problem in intelligent science, and to let machines understand the world: understanding entities + understanding behavior, because the uncertain world space can be classified as both entities and behaviors.

Guotai Junan Securities: Embodied intelligence, the next wave of artificial intelligence

Deep learning frameworks have hit a bottleneck in behavioral cognition. Because deep learning has made great progress, there are two elements in the field of computer vision, one is object-centric perception and the other is human-centered perception. With evolving deep learning algorithms, complex object recognition can be very successful, but it is very difficult for machines to understand the true semantics of this behavior from a human perspective. The same is true for market performance, many commercial products are based on object detection, and there are very few products for behavioral understanding. The reason why human-centric perception is difficult is because deep learning itself has reached a bottleneck. According to Professor Lu's research, the SOTA of behavioral recognition is much lower than that of object recognition.

The key to behavioral understanding is to extract behavioral understanding elements in great semantic noise. Behavior is an abstract concept, so it is necessary to capture the elements related to behavior in the image. To measure the semantic judgment interval of an image, it can be depicted with the speech-to-noise ratio (speech-to-noise ratio = supporting semantic judgment interval/full-image interval), that is, erasing an area on the image so that others cannot recognize the type of behavior of the smallest area. Professor Lu's team found through calculations that the speech-to-noise ratio of object recognition is much greater than that of behavior recognition, which means that objects can still be recognized by covering a large area, but even a small area cannot be recognized. Therefore, it can be concluded that the key to behavioral understanding is to extract behavioral understanding elements in great semantic noise, that is, it is necessary to truly mine the true semantics of the image under the condition of great interference. This work cannot be achieved by increasing the workload of deep learning.

It is a better scientific path to decompose behavioral cognitive problems into two simpler stages of perception to knowledge and knowledge to reasoning. Discrete semantic symbols are shared by different behaviors, such as eating, reading and cleaning have "hand-hold-something" labels, through the migration, reuse and combination of these shared labels, behavioral primitives can be formed, thereby constructing "middle layer knowledge", this combination can have a certain generalization ability, that is, through the combination of primitives, machines can make unseen behaviors.

Guotai Junan Securities: Embodied intelligence, the next wave of artificial intelligence

Therefore, building a large amount of primitive knowledge and logical rule base is the first task. The basic causes of human understanding of behavior are approximately equal to what the various parts of the human being are doing, so it is first necessary to construct a large amount of primitive knowledge based on the local state of human beings and be able to identify them. Secondly, with good primitive detection, they need to be programmed to achieve data-driven learning guided by logical rules, but the problem here is that the rules are what humans think, and if the rule base is wrong, it will have a great impact, so rule learning is the solution to this problem. The specific process is to randomly sample in the knowledge base of the behavior primitives to form a judgment of the behavior, and then search based on the prior starting point given by humans, rule space sampling, if the accuracy rate is improved, add the rule, otherwise delete the rule, and form a new rule through the adjusted rule distribution. Professor Lu Zewu found that taking the "human cycling" image as an example, after the above technical process, the machine can automatically identify the behavior of "cycling" in the unseen "cycling rules", so the technical route can effectively approximate the human performance of behavior recognition.

Guotai Junan Securities: Embodied intelligence, the next wave of artificial intelligence

Machines need to be backed by science to understand human behavior. Therefore, scientists need to further determine whether there is a stable mapping relationship between machine vision behavioral classification features and neural features. If there is a stable relationship, there is an objective basis for visually defining behavior.

Experiments found that there was a mapping of behavior from patterns to brain signals, and the model was stable. Professor Lu Cewu and the biomedical team built the first large-scale visual understanding-neural signal closed-loop system, and analyzed the correlation between the behavior patterns and neural signals of mice. Through experiments, it was found that machine learning showed that there was a mapping of behavior from patterns to brain signals, and a stable model could be established. In addition, by constructing a neural circuit discovery system related to behavior based on machine learning, we successfully discovered neural circuits that analyze the behavior of "mouse social hierarchy". In summary, it can be concluded that there is a scientific basis for defining behavior through vision.

Not only to understand behavior, but also to be able to perform behavior, and machines can perform behavior to truly understand behavior. Through computer vision and behavioral cognitive recognition, allowing machines to confirm and distinguish a behavior is only the first step, which is only the functional level achieved by traditional spectator AI learning, for example, traditional AI learning can make machine learning "box" concept and say the label "box" in new scenes, but in the embodied intelligent learning mode, the machine through perception of environmental entities, through personal experience to complete embodied learning, and finally understand the scene and form the concept of "open". Therefore, when the machine can perform this behavior, it is the foothold of embodied intelligence.

Executive behavior requires systematic interaction involving form, behavior, and learning. In morphological, embodied intelligence, form, behavior and learning are closely related. First of all, it is necessary to use form to generate behavior, which emphasizes the use of the morphological characteristics of embodied agents to skillfully achieve specific behaviors, so as to achieve the purpose of partially replacing "computing". Secondly, it is necessary to use behavior to achieve learning, focusing on the use of behavioral abilities such as exploration and operation of embodied agents to actively obtain learning samples and label information, so as to achieve the purpose of independent learning, which is currently at the forefront of research. Moreover, it is necessary to emphasize the use of learning to improve behavior and the use of behavior to control form, the latter has a variety of implementation methods, but the current use of learning means to improve behavior, and then control the form of the work is a new intelligent control method emerged after the development of modern artificial intelligence technology, especially the technology based on reinforcement learning has become a hot means. Finally, embodied intelligence needs to use learning to optimize form, emphasizing the use of advanced learning optimization technology to realize the morphological optimization design of embodied agents.

Guotai Junan Securities: Embodied intelligence, the next wave of artificial intelligence

"Embodied perception" is the full-concept interactive perception oriented to the execution of actions. Embodied intelligence first and foremost has to solve the problem of embodied concept learning, that is, how to define, acquire, and express physical concepts that can be used by robots. Embodied perception is different from traditional computer vision, which does not analyze all knowledge, while embodied perception includes "full-concept perception" and "interactive perception", so as to ensure that the machine sees not the label, but how to use it. For example, from the perspective of human cognition, build a large-scale joint body knowledge base, the knowledge base covers shape, structure, semantics, physical properties, while marking the mass, volume, inertia, etc. of each component of the joint body, recording real-world object operation force feedback and simulation operation force feedback, under the blessing of physical attribute knowledge, the object force feedback curve can be fully fitted, at this time, when simulating object operation, it is no longer to detect the label, but all knowledge is detected, after detection, The accuracy of perception can be judged by the accuracy rate of machine execution.

Through the feedback of behavior and the spatial compression of pattern learning, a certain generalization of "embodied execution" can be achieved. Under interactive perception, if the machine only looks at the object, the amount of information does not increase, but if it interacts with it, it can quickly reduce the error. The machine faces the object and preliminarily tests its knowledge, but there is definitely a situation where the knowledge structure is inaccurate, but on the basis of guessing how it made this behavior, the machine can be guided to do it, and if it is different from the real thing after doing it, it proves that there is a problem with guessing, and then optimizes the problem in turn. Moreover, all the feature patterns captured can be compressed into the spatial range that can be learned, and through this mechanism, the machine can also perform related behaviors when facing objects it has never seen, so it has a certain degree of versatility.

Tesla Bot features are progressing rapidly, and commercialization prospects are promising. In 2021, at the "Tesla AI Day", Musk released Tesla's general robot plan and showed the general shape of the humanoid robot Tesla Bot with pictures. But the Tesla Bot was just a concept at the time. A year later, at Tesla AI Day 2022, the humanoid robot Optimus was physically unveiled. At the Tesla shareholders' meeting in mid-May 2023, Musk showed the latest progress of Tesla Bot, which can now walk smoothly and flexibly grab and drop objects. Musk said at the meeting: "Humanoid robots will be Tesla's main long-term source of value in the future." If the ratio of humanoid robots to humans is 2 to 1, the demand for robots may be 10 billion or even 20 billion, far exceeding the number of electric vehicles."

Recent Tesla Bot breakthroughs come from Tesla's improved motor torque control and environmental modeling techniques. Tesla has used a number of technological methods to improve the movement and control of humanoid robots, including motor torque control, environmental discovery and memory, and training robots based on human demonstrations. First, the research team used motor torque control to manipulate the movement of the legs of the humanoid robot to keep the robot's landing force gentle. It is very important for a robot to observe or perceive its surroundings, so Tesla added the ability to discover and remember the environment to the humanoid robot. The humanoid robot can now roughly model its surroundings. Tesla's humanoid robot has a similar body structure to a human, and Tesla's research team has trained the robot using a large number of human demonstrations, especially in terms of hand movements, aiming to give it the ability to grasp objects similar to humans.

The AI value brought by embodied intelligence is far greater than that of humanoid robots. The biggest characteristic of embodied intelligence is to be able to independently perceive the physical world from the perspective of the protagonist, and use anthropomorphic thinking paths to learn, so as to make the behavioral feedback that humans expect, rather than passively waiting for data feeding. Humanoid robots provide a variety of learning and feedback systems based on human behavior, and provide an iterative basis and testing ground for achieving more complex behavioral semantics, so the gradual improvement of humanoid robots also provides a direction for the landing of embodied intelligence. The embodied intelligence application for industrial and other scenarios does not have to be a humanoid robot, so the technology and methodology behind embodied intelligence is the core, which also means that the value brought by embodied intelligence is much higher than that of the humanoid robot itself. In other words, humanoid robots are an important application scenario of embodied intelligence, and will also provide direction and space for the iterative optimization of embodied intelligence.

After the rise of reinforcement learning, embodied intelligence received more attention. Previously, with the success of Alpha Go, the academic community's interest in reinforcement learning increased, and many people began to use RL to open up the perception-decision-execution of agents, hoping to achieve embodied intelligence. Training RL is a process of trial and error, so since 2017 and 18, many simulation training platforms have emerged, which can put an agent in the form of embodiment, and then obtain a reward through interaction with the environment, and then learn a policy. But because there is always a gap between the simulation and the real environment (called sim2real gap), the learned policy may not necessarily be transferred to the real world. At present, the skills that can be transferred from simulation to the real environment are mainly relatively single skills such as mobile navigation, single-step grasping or operation, and it is difficult to generalize.

Recently, the limelight of large language models has overwhelmed reinforcement learning. Recently, the industry hopes to integrate vision, language, and robots with a model through large-scale sequence to sequence, and has also achieved certain results. However, the execution of the robot requires 4D data (three-dimensional environment and the timing trajectory of the robot's movement), and its data volume and richness are far less than pictures and text, and the acquisition cost is much higher, so the difficulty of iterative evolution is much higher than that of large models.

The multimodal large model provides an important driving force for the breakthrough of the technical bottleneck of embodied intelligence. Embodied intelligence is the inevitable result of the integration and development of artificial intelligence, robotics and other technical branches, because computer vision opens a window for image acquisition and processing, graphics also provides tool support for physical simulation, NLP also provides convenience for human-machine interaction, and also provides an effective way for machines to learn knowledge from text, and cognitive science also provides a scientific research approach for the behavioral cognitive principles of embodied intelligence. Various robot components also provide a bridge for the interaction between the agent and the physical environment. Therefore, the technical branch of artificial intelligence and the improvement of robot functionality have brought possibilities for the further development of embodied intelligence, and the large model of the current AIGC era can better integrate and innovate the previous technical branches, and many researchers have tried to use the multimodal large language model as a bridge between humans and robots, that is, through the joint training of images, text, and embodied data, and the introduction of multimodal input, enhance the model's understanding of real objects. This helps the robot to deal with embodied reasoning tasks more efficiently, and to a certain extent improves the generalization level of embodied intelligence. Therefore, AI models such as GPT provide new research methods for the self-perception and task processing optimization and upgrading of embodied intelligence.

Guotai Junan Securities: Embodied intelligence, the next wave of artificial intelligence

"Computing power overlord" NVIDIA has a high-profile layout and intelligence. At ITF World 2023, Jensen Huang said the next wave of AI will be embodied intelligence, i.e. intelligent systems that understand, reason and interact with the physical world. At the same time, he also introduced NVIDIA's multimodal embodied intelligence system Nvidia VIMA, which can perform complex tasks, acquire concepts, understand boundaries, and even simulate physics under the guidance of visual text prompts, which also marks a significant improvement in AI capabilities.

Combining sensor modalities and language models, Google's visual language model adds visual functions compared to ChatGPT. In March 2023, the AI research team of Google and the Technical University of Berlin launched the PaLM-E multimodal visual language model (VLM), the largest visual language model at that time, which has 562 billion parameters, integrates the visual and language capabilities of the controllable robot, directly incorporates real-world continuous sensor modalities into the language model, thereby establishing the connection between words and perception, and the model can perform various tasks without retraining, which adds visual functions compared to ChatGPT. The main architectural idea of PaLM-E is to inject continuous, materialized observations (such as images, state estimation, or other sensor modalities) into the language embedding space of the pre-trained language model, thus enabling continuous information to be injected into the language model in a manner similar to language tagging.

Google has achieved high-level real-time interconnection between visual language and robots, and observed the emergence of emerging capabilities such as multimodal thinking chain reasoning and multi-image reasoning. Based on the language model, PaLM-E makes continuous observations, such as receiving image or sensor data, and encodes them into a series of vectors the same size as the language token. As a result, the model can continue to "understand" sensory information in the same way it processes language. Moreover, the same PaLM-E model can achieve the level of real-time control of robots. PaLM-E also demonstrates the ability to adapt to adaptability, such as multimodal thought chain inference (allowing the model to analyze a range of inputs, including verbal and visual information) and multi-image inference (using multiple input images simultaneously for reasoning or prediction) despite being trained on single image cues. However, the spatial scope, item types, task planning complexity and other conditions in the demo shown by Google are still relatively limited, and as deep learning models become more complex, PaLM-E will also open more feasible application space.

Microsoft is planning to extend ChatGPT's capabilities to robotics, making it possible to control bots with spoken and written language. Experiments have been able to control the robot to find "healthy drinks", "things with sugar and red flags" in the room by entering commands into ChatGPT's dialog box. According to the Microsoft researchers, "the goal of the study is to see if ChatGPT can go beyond generating text and reason about real-world situations to help robots complete tasks." Microsoft wants to help people interact with robots more easily without having to learn complex programming languages or details about robotic systems.

Ali uses a similar path to Microsoft and is experimenting with connecting the Qianwen model to industrial robots. At the 6th Digital China Construction Summit held recently, Alibaba Cloud released a demonstration video to show the practical application scenarios of the Qianwen model. Among them, the Qianwen model is connected to the industrial robot, and after the engineer issues instructions to the robot through the DingTalk dialog box, the Qianwen model automatically writes a set of code in the background and sends it to the robot, and the robot begins to identify the surrounding environment, finds a bottle of water from a nearby table, and automatically completes a series of actions such as moving, grabbing, and distribution, and delivers it to the engineer. Enter a human language in the DingTalk dialog box to direct the robot to work, which will bring revolutionary changes to the development and application of industrial robots, which means that the big model opens a new door for the development of industrial robots. Because large models such as Qianwen provide robots with the ability to reason about decisions, it is expected that the flexibility and intelligence of robots will be greatly improved.

This article is selected from the Brokerage Research Report

Read on