laitimes

Special report on artificial intelligence industry: AI large model empowers humanoid robots

author:Weather observation

(Report producer/author: Guotai Junan Securities, Xiao Qunxi, Bao Yanxin)

1. Universal - solve the contradiction between high demand and low penetration rate of robots

1.1. Robot evolution path: from fixed to mobile, from independent to collaborative, from single to universal

The premise of the commercialization of service robots is that the product can provide real value, and the judgment of real value lies in whether the robot can be universal. In the context of global labor shortage, the robot industry is booming, with the global service robot market size of $21.7 billion in 2022, with a compound growth rate of more than 20% in the past 5 years. However, under the background of rapid development, the penetration rate of service robots is still not high, and the landing of large-scale business is not smooth.

We believe that the reason is that most service robots currently have more or less scene adaptation problems, such as unable to adapt to environmental changes, and users cannot achieve scene adaptation through simple operations after environmental changes; The degree of intelligence is low, pedestrian obstacle avoidance and functional performance are not ideal; The robot deployment process is complex (such as SLAM mapping, target point annotation, etc.), and all deployment operations can only be performed by the robot on-site deployment engineer, which is difficult for users to operate and participate, and when changes are required, on-site deployment engineers are still required to operate. Take the supermarket scene as an example: the environment is complex: the hollowed shelves (ultra-high obstacles), narrow passages, easy to fall areas, low obstacles and temporary paving in the scene test the robot's passability, perception ability, and task planning ability. High dynamic: the mall has a large flow of people, easy to gather, many dynamic obstacles, and high requirements for the robot's safe obstacle avoidance ability. There are many special objects, and the scene light changes greatly: such as glass guardrails, escalators, glass turnstiles, glass walls and other high-transparency objects Most robots are basically unrecognizable, and it is easy to interfere with the lidar, resulting in misjudgment of the robot, collision, fall, and inability to approach the operation. For robots that rely on vision sensors, it is difficult to operate stably under light conditions such as ordinary light, darkness, and overexposure.

The above problems also exist in the field of industrial robots, affecting the increase in the penetration rate of industrial robots until the emergence of collaborative robots. In 2022, the global cooperative robot market size will be 8.95 billion yuan, and it is estimated that the market size will reach 30 billion yuan with a growth rate of 22.05% in 2022~2028. In 2017~2022, China's collaborative robot sales increased from 3,618 units to 19,351 units, and it is expected that shipments will exceed 25,000 units in 2023, and the market size will increase from 360 million yuan to 2.039 billion yuan in 2016~2021, with a compound growth rate of 41.5%. Cobots can also be considered service robots because they are designed to work side by side with humans. Traditional industrial robots work separately from people behind fences and complete limited work, such as welding, spraying, hoisting, etc. Collaborative robots are more flexible, smarter, easier to work with, and more adaptable, enabling manufacturing industries such as automotive and electronics to extend automation to final product assembly, completing tasks such as polishing and coating, quality inspection, and more.

1.2. How to make robots more versatile?

To make robots more versatile, it is necessary to comprehensively improve the robot's perception ability, thinking and decision-making ability, and action execution ability. We believe that the emergence of GPT (Pre-trained Big Prediction Model) and humanoid robots is a big step on the road of robots to general artificial intelligence. The ability to perceive the world (robot eyes): Laser and visual navigation are the mainstream application scenarios in the perception and positioning technology of autonomous movement of robots. The development of computer vision has experienced deep learning technology based on traditional visual methods represented by feature descriptors and CNN convolutional neural networks, and the current general visual large model is in the research and exploration stage, the scene of humanoid robots is more general and complex than industrial robots, and the multi-task training scheme of All in One of the vision large model can make the robot better adapt to human life scenes. On the one hand, the strong fitting ability of large models makes humanoid robots have higher accuracy in tasks such as target recognition, obstacle avoidance, three-dimensional reconstruction, and semantic segmentation. On the other hand, the large model solves the problem that deep learning technology relies too much on a single task data distribution and the scene generalization effect is not good, the general vision large model learns more general knowledge through a large amount of data and migrates it to the downstream task, and the pre-trained model obtained based on massive data has good knowledge completeness and improves the scene generalization effect.

The ability to think and make decisions (the brain of the robot): The current robots are special robots, which can only be applied in limited scenes, even if the robot grasps, based on computer vision, it is still in the limited scene, the algorithm is only used to identify objects, how to do, what to do still need human definition. To make the robot universal, tell him to water the flowers, he knows that to get the kettle, collect water, and then water the flowers, this is something that requires common sense to complete. How can robots have common sense? Before the advent of large models, this problem was almost unsolvable. The large model allows the robot to have common sense, so that it can be versatile to complete various tasks, completely changing the mode of universal robot implementation.

The adaptability of human tools and the environment eliminates the need to build tools for robots. Executive ability (robot limbs): mobility (legs) + fine manipulation (hands). The purpose of making a robot into a human form is to make the robot's ability to execute more versatile. The environment in which robots perform tasks is built according to the human body: buildings, roads, facilities, tools, etc., and the world is designed for the convenience of humanoid creatures such as humans. If a new form of robot emerges, one must redesign a new environment for the robot to adapt to. It is relatively easy to design a robot that performs tasks within a certain range, and if you want to improve the versatility of the robot, you must choose a humanoid robot that can be used as a doppelganger. In addition, humans and humanoid robots are more likely to have emotional communication, and humanoid robots make people feel close. Japanese roboticist Masahiro Mori hypothesized that because robots and humans are similar in appearance and movement, humans will also have positive emotions towards robots.

Special report on artificial intelligence industry: AI large model empowers humanoid robots

1.3. Humanoid robots enter the eve of commercialization

From the DARPA Robotics Challenge in 2015, to the various scientific research projects of humanoid robots in 2019, the general decline in the industry, and then to the blooming of a hundred flowers driven by Tesla in 2022, the humanoid robot industry is in a spiral upward development. Boston Dynamics' Atlas, Tesla's Optimus, Xiaomi CyberOne, IHMC's Nadia, Agility Robotics' Nadia, Japanese Asimo and HRP-5P are all exploring the commercial form of humanoid robots. We have sorted out representative products from the development of humanoid robots: the first humanoid robot WABOT-1 (1973). In 1973, Ichiro Kato of Waseda University in Japan led a team to develop the world's first life-size humanoid intelligent robot - WABOT-1. The robot has a limb control system, a vision system and a dialogue system, with two cameras on the chest and tactile sensors on the hands.

Honda's E-series robots (1986~1993), laying the foundation for stable walking. Honda launched the E-series bipedal robot, E0 to E6, the walking speed from slow to fast, from walking straight to on steps or slopes can achieve stable walking, laying the foundation for the next step of the development of the P series humanoid robot, which is a milestone in the history of robotics. Honda P series robots (1993-1997) & ASIMO (2000~2011). In 1993, Honda developed the first humanoid robot prototype P1, and in 2000, the fourth and last robot P4 in the P series was born, commonly known as ASIMO. The third generation ASIMO launched in 2011 is 1.3 meters tall, weighs 48 kg, and walks at a speed of 0-9km/h, and the latest version of ASIMO in 2012, in addition to having walking functions and various human body movements, can also pre-set movements, and make corresponding actions according to human voice, gestures and other commands. He also has basic memory and recognition skills. In 2018, Honda announced that it would stop the development of the humanoid robot ASIMO to focus on more practical applications of the technology.

Special report on artificial intelligence industry: AI large model empowers humanoid robots

HPR series robots (1998~2018) replace the heavy work in the construction industry: this is a general family assistant robot development project sponsored by the Ministry of Economy, Trade and Industry and the New Energy and Industrial Technology Development Organization, and led by Kawada Industries, the National Institute of Advanced Industrial Science and Technology (AIST) and Kawasaki Heavy Industries. The project began with HPR-1 (Honda P3) in 1998, and successively launched HPR-2P, HRP-2, HRP-3P, HRP-3, HRP-4C, HRP-4 and other humanoid robots. Currently the latest robot HPR-5P was released in 2018, this robot is 182cm tall, weighs 101kg, and has a total of 37 degrees of freedom in the whole body, designed to replace heavy work in the construction industry.

Boston Dynamics (1986~2023): Leg-and-foot robot operation control technology is at the forefront and has obvious military application characteristics. Boston Dynamics was first known around the world for its Big Dog, and the company released BigDog, Rise, LittleDog, PETMAN, LS3, Spot, Handle, Atlas and other robots, from single-legged, multi-legged robots to humanoid robots, with obvious route characteristics for military applications. Boston Dynamics is a typical technology-driven company, from the mechanical structure, algorithm gait control, power system energy consumption and other aspects of the robot continuous iterative update, the core is to develop legged robots to adapt to the use of different environments, the key technology lies in power research and robot balance state control.

Digit series robots (2019~2023): with walking ability, focusing on commercialization in the logistics field. The Digit series is an attempt by Agility Robotics to commercialize in the field of logistics, a spin-off robotics company from Oregon State University (OSU) dedicated to the development and manufacture of bipedal robots, and the development of MABEL, ATRIAS, CASSIE, DIGIT series of foot robots. Among them, CASSIE can achieve an amazing pace of 4m/s, which is a milestone achievement in the fast walking ability of leg-and-foot robots. In 2019, Agility launched the humanoid robot Digit, adding a torso, arms, and more computing power to Cassie, supporting boxes with a load of 18kg for moving packages, unloading and other work.

Special report on artificial intelligence industry: AI large model empowers humanoid robots

Xiaomi "Tieda" robot (2022): In 21, Xiaomi released a mechanical dog Cyberdog, which was its first attempt at foot robots. In August 2022, Xiaomi's first full-size humanoid bionic robot, CyberOne, was unveiled at the autumn conference. CyberOne is 177cm tall, weighs 52kg, and has the stage name "Tie Da", which can perceive 45 human semantic emotions and distinguish 85 environmental semantics; Equipped with Xiaomi's self-developed whole body control algorithm, it can coordinate the movement of 21 joints; Equipped with the Mi Sense visual space system, which can recreate the real world in three dimensions; 5 types of coupling drives throughout the body, peak torque of 300Nm.

Tesla Optimus Robotics (2022): Advancing the commercialization of humanoid robots. The Optimus prototype is similar to the 2022 Tesla AI day, with a height of 1.72m, a weight of 57kg, a load of 20kg, and a maximum exercise speed of 8km/h. At present, Optimus is still in rapid research and development, and in only 8 months, the robot can achieve complex actions such as upright walking, handling, and sprinkling.

Interactive robots Sophia (2015) and Ameca (2021), attempts to anthropomorphize facial expressions: Sophia is a humanoid robot developed by Hanson Robotics and introduced in 2015. Sophia's skin is made of Frubber bionic material, based on speech recognition and computer vision technology, which can recognize and replicate a wide variety of human facial expressions, and talk to humans by analyzing human expressions and language. Built by Engineered Arts, the UK's leading bionic entertainment robot design and manufacturing company, Ameca features 12 new facial actuators that can wink, purse, frown and smile in the mirror after facial expression upgrades. Akaka is free to perform dozens of human-like body movements and is considered "the most realistic robot in the world."

Special report on artificial intelligence industry: AI large model empowers humanoid robots

2.AI Large model + humanoid robot: provide common sense to the robot

2.1.AI Large model training process and development trend

Large model = pre-training + fine-tuning. Starting from Transformer in 2017, to the emergence of GPT-1, BERT, GPT2, GPT-3, GPT-4 models, the parameter magnitude of the model has achieved a breakthrough from hundreds of millions to trillions, and large models (pre-trained models, Foundation Models) are pre-trained on data without labeling, and fine-tuning is fine-tuned using dedicated small-scale labeled data, which can be used for downstream task prediction. Transfer learning is the main idea of pre-trained models, when the target scenario data is insufficient, first train the AI model based on deep neural network on a public dataset with a large amount of data, and then migrate it to the target scene, fine-tune it through the small data set in the target scene, so that the model can achieve the required performance. Pre-trained models greatly reduce the need for the model to work downstream of the amount of labeled data, making it suitable for scenarios where it is difficult to obtain a large amount of labeled data.

The development process and trend of large models: From the perspective of parameter scale, large models have gone through the stages of pre-training models, large-scale pre-training models, and ultra-large-scale pre-training models, and the number of parameters has achieved development from 100 million to trillions. From the perspective of data modes, large models are developing from single modal large models such as text, speech, and vision to general artificial intelligence with the integration of multiple modes.

Special report on artificial intelligence industry: AI large model empowers humanoid robots

2.2.AI Large models make humanoid robots have universal task solving capabilities

The AI large model will realize the combination with humanoid robots from speech, vision, decision-making, control and other aspects, forming a closed loop of perception, decision-making and control, and greatly improving the "intelligence" of the robot: Voice: ChatGPT As a pre-trained language model, ChatGPT can be applied to natural language interaction between robots and humans. For example, a robot can use ChatGPT to understand human natural language instructions and act accordingly according to the instructions. Natural language is the most common interaction medium for humans, and speech as the carrier of natural language will be the key task of robot anthropomorphism. Although the emergence of deep learning has pushed speech interaction technology with speech recognition technology, natural language processing, and speech generation technology as the modules to a relatively mature stage, in practice, semantic understanding deviations (irony, etc.), insufficient multi-round dialogue ability, and blunt text are still prone to occur in the actual process. The language big model provides a solution for the robot's autonomous speech interaction problem, and ChatGPT shows no less than human understanding and language generation ability in general language tasks such as context understanding, multilingual recognition, multi-round dialogue, emotion recognition, and fuzzy semantic recognition. With the blessing of the large model represented by ChatGPT, the understanding and interaction of humanoid robots with common language can be put on the agenda, which will be the beginning of general AI-enabled general service robots.

Vision: The vision model empowers humanoid robots to recognize more accurately and use more versatile scenes. The development of computer vision has experienced the traditional visual method represented by feature descriptors and the deep learning technology represented by CNN convolutional neural networks, and the general visual large model is currently in the research and exploration stage. On the one hand, the strong fitting ability of the large-parameter model makes the humanoid robot have higher accuracy in tasks such as target recognition, obstacle avoidance, three-dimensional reconstruction, and semantic segmentation. On the other hand, the general large model solves the problem that the deep learning technology represented by convolutional neural network in the past relies too much on the distribution of single task data and the effect of scene generalization is not good, the general vision large model learns more general knowledge through a large amount of data, and transfers it to downstream tasks, and the pre-trained model obtained based on massive data has better knowledge completeness and greatly improves the effect of scene generalization. The scene of humanoid robots is more versatile and complex than that of industrial robots, and the multi-task training scheme of All in One of the vision large model can make the robot better adapt to the human life scene.

Decision-making: Common language and environmental awareness are the basis of automated decision-making, and the multimodal large model meets the decision-making needs of humanoid robots. Single-modal intelligence cannot solve the decision-making problem of designing multimodal information, such as the task of "voice telling the robot to fetch the green apple on the table". The purpose of multimodal unified modeling is to enhance the cross-modal semantic alignment ability of the model, gradually standardize the model, and enable the robot to synthesize multi-dimensional information of vision, speech and text to realize the ability of integrated decision-making by all senses. Pre-trained large models based on multi-modality may become artificial intelligence infrastructure, enhancing the diversity and versatility of tasks that robots can complete, making them not limited to single parts such as text and images, but compatible with multiple applications, expanding single intelligence into fusion intelligence, so that robots can combine their perceived multi-modal data to achieve automated decision-making.

Special report on artificial intelligence industry: AI large model empowers humanoid robots

Control: Generative AI empowers robots to self-control, ultimately forming a closed loop of perception, decision-making, and control. To make humanoid robots have universal capabilities, they first need to have "common sense", that is, general language understanding (speech) and scene understanding (vision); Secondly, it needs to have decision-making ability, that is, the dismantling of tasks generated after receiving instructions; Finally, it needs to have self-control and execution performance, and the code generation ability of generative AI will eventually make the robot's perception, decision-making, and action form a closed loop to achieve the purpose of self-control. In fact, the Microsoft team has recently tried to apply ChatGPT to the scene of robot control, by writing the robot's underlying function library in advance, and describing its function and goals, ChatGPT can generate code to complete the task. Driven by generative AI, the threshold for robot programming will slowly be lowered, eventually achieving self-programming, self-control, and completing universal tasks that humans take for granted.

2.3. OpenAI and Microsoft apply big language models to robots

OpenAI leads the investment in Norwegian humanoid robotics company 1X Technologies. In 2017, OpenAI launched Roboschool, an open-source software for robots, deploying a new one-sample imitation learning algorithm in robots to demonstrate how robots can perform tasks through humans in VR. In 2018, OpenAI released 8 simulated robotics sessions and post-hoc experience return baseline implementations that were used to train models that worked on physical robots. In 22 years, Halodi Robotics tested EVE, a medical assistant robot, at Sunnaas Hospital in Norway, to perform logistics work. On March 28, 2023, OpenAI led the investment in Norwegian humanoid robotics company 1X Technologies (formerly known as Halodi Robotics). Through the Ansys Startup Program, Halodi Robotics leverages Ansys simulation software to develop humanoid robots that can safely collaborate with people in everyday scenarios.

Microsoft proposed ChatGPT for Robotics, which uses ChatGPT to solve the problem of writing robot applications. In April 2023, Microsoft published a paper on its official website called "ChatGPT for Robotics: Design Principles and Model Abilities", the goal of this research is to see if ChatGPT can go beyond textual thinking and reason about the physical world to help complete robotic tasks. Humans still rely heavily on handwritten code to control robots, and the team is exploring how to change that reality, using OpenAI's new artificial intelligence language model, ChatGPT, to enable natural human-computer interaction.

Humans can go from being in the loop in the robotic process to being on the loop. The paper proposes that LLM is not required to output code specific to robot platforms or libraries, but only to create simple high-level libraries for ChatGPT to call, and on the back end, the high-level libraries are linked to existing libraries and APIs for various platforms, scenarios and tools. The results show that the introduction of ChatGPT enables humans to interact with language models through high-level language commands such as natural language, and users continuously input human perception information into ChatGPT through text dialogue, and ChatGPT parses the observation flow and outputs relevant operations in the dialogue system without generating code. In this way, humans can seamlessly deploy a variety of platforms and tasks, and humans evaluate the quality and safety of ChatGPT output. The main tasks of humans in the bot pipeline are: 1) First, define a set of high-level bot APIs or libraries. The library can be designed for a specific robot type and should map from the robot's control stack or perception library to an existing low-level concrete implementation. It is important to use descriptive names for high-level APIs so that ChatGPT can reason about their behavior. 2) Write a text prompt for ChatGPT that describes the task objective and clearly states which functions are available in the high-level library. Prompts can also contain information about task constraints, or how ChatGPT should organize its answers, including using a specific programming language, or using auxiliary resolution components. 3) Users evaluate ChatGPT's code output by directly inspecting or using an emulator. If needed, users use natural language to provide ChatGPT with feedback on the quality and safety of answers. 4) When the user is satisfied with the solution, the final code can be deployed to the bot.

Special report on artificial intelligence industry: AI large model empowers humanoid robots

ChatGPT can solve simple bot tasks in a zero-shot fashion. For simple robot tasks, users only need to provide text prompts and library descriptions, and do not need to provide specific code examples, ChatGPT can zero-shot solve spatiotemporal reasoning (ChatGPT controls a flat robot and captures the basketball position with a visual servo), controls real drones to complete object search, and manipulates virtual drones to achieve industrial detection.

With human user interaction on the loop, ChatGPT can complete more complex bot control tasks. 1) Course learning: teach ChatGPT simple skills of picking and placing objects, and use the learned skills in logical combinations for more complex block arrangement tasks; 2) Airsim obstacle avoidance: ChatGPT builds most of the key modules of the obstacle avoidance algorithm, but requires manual feedback on information such as the orientation of the drone. Human feedback advanced natural language, ChatGPT can understand and make code corrections where appropriate.

ChatGPT's dialogue system can parse observations and output relevant actions. 1) Closed-loop object navigation with API: Provides ChatGPT with access to computer vision models as part of its library. ChatGPT builds a perceiv-action loop in its "code" output, implementing the ability to estimate relative object angles, explore unknown environments, and navigate to user-specified objects; 2) Use ChatGPT's dialogue system for closed-loop visual language navigation. In the simulation scenario, the human user takes the new state observation as the dialogue text input, and the output of ChatGPT only returns the forward movement distance and turning angle, realizing the "dialogue system" to guide the robot to navigate to the area of interest step by step.

3. Humanoid, making the robot's movement execution more versatile

Executive ability (the robot's limbs): mobility (legs) + fine manipulation (hands). The purpose of making robots humanoid is to make the robot's ability to perform more versatilely. The environment in which robots perform tasks is built according to the human body: buildings, roads, facilities, tools, etc., and the world is designed for the convenience of humanoid creatures such as humans. If a new form of robot emerges, people will have to redesign a new environment for robots to adapt to. It is relatively easy to design a robot that performs tasks within a certain range, and if you want to improve the versatility of the robot, you must choose a humanoid robot that can be used as a doppelganger. In this chapter, two representative products, Boston Dynamics Altas and Tesla Optimus, compare the differences in scheme from the aspects of drive, environment perception and motion control, and explore the trend of commercialization of humanoid robot motion control scheme.

Boston Dynamics Altas is positioned as a forward-looking study of technology, with a focus on exploring the possibilities of technology applications rather than commercialization. From the perspective of hardware architecture, Altas has excellent dynamic performance, instantaneous power density and stable motion attitude, which can achieve high-load, high-complexity movements, like a technology-driven feast. Commercialization is not a major consideration at Boston Power, and the Altas project serves more as a research platform for researchers to conduct academic experiments, focusing on exploring the possibilities of technology applications rather than commercialization. Tesla Optimus is driven by the goal of large-scale, commercialization, standardization and commercialization of humanoid robots, and cost and energy consumption have become the consideration indicators of the Tesla team.

Special report on artificial intelligence industry: AI large model empowers humanoid robots

3.1. Drive: hydraulic drive VS electric drive

3.1.1. Electric drives have low cost, easy maintenance, high control accuracy and high commercialization potential

The driving scheme of mainstream humanoid robots includes hydraulic drive and electric drive (servo motor + reducer). Compared with the electric drive, the hydraulic drive has a large output torque, high power density and high overload capacity, so it can meet the needs of Boston Power Atlas for high load action and fast motion; However, the hydraulic drive method has large energy consumption and high cost, and at the same time it is prone to problems such as leakage and poor maintainability. On the one hand, high-load actions (such as parkour, backflip, etc.) in commercial scenarios are non-essential behaviors, on the other hand, with the continuous improvement of power density and response speed of electric drive systems, we believe that combined with the advantages of low cost of electric drive, easy maintenance and mature technology application, the commercialization possibility of humanoid robots based on electric drives is higher.

3.1.2. Boston Dynamics Atlas: "hydraulically driven" scheme

Boston Dynamics has a total of 28 hydraulic actuators throughout the body to perform complex movements with high loads. HPU (Hydraulic Power Unit) as the hydraulic power source of Atlas has a very small size of high energy density (~5kW/5Kg), electro-hydraulic through the fluid line connected to each hydraulic pump, can achieve fast response and precise force control, its high instantaneous power density hydraulic drive can support the robot to achieve running, jumping, backflip and other complex actions, the structural strength of the robot thanks to its highly integrated structural assembly. Based on the official disclosure of images and patent details, we speculate that the ankle, knee and elbow joints are driven by hydraulic cylinders; The hips, shoulders, wrists and waist and abdomen are driven by swinging hydraulic cylinders.

3.1.3. Tesla Optimus: "electric drive" scheme

A single Optimus has 40 actuators in the whole body, which is 6~7 times that of a single multi-joint robot. Among them: the body connection part adopts the transmission mode of reducer/screw + servo motor, a total of 28 actuators; Based on the underdrive scheme, the manipulator adopts the transmission structure of motor + tendon-driven, with 6 motors and 11 degrees of freedom in one hand.

Special report on artificial intelligence industry: AI large model empowers humanoid robots

According to Testla AI Day, among the six actuators independently developed by Tesla, the rotary joint scheme inherits the industrial robot human, linear actuator and micro servo motor are the new requirements of humanoid robots, specifically:

Rotary joint scheme (shoulder, hip, waist and abdomen): servo motor + reducer, we speculate that a single humanoid robot will be equipped with 6 RV reducers (hip, waist and abdomen) and 8 harmonic reducers (shoulder, wrist). According to the Tesla Optimus actuator scheme, the RV reducer has large volume, strong load capacity and high rigidity, and is suitable for hip, waist and abdomen large load joints, including 2*2 hip joints and 2 waist and abdomen two degrees of freedom, a total of 6 units; Harmonic reducer has small size, high transmission ratio and high precision, and is suitable for shoulder and wrist joints, including 3*2 shoulder joints and 1*2 wrist joints, for a total of 8 sets. With the influx of more manufacturers, there may be differences in their actuator solutions, and if the linear actuator is replaced by a rotary actuator, the number of individual robot reducers will increase.

Joints with small swing angles (knees, elbows, ankles, wrists): linear actuators (servo motor + lead screw). The integrated service electric cylinder (servo motor + lead screw) solution has self-locking ability, and the energy consumption is lower than that of pure rotary joint solution. Linear actuators are highly space-efficient and provide a large driving force. We suspect that a linear actuator based on torque motor combined with planetary roller screws will be applied to linear actuator joints (hip, knee, ankle, elbow, wrist), and a total of 14 linear actuators are expected.

With its high load, high rigidity and long life, planetary roller screw has become the key transmission device of the linear actuator of humanoid robot, and it is the premise of large-scale amplification to achieve cost reduction by adapting to the needs of humanoid robots. According to the information presented at Tesla AI Day 2022, the Optimus linear actuator uses a planetary roller screw integrated servo-electric cylinder. We believe that the servo electric axes of the hip, knee, ankle and elbow joints of the lower limbs are more likely to use planetary roller screws with high load capacity and high stiffness as the transmission device. The planetary roller screw structure is complex, the processing is difficult and the cost is very high, and it is the premise of large-scale application to achieve cost reduction by adjusting the design and process scheme to meet the needs of humanoid robots.

Special report on artificial intelligence industry: AI large model empowers humanoid robots

Manipulator: Optimus includes 6 actuators in one hand, which can achieve 11 degrees of freedom, driven by a micromotor, the "underdriven" solution is cost-effective, and the "rope drive" drive structure is uncertain. "Underdrive", the number of system actuators is less than the number of its degrees of freedom, because of the characteristics of the high number of degrees of freedom of the manipulator itself, in order to improve the integration, compactness and cost reduction of system design, and to simplify the consideration of subsequent motion control, designers will reduce the number of motors used (that is, the number of actuators), forming an underdrive scheme in which the number of actuators is less than the number of its degrees of freedom. By optimizing the mechanical structure, it drives more freedom with fewer actuators and saves costs, which is the mainstream choice for the research and development of commercial products and university manipulators.

The manipulator drive scheme varies greatly, and the lightweight, low cost of the motor is the key. In terms of mechanical transmission structure, the mainstream solutions of manipulators include Tendon Driven, connecting rods, rack and pinion, material deformation, etc. The robot drive concept is very different: Ritsumeikan Hand Ritsumeikan Hand drives 2 drives to 15 joints via coupling traces; Stanford/JPL dexterous hand 16 motors with one hand; Shadow Hand has 30 motors in one hand for a total of 24 degrees of freedom. The humanoid robot manipulator needs to meet the requirements of light weight, compact structure and strong grasping force, so the motor should have the characteristics of small size, light weight, high precision and large torque. The coreless cup motor has compact structure, high energy density, low energy consumption, and high compatibility with the needs of humanoid robot manipulators.

Tesla Optimus manipulator adopts motor + tendon rope drive, which may optimize the hand transmission solution. Although the rope drive brings great flexibility to the manipulator, and can greatly simplify the design difficulty and complexity of the system, its reliability and transmission efficiency are lower than the traditional connecting rod, rack and pinion, etc., which may be a short-term development expedient measure for the R&D team.

3.2. Context Perception: Depth Camera + LiDAR VS Pure Vision Scheme

The principles of perception and positioning technology used to realize the autonomous movement of robots mainly include vision, laser, ultrasound, GPS, IMO, etc., corresponding to different sensor categories of robot perception systems. SLAM (real-time positioning and map construction) is a relatively mature and widely used positioning technology, which is a system in which robots collect and calculate various sensor data to generate positioning and scene map information of their own location and pose. The SLAM problem can be described as a robot moving from an unknown location in an unknown environment, positioning itself based on position estimates and sensor data as it moves, while building incremental maps. After obtaining the positioning and map, it can move autonomously according to the path planning algorithm (global, local, obstacle avoidance).

Special report on artificial intelligence industry: AI large model empowers humanoid robots

3.2.1. Boston Dynamics Atlas: Depth Camera + LiDAR

Boston Dynamics' Atlas perception solution combines depth cameras and lidar to achieve gait planning based on a multi-plane segmentation algorithm. Atlas robot perceptual vision technology is relatively mature, it draws on the Google Transformer model, builds a HydraNet neural network model, optimizes the vision algorithm, and completes the migration of the pure vision system of autonomous driving; Atlas uses a ToF depth camera to generate a point cloud at 15 frames per second, extracts the environmental surface from the point cloud based on a multiplane segmentation algorithm, and maps the data to identify surrounding objects. After that, the industrial computer performs gait planning based on the identified surface and object information to achieve tasks such as obstacle avoidance, ground condition detection, and cruising. IHMC, known as the Institute for Human and Machine Cognition, is a top organization focusing on the development of robot control algorithms, mainly developing the key algorithms required for humanoid robots to walk, and directing Atlas robots to stand and walk and other algorithms come from IHMC.

3.2.2. Tesla Optimus: Vision-only solution at a lower cost

Tesla Optimus Environmental Perception uses a camera-based pure vision solution, porting Tesla's full self-driving system at a lower cost. Optimus is equipped with three cameras (fisheye camera + left and right cameras) on the head, and realizes environmental perception through panoramic segmentation + self-developed three-dimensional reconstruction algorithm (Occupancy Network), which is a pure vision solution that costs less than perception devices such as lidar, but requires high computing power. The robot inherits the Autopilot algorithm framework and trains the neural network suitable for the robot by reacquiring data to achieve three-dimensional reconstruction of the environment, path planning, autonomous navigation, dynamic interaction, etc. The implantation of Tesla's powerful Full Self-Driving System (FSD) has enabled robotic vision solutions to advance in a more accurate and intelligent direction without increasing hardware costs.

3.3. Motion Control: A universal controller solution has not yet been developed

Operation control algorithm is the core competitiveness, and each family-shaped robot control algorithm is self-developed. Humanoid robots have high requirements for motion control ability and perceptual computing ability, and the number and category of actuators of different manufacturers vary greatly, and the operation control algorithm may become the core competitiveness of manufacturers in the future, and the possibility of self-research is greater; In addition, the humanoid robot control scheme, the understanding of customer application scenarios and process requirements are also important factors, the current downstream scenarios are scattered, a single manufacturer is still difficult to make the humanoid robot universal in each scene.

3.3.1. Motion control algorithm: similar thinking, all offline behavior library and real-time adjustment

Boston Power Atlas: Behavior control based on offline behavior library and model predictive control (MPC) The offline behavior library is created based on trajectory optimization algorithms (centroid kinematics optimization + kinematics optimization) and motion capture (motion capture), and technicians can add new functions to the robot by adding new tracks to the library; After the robot is assigned a behavior target, it selects the behavior as close as possible to the target from the behavior library to obtain a theoretically feasible dynamic continuous action. Model predictive control (MPC) adjusts the details of some parameters (force, posture, joint action time, etc.) based on the real-time information fed back by the sensor to adapt to the difference between the real environment and the ideal and other real-time factors. MPC is an online control method that allows the robot to deviate from the template and predict the transition between two rows, such as jumping and backflipping, simplifying the creation of the behavior library.

Special report on artificial intelligence industry: AI large model empowers humanoid robots

Tesla Optimus: The idea of gait planning algorithm is similar to Altas, the motion planner generates a reference trajectory, the controller adjusts the optimization behavior in real time according to the sensor information, and the control algorithm is not yet mature In the gait control algorithm, the motion planner first generates a reference trajectory based on the expected path and determines the dynamic parameters of the robot model. The controller estimates the attitude of the robot based on the sensor data, and corrects the robot behavior parameters according to the difference of the real environment and the ideal model to obtain the real behavior. In addition, between the continuous gaits, the algorithm combines the footstep state of human walking (the soles of the feet initially land on the ground - > toes last off the ground), combined with the coordinated swing arm movement of the upper body, to achieve natural swing arms, large strides and straight knee walking as much as possible, improving walking efficiency and posture. At present, the robot's gait control scheme is not mature enough, the anti-interference ability is weak, and the dynamic stability is poor, Tesla technicians said that the balance problem of Optimus may take 18~36 months to solve.

Similarly, Optimus upper limb manipulation uses offline behavioral libraries based on motion capture and inverse kinematic mapping to achieve adaptive operation through real-time trajectory optimization.

3.3.2. Motion controllers: mostly self-designed, and the requirements of different manufacturers vary greatly

Humanoid robots collect and process a variety of modal data, and the complexity of the actuator is much higher than that of industrial robots, which requires high real-time computing power and integration of controllers. The type and number of humanoid robot sensors far exceed that of industrial robots, and it is necessary to complete 3D map construction, path planning, multi-sensor data collection, collection and calculation and closed-loop control at the same time during the process, etc., the process is relatively complicated, the data dimension and data volume are higher than that of industrial robots, and the computing power requirements are high. Industrial robots are generally identified and detected through external image grabbers and image processing software; The humanoid robot in the mobile scenario requires the image processor to be integrated in the controller chip, which requires the integration of the chip. Most humanoid robot controllers are independently designed, and the needs of different manufacturers vary greatly. At present, the uncertainty of the downstream scene of humanoid robots is strong, and the robot drive scheme (such as drive mode, motor scheme), perception scheme (pure vision, multi-sensor fusion, etc.), and control algorithm developed by different manufacturers are quite different, and the robot has different needs for the computing power and storage of the controller, so the composition of the controller is different, mainly independent design. We believe that the humanoid robot controller is more likely to adopt a distributed control system, that is, it consists of a core controller and multiple small controllers, where the small controllers are used to drive the joints of various body areas.

Boston Dynamics Atlas: The robot body is equipped with 3 industrial computers responsible for the calculation of the operation control system. The controller receives data from LiDAR, ToF depth camera, generates maps and paths, and plans target behavior based on offline behavior libraries; During the actual movement, sensor data such as IMU, joint position, force, oil pressure, temperature, etc. are collected to adjust and optimize the action sequence in real time.

Tesla Optimus: Reusing the perception and computing capabilities of Tesla vehicles to develop a controller system suitable for humanoid robots based on fully autonomous driving FSD chips. The FSD chip integrates the central processing unit, neural network processor (NPU), image processor (GPU), synchronous dynamic random access memory (SDRAM), signal processor (ISP), video encoder (H.265), and security modules to efficiently implement image processing, environment awareness, general-purpose computing, and real-time behavior control. In order to match the differences in the needs of humanoid robots and automobiles, the Optimus controller chip has made adaptive modifications on the basis of the FSD chip, added multi-modal information input support for visual, auditory, tactile and other data acquisition, implanted voice interaction and wireless connection modules to support human-machine communication, and has hardware protection functions to ensure the safety of robots and surrounding personnel, so as to realize behavioral decision-making and motion control.

Special report on artificial intelligence industry: AI large model empowers humanoid robots

Read on