The world is multidimensional, and the same scene in life will take on different forms from different perspectives. If artificial intelligence is to be more like a human, it is necessary to make its perspective closer to people. Looking at the environment from a human perspective, artificial intelligence may see a new world.
Recently, an academic alliance of Facebook and 13 universities and laboratories in nine countries announced that in November it will open source the Ego4D (Egocentric 4D Perception) project that allows artificial intelligence to interact with the world from a first-person perspective. The project contains more than 3,025 hours of first-person video covering the daily lives of more than 700 participants from 73 cities. These videos will help make the way AI perceives the world more human.
So, what kind of perspective does artificial intelligence mainly perceive the world through, and what impact will different perspectives have on the cognitive environment of artificial intelligence? What are the main technologies for artificial intelligence to perceive the environment and understand the world? If you want to perceive the world in a way that is more like humans, what bottlenecks does artificial intelligence need to break through?
AI often employs a third-person perspective
"For AI systems to interact with the world like humans, the field of AI needs to develop a whole new paradigm of first-person perception. This means that artificial intelligence should understand daily activities from a first-person perspective when moving and interacting in real time. Facebook's chief research scientist Kristin Grauman once said.
Today's computer vision systems mostly learn from millions of photos and videos taken in the third person. "To build a new paradigm of perception, we need to teach AI to be like humans, to understand and interact with the world through immersive observations from a first-person perspective, that is, 'me,'" which can also be called self-centered cognition." On October 26, Tan Mingzhou, director of the Artificial Intelligence Division department of Yuanwang Think Tank and chief strategy officer of Turing Robot, pointed out in an interview with a reporter from Science and Technology Daily.
How to understand the first-person and third-person perspectives of artificial intelligence? Tan Mingzhou explains: "The first-person perspective has a strong sense of substitution, for example, when playing a game, you are immersed in the scene, and the game screen you see is the picture you see in the real world. The third-person perspective, also known as the God perspective, is as if you are always floating around the character, like a shadow, and you can see the character itself and the surrounding situation. For example, hiding behind a bunker in a third-person perspective can see what is in front of the bunker, while in the first person view, confined to the viewing angle, only the bunker itself can be seen behind the bunker. ”
"Another example is autonomous driving, if its vision system only collects data from the perspective of a bystander (such as a car), even if it is trained through hundreds of thousands of images or videos of vehicles traveling based on bystander perspectives, artificial intelligence may still not know how to do it, and it is difficult to reach the current level of autonomous driving." Because the perspective of this bystander is very different from the perspective of sitting in front of the steering wheel in the car, under the first-person perspective, the reaction of the real driver also includes acts such as clicking and braking, and these data cannot be collected from the perspective of the bystander. Tan Mingzhou further said.
"In the past, the ARTIFICIAL community rarely collected data sets from a first-person perspective, and this project makes up for the shortcomings of the AI perspective system. The future development of AR and VR is very important, if artificial intelligence can start from 'I' and observe and understand the world from a first-person perspective, it will open a new era of immersive experience for humans and artificial intelligence. Tan Mingzhou pointed out.
Kristen Grauman also said: "The next generation of AI systems needs to learn from a completely different kind of data, that is, from event center vision rather than a video of a sideline vision showing the world. ”
Build real-world datasets
At present, let artificial intelligence perceive the environment, understand the world, and establish a human-like cognitive system through what "grasping hands"?
Industry experts point out that history proves that benchmarks and datasets are key catalysts for innovation in the AI industry. Today, computer vision systems that can recognize almost any object in an image are built on datasets and benchmarks that provide researchers with an experimental bench for studying real-world images.
The project released by Facebook a few days ago is actually to build a dataset in itself, aiming to train artificial intelligence models to be more like humans. It developed five benchmark challenges around the first-person visual experience, that is, to disassemble the first-person perspective into 5 goals and conduct corresponding training set competitions. Tan Mingzhou pointed out.
The 5 benchmarks for Ego4D are: Episodic memory, when does it happen? Predict, what might I do next? Hand-object interaction, what am I doing? Audiovisual diaries, who said what and when? Social interaction, who is interacting with whom?
Tan Mingzhou stressed that the above benchmarks will facilitate the research of the building blocks necessary to develop artificial intelligence assistants. AI assistants can not only understand and interact with instructions in the real world, but also implement the understanding and interaction of instructions in the metaverse.
To build this dataset, a university team working with Facebook distributed off-the-shelf head-mounted cameras and other wearable sensors to study participants to capture first-person, unedited videos of everyday life. The project focused on participants capturing videos from everyday scenes such as shopping, cooking, chatting while playing games, and other group activities with family and friends.
The video captures what the camera wearer chooses to look at in a particular environment and how the camera wearer interacts with people and objects from a self-centered perspective. So far, camera wearers have performed hundreds of activities and interacted with hundreds of different objects, and all data from the project is public.
"Facebook's research can drive the advancement of self-centered cognition research in the field of artificial intelligence more quickly. This will have a positive impact on the way we live, work and play in the future. Tan Mingzhou said.
Make AI cognitive abilities more human-like
The ultimate goal of AI development is to benefit humanity and enable us to meet the increasingly complex challenges of the real world. Imagine that AR equipment can accurately show how to play the piano, play chess, hold a pen and sketch in the piano, chess, book, and drawing classrooms; vividly guide housewives to bake barbecues and cook dishes according to recipes; forgetful old people can recall the past with the help of holograms in front of them...
Facebook stressed that it hopes to open up a whole new path for academic and industry experts through the Ego4D project to help build smarter, more flexible and more interactive computer vision systems. As AI becomes more in-depth in its understanding of the way humans live in daily life, it is believed that this project will be able to contextualize and personalize the AI experience in an unprecedented way. However, the current research is only touching on the skin of self-centered cognition.
How can we make AI's cognitive abilities more human-like? "The first is attention, the attention mechanism of artificial intelligence is closer to intuition, while human attention is selective. At present, most of the AI attention mechanism is to repeatedly tell the AI which places and what things are related during training. In the future, participants in the trial may be able to wear special devices that can capture the focus of the eyeballs to further collect relevant data. Tan Mingzhou pointed out.
"Second, it is also necessary to define the behavior of artificial intelligence with the correlation of events and behaviors as a research center. The occurrence of an event involves multiple behaviors, and artificial intelligence systems are trained in the form of human feedback to make the behavior of artificial intelligence consistent with our intentions. Tan Mingzhou further said.
Tan Mingzhou emphasized: "In addition, there is a need for cooperation, response and linkage between hearing and vision, language and behavior, which requires the construction of a multimodal interaction model, in-depth study of why the perspective will focus on the direction and combine with intention recognition, forming a linkage mechanism with behavior." ”
Hualing
Source: Science and Technology Daily