laitimes

Which volume does the CV go to? Li Feifei pointed out three "Polaris": embodied intelligence, visual reasoning and scene understanding

Which volume does the CV go to? Li Feifei pointed out three "Polaris": embodied intelligence, visual reasoning and scene understanding

Reporting by XinZhiyuan

EDIT: LRS

ImageNet has witnessed the brilliant process of computer vision development, in the case of some tasks performance has surpassed humans, how should the future of computer vision develop? Li Feifei recently posted pointing to three directions: physical intelligence, visual reasoning and scene understanding.

In the process of the deep learning revolution, computer vision relies on the large-scale dataset ImageNet, which has shown amazing performance in multiple tasks such as image classification, object detection, and image generation, even higher than the accuracy rate of humans!

But why has CV achieved so much? Where does the future lie?

Recently, "Chinese AI goddess" Li Feifei published an article in the journal D dalus of the American Academy of Arts and Sciences, taking the object recognition task in computer vision as the starting point to study the development process of ImageNet dataset and related algorithms.

Which volume does the CV go to? Li Feifei pointed out three "Polaris": embodied intelligence, visual reasoning and scene understanding

Article link: https://www.amacad.org/publication/searching-computer-vision-north-stars

The article argues that much of the development of technology stems from the pursuit of north stars. "Polaris" here refers to researchers who focus on solving key problems in a scientific discipline that can spark enthusiasm for research and make breakthroughs.

After the success of ImageNet and object recognition, more and more Polaris problems have emerged.

This article focuses on the brief history of ImageNet, its related work, and subsequent developments. The aim is to inspire more polaris-related work to advance the field and artificial intelligence in general.

Which volume does the CV go to? Li Feifei pointed out three "Polaris": embodied intelligence, visual reasoning and scene understanding

The second author of the paper, Ranjay Krishna, an assistant professor in the Allen School of Computer Science and Engineering at the University of Washington, graduated from Stanford University in 2021 with a Ph.D. in the supervision of Feifei Li, whose main research direction is the intersection of computer vision and human-computer interaction, using frameworks derived from social and behavioral science to develop machine learning model representation, interaction, model, training paradigm, data collection pipeline, and evaluation protocols.

Which volume does the CV go to? Li Feifei pointed out three "Polaris": embodied intelligence, visual reasoning and scene understanding

ImageNet's past and present lives

For most ordinary users, artificial intelligence is a rapidly evolving field, of course, everything stems from the engineering feats of modern computer science, especially in recent years, AI engineering progress has become faster and faster.

From spam filtering to personalized recommendation systems to smart autonomous brakes in cars, there's plenty of engineering practice within the system.

The science behind engineering is often overlooked.

Which volume does the CV go to? Li Feifei pointed out three "Polaris": embodied intelligence, visual reasoning and scene understanding

As researchers in the field of AI, they often have a deep understanding of engineering and science, and will think that the two are inseparable and complementary. Stimulate new ideas and explorations in practice, and over time, put them into engineering practice.

Once you've identified the basic problem and found the next Polaris, you're already at the forefront of the field. As Albert Einstein said, it's often more important to ask a question than to solve it.

Since 1950, the field of artificial intelligence has been driven by various Polaris problems, when Turing cleverly raised the question of how to judge whether a computer is worthy of being called intelligent, the "Turing test."

Six years later, when the founders of AI planned to host the Dartmouth conference, they set another ambitious goal, proposing to build machines that could "use language, form abstractions and concepts, solve the problems now left to humans, and improve themselves."

Which volume does the CV go to? Li Feifei pointed out three "Polaris": embodied intelligence, visual reasoning and scene understanding

Without this guiding light, we may never be able to solve new problems.

In the study of artificial intelligence, vision is at the core, and some evolutionary biologists hypothesize that the preferential evolution of animal eyes has led to differences in species.

So how do you teach a computer to see things?

At the turn of the century, inspired by a lot of previous work, Li feifei and his collaborators raised a problem of object recognition: the ability of computers to correctly identify what appears in a given image.

This seems like a promising Polaris problem, and over the course of a decade from 1990 to early 2000, object recognition researchers have made tremendous progress toward this daunting goal, but progress has been slow because of the widely different appearances of real-world objects.

Even in a single, specific category, such as a house, a dog, or a flower, an object can look completely different. For example, an AI model that can accurately identify an object in a photo as a dog, whether it's a German Shepherd, Poodle, or Chihuahua, whether it's shot from the front or side, running to catch a ball or landing on all fours, or with a blue turban around the neck, should be able to identify correctly. In short, dog-related images are dizzyingly diverse, and models that used to teach computers to recognize such objects can't cope with this diversity.

One major reason is that models of the past tended to use hand-designed templates to capture features in images, and models lacked input to large-scale image data to cope with the diversity of objects.

This meant that we needed a completely new data set to achieve three design goals: scale, diversity, and quality.

Which volume does the CV go to? Li Feifei pointed out three "Polaris": embodied intelligence, visual reasoning and scene understanding

The first is scale, psychologists hypothesize that human-like perception requires contact with thousands of different objects. When a toddler begins to learn, his daily life begins to be exposed to a large number of images. For example, a six-year-old has probably seen three thousand different objects and learned enough features to help distinguish between more than thirty thousand categories.

At the time, the most commonly used object recognition dataset contained only 20 kinds of objects, so it was important to expand the dataset, we collected 15 million images from the Internet and labeled them as corresponding object categories.

Referring to WordNet, Li Feifei named the new dataset ImageNet

The second is diversity. The images collected from the Internet cover many categories, and there are more than 800 species of birds alone, including a total of 21,841 categories to organize these tens of millions of images. To make the trained model more robust, the data in ImageNet contains images of various scenarios, such as "German Shepherd in the Kitchen", and also labels categories with upper and lower digits, such as huskies including "Alaskan husky" and "heavy-haired Arctic sled dog"

Which volume does the CV go to? Li Feifei pointed out three "Polaris": embodied intelligence, visual reasoning and scene understanding

The third point is quality. To create a gold-standard dataset that can replicate human visual acuity, ImageNet only receives high-resolution images. To make the labels more accurate, the research team asked Princeton undergraduates to label and validate the labels, later using Amazon's crowdsourcing platform, and eventually quickly hired about 50,000 labelers from 167 countries and territories between 2007 and 2009 to label and verify objects in the dataset.

With ImageNet data, how to make it work becomes key.

The ImageNet team agreed that it was free for any interested researcher and that an annual competition was set up to incentivize the development of relevant models.

The turning point came in 2012, when AlexNet made it possible to apply convolutional neural networks to object recognition for the first time, and the accuracy rate crushed the second-place contestant.

Which volume does the CV go to? Li Feifei pointed out three "Polaris": embodied intelligence, visual reasoning and scene understanding

Although neural networks have been studied for decades, it is ImageNet that makes neural networks play their original power.

Within a year, almost all AI papers were about neural networks. As more people participate in research, the accuracy of object recognition is also getting higher and higher.

In 2017, the challenge was completed. In eight years, the contestants have increased the algorithm's correct recognition rate from 71.8% to 97.3%, and this accuracy has even surpassed our own (95%).

Learning to recognize objects is just a form of learning to "see", and there are many more tasks in the field of computer vision, such as object detection, but there are certain similarities between them, which also means that experience can be used as a reference.

Theoretically, computers should be able to exploit these similarities, a process also known as "transfer learning."

Humans are very good at transfer learning, and transfer learning is also of great help to AI, the current way to help computers transfer learning is pre-training, the starting point is to use the ImageNet dataset to learn object recognition.

But that's not to say that ImageNet is useful for all computer vision.

One example is medical imaging. Conceptually, the task of classifying medical images (such as screening tumors) is not fundamentally different from identifying images taken by mobile phones, both visual images and category labels are required, and can also be judged by properly trained models.

Which volume does the CV go to? Li Feifei pointed out three "Polaris": embodied intelligence, visual reasoning and scene understanding

But the ImageNet dataset can't be used to screen tumors because there's no data for the task. What's more, using crowdsourcing platforms is also largely unfeasible, and labeling data related to medical diagnoses requires a very high level of expertise, is scarce and expensive.

Computer vision certainly has other applications, such as analyzing satellite imagery to help governments assess changes in crop yields, water levels, deforestation, and wildfires, and tracking climate change.

The use of ImageNe also raises a problem where people focus too much on large-scale data and ignore the impact of a single piece of data. For example, some "adversarial examples" can make the model misclassify images by modifying a single pixel, and researchers are currently working on how to defend against attacks.

Finally, the widespread impact of ImageNet has led to some criticism of datasets and raised questions that were not fully considered at the time of its inception.

The most serious of these is the issue of the fairness of the portraits. Although we've known for a long time to filter out blatantly derogatory image tags such as race, sexism, etc., there are subtle problems in datasets: labels that aren't pejorative in nature but can be offensive when applied improperly.

While these equity issues are difficult to eliminate completely, there is also some work dedicated to mitigating the effects of bias.

Where is CV Polaris?

What's next for computer vision?

The authors believe that one of the most promising areas is embodied AI, that is, robots that can be used for tasks such as navigation, manipulation, and execution of instructions.

Robots do not refer to humanoid robots with heads and two legs, and any tangible intelligent machine that moves in space is a form of embodied artificial intelligence, whether it is a self-driving car, a robot vacuum cleaner, or a robotic arm in a factory. Just as ImageNet aims to represent the vast and diverse imagery of the real world, the study of embodied AI needs to address the complex diversity of human tasks, from folding clothes to exploring new cities.

Which volume does the CV go to? Li Feifei pointed out three "Polaris": embodied intelligence, visual reasoning and scene understanding

Another North Star is visual reasoning, such as understanding three-dimensional relationships in a two-dimensional scene. Imagine a scenario where even having a robot perform a seemingly simple command, such as "bring the cup back to the left side of the cereal bowl," requires visual reasoning. Executing such instructions certainly requires something more than vision, but vision is an important part.

Which volume does the CV go to? Li Feifei pointed out three "Polaris": embodied intelligence, visual reasoning and scene understanding

Understanding the people in the scene, including social relationships and human intentions, adds another level of complexity, and this basic social intelligence is also a north star of computer vision. For example, if you see a woman with a little girl on her lap, the two people are likely to be in a mother-daughter relationship; if a man opens the refrigerator, he may be hungry. But computers don't yet have enough intelligence to infer these things.

Which volume does the CV go to? Li Feifei pointed out three "Polaris": embodied intelligence, visual reasoning and scene understanding

Computer vision, like human vision, requires not only perception, but also deep cognition. There's no doubt that all of these Polaris are huge challenges, bigger than ImageNet.

It's one thing to identify a dog or chair by looking at pictures, it's another to think about and navigate the world of infinite people and spaces.

But it's a set of challenges well worth pursuing: As computer vision intelligence unfolds, the world can be a better place. Doctors and nurses will have a pair of tireless eyes to help them diagnose and treat patients, cars will run safer, and robots will help humans brave disaster areas to save the trapped and wounded.

Scientists can, with the help of more powerful intelligent machines, break through human blind spots, discover new species, better materials, and explore uncharted territory.

Resources:

ttps://www.amacad.org/publication/searching-computer-vision-north-stars

Read on