Ng: Farewell, big data

2022-02-14 12:30:09

Compilation 丨Vik, Wang Ye

Ng is one of the most authoritative scholars in the field of artificial intelligence (AI) and machine learning, and in the past year, he has been talking about "data-centric AI", hoping to shift everyone's attention from model-centric to data-centric.

Recently, in an interview with IEEE Spectrum, he talked about some insights into basic models, big data, small data, and data engineering, and gave the reasons for launching the "data-centric AI" movement.

"Over the past decade, the architecture of code-neural networks has matured very well. Keeping the neural network architecture fixed and looking for ways to improve the data will be more efficient. ”

Ng said that his data-centric thinking has been criticized a lot, just as he was criticized when he launched the Google brain project to support the construction of large-scale neural networks: the idea is not new, the direction is wrong. According to Professor Wu, there are many industry veterans among the critics.

Regarding small data, Professor Wu believes that it can also be powerful: "Just having 50 good examples is enough to explain to the neural network what you want it to learn." ”

The following is the original interview, AI technology reviews have been compiled without changing the original meaning.

IEEE: The success of deep learning over the past decade has come from big data and big models, but some people think it's an unsustainable path, do you agree with that?

NG: Good question.

We've seen the power of foundation models in the field of natural language processing (NLP). To be honest, I'm excited about the larger NLP model, and building the underlying model in computer vision (CV). There is a lot of information in video data that can be used, but due to the limitations of computing performance and video data processing costs, it is not possible to establish a related basic model.

Big data and big models have been running successfully as deep learning engines for 15 years, and it's still alive. That being said, in some scenarios, we also see that big data doesn't apply, and "small data" is a better solution.

IEEE: What do you mean by the CV base model?

Ng Enda: Refers to a model that is very large in scale and trained on big data, and can be fine-tuned for specific applications when used. Terms coined by friends of mine and Stanford, such as GPT-3, are the underlying model for the NLP field. The underlying model offers a new paradigm for developing machine learning applications, which has great promise, but also faces challenges: how to ensure reasonableness, fairness, and impartiality? These challenges become more apparent as more people build applications on the underlying model.

IEEE: Where is the opportunity to create a foundational model for CV?

Ng: There are still scalability challenges. Compared to NLP, CV requires more computing power. If you can produce a processor with 10 times more performance than today, it will be very easy to build a basic visual model containing 10 times the video data. At present, there are already signs of developing a basic model in CV.

Speaking of which, let me mention this: Over the past decade, the success of deep learning has been more in consumer-oriented companies that are characterized by huge amounts of user data. Therefore, in other industries, the "scale paradigm" of deep learning does not apply.

IEEE: You said that, I remembered that you were early on in a consumer-facing company with millions of users.

Ng: Ten years ago, when I started the Google Brain project and built "big" neural networks using Google's computing infrastructure, it caused a lot of controversy. An industry veteran at the time "quietly" told me that starting a Google Brain project was not good for my career, and I should not only focus on large scale, but on architectural innovation.

To this day I remember that my students and I published the first NeuralIPS workshop paper advocating the use of CUDA. But another industry veteran advised me that CUDA programming is too complicated, and it's too much work to use as a programming paradigm. I tried to convince him, but I failed.

IEEE: I think now they've all been convinced.

Ng: I think so.

I've been talking about data-centric AI for the past year, and I've come across the same comments as 10 years ago: "Nothing new," "It's the wrong direction."

IEEE: How do you define "data-centric AI" and why do you call it a movement?

Ng: "Data-centric AI" is a systems discipline that aims to focus on the data needed to build AI systems. For AI systems, it is necessary to implement the algorithm in code and then train on the dataset. For the past decade, people have been following the "download datasets, improve the code" paradigm, thanks to which deep learning has been a huge success.

But for many applications, the code—neural network architecture—has been largely solved and won't become a major pain point. Therefore, it is more efficient to keep the neural network architecture fixed and look for ways to improve the data.

When I first mentioned it, many people raised their hands in favor of it: we've been doing things according to the "routine" for 20 years, we've been doing things intuitively, and it's time to turn it into a systematic engineering discipline.

"Data-centric AI" is far bigger than a company or a group of researchers. When I organized a "Data-Centric AI" workshop with friends at NeurIPS, I was very pleased with the number of authors and speakers present.

IEEE: Most companies only need a small amount of data, so how can "data-centric AI" help them?

Ng: I built a facial recognition system out of 350 million images, and you probably hear stories of building a visual system out of millions of images. But the architecture of these scale products cannot be built with only 50 pictures. Facts proved. If you only have 50 high-quality pictures, you can still produce something very valuable, such as defect system detection. In many industries, big data sets don't exist, so I think it's important to shift the focus "from big data to high-quality data" at this time. In fact, having 50 good examples of data is enough to explain to the neural network what you want it to learn.

Ng: What kind of model was trained using 50 images? Is it fine-tuning the big model, or is it a brand new model?

Ng: Let me talk about the work of Landing AI. When doing visual checks for manufacturers, we often use training models, RetinaNet, and pre-training is only a small part of it. One of the harder issues is providing tools that enable manufacturers to pick and label the right set of images for fine tuning in the same way. This is a very practical problem, whether in the field of vision, NLP, or speech, even the tagging staff is reluctant to manually mark. When using big data, if the data is uneven, the common way to process it is to obtain a large amount of data and then use algorithms to average it. However, if tools can be developed to label the differences in the data and provide a very targeted approach to improving the consistency of the data, it will be a more efficient way to get a high-performance system.

For example, if you have 10,000 pictures, one for every 30 pictures, the tags for those 30 pictures are inconsistent. One of the things we have to do is build tools that will allow you to focus on these inconsistencies. You can then relabel these images very quickly to make them more consistent, which can improve performance.

IEEE: Do you think that if the data could be better designed before training, would this focus on high-quality data help solve the problem of bias in the data set?

NG: Quite possibly. There are many researchers who have pointed out that biased data is one of the many factors that cause the system to deviate. In fact, there have been a lot of efforts in designing data. At the NeurIPS workshop, Olga Russakovsky gave a great talk on the subject. I also really enjoyed Mary Gray's talk at the conference, which mentioned that "data-centric AI" is part of the solution, but not the whole solution. New tools like Datasheets for Datasets also seem to be an important part of that.

One of the powerful tools that "data-centric AI" gives us is the ability to engineer individual subsets of data. Imagine a trained machine learning system that performs well on most datasets, but deviates from only a subset of the data. At this time, it is quite difficult to change the entire neural network architecture in order to improve the performance of this subset. However, if you can design for only one subset of the data, you can solve this problem more specifically.

IEEE: What do you mean by data engineering specifically?

Ng: In the field of artificial intelligence, data cleaning is very important, but the way data is cleaned often requires manual solutions. In computer vision, someone might visualize an image through a Jupyter notebook to find and fix problems.

But I'm interested in tools that can handle large data sets. Even in noisy cases of labeling, these tools can quickly and effectively draw your attention to a single subset of the data, or quickly direct your attention to one of the 100 groups where it would be more helpful to collect more data. Collecting more data is often helpful, but it can be very expensive if all the work has to collect a lot of data.

For example, I once found that when there is car noise in the background, having a speech recognition system performs poorly. Knowing this, I can collect more data in the context of car noise. Rather than collecting more data for all the work, it would be very expensive and time-consuming to process.

IEEE: Would it be a good solution to use synthetic data?

Ng: I think synthetic data is an important tool in the "data-centric AI" toolbox. At the NeurIPS symposium, Anima Anandkumar gave a great talk on synthetic data. I think the important use of synthetic data is not just in terms of adding learning algorithmic datasets in preprocessing. I'd like to see more tools for developers to use synthetic data generation as part of the closed loop of machine learning iterations.

IEEE: Do you mean that synthesizing data allows you to try models on more datasets?

NG: Not really. For example, there are many different types of defects on smartphones, and if you want to detect defects in smartphone housings, they may be scratches, dents, pit marks, material discoloration, or other types of defects. If you train a model and then find through error analysis that it performs well overall, but performs poorly on the pits, then the generation of synthetic data allows you to solve this problem in a more targeted way. You can generate more data just for the pit category.

IEEE: Can you give specific examples? If a company finds Landing AI and says they have a problem with visual inspection, how do you convince them? What solution will you give?

Ng: Synthetic data generation is a very powerful tool, but I usually try many of the simpler tools first. For example, use data enhancement to improve label consistency, or simply require manufacturers to collect more data.

When a customer finds us, we usually start by talking about their detection issue and looking at some images to verify that the issue is feasible in terms of computer vision. If feasible, we will ask them to upload data to the LandingLens platform. We usually advise them based on a "data-centric AI" approach and help them label the data.

One of the priorities of Landing AI is to let manufacturing companies do their own machine learning work. A lot of our work is for the convenience of the software. Through the development iteration of machine learning, we provide customers with many suggestions on how to train models on the platform, how to improve data labeling problems to improve the performance of models, and so on. Our training and software work during this process until the trained model is deployed to the edge devices of the factory.

IEEE: So how do you respond to changing needs? If the product changes or the lighting conditions of the factory change, can the model adapt in such a situation?

Ng: It varies from manufacturer to manufacturer. In many cases there are data shifts, but there are also some manufacturers who have been running on the same production line for 20 years and have hardly changed, so they don't expect changes in the next 5 years, and things become easier to stabilize the environment. For other manufacturers, we also provide tools to mark when there is a large data drift problem. I've found it really important to enable customers in manufacturing to autonomously correct data, retrain, and update models. For example, it's 3 a.m. in the U.S., and as soon as something changes, I hope they'll be able to immediately adjust their learning algorithms to keep themselves running.

In the consumer software internet, we can train a small number of machine learning models to serve 1 billion users. In manufacturing, you might have 10,000 manufacturers customizing 10,000 AI models. The challenge is, how can Landing AI do this without hiring 10,000 machine learning experts?

IEEE: So in order to improve quality, you have to authorize users to train their own models?

Ng: Yes, exactly! It's an industry-wide AI problem, not just in manufacturing. For example, in the medical field, the format of electronic medical records in each hospital is slightly different, how to train and customize their own AI models? It is unrealistic to expect it to be reinvented in the neural network architecture of every hospital. Therefore, tools must be built to enable them to build their own models by providing users with the tools to design data and express their domain knowledge.

IEEE: Is there anything else you need readers to know?

Ng: The biggest shift in AI over the last decade has been deep learning, and in the next decade, I think there will be a shift to data centricity. As neural network architectures mature, for many practical applications, the bottleneck will be "how to obtain and develop the data needed." Data-centric AI has tremendous energy and potential in the community, and I hope more researchers will join!

Leifeng Network

Ng: Farewell, big data

Read on