90% of the papers are model-centric, in the field of AI, which is more important, data or models?

2022-02-19 12:18:46

Selected from neptune.ai

Author: Harshil Patel

Machine Heart Compilation

In machine learning, is the data important or the model important? This is a difficult question to answer.

Models and data are the foundation of AI systems, and these two components play an important role in the development of models.

One of the most authoritative scholars in the field of artificial intelligence, Ng Enda, once proposed that "80% of the data + 20% of the model = better machine learning", he believes that a team research 80% of the work should be focused on data preparation, data quality is important, but few people care. If there is more emphasis on data-centric rather than model-centric, machine learning will evolve faster.

We can't help but ask whether the advances in machine learning are brought about by models or data, and there is no clear answer.

In this article, Harshil Patel, an Android developer and machine learning enthusiast, presents Machine Learning: Data Centric VS Model-Centric, which compares to determine which of the two is more important, and how to use a data-centric infrastructure.

90% of the papers are model-centric, in the field of AI, which is more important, data or models?

Data-centric approach VS Model-centric approach

A model-centric approach means that experiments are needed to improve machine learning model performance, which involves the choice of model architecture and the training process. In a model-centric approach, you need to keep the data the same and improve performance by improving the code and model architecture. In addition, improvements to the code are fundamental goal-centered models.

Currently, most AI applications are model-centric, and one possible reason is that academic research places a lot of emphasis on the field of AI. According to Ng, more than 90% of research papers in the field of AI are model-centric because it is difficult for us to create large data sets, making them the accepted standard. As a result, the AI community sees model-centric machine learning as more promising. Researchers tend to overlook the importance of the data while focusing on the model.

For researchers, data is at the heart of every decision-making process, and data-centric companies can get more accurate, organized, and transparent results by using the information generated by their operations, which can help companies run more smoothly. The data-centric approach involves systematically refining and refining data sets to improve the accuracy of ML applications, and processing data is a central goal of data centricity.

Data-driven VS data-centric

Many people often confuse the concepts of "data centric" and "data-driven". Data-driven is a method of collecting, analyzing, and extracting insights from data, sometimes referred to as "analytics." A data-centric approach, on the other hand, focuses on using data to define what should be created first; a data-centric architecture refers to a system in which data is a primary and permanent asset. A data-driven architecture means creating technologies, skills, and environments by leveraging large amounts of data.

For data scientists and machine learning engineers, a model-centric approach seems to be more popular. This is because practitioners can use their knowledge to solve specific problems. On the other hand, no one wants to spend a lot of time labeling data.

However, in today's machine learning, data is critical, but it is often overlooked and mishandled in AI evolution. Because of data errors, researchers can spend a lot of time troubleshooting errors. The root cause of the model's low accuracy may not come from the model itself, but from the wrong data set.

In addition to focusing on the data, models and code are also important. But researchers tend to focus on the model while ignoring the importance of the data. The best approach is to focus on a hybrid approach that focuses on both data and models. Depending on the application, researchers should balance the data and the model.

Data-centric infrastructure

Model-centric machine learning systems focus on model architecture optimization and its parameter optimization.

Model-centric ML applications

The model-centric workflow depicted in the preceding diagram is suitable for a small number of industries, such as media, advertising, healthcare, or manufacturing. However, it may also face the following challenges:

Advanced customization systems are required: Unlike in the media and advertising industries, many businesses can't use a single machine learning system to detect production failures in their products. While media companies can afford to have a complete ML department to handle optimization issues, manufacturing enterprises that require multiple ML solutions cannot implement on such a template;

The importance of large data sets: In most cases, companies don't have a lot of data available. Instead, they are often forced to work with tiny data sets that can easily produce disappointing results if their approach is model-centric.

In his AI talk, Ng explained how he believes data-centric ML is more valuable and advocates for the community to move in a data-centric direction. He once gave the example of "steel defect detection", where a model-centric approach failed to improve the accuracy of the model, while a data-centric approach increased the accuracy rate by 16%.

Data-centric ML applications

When implementing a data-centric architecture, you can think of data as a fundamental asset that is more durable than applications and infrastructure. Data-centric ML makes data sharing and movement simple. So, what exactly is involved in data-centric machine learning? When implementing a data-centric approach, we should consider the following factors:

Data label quality: When a large number of images are incorrectly labeled, unexpected errors will occur, so the quality of data labeling needs to be improved;

Data enhancement: let limited data produce more data, increase the number and diversity of training samples (noise data), and improve model robustness;

Feature engineering: Adding features to a model by altering input data, prior knowledge, or algorithms is often used in machine learning to help improve the accuracy of predictive models;

Data versioning: Developers track errors and view meaningless content by comparing two versions, data versioning is one of the most indispensable steps in maintaining data, it helps researchers track changes to datasets (additions and deletions), versioning makes code collaboration and dataset management easier;

Domain knowledge: Domain knowledge is very valuable in a data-centric approach. Domain experts can often detect subtle differences that ML engineers, data scientists, and annotators can't detect, and content involving domain experts is still missing in ML systems. If additional domain knowledge is available, ml systems may perform better.

Which one should be prioritized: data quantity or data quality?

It should be emphasized that a large amount of data does not equate to a good data quality. Of course, training a neural network can't be done with just a few graphs, and the amount of data is one aspect, but the focus now is on quality rather than quantity.

As you can see in the image above, most Kaggle datasets aren't that large. In a data-centric approach, the size of the dataset is less important, and more can be done with a lesser-quality dataset. However, it is important to note that the data quality is high and the labeling is correct.

In the image above is another way to label data, either individually or in combination. For example, if Data Scientist 1 labels pineapples individually and Data Scientist 2 labels them in combination, the data labeled by the two is incompatible, confusing the learning algorithm. Therefore, you need to keep the data labels consistent, and if you need to label them individually, make sure that all labels are made in the same way.

Above, Ng explains the importance of consistency in small data sets

How much data do you really need?

Data quality cannot be ignored, but the amount of data is also crucial, and researchers must have enough data to solve the problem. Deep networks have low bias, high variance characteristics, and we can foresee more data that can solve the variance problem. But how much data is enough? This is a difficult question to answer at the moment, but we can consider having a lot of data as an advantage, but it is not necessary.

If you take a data-centric approach, keep the following in mind:

Ensure data consistency throughout the ML project lifecycle;

Consistent data annotation;

Timely feedback on the results;

Conduct error analysis;

Eliminate noise samples.

So, where can we find high-quality datasets? Here are a few sites to recommend, starting with Kaggle: In Kaggle, you'll find all the code and data you need to do your data science work, and Kaggle has over 50,000 public datasets and 400,000 public notebooks to get things done quickly.

Datahub.io: Datahub is a dataset platform that focuses primarily on business and finance. Many datasets, such as lists of countries, populations, and geographic boundaries, are currently available on DataHub.

Finally, Graviti Open Datasets: Graviti is a new data platform that primarily provides high-quality datasets for computer vision. Individual developers or organizations can easily access, share, and better manage open data.

90% of the papers are model-centric, in the field of AI, which is more important, data or models?

Read on