With the rapid development of artificial intelligence, the importance of high-quality data has become more and more obvious. Taking large language models as an example, the leaps and bounds in recent years rely heavily on high-quality and rich training datasets. Compared to GPT-2, GPT-3 has made minimal changes in the model architecture, and more effort is invested in collecting larger, higher-quality datasets for training. For example, ChatGPT is similar to GPT-3's model architecture, but uses RLHF (reinforcement learning from a human feedback process) to generate high-quality labeled data for fine-tuning.

The future is here: how data is driving the competitive advantage of big AI models

Recognizing this phenomenon, Wu Chengen, an authoritative scholar in the field of artificial intelligence, launched the "data-centric AI" movement, which is a new concept that advocates improving the training effect of the entire model by improving the quality and quantity of data under the premise that the model architecture is relatively fixed. This includes adding data labels, cleaning and transforming data, data reduction, increasing data diversity, continuously monitoring and maintaining data, and more. Therefore, in the future, the proportion of data cost (including data collection, cleaning, annotation, etc.) in the development of large models may gradually increase.

The dataset required by the AI large model should have the following characteristics:

1) High quality: A high-quality dataset can improve the accuracy and interpretability of the model, and at the same time shorten the time for the model to converge to the optimal solution, that is, the training time.

2) Large-scale: In the article "Scaling Laws for Neural Language Models", OpenAI proposed the "scaling law" of LLM models, that is, independently increase the amount of training data, model parameter scale or extend the model training time, and the effect of pre-training models will continue to improve.

3) Diversity: The diversity of data helps to improve the generalization ability of the model, and too single data may cause the model to overfit the training data.

Dataset generation and processing

The data set creation process mainly includes the following steps:

Data collection: Data collection objects may include video, pictures, audio, and text of various types and formats. Common methods of data collection include system log collection method, network data collection method and ETL.
Data cleaning: Data cleaning is particularly important because the collected data may have quality problems such as missing values, noisy data, and duplicate data. As a crucial link in data preprocessing, the quality of cleansed data largely determines the effectiveness of AI algorithms.
Data labeling: This is the most important part of the process. The administrator divides the data to be labeled into different labeling tasks according to different labeling requirements. Each labeling task has different specifications and labeling point requirements, and a labeling task will be assigned to multiple labelers to complete.
Model training: Model trainers will use the labeled data to train the required algorithm model.
Model testing: Testers test models and feed back the test results to model trainers, who continuously adjust parameters to obtain better performing algorithm models.
Product evaluation: Product evaluators need to repeatedly verify the labeling effect of the model and evaluate whether the model meets the online goals. Only data that has passed through the product evaluation process can be considered to be truly passable.

However, despite China's abundant data resources, high-quality Chinese datasets are still scarce due to factors such as insufficient data mining and the inability of data to circulate freely in the market. According to statistics, in the training data of ChatGPT, the proportion of Chinese data is less than one thousandth, while English data accounts for more than 92.6%. In addition, research by the University of California and Google Research found that 50% of the datasets currently used by machine learning and natural language processing models are provided by 12 top institutions, of which 10 are US institutions, 1 is German institutions, and only 1 institution is from China, namely Hong Kong Chinese University.

We believe that the main reasons for the lack of high-quality datasets in China are as follows:

High-quality datasets require huge capital investment, but the current domestic investment in data mining and data governance is insufficient.
Domestic companies often lack open source awareness, resulting in data that cannot circulate freely in the market.
Domestic related companies were established late, and data accumulation is less than that of foreign companies.
In the academic field, Chinese data sets are given low importance.
The market influence and popularity of domestic datasets are relatively low.

At present, domestic technology Internet leading enterprises mainly train large models through public data and their own unique data. For example, the unique data used by Baidu's "Wenxin" model mainly includes trillions of web page data, billions of search data and image data. The training data of Ali's "Tongyi" large model mainly comes from Ali Dharma Academy. The unique training data of Tencent's "mixed element" large model mainly comes from high-quality data such as WeChat public account and WeChat search. In addition to public data, the training data of Huawei's Pangu big model is also supported by B-end industry data, including meteorology, mining, railway, and other industry data. The training data of SenseTime's "Daily Update" model includes the self-generated Omni Objects 3D multimodal dataset.

China's data environment and future

Despite the shortcomings of the current situation, China's data environment still has huge potential. First, China is the world's largest group of Internet users, and Nissan's huge amount of data provides the basis for building large-scale high-quality datasets. Second, the Chinese government's emphasis on AI and data governance, whether it is policy support or capital investment, provides favorable conditions for the improvement and development of the data environment.

In the future, China needs to make efforts in the following aspects:

Establish a data acquisition and cleaning system: Establish a complete data collection and cleaning system to ensure the quality and validity of data and provide a reliable data basis for subsequent model training.
Improve the accessibility and use of open data: Encourage companies, research institutions, etc. to disclose data and allow data to circulate freely in the market, so as to improve the accessibility and use of data.
Increase investment in data labeling: Reduce labeling costs by improving labeling efficiency and quality, so as to obtain more and higher-quality annotation data.
Train more data scientists and AI engineers: Increase the number and quality of data scientists and AI engineers through education and training to promote AI research and application in China.
Strengthen data cooperation at home and abroad: Through data cooperation, learn from successful foreign experiences, improve technologies and methods in data collection, processing and use, so as to enhance the quality and value of Chinese data.

Data is the "fuel" of AI models, and the competition of AI large models in the future will undoubtedly rely more on high-quality data. Therefore, the investment and utilization of data will determine China's position and achievements in the global AI competition.

The future is here: how data is driving the competitive advantage of big AI models

Dataset generation and processing

China's data environment and future