laitimes

AI Contract Theory (7): Toxic Data Threatens Big Model Data Pool, How AI Training Protects Against New "Trojan Horses"

author:21st Century Business Herald

Southern Finance All Media Wu Liyang 21st Century Business Herald Zheng Xue Wang Jun reported from Shanghai and Beijing

Editor's note:

In the past few months of 2023, major companies have rushed to land large models, GPT commercial exploration, and computing power infrastructure is bullish... Just like the Age of Exploration that began in the 15th century, human exchanges, trade, and wealth have exploded, and the space revolution has swept the world. At the same time, the change also brings the challenge of order, data leakage, personal privacy risks, copyright infringement, false information... In addition, the posthumanist crisis brought about by AI is already on the table, what posture should people take to meet the myth caused by the mixing of man and machine?

At this moment, seeking consensus on AI governance and reshaping the new order have become common issues faced by all countries. Nancai Regtech Research Institute will launch a series of reports on AI contract theory, analyzing from the dimensions of Chinese and foreign regulatory models, entity responsibility allocation, corpus data compliance, AI ethics, and industrial development, in order to provide some ideas for AI governance solutions and ensure responsible innovation.

With the rapid development of the AI industry, the scale of data sets used for training is also increasing exponentially, and the experience and culture accumulated by human beings in the long history are rapidly being absorbed by the emerging intelligent form of artificial intelligence, and the accumulation of years has become the basis for cultivating future technology, and the bright star of knowledge has illuminated the past, present and future of civilization at the same time.

As more and more data is learned and even understood by AI, people are pleased to see that the intelligence and ability of artificial intelligence are undergoing earth-shaking changes, which is far faster than any known natural or human creation, but it also brings society to the unknown hidden worries.

What is difficult to fully confirm is that when the data and text extracted by AI are manually identified and cleansed, the systemic malice and bias that are also rooted in history can be erased. It is inevitable to ask whether AI, which seems to have unlimited potential, should be indoctrinated by the spiritual and moral laws of human society at the same time, as it looks up at the starry sky of knowledge from time immemorial.

Goodwill, virtue and law, when tracing back to the source of AI generation and growth what shapes the form of artificial intelligence, it seems that it still needs to return to the data itself produced and processed by humans, and how to establish rules in the construction and use of data also implies our real mode of dealing with AI as a content, tool or partner, and the future of mutual influence.

Compared with privacy and copyright issues, the impact of data itself on AI seems to be more "uncontrollable": on the one hand, the black box of AI training and content generation makes it difficult to trace the source of output results; On the other hand, the moral standards of human society have not been fully internalized into the operation mechanism of AI, but the ethical and safety issues caused by this often attract widespread public attention. In this article, we will focus on how the data cleaning and labeling process affects the model quality, and how to prevent and manage database risks such as toxic data.

Cleaning and labeling

The domestic 100-model war is intensifying, and the high-quality, large-scale and rich data set composed of data as fuel has become an indispensable content in the competition of large models.

Where do datasets come from? Taking ChatGPT, the overseas large model that triggered the AI boom, as an example, its model datasets are divided into six categories: Wikipedia, books, journals, Reddit links, Common Crawl and other datasets. The data sets of domestic large models mostly come from three aspects: data accumulated by manufacturers, data crawled through public channels, and various free or paid third-party databases and datasets.

The most critical part of the dataset is the data that is highly relevant, diverse, and high-quality to the model task. Considering that the collected data may have missing, noise, duplication, etc., the massive data cannot be directly used in the large model, but needs to be cleaned, annotated and other processes to generate a dataset that can be used by the large model, and then combined with the algorithm, computing power, etc., so as to truly use the large model.

Taking GPT-3 as an example, its raw data volume is 45TB, while the high-quality data after cleaning is 570GB, and only about 1% of the raw data after cleaning becomes data in the corpus.

What are the stages of data that becomes a corpus?

Cleaning is essential. Gu Dujuan, director of NSFOCUS Tianshu Lab, said in an interview with reporters that data cleaning is to delete noisy data and meaningless information in text, and finally retain data useful for tasks in text data, generally including data deduplication, error correction, abnormal data deletion, and data format standardization.

Page analysis, that is, structuring unstructured data, is the first step in data cleansing. "Taking the captured web page data as an example, technical personnel are required to extract effective information in the original text, such as the title of the page, the body of the page, the title of the image, etc.; For data that has been structured, measures such as filtering should be taken, such as various anti-spam identification, etc., and the data after cleaning is basically usable data. An engineer working on algorithms told reporters.

In his view, data cleaning is nothing more than two ideas, one is to push garbage data out, and the other is to extract high-quality data from massive data. "Large model training, on this basis, will also do some related cleaning, which may be for specific fields, such as humanities, history, etc., but also to do specific high-quality text recognition and capture." The above algorithm engineer introduced.

Labels are equally important.

"Text data annotation is divided into entity recognition, relationship extraction, event extraction, part-of-speech tagging, sentiment analysis, syntactic analysis and other types in natural language tasks, depending on the model task." Gu Dujuan introduced.

Unlike traditional deep learning, which uses manual labeling, the data required by current large models cannot be done manually, but through algorithms. The above-mentioned algorithm engineer told reporters that according to empirical calculations, a considerable proportion of people in the large model team are engaged in data cleaning and annotation, and this cleaning and labeling work will run through the entire large model.

Data "anti-virus"

Cleaning and labeling are the basic processes of building large model datasets and important thresholds for improving data quality, but with the rapid expansion of the amount of data required for AI training, especially more and more AI is connected to the Internet, hidden dangers such as toxic data have also begun to become important threats to AI reliability and even compliance.

As early as before the advent of ChatGPT, the problem of data poisoning has been widely concerned by artificial intelligence developers, whether the black production of toxic data is to reduce the overall reliability of machine learning models, or to promote the output of AI to a certain aspect of deviation, as the application of AI in finance, medical care, education and other fields deepened, the toxic data buried in the training stage may bring more specific harm.

In terms of attack methods, leading to the injection of toxic data into the database or modifying entries in existing data sets are all possible data poisoning measures: the former does not need to affect a high amount of data - studies have shown that just changing 0.00025% of the data (such as mixing other images in an Apple image and claiming it as Apple) AI will be deceived; The latter is more difficult to identify and troubleshoot.

Chris Anley, chief scientist at NCC Group, has pointed out that hackers have the potential to confuse well-designed bad data with normal data to AI to increase the likelihood of discovering application "backdoors."

In addition, different large models may also lead to differences in the identification and protection capabilities of toxic data due to different data sources, for models that use closed databases for machine learning, the data cleaning and annotation process with high accuracy can better avoid the pollution of toxic data, but for models that need to be updated in real time or even connected to Internet databases, high-frequency data flow makes toxic data easier to penetrate into the iteration and generation process of AI.

Gu Dujuan pointed out that for the basic model, the corpus is more of a wide general corpus, and for the vertical field large model, it is more focused on professional data in a specific field, and there are differences in the data sources of different models, and the source channels of corpus data are also different, which brings variables to the overall accuracy of the data.

It is worth noting that a number of industry insiders pointed out in exchanges with reporters that in the period of rapid development of AI, it is difficult to supervise toxic data from the perspective of data sources, and it is more feasible to control input and output, but this approach also faces problems such as high difficulty in tracing and lagging processing.

"At present, enterprises that carry out large model development work often have a wide range of data sources, and it is difficult to have a unified high-standard scheme for self-accumulated data and externally obtained data to completely exclude toxic data. An artificial intelligence architect of a large Internet factory told reporters.

However, he also pointed out that the "emergent" characteristics of large model performance and the so-called "AI illusion" problem in the recent AI development boom also reflect that the controllability of simply supervising the output content of AIGC is relatively limited: "The current compromise is to impose a clearer limit on the use scenario, limit the output content and form to a certain range, and make the AIGC process relatively controllable." ”

Zhang Wei, partner of Ernst & Young (China) Consulting Co., Ltd. Greater China Cybersecurity and Privacy Protection Advisory Co., Ltd., told reporters that compared with pushing back the data level after a compliance incident, a better way is to do a good job in compliance management in all aspects of the AI research and development stage.

"AI R&D involves many small business processes, and code, transmission, application and other levels need to have corresponding detection methods to ensure reliable sources and processes. For example, when using an open source database, it is verified whether it is certified, whether it is maintained out of the community, whether the code has been checked, etc., compared with post-training optimization, compliance control before training and during training is more feasible. Zhang Wei said.

On the other hand, for compliance issues from data to output, manufacturers at different nodes of the industry chain are also studying their own solutions. In April, NVIDIA announced on its official website that it would open source NeMo Guardrails to help applications supported by ChatGPT and other similar large language models build a security system and reduce illegal, discriminatory, unethical and other content output.

According to it, NeMo Guardrails can help developers improve the security of applications supported by large language models, including code, examples, documentation, monitoring, security information filtering, etc.

"As one of the upstream manufacturers with the largest revenues, NVIDIA personally helps AI developers provide compliance and security services to win regulatory and social support." A practitioner in the artificial intelligence industry in Shanghai told reporters.

Coordinator: Wang Jun

Reporters: Wu Liyang, Zheng Xue, Wang Jun

For more information, please download 21 Finance APP

Read on