Yao Qian: Construction and Governance of Industry Large Model Corpus

Summary

The industry large model corpus usually includes general corpora such as natural sciences and social sciences and industry-specific corpora, and the construction of industry large model corpus is an important move to realize the transformation from informatization to digitalization and intelligence

Text: Yao Qian

A large model corpus refers to a series of text, speech, or other modal data that is used to train and evaluate a large model. The size and quality of the corpus have a crucial impact on the performance of large models and the depth and breadth of applications. At present, there are problems such as incomplete coverage, insufficient accuracy, and insufficient timeliness of large model training corpus in the industry, which makes it difficult for large models to achieve the expected goals. Practical experience shows that even if the magnitude of the model parameters decreases, as long as the quality of the data corpus is high enough, the performance is still good.

In order to further improve the application scope and effectiveness of large models in the industry, it is necessary to coordinate industry forces to build a community platform, broaden the source of corpus, construct corpus standards and specifications, carry out corpus governance, ensure corpus security, and provide high-quality corpus with industry characteristics and standardization that meet the needs of business scenarios for large model training and application.

1. Scope of the corpus

Industry large model corpus refers to the dataset used to train large models in vertical domains, which usually contains general corpora such as natural sciences and social sciences and industry-specific corpora. Taking the securities and futures industry as an example, industry-specific corporus includes financial news, financial reports, regulatory documents, and public transaction data. By collecting and organizing corpora, large models can be trained to understand and generate industry-specific concepts and knowledge, supporting intelligent tasks such as industry analysis, prediction, and decision-making.

(1) General corpus

The introduction of general corpus such as encyclopedias and books can reduce the risk of misunderstanding of professional terms (such as non-professional usage of professional terms, puns on terms, contexts irrelevant to specific industries, etc.) when performing industry-specific tasks, and provide more accurate and natural responses in the face of cross-domain queries or exchanges.

(2) Industry-specific corpus

The introduction of industry-specific corpus aims to enrich the understanding of industry-specific vocabulary, expressions and specific knowledge of the large model, so that the model can deal with industry-related complex queries, perform accurate data analysis, and more effectively support auxiliary decision-making. In addition, large models trained based on industry-specific corpus can show higher reliability and applicability when performing tasks such as risk assessment, prediction, and compliance checking.

2. Current status of the corpus

Usually, industry management departments, business institutions and information technology service providers will build their own corpora. On the one hand, it can meet its own needs such as industry knowledge collation, business research, compliance and risk control, and on the other hand, it can be further processed into new data assets, research reports, etc., and provide external services. The current situation and problems faced by different institutions in corpus construction are different, and they show their own characteristics.

(1) Industry management departments

In the work of building a corpus, the main challenge of the management department lies in the standardization of the dataset and the standardization of data, which is the basis of knowledge collation. There are the following problems in the construction of its corpus: 1. Data dispersion: many important data are scattered in various business systems, important information and expert experience cannot be effectively precipitated, and there are barriers to data sharing. 2. Data heterogeneity: A large amount of text data accumulated on a daily basis comes from different departments and levels, with different formats, structures, and contents. 3. Data sensitivity: Managing departmental data usually involves a large amount of sensitive information, and security compliance must be ensured during processing and storage.

(2) Industry operating institutions

The corpus of operating organizations involves massive amounts of structured and unstructured data, and the challenge is how to dig deeper to support decision analysis and customer service. There are the following problems in the construction of its corpus: 1. Difficult to process: business and transaction data from multiple channels have different formats and standards and various modalities, which is difficult to effectively integrate. 2. Shallow processing depth: The corpus construction of business institutions only stays at the surface information, and does not involve deep semantic understanding and in-depth analysis. 3. Difficulty in privacy protection: The large model corpus involves trade secrets and sensitive customer information, and the operating institution must do a good job of compliance and risk control during the training and use process.

(3) Information technology service providers

Information technology service providers are good at integrating common corpora, and the main challenges they face when cooperating with the construction of industry corpora are professional capabilities and service quality. 1. Professional ability: The classification, analysis and interpretation of industry corpus by information technology service providers require industry knowledge, and their professional ability seriously affects the application value of corpus. 2. Service quality: The construction of industry corpus is a continuous iterative work, which requires information technology service providers to provide long-term high-quality services.

In addition, synthetic data is also an important data source for large model training, which has advantages in reducing costs, improving data quality, and avoiding privacy issues. How to explore the effective path of industry data synthesis is a major topic in the construction of industry corpus.

3. The necessity of corpora

The construction and governance of industry corpora is particularly critical to the development of industry models and the activation of the value of data elements. A well-structured, high-quality, and well-managed corpus can provide industry participants with a knowledge base with deep insights, and promote the digital transformation and high-quality development of the industry. A credible corpus needs to be co-built and shared by the industry, which objectively promotes the construction of the industry corpus community and the development of public services.

(1) High-quality corpora is the foundation for innovation such as the implementation of industry models

The corpus determines the training quality, performance, and breadth and depth of application fields of the model. In addition to considering the quality dimension, the corpus construction should also pay attention to the degree of openness. The construction of a unified, open and standard industry large model corpus is conducive to improving the utilization efficiency and value of industry corpus, promoting the training and development of industry large models, and accelerating the application of large models.

(2) High-quality corpora is an important starting point for the digital transformation of the industry

High-quality corpora should be large-scale, diverse, authentic, coherent, legitimate, and unbiased. At present, there is a relative lack of high-quality corpus in the industry, and promoting its construction is an important move to realize the transformation from informatization to digitalization and intelligence.

(3) High-quality corpora is an effective means to activate the value of data elements and break down data barriers

Large model corpora usually require cross-institutional and wide-caliber data, which may involve issues such as data security, privacy protection, and intellectual property rights. Third-party data hosting and other methods can be explored to activate the value of data elements and effectively solve the problem of cross-agency data sharing.

4. Construction ideas

The construction of a credible industry model corpus is a long-term and professional systematic project, covering infrastructure, public service platforms, industry norms and standards, incentive mechanisms, etc. In the construction method and realization path, it is necessary to form a joint force, take multiple measures at the same time, and work for a long time (see figure).

(1) Make full use of the achievements and experience of the general corpus

International general corpora, such as foreign datasets such as The Pile, C4, Wikipedia (Wikipedia) and other datasets, as well as the domestic "Shusheng Wanjuan" multimodal pre-training corpus, and the Chinese general corpus released by the Cyberspace Security Association of China, can be used as the basis for building industry large model corpora. In order to expand the resources of the general corpus and to take into account self-reliance and openness, it is necessary to consider establishing filtered domestic mirror sites for specific data sources such as Wikipedia and Reddit (American entertainment, social networking and news websites) for domestic data processors to use.

(2) Focus on the supply, hosting, processing, security and evaluation of corpus

Practical experience shows that based on the industry corpus, the scale ratio of general corpus and professional corpus is usually about 1:1 when retraining the general large model. Therefore, the integration and convergence of industry-specific corpora and increasing the supply of corpus are the prerequisites for the construction of industry large models.

An effective way to think is to build a data community and explore platforms based on trusted institutions or trusted technologies to provide managed services for data subjects. Industry organizations can use the managed data to do secondary training or fine-tuning based on industry large models to improve the capabilities of private models. The custodial corpus assets can also be traded in the community for a fee, and the circulation can be carried out in an orderly manner.

Corpus processing is in the upstream link of large model training and development, which directly affects the production speed, scope of application and quality level of corpus. Data processing, especially data annotation, has been industrialized, and industry information technology service providers can carry out large-scale and professional data processing and annotation work in the data community to promote the construction and standardization of industry corpora.

Corpus security is the "red line" of building industry corpora. Supervision should be strengthened to ensure that the content of the stored data is compliant and the rights and interests are clear. It is necessary to improve laws and regulations, optimize policies and systems, and form a joint regulatory force through multiple channels and methods, and strictly prevent malicious tampering with models and infiltrating harmful data. Explore the use of technical methods such as reinforcement learning based on human feedback (RLHF) and scalable oversight to ensure that the output of large models conforms to human values and prevent large models from generating harmful content.

The evaluation of industry corpus is the key to further improve the ability of large models, not only to evaluate the quality of the corpus in the training process of large models, but also to evaluate the breadth and depth of the corpus's coverage of industry knowledge through the application of effectiveness, and to continuously iterate to achieve better results.

(The author is the director of the Department of Science and Technology Supervision of the China Securities Regulatory Commission, and this article only represents his personal academic views and does not represent the opinions of his institution; editor: Zhang Wei; this article was first published in Caijing magazine on April 22, 2024)

Yao Qian: Construction and Governance of Industry Large Model Corpus

Read on