【Depth】"Money Cow" or "Gold Swallower"? Large models test AI data service providers

"Large models have very high requirements for data collection and annotation. In the past, everyone was still trading prices, but now the cost of processing a piece of data can even reach hundreds of dollars. In a large-scale model corpus data promotion meeting, Qiao Tian, a data expert at Beijing Qingshu Wisdom Technology Co., Ltd. (hereinafter referred to as "Qingshu Wisdom"), said.

As a data service company, Qingshu Wisdom mainly provides high-quality AI training datasets and professional consulting services for artificial intelligence R&D enterprises and scientific research institutions. Qiao Tian's feelings are not unique. The Securities Times reporter interviewed a number of AI data service providers and found that the popularity of large models since the beginning of this year has brought more order demand to a group of AI data service providers, but also greatly increased the cost of data products and services.

In the era of large models, opportunities and challenges coexist. Is the layout model planting a definite "cash cow" for the future, or is it raising a "gold-swallowing beast" with an unknown "money"? With the release of the third quarter report, the performance of some listed companies also revealed a signal: the performance of AI data service providers is under pressure and is facing a cost test.

Earnings – Demand blowouts lead to more orders

Computing power, data, and algorithms are known as the troika that supports large AI models. At this year's World Artificial Intelligence Conference, Wu Chao, director of the expert committee of CITIC Think Tank and director of the Securities Research Institute of China Securities Construction Investment, said that the quality of a model is determined by the algorithm 20% and 80% by the data quality, and high-quality data will be the key to improving the performance of the model in the future.

Training large models requires large amounts of high-quality data. If the large model is compared to a learner, then only by providing high-quality "learning materials" can it master knowledge more effectively and improve its intellectual level. With the development of pre-trained large model technology, the requirements for the quality and quantity of data are getting higher and higher. According to Deloitte's forecast, the market size of AI pre-trained data services is expected to reach 16 billion yuan in 2027, with a five-year compound growth rate of 28.9%.

Moreover, at present, the application of large models in thousands of industries is accelerating, and the demand for high-quality datasets in vertical fields is even more explosive. The reporter combed and found that the major AI data service companies in the A-share market have announced in recent times that they have reached cooperation with large model companies or scientific research institutions.

For example, Haitian AAC, a leading AI training data company in China, recently announced that it had officially signed a strategic cooperation agreement with Beijing Academy of Artificial Intelligence to cooperate in large model data processing, large model evaluation, dataset research and development, and artificial intelligence standard development. Tors recently said on the investor interactive platform that the company has signed sales contracts with artificial intelligence companies and national laboratories to provide them with high-quality and diversified data as a large model pre-training dataset.

"One of our obvious feelings is that the large model has achieved a real explosion on the scene side." Cao Feng, chief technology officer of Shanghai Data Bank Technology, said in an interview with reporters. As a data technology company, Databank Technology has accumulated a huge number of data products and system services in the financial and industrial fields. Cao Feng told reporters that customers will now put forward many needs that cannot be met by previous technologies based on the ability of large models, such as in-depth analysis of existing research reports and announcements, and the content of interviews with listed companies will be formed into text and automatically extract key points.

The diversification of scenario applications and the deepening of information processing mean more and more complex data requirements. According to reports, some of these data need to be produced with the help of large models, and some are used as training corpora for large models in vertical domains or reference materials for generating content.

Cost – Computing power and manpower costs are rising

Although AI data service providers have accumulated many mature data products before the emergence of large models, many of them do not meet the requirements for training large models. "A large model company is like a chef, and a data service provider is like a vegetable farmer, and some of the 'ingredients' ordered by the chef are something that vegetable farmers have not seen before." Qiu Huihui, founder of Feidi Technology, a financial information service provider, made a vivid analogy to reporters.

"Chefs" put forward the demand for customized and higher-end ingredients, and "vegetable farmers" can only invest more energy and spend higher costs to make them. One of the immediate effects of this is that the cost of data products and services has become higher.

Where exactly do you need to spend more money? An artificial intelligence researcher told reporters that the higher cost is mainly reflected in two aspects - computing power and manpower. In terms of computing power, data service providers often need to lease or purchase more hardware resources such as chips and graphics cards because large models need to mine data more deeply and finely, and are inseparable from stronger computing power.

In terms of manpower, for a long time in the past, AI data services, especially data annotation services, were regarded as labor-intensive, dirty, hard work, and low value-added work. Taking data labeling as an example, some large technology companies and data service providers often set up data labeling teams in economically underdeveloped areas to help local people find employment and reduce labor costs. However, in the era of large models, the data quality requirements have been greatly improved, the data processing has become more difficult, and the past model of relying on low-cost labor and "low price and volume" is no longer valid.

"In the past, secondary school or high school students could meet the requirements of data labeling, but now we need to recruit college students, even master's and doctoral students, to process vertical data in designated industries." An AI data service provider told reporters. According to media reports, the undergraduate rate of the first batch of annotators in the data annotation base established by a leading large model manufacturer has reached 100%. There is no doubt that, at least at this stage, the large model has rolled up the academic qualifications of data annotators, and the labor cost has naturally risen.

In addition, it may be necessary to build a new platform for the data to be "fed" to the large model before it can be processed and preprocessed. For AI data service providers, it is necessary to lay out hardware equipment for data storage and processing, which will inevitably be accompanied by more manpower investment from algorithm engineers. Moreover, under the wave of large models, some data providers who have accumulated high-quality industry data are no longer satisfied with providing data services, but are building large industry models on their own - and this is a larger investment.

Therefore, the layout of large models is destined to be a "money-burning" business. Reflected in the secondary market, a number of A-share listed companies engaged in data business have released private placement fundraising plans to meet the R&D investment of large models. In June, Haitian AAC announced a plan to issue A-shares to specific targets, with plans to raise no more than 790 million yuan for the construction project of AI large model training dataset and the research and development project of vertical large model for data production. In July, Transwarp Technology released a plan to issue A-shares to specific targets, and planned to raise no more than 1.521 billion yuan for the construction project of data analysis large model and the construction project of intelligent quantitative investment and research integration platform. In August, Tors released a plan to issue shares to specific targets, and planned to raise no more than 1.845 billion yuan for the research and development of Tuotian industry large model and AIGC application industrialization project.

Test - The performance of AI data service providers is generally under pressure

Since the beginning of this year, the large model has continued to be hot, igniting the investment enthusiasm of the primary and secondary markets, but the market also has doubts, worried about whether the high investment can produce corresponding returns. It is worth noting that after Haitian AAC and Transwarp released their private placement fundraising plans, both companies received letters of inquiry from regulatory authorities, requesting specific explanations on the necessity of fundraising, the company's existing business and related market prospects.

In its reply to the inquiry letter in September, Haitian AAC mentioned that the large model products that have been launched are mainly general large language models, and the number of large models in vertical and multimodal fields is still small, and the data demand has not yet been fully released. In view of the fact that the products of the company's downstream large model related customers are still in the early stage of the first generation product release or research and development stage, and the market has not yet been widely applied, the relevant data demand will be further released after the products are put on the market, and the company's large model business-related revenue is expected to further increase in the future.

Transwarp's reply to the inquiry letter in September mentioned that based on the current development trend of the artificial intelligence industry and market competition, if the company does not carry out research and development related to large models, it may not be able to continue to maintain market competitive advantages in related fields in the future.

When the wave of new technologies rolls in, everyone is afraid of being abandoned by the trend, so they accelerate the layout of new performance growth points. However, judging from the financial statements of the third quarter, the performance of AI data service providers is generally under considerable pressure.

The reporter also noted that Haitian AAC issued a fixed increase adjustment plan on October 25, and the amount of funds raised decreased from 790 million yuan to 666 million yuan, and the amount of funds to be invested in the research and development project of the vertical large model of data production shrank by 23.51%, and the amount of funds to be raised that was originally planned to be invested in the construction project of the AI large model training dataset shrank by 7.38%.

The performance of AI data service providers is under pressure, will it be a dangerous gamble to bet on large models? This question may be too early to ask. A brokerage person analyzed to reporters that seizing the commercial application of large models will inevitably face high investment, but the related industries are still in the early stage of development, and large models still need time to sink into more application scenarios, and the release of data demand is not an overnight thing, and it is impossible to judge the future situation with the current revenue data.

"Doing data itself is a long-distance race, and the data industry is a long-cycle industry, which requires advance layout and some patience." Zhang Qingqing, founder of Qingshu Wisdom, said. She told reporters that the company has been focusing on conversational scenarios in the past and has accumulated a lot of high-quality voice data, including voice data with a high sampling rate of 48kHz for multiple speakers. Recently, many AI synthetic videos of celebrities speaking native foreign languages or dialects have been widely circulated on the Internet, and "video interpretation" has become a very popular application, and one of the key technologies supporting this application, speech replication technology, is realized using data with a high sampling rate of multiple speakers. "There are a lot of manufacturers who have asked us recently, but the premise is that we have been working silently in this direction for 7 years and have been precipitating and accumulating this kind of data." Zhang Qingqing said.

A recent research report by Caitong Securities pointed out that the landing of scenario applications has become a new round of development momentum for AI large models, and the demand for AI pre-training data is expected to grow rapidly with the landing of scenario applications. The research report further pointed out that as the industry enters a period of rapid development and the industry gradually evolves to multi-modal, compliant, and semi-automated, technology giants and professional pre-training data service providers have stronger R&D advantages, which are expected to form barriers to resource integration and R&D technology, and cut more incremental market share.

There is an industry consensus among AI data service providers: the emergence of large models is a good thing for the data industry, and 2023 is the first year of high-quality development of the data industry. A writer once wrote, "You do things in March and April, and you have an answer in August and September." "Only time can tell us what kind of flowers and fruits the seeds planted by AI data service providers in the first year will bloom and bear fruit in the future.

Editor-in-charge: Ye Shuyun

Proofreading: Gao Yuan

All original content on the platforms of the Securities Times shall not be reproduced by any unit or individual without written authorization. Our company reserves the right to pursue the legal responsibility of relevant actors.

For reprinting and cooperation, please contact the Securities Times assistant, WeChat ID: SecuritiesTimes

END

【Depth】"Money Cow" or "Gold Swallower"? Large models test AI data service providers

Read on