Interview with the founder of the data library: The general large model has developed to the extreme, and there is still a long process

【Text/Observer Network Lv Dong】

"We are mainly using OpenAI models now, and we have also tested a series of large domestic models, and they have made rapid progress, but for now, they can be further improved in terms of maturity." On July 7, Shen Xin, founder and president of Dataku Technology, said in an exclusive interview with Observer Network at the 6th World Artificial Intelligence Conference (WAIC).

He believes that if the general large model is developed to the extreme, there will be no industry big model, but this is a very long process. Because there is a lack of high-quality data that can really be used by large models. For example, there are only a handful of companies that can set high standards for financial-related data, and these companies will certainly not contribute data to others.

Interview with the founder of the data library: The general large model has developed to the extreme, and there is still a long process

Shen Xin, founder and president of Dataku Technology

Shen Xin frankly told the observer network that today human beings are in a world where everything is connected, and all industries are connected together. In the past, people who may analyze chips and analyze cars rarely intersect, but today industrial networks are intertwined, and the requirements for people's analytical capabilities are very high. In this case, the connection and weaving of data is particularly important to help people make decisions.

He also mentioned that there is no so-called "magic" in this world, and the big model will not solve all problems at once. Because the big model itself is an efficiency tool, "we will now integrate some of the technology of the big model into the data production side to further improve production efficiency." This is actually a point where we look at big models, don't technology for technology's sake."

Founded in 2009 by Shen Xin and Liu Yanhai, a returnee from overseas, the company mainly provides intelligent data products and system services based on industrial logic in the financial and industrial fields, helping financial institutions, enterprise groups, and government departments solve data and system requirements in business scenarios.

At this WAIC site, the observer network experienced the conceptual product map released by Data Library Technology - istari, after the user input problem is parsed with a large language model, it can be transformed into a unified product knowledge map (UPG) related query, showing the relevant professional industry knowledge and the relationship between various knowledge points, the product is mainly through the large model to do the deduction of industrial relations.

The following is a transcript of the interview:

Observer.com: This year's booth is larger than last year, what is the focus on display?

Shen Xin: Compared with last year, this year's exhibition is more about enhancement and productization. For example, for banks, last year was a program, this year is a standardized product. Because over time, if the company wants to grow, it must become more and more productized. At present, everyone is still exploring digital transformation, and as we contact more and more customers, we must extract the common needs. Because our goal is not only to serve the top financial institutions, but also to serve a large number of small and medium-sized banks, they may not have as many resources and capabilities, in this case, standardized products are important for them. And after they experience the benefits of standardized products, they are more determined and confident to invest more cost to do more refinement.

Observer Network: Continuously participating in the World Artificial Intelligence Conference, how does the data library business combine with artificial intelligence technology?

Shen Xin: Technology is always a tool, so the data library uses a large number of artificial intelligence technologies in three levels: data production, data analysis and data weaving. Technology empowerment is the underlying capability, we never directly use technology to monetize, this level of dazzling skills is meaningless. For a company to be sustainable, the key is to consolidate the underlying capabilities. Just like athletes, the most they usually do is physical training, although the final level of real performance is also related to mentality, but the underlying ability is still the core.

Observer Network: What are the underlying technical capabilities of artificial intelligence that the data library is currently using?

Shen Xin: When we analyze and analyze different types of data, we use a variety of small models, such as NLP (natural language processing). Previously, we did not do datasets that required massive manual annotation because this would reduce gross margin. But today, through the large model, we can do massive data annotation, specifically by breaking down the large paragraph into small ones through the small model, and then using the large model to extract the key elements. By integrating engineering technologies, the data factory is further enhanced.

Observer Network: Is the current model capability used by the data library self-developed or using models on the market?

Shen Xin: We will not make large models ourselves, because large models are a long-term investment to produce results, suitable for large manufacturers to do, and large models are also ready-made. The data library is now also connected to OpenAI, and domestic large models like Baidu's Wenxin are also being tested. We just use good tools in vertical areas, and we can use whoever has a large model that is good to use. We are more focused on vertical fields, including financial institutions, and it is impossible to develop large models by themselves, and they also use ready-made. Therefore, on the one hand, we must clearly understand the level of large models in the market, and on the other hand, we must be compatible with what customers need.

Observer Network: What large models are the data libraries mainly cooperating with in the market?

Shen Xin: We are now mainly using OpenAI models, and we have also tested a series of large domestic models, and their progress is very fast, but at present, the maturity can be further improved.

Observer.com: How do you think about the application of large models?

Shen Xin: There is no so-called "magic" in this world, and it is impossible for a big model to solve all problems at once. Because the big model itself is an efficiency tool, we will now integrate some of the technology of the large model into the data production end to further improve production efficiency. This is actually a point where we look at big models, don't technology for technology's sake.

Observer.com: How do you view the conflict between general large models and industry big models?

Shen Xin: If the general large model is developed to the extreme, I think there will be no industry big model, but this is a very long process. There is a great lack of high-quality data that can really be used by large models.

For example, in our industry, companies that can achieve high standards for financial-related data may be counted on their fingers, and these few will definitely not contribute data to others. So it's actually a long process, not as fast as everyone thinks. Another point is that many scenes do not need to use large models. It may be useful to have a large use in consulting service scenarios, but in an industry with strict data requirements such as finance, large models are actually meaningless, because the feedback is relatively vague.

Observer: How should we understand the business of the data database, or how the data database uses the data?

Shen Xin: We are connecting all the data in the market that looks like silos. In such a data network, look for some useful information points that are difficult for the outside world to capture. Because people's thinking is always limited, no matter how strong the ability of experts, may only be more professional in one or two fields, once cross-field may not be able to do anything. Today we live in a world where everything is connected, and all industries are connected. In the past, we may say that the analysis of chips and the analysis of cars rarely intersect, but today new energy vehicles are inseparable from chips, industrial networks are intertwined, and the requirements for human analysis capabilities are very high. In this case, the connection and weaving of data is especially important to help people make decisions.

Observer Network: What is the current proportion of data library R&D personnel?

Shen Xin: We now have more than 200 employees, of which more than 100 are doing R&D, accounting for more than half, and these R&D personnel are also constantly consolidating the underlying capabilities of the data library. When the underlying technical capabilities reach a certain level, you will find that the data extraction capabilities and data accuracy are all increasing. This is the same as China's launch of rockets to explore the moon, once aerospace technology breaks through, all technical points will be improved in the civilian field as a whole.

Observer.com: What is the current revenue level of Databank, and is there any IPO plan in the future?

Shen Xin: Our revenue has exceeded 100 million, last year the business volume tripled, and this year will double again. I think whether it is an IPO or any way out of the capital market in the future, it depends on the company's fundamentals, that is, whether it really creates value for customers.

This article is an exclusive manuscript of the Observer Network and may not be reproduced without authorization.