Scaling Law 已成为大模型进化的「不二法门」。

The larger the number of parameters, the larger the dataset size, and the greater the computing power consumption, the better the performance of the large model. Compared with overseas large model companies, domestic large model companies will face more severe computing power problems, such as capital and graphics card purchase restrictions, so that many people question whether there is a Scaling Law for Chinese large models?

Xia Lixue, co-founder & CEO of Wuwen Xinqiong, said, "I think there can be another interpretation of Scaling Law in China, that is, Scaling Law in application scenarios. 」

The "MxN" architecture they launched "solves the problem of how a bunch of similar large models can run on different cards, and finally give them to developers in the form of resources similar to water, electricity, and gas." 」

In his opinion, "the core task of the large model this year is to land, and the stuck point of the landing is the cost performance".

On April 10, in the live broadcast of the dialogue between Zhang Peng, founder & president of Geek Park, and Xia Lixue, they discussed the Scaling Law of large models, the problems of computing power in China, and the problems of landing large models, and tried to put forward some non-consensus views.

01 CUDA is NVIDIA's barrier, and inference scenarios are the focus of computing power in the future

Peng Zhang: From your point of view, what is there to keep in mind at last month's GTC?

Xia Lixue: Since the 2018 GTC, everyone's focus has been on what NVIDIA's latest graphics cards are, including the release of the latest B-series graphics card (Blackwell B200).

The B-series still has a lot of technical improvements, such as doubling the memory of the video, using the new PCIe 6.0 protocol, and greatly improving the bandwidth of the entire interconnection. This shows that with the development of technology, NVIDIA is still at the forefront of the times, and it can be said that it is very determined to do greater system engineering. Because these upgrades are really geared towards "building a bigger training system".

However, some indicators do have room for discussion. For example, some news will mention a 30x increase, but we have not yet found any clear evidence for this, and we speculate that this may be data in some specific scenarios, such as when the scale is large enough to a certain extent, and its original H-series graphics cards have produced saturation losses, this comparison result may achieve a 30x increase.

The core improvement we've seen so far is that the B-series has achieved two graphics cards stacked together without a significant loss of performance, achieving about twice the effect.

On the whole, this conference did not have the exaggerated "black magic" type of improvement, but it did prove that NVIDIA has made some strong system-level technical upgrades in the matter of Scaling Law.

Zhang Peng: Two days ago, Anker's Yang Meng shared a point of view. He believes that in the long run, Nvidia still has huge challenges and uncertainties, and the integration of storage and computing is the hope for the future, and this press conference does not see such a corresponding plan at all. Do you have any non-consensus findings that people have overlooked?

Xia Lixue: In fact, one of the core reasons why NVIDIA can continue to lead is that it has a large number of users now, so he can see the direction of the future, and then implement this direction to his next generation of products. So we can analyze some of its new features and see what it's thinking.

One of the points I saw at this press conference was that the floating-point number of 4 bits (FP4) was officially added to the functional indicators of the B series, which was not available in the previous H series. Considering that the 8-bit number of H-series (such as FP8) has not been widely used in training, this 4-bit must not be used to train large models, but for how the large model will eventually be inferred and implemented, so that more developers can enjoy the benefits of NVIDIA graphics cards. Therefore, NVIDIA is also considering whether it can help you better implement the model in specific scenarios in addition to helping you make a larger-scale model.

And you can combine it with NVIDIA's latest financial report, and reason that the revenue of this scenario has accounted for nearly 40%, which actually exceeds the industry's expectations for it. Previously, Wall Street's ratio of predictive training to inference was 8:2, but now it is 6:4 on Nvidia alone.

Therefore, whether it is from the perspective of the actual return that NVIDIA has already obtained or the perspective of future strategic planning, it will support the use of inference scenarios more.

Zhang Peng: There are many excellent chip companies in the tradition, such as Intel and AMD, and there are also many cutting-edge companies now. Why has Nvidia reached such heights today?

Xia Lixue: NVIDIA's core competitiveness is that it always knows what indicators the next generation of chips can serve the tasks of the next era.

So why does it know this? That's the CUDA ecosystem. In the AI space, this is one of NVIDIA's most important barriers.

Every piece of hardware has an interface, and the interface is like a manual, and the developer uses its hardware according to the "manual". Nvidia invested a lot of manpower in building its CUDA development ecosystem very early on, making this manual very easy to read, so that all developers can easily use Nvidia's hardware.

So basically since the last AI era, all the most advanced models and applications have been running on NVIDIA's CUDA first. This creates a positive cycle, where everyone spontaneously develops their own new features on Nvidia's cards, and Nvidia enjoys such dividends. At the same time, its competitors have to invest extra manpower to move these functions into their own environments. It's equivalent to Nvidia not having to do anything, and competitors doing twice as much work. This is actually the core barrier of NVIDIA, and it is also the core reason why it can maintain its "hegemony" without generating generational leadership in hardware.

Of course, this thing is not completely unshakable, because the large model appeared.

In the previous era, AI models needed to be optimized for each scenario, such as convolutional neural networks, which were used for vision, and recurrent neural networks, which were used for language processing...... In this way, everyone will inevitably converge to complete their own development in the same language system. For example, I have accumulated a set of things with CUDA now, and naturally I will migrate some common things to other scenarios.

This matter actually constitutes the thickness of NVIDIA's CUDA ecosystem, but the large model thins this ecology. Because the structural differences between the large models are not that large, I don't need 100 large models anymore. What everyone is more pursuing is whether it is cheap or not.

From this point of view, other hardware manufacturers have more opportunities. So that's why after the big model came out, people like AMD, Intel, etc., were very quick to release some of their core software and products, because they also saw this.

02 China's Scaling Law is a scenario advantage

Zhang Peng: When I return to China, I still have to face the problem of a computing power ceiling. Some time ago, a friend even put forward a particularly pessimistic view: Does Scaling Law really exist in China? Because Scaling Law theoretically needs endless computing power support to lead to AGI, but there is a computing power ceiling in China, and in the end, there may be no way to really enjoy such a technical dividend as Scaling Law? What do you think of this view?

Xia Lixue: Before the term came to the public, its source was an OpenAI paper. The core of the paper is that when we want to train a model and make the best predictions, what are the rules to follow? It mentions that there are actually two factors that affect Scaling Law, not only computing power, but also data.

Opinions on Scaling Law have collided once on OpenAI and Llama. The logic of OpenAI Scaling Law is that a good large model can be obtained more efficiently with greater computing power and data, which is purely from the perspective of the cost performance of the training model. Llama's idea is that the model will eventually be implemented, so this cost performance should take into account the final model inference stage, so with the inference as the goal, the data is continuously superimposed on a model of "almost okay" scale, and finally a scaling law at the data level is obtained.

This picture is familiar. Looking back at the Internet era and the mobile Internet era, some technologies originated in Europe and the United States, and then achieved the explosion of scenarios in China. Because China has the largest number of users and scenario data first, we also have a lot of enterprises and developers who can implement application scenarios.

Therefore, I think there can be another interpretation of Scaling Law in China, that is, Scaling Law in application scenarios. Let's say we have a model that reaches the basic water level and empower it to a wide range of industries. Empowering thousands of industries means accumulating high-quality data in each industry? After adding high-quality data and applying it to the model, you can quickly make the data flywheel spin.

It can be said that the computing power scaling law improves the output value of an industry itself, while the scenario scaling law solves the problem of penetration, that is, how to penetrate large models into all walks of life. We have the advantage of having our own unique definition of Scaling Law.

Zhang Peng: What are your long-term judgments on the domestic computing power market?

Xia Lixue: First of all, we have voted with our feet, which is why we want to do "MxN", because we believe that Nvidia is not the only computing power manufacturer.

Of course, NVIDIA is still dominant in the domestic computing power market, but we also see that many manufacturers, whether it is AMD or some other chip manufacturers we work with, have gradually had a certain ability to compare with NVIDIA.

But what is missing is the so-called next customer. It's just that no one knows you can use it, so no one will use you on a large scale, and then no one knows that you can use it.

We also tell our model partners, don't do two very uncertain things at the same time, the model belongs to you, hand over the uncertainty of computing power to me, and you first run the business on our Infini-AI. I can prove to you that other cards can also allow you to run your business well, quickly, and economically.

We can maintain a good relationship with so many chip manufacturers, because we also need us to help them prove their strength, we also need our optimization capabilities to help them do better, and we need us to open up the industrial chain.

Going back to the question at the beginning, I think that the current is still dominated by NVIDIA, but there must be a non-NVIDIA market in the future.

Peng Zhang: What is multivariate heterogeneous computing power and why is it important?

Xia Lixue: Essentially, it's because of the special ecology of China. If there were enough Nvidia chips, then everyone would be able to use Nvidia, but now the problem is that Nvidia chips are not enough.

So why do we need to do isomerism? Because the domestic ecology is still relatively scattered, everyone has their own one-third of an acre of land to cultivate. So the market will continue to be in this state for a longer period of time: there are many options at your disposal, and at the same time, these options are relatively fragmented.

It is impossible for everyone to have enough NVIDIA chips, so whether it is a large model manufacturer or an application manufacturer, it needs to be adapted to many chips. So can we integrate these needs and finally turn them into a useful service for everyone? It is equivalent to doing something that everyone has to do again, and we do it for everyone. Originally, everyone had to do MxN development, but through its own platform, Wuwen Core Dome has docked M models, applications, and N chips, so the entire ecosystem only needs to do M+N adaptation, and there is no need to waste it.

In fact, this is also an opportunity born of the unique situation of China's computing power market.

03 The integration of push and training is the future, and the Transformer architecture will not be subverted anytime soon

Peng Zhang: How do you understand the idea that "reasoning is training"?

Xia Lixue: This is a very important point. How do we understand the core ability of human beings? Some people say they use tools, but monkeys also use tools, and some people say they use social labor, but in fact ants also have social labor. Therefore, I understand that the core ability of human beings is continuous learning, which can pass on wisdom from generation to generation and iterate continuously, which is the foundation for the growth of a civilization.

The way we train a model now, based on the existing technical limitations, is to pre-train a model first, and then use it in the corresponding scenario, and the returned results can become a new dataset for the model to iterate. Just like the software upgrade, iOS13 was released today, and iOS14 was upgraded tomorrow.

But in fact, people are not like this, if I get a question wrong in the morning, I won't make that mistake in the afternoon.

So ideally, training and inference are integrated, and we can give the data to the system in real time in the process of continuous use, and then the system will generate a feedback at the moment. This model has been actually used in the industry of the Internet for an era, that is, the advertising system. Once you don't click on the ad, it will most likely not show you a similar ad next time, and once you click on an ad, it will immediately know what you like.

However, the system was able to be used quickly at the time because it was able to calculate the cost, and the entire training and inference cost could support the system to learn and run continuously 24/7.

Now the cost of large models is too high, and if there is both training and inference, the whole cost is unbearable. So it's still a goal, but I think it's a very important direction.

Zhang Peng: To a certain extent, it can be understood that if there is no clear goal to cultivate general artificial intelligence, this is a very high-cost thing, but if the goal is very clear to strengthen the intelligence of a certain ability, there may be a different path.

In fact, the role of business is like this, in the past, as long as the business demand was calculated, this part of the technology will develop rapidly.

Therefore, whoever produces the business closed loop first, whose intelligence may develop quickly, this statement also makes sense, not necessarily only looking at the absolute value of computing power.

Peng Zhang: In addition to GPUs, what other chip solutions do you think are worth seeing?

Xia Lixue: I think first of all, NVIDIA represents a direction, that is, GPU, a large-scale parallel computing, under the structure of Transformer, is the most efficient type of execution logic.

Like AMD, including some domestic manufacturers, they are also making their own GPU-like architectures. I think there's definitely a lot of room for that. Large models are born based on the architecture of GPUs, and in turn, GPUs have developed rapidly due to the growth of large models.

The Tranformer structure will not be subverted quickly and substantially, it has absorbed most of the knowledge of mankind, and it is more costly to create a new "god" and "fight" it. So now no one has the incentive to do a new architecture completely to disrupt the GPU.

Along this path, in addition to the GPU architecture, there will also be people who will do some hardware that is completely specific to the Tranformer structure, which is also worth looking forward to.

Peng Zhang: Some people mentioned SambaNOVA, which is to follow your idea to further strengthen the Tranformer and form a complete system. Are you bullish on this type of company?

Xia Lixue: We still hope that more people will explore, which is conducive to the healthy development of the industry.

But there is a very core problem in this, that is, the development of hardware must continue to be combined with the scene, and it is not possible to really build a nuclear bomb in silence.

When you look at the future development of hardware, you must see how it can have a plannable path and continuously absorb new computing paradigms to achieve continuous iterative optimization of hardware.

Artificial intelligence provides a very good foundation for the joint optimization of software and hardware. Because in the last era, the hardware and software design of many tasks were separated. However, because the AI model is adjustable, the structure of the hardware can be taken into account when designing the process, and a hardware that can meet the task and be computationally efficient can be designed with the highest efficiency.

This is the unique space that AI provides for the joint design of software and hardware, and I think this matter will be of greater value in the future.

04 Wuwen Core Dome is committed to turning computing power and large models into basic resources like hydropower

Zhang Peng: I don't know where the name of the core dome came from? It feels very romantic, not like the style of your science and engineering studies.

Xia Lixue: The word "none" is the abbreviation of the Department of Electronics of Tsinghua University, because the predecessor of the Department of Electronics was the Department of Radio in the 80s, so the Department of Electronics is also called "None". No questions and no dome are the lyrics in the Tsinghua school song, which are also very consistent with our company's ideal vision. Hence the name.

Zhang Peng: In the field of chips, what opportunities does Wuwen Xinqiong see and what problems do you want to solve?

Xia Lixue: On the one hand, since the large model unifies the structure of the model and realizes a more general task, a new requirement has emerged, which needs to be optimized through the joint integration of software and hardware.

On the other hand, since the large model has thinned the ecological barrier of CUDA, and the domestic hardware and algorithm ecology is becoming more and more prosperous, a gap has been formed, that is, the connection between the model and the hardware. As an end customer, he doesn't really care about the model, computing power and other issues, he cares about what the large model can bring to my application scenarios.

So as a core dome of Wuwen, we have two core tasks.

One is to connect different models and different hardware, which we call "MxN", that is, between M different large models and N different hardware, to achieve unified deployment and joint optimization. It is equivalent to uniting everyone to form a joint force to provide better model and computing power services for the final industrial customers, and finally promote the explosion of large models in such a unique application scenario in China.

The second task is to settle accounts. No, the model is not only a matter of matching, but also the core of how to calculate the account to achieve the ultimate performance. Therefore, after solving the previous ease of use, it is more important for us to do in-depth optimization from model to hardware.

These two points are the basic capabilities accumulated by our team, so that we are willing to come out at such a point in time to become such a company, so that the development of the entire industry can be promoted.

Peng Zhang: It sounds very similar to what CUDA does, what is the difference between you and CUDA?

Xia Lixue: It can be understood that CUDA solves the problem of how a bunch of dissimilar models can run on NVIDIA's chips, and we solve the problem of how a bunch of similar large models can run on different cards, and finally give them to developers in the form of resources such as water, electricity, and gas. It is equivalent to unifying the originally differentiated resources into a set of services, and giving them to customers who ultimately need computing power and models.

Just like you don't need to care about wind power or thermal power generation behind you when you use electricity, electricity itself is a unified resource. That's what we're doing.

Peng Zhang: It sounds like the core dome is doing something similar to the middle layer. This work sounds in demand today, but will it be eroded by models or computing power in the future?

Xia Lixue: There are actually two points here.

First of all, the overall computing power in China is in short supply. On the one hand, many software companies can't find good computing power, and on the other hand, the computing power made by many chip manufacturers can't find good customers to use. Under this supply-demand relationship, the middle layer has great value, because it is equivalent to opening up the supply chain. This is the value of the industrial layer that exists in the middle layer itself.

Then the core of our team is to improve the optimization ability, and finally provide you with a cost-effective and extreme optimization. Our team is very confident in the cross-layer optimization of joint models to hardware, and is one of the strongest teams in the field.

We have accumulated experience in this area and want to work with upstream hardware manufacturers and downstream model manufacturers to solve the problem of large model landing. Because many models are now available, but the cost is stuck.

This is a common mission for our industry, and in this mission our ability to optimize is very important. In the process of achieving the mission, it has been able to realize the value of the industry.

05 The core task of the large model this year is to land, and the stuck point is cost-effective

Zhang Peng: Baidu, Tencent, and Zhipu have all invested in Wuwen Xinqiong, and it is rare to see a startup company being jointly supported by important players in the industry as soon as it comes up. How did they talk about it, and how did they come to such a clear consensus with you?

Xia Lixue: First of all, it must be because our team's technical accumulation is still needed for everyone. Because what is actually needed in this era of large models is to be able to calculate the account in the end, which involves a lot of cost-effective optimization. In this process, including our ability to combine models with hardware optimization, and the ability to use various cards for everyone, are all needed by the industry.

In fact, these are all downstream manufacturers in the partial scene, and we can help them provide resource supplements, which is the position of the industry.

And then our core optimization ability is to help everyone make the cost performance. Because the core task of the large model this year is to land, and the stuck point of the landing is the cost performance. This matter requires us to work together with model manufacturers and hardware manufacturers. What model manufacturers do is to make the models more refined, and hardware manufacturers make better computing power, so what we do is how to make these exquisite models and these hardware more appropriate.

In the end, we can reduce the cost of large models by several orders of magnitude, and then we can drive the entire industry.

Peng Zhang: So do you think that they finally recognized the value of Wuwen Core Dome because it can effectively solve the problem of multiple heterogeneous computing power, or is it because of the optimization of performance from a long-term perspective?

Xia Lixue: I think the two are equally important, and the two are a match for each other.

The ongoing shortage of computing power is due to the fact that everyone is making bigger models. On the one hand, we are facing a shortage of computing power, and on the other hand, we are also facing very high costs. So both of these values will persist.

Then, in the current international situation, it is a very clear way to do localization and heterogeneity. The no-question core dome firmly chose this path.

Peng Zhang: If you join a large model company, they will become very competitive. Did you think about it in the first place, and why did you end up setting up an independent company?

Xia Lixue: It's a bit like the difference of opinion when we just discussed Scaling Law, the ultimate goal is to empower thousands of industries with large models, but there can be different paths to achieve this.

We can choose to pile up the wisdom and ability to the extreme, prepare the best training infrastructure, and then gradually solve the problem of landing. You can also choose to make large models available to all walks of life now.

Why do we need to be an independent middle-layer ecosystem? Because we want to do this thing. On the one hand, we work with large model manufacturers to help them explore the limits of intelligence. On the other hand, we also hope to help existing software companies, as a data and scenario holder, quickly use advanced technologies. For example, a while ago, we released MaaS (Infini-AI), which can make it easy for small developers to use these computing power and models. Doing such inclusive things can make the entire ecosystem make money quickly.

无穹 Infini-AI 体验地址:http://infini-ai.com

Peng Zhang: Who are the customers of Wuwen Core Dome, and how do you make them understand the value of Wuwen Core Dome?

Xia Lixue: We have many types of customers, including scenario customers in many industries.

For these customers, their current core problem is how to cost-effectively combine their own scenarios with large models. Therefore, the core capability we provide is that we have enough easy-to-use and cost-effective large-scale model service resources here. Customers can use it directly from us out of the box and are very resourceful. The reason is that our core set of technical capabilities and product capabilities allow us to use all kinds of cards.

But in practice, it often doesn't need to be explained so clearly to the customer, and they usually don't care too much. Because no matter how strong our technology is, the final embodiment is that the product is easy to use and cost-effective, which is the most direct value we can bring to customers.

06 Wuwen Xinqiong is an intelligent computing operator, and in the future, each company will have its own intelligent computing resource department

Peng Zhang: The MxN thing sounds like a very complicated matter, but your team has the confidence to do it, where does the confidence come from?

Xia Lixue: Our team originated from the Department of Electronics at Tsinghua University, including Professor Wang Yu, the founder of the company and the head of the Department of Electronics at Tsinghua University, and I myself am a student of Professor Wang Yu.

In fact, since 2008, our lab has been doing joint optimization of software and hardware for various scenarios, among which artificial intelligence is a very important scenario. Software co-optimization is actually to solve problems like "MxN", which we have been accumulating for more than ten years.

It's just that there are various models in the previous era, and we are still more at the stage of academic research. A set of methodologies has been formed, which can face each different small model and do the ultimate optimization. While this essentially requires 100 optimizations for 100 models, we can do that 100 times more conveniently.

Today, the opportunity of large models tells us that the market now needs to no longer do it for every model, but to do more in-depth optimization for this large language model. We found that the technology that had been accumulated for more than ten years was finally able to exert its capabilities in a sufficiently focused and large enough scene. This allows us to set up a company to do this, and the overall ROI is very positive.

So at this point in time, we have the confidence to do it.

Peng Zhang: A lot of people understand what you're doing as a compiler. What kind of system do you think Wuwen Core Dome relies on to create value, can it be summarized in one sentence?

Xia Lixue: Words like compilers are more of an interpretation of technical positioning. Our technology stack is more than just a compiler, and the final presentation is more than just software.

I think we are equivalent to an operator in the field of intelligent computing. It is to provide computing power and models as a basic resource like water, electricity and gas.

Zhang Peng: Can you talk more about the concept of operators?

Xia Lixue: At first, people thought that infrastructure was electricity, houses, and servers. Later, it was thought that computing power was also an infrastructure. Then, with the explosion of large models again, some people say that tokens are a basic resource in the future.

If computing power is a basic resource, then I am equivalent to an operator, because I actually integrate various heterogeneous and cross-regional computing power to provide customers. If in the end the token becomes the basic resource, then we are actually the supplier of this basic resource.

There will be some differences in our positioning from different perspectives, which may depend on each person's previous industry accumulation - from which angle he is used to seeing.

Zhang Peng: I learned from the industry that at this time last year, I could receive tens of millions of dollars for helping the company deploy a private model. But it seems that since the middle of last year, the price has started to drop to millions, or even hundreds of thousands. So I'd like to understand what the cost of training and using models for the enterprise has been going down in the past year, and what is it going to be like in the future, is it going to go down exponentially, or is it linear?

Xia Lixue: There were indeed some changes in prices last year, but when interpreted, they are not necessarily negative.

Maybe it's because customer needs are changing. It may be that in the initial exploration stage, what you want to solve is a very complex comprehensive task for a large customer, and the cost to be invested and the corresponding functions to be implemented are the most intensive and complex. Then the corresponding price will be higher.

Later, some customers found that my requirements for the intelligence of large models are actually not so high, and I don't need to spend so much money to hire an "expert" to come, I only need to ask an "assistant" to come over to solve a lot of problems.

Therefore, in a sense, this price change also reflects everyone's judgment on the level of income that the large model may bring in its own scenario, which is the process of continuous enrichment and completeness of the pricing system. It's just that from the outside world, it seems that what used to be tens of millions of things can now be bought for hundreds of thousands.

In the end, this matter depends on what problems the large model can solve in the landing scenario, and how much investment is needed in the corresponding capabilities of each level. I think it's possible to have 10 million, 100,000, 100,000, just like there are different brands on Taobao that correspond to different prices.

Zhang Peng: If intelligence is regarded as a kind of productivity, in what way will it exist in the organizational structure of the company in the future? For example, by analogy with the human resources department, will there be an intelligent resource department?

Xia Lixue: This concept is very advanced, but it is also in line with some of the actual situations that our customers are encountering now.

When the large model first came out, everyone's need was to satisfy curiosity and tool learning. At this time, the enterprise uses a large model, which has not yet reached the step of customization, and is more managed as a unified standard machine.

But recently, many of our customers have obviously encountered this kind of problem. These customers are not small, and they are very digital. They have a lot of business units that want to use large models, which is equivalent to splitting up many versions internally. At this point, how to coordinate the allocation of these resources becomes a problem. For example, how to do the version control of the model within the company, and whether these versions can be coordinated with each other? Even whether they can train each other? Just like the rotation of personnel, learn some basic knowledge, and then learn some product knowledge, can they be trained to become a person in charge of production and research? Career planning problems similar to models are also problems faced by our customers. Because if he does it all over again, he has to do N training programs for each model, which is contrary to the concept of large models, and the cost of the enterprise is also very high.

We have used some technical means to make different versions of the model able to form information exchange, and quickly generate some specific internal versions.

I think that in the future, in addition to computing power as a resource, models will also become an important resource. How can this resource generate greater value and how can it be upgraded and iterative? We will also customize some training programs for it, just like cultivating a core employee.

Zhang Peng talks to Xia Lixue: China's Scaling Law is a scenario advantage, and heterogeneous computing power solves the problem of large model landing