Robin Li said that the open source model will become more and more backward, why do many people disagree?

The gap is narrowing, but it may never catch up.

Author|Zhao Jian

Last week, Baidu's chairman and CEO Robin Li's remarks about the open source model sparked controversy.

Robin Li said at the Create 2024 Baidu AI Developer Conference: "Open source models will fall behind more and more. ”

Robin Li explained that Baidu's basic model Wenxin 4.0 can be flexibly tailored in terms of effect, response speed and inference cost according to different needs, generate a simplified model suitable for various scenarios, and support fine tuning and post pretrain. Compared with using the open-source model directly, the Wenxin 4.0 tailor-made model performs better at the same size and has a lower cost under the same effect, so he predicts that the open-source model will fall behind more and more.

But many AI practitioners don't agree with this conclusion. For example, Fu Sheng, chairman and CEO of Cheetah Mobile and chairman of Orion Star, quickly retorted in a video, saying that "the open source community will eventually defeat closed source".

The question of whether open source models can surpass closed-source models has been controversial since last year.

In May last year, foreign media reported that Google leaked a document with the theme of "We don't have a moat, and neither does OpenAI." While we were still squabbling, open source had quietly robbed us of our jobs."

After Meta released the open-source large model Llama 2 last year, Yann LeCun, vice president and head of the artificial intelligence department of Meta, said that Llama 2 will change the market landscape of large language models.

People are looking forward to the open source community led by the Llama series of models. But to this day, the latest release of Llama 3 still hasn't caught up with the most advanced closed-source model, GPT-4, although the gap between the two is already small.

"Jiazi Lightyear" interviewed a number of AI practitioners, and a common feedback is that the discussion of whether open source or closed source is better is determined by the position, and it is not simply a binary issue.

Open source vs. closed source is not a technical issue, but more of a business model issue. However, the current status of the development of large models is that neither open source nor closed source has yet found a viable business model.

So, what exactly does the future hold?

1. The gap is not widening, but narrowing

Who is stronger, open source model or closed source model? Let's take a look at the objective data rankings.

The most authoritative list in the field of large models is the large model arena (LLM Arena), which uses chess and has always used the ELO point system. Its basic rule is to let users ask any question to two anonymous models (e.g., ChatGPT, Claude, Llama) and vote for the one that answers better. Models that answer better will receive points, and the final ranking is determined by the cumulative points. Arean ELO collected voting data for 500,000 people.

Large model leaderboard, picture from LLM Arena screenshot

On the LLM Arena list, OpenAI's GPT-4 has long dominated the first place. Anthropic's latest release of Claude 3 briefly replaced GPT-4 for the first place, but OpenAI soon released the latest version of GPT-4 Turbo to regain the No. 1 spot.

The top 10 models in LLM Arena are essentially closed-source models. There are only two open source models that can squeeze into the top 10 list: one is the LLama 3 70B released by Meta last week, which ranks fifth and is also the best performing open source model, and the second is the recently released Command R+ released by Cohere, one of the "Eight Transformers", which ranks seventh. It is worth mentioning that the open source model Qwen1.5-72B-Chat released by Ali ranks twelfth and is the best performing open source model in China.

In terms of absolute rankings, closed-source models are still far ahead of open-source models. But if we look at the gap between the two, it is not that it is getting bigger and bigger, but that it is getting smaller and smaller.

The gap between the closed-source model and the open-source model, image from X

Fang Han, chairman and CEO of Kunlun Wanwei, previously told "Jiazi Lightyear" that the gap between the open source model and the closed-source model has been chased from 2 years behind to only 4~6 months.

What factors affect the difference in capabilities between open source and closed source models?

Zhang Junlin, head of new technology research and development at Weibo, believes that the smoothness or steepness of the growth curve of the model's capability is more important. If the growth curve of the model capability is steeper (the faster the growth of the model's capabilities in all aspects of the unit time, the greater the "acceleration" similar to the motion of the object), it means that the greater the computing resources need to be invested in a short period of time.

Conversely, the flatter the model capability growth curve, the smaller the difference between open source and closed source models, and the faster they can catch up. This difference in the capabilities of open-source closed-source models, which is determined by the steepness of the model capability growth curve, can be called the "acceleration difference" of model capabilities.

Zhang Junlin believes that in the next few years, whether the capabilities of open source and closed source will shrink or increase depends on the technological progress in "synthetic data". If there is a breakthrough in "synthetic data" technology in the next two years, the gap between the two is likely to widen, and if not, the capabilities of open source and closed-source models will be equal.

Therefore, "synthetic data" is the most critical and decisive technology for large language models in the next two years, and it is likely that there will not be one of them.

2. The "true and false open source" of the open source model

People's expectations for the open source model lie largely in the word "open source".

Open source is a decisive force for the software industry to thrive. As Zhou Hongyi, the founder of 360 Group, mentioned in a recent speech at Harvard University: "Without open source, there would be no Linux, no PHP, no MySQL, and not even the Internet." Including in the development of artificial intelligence, if it were not for Google's open-source Transformer, there would be no OpenAI and GPT. We are all individuals and companies that have grown up benefiting from open source. ”

However, this time the open source model may be going to disappoint many open source believers.

Shortly after the release of Llama 2 last year, there were critics who said that Meta was actually "fake open source".

For example, Erica Brescia, managing director of RedPoint, an open-source-friendly venture capital firm, said: "Can anyone explain to me how Meta and Microsoft can call Llama 2 open source if it doesn't actually use an OSI (Open Source Project) approved license and doesn't meet the OSD (Open Source Definition)? Are they deliberately challenging the definition of OSS (Open Source Software)?"

Indeed, Llama 2 does not follow the above protocol, but has a set of "open source rules" that prohibit the use of Llama 2 to train other language models, and if the model is used in applications and services with more than 700 million monthly users, it will need to obtain a special license from Meta.

Although Llama 2 claims to be an open-source model, it only opens up the model weights, that is, the parameters after training, but does not open key information such as training data and training code.

Lin Luqiang, the person in charge of open source of 010000 things, told "Jiazi Lightyear" that the current open source model, compared with open source software, is an intermediate state between closed source and open source, and developers can fine-tune and do RAG on the basis of it, but they cannot modify the model itself like open source software, let alone obtain its training source data.

In the field of "true open source" open source software, a significant feature is software source code sharing, and developers in the open source community can not only report bugs, but also directly contribute code.

For example, TiDB, a domestic open-source database, once shared a set of data, and 40% of the 40% of the code updated every year was contributed by external contributors.

However, due to the black box of the algorithm of the large model, only the "semi-open source" of the model weights has led to a result: no matter how many developers use Llama 2, it will not help Meta improve any Llama 3 capabilities and know-how, and Meta cannot rely on Llama 2 to obtain any data flywheel.

If Meta wants to train a stronger Llama 3, it can only rely on the talent, data, and GPU resources within its own team, or it still needs to do experiments (such as Scailing Law), collect more high-quality data, and build larger computing clusters. This is essentially no different from OpenAI training closed-source GPT-4.

As Robin Li said in Baidu's internal letter, the open source model cannot achieve the same level of "high flame as everyone picks firewood" like open source software.

Today, many open-source models are aware of this problem. For example, when Google released the open source model Gemma, Google deliberately named it "Open Model" rather than "Open Source Model". Google says that the open model has free access to model weights, but the terms of use, redistribution, and variant ownership vary depending on the model's specific terms of use, which may not be based on an open source license.

Cheng Cheng, the person in charge of Kunlun Wanwei AI Infra, made the following ratings for open source models on Zhihu:

Only the model is open source (the technical report only lists Evaluation). Mainly good for companies doing applications (continue to train and fine-tune) and regular users (direct deployment)

Technical Report Open Source Training Process. The key details of model training are described in detail. Good for algorithm research.

The training code is open source/the technical report is open source, and the full details. It contains the core key information of data matching. This information is invaluable, and it is a know-how that would otherwise require a lot of GPU resources to obtain.

Full training data is open source. Other teams with computing resources can fully reproduce the model based on training data and code. Training data can be said to be the core asset of a large model team.

The data cleansing framework and processes are open source. The cleaning process from the source of raw data (such as CC web pages, PDF e-books, etc.) to trainable data is also open source, and other teams can not only reproduce the data preprocessing process based on this cleaning framework, but also expand their own data scale by collecting more sources (such as full web pages crawled by search engines) to obtain a stronger base model than the original model.

He said that in fact, most of the models are open source such as LLama2, Mistral, Qwen, etc., and only achieve Level-1, such as DeepSeek, which can achieve Level-2. And there is no open source of Level-4 and above. So far, no company has open-sourced all of its training data and data cleansing code, so that the open-source model cannot be fully reproduced by third parties.

The result of this is that the core secrets of model progress (data, matching) are firmly in the hands of the large model company, and there is no other force from the open source community that can help it improve its ability to train the model next time except for the large model company's own team.

So, this brings us back to a key question: why open source if it can't help improve model performance with the help of external forces?

3. What is the significance of open source models?

Open source or closed source does not determine the performance of the model. Closed-source models are not leading because they are closed-source, and open-source models are not lagging behind because they are open-source. Even on the contrary, the model chooses closed source because it is leading, and it has to choose open source because it is not leading enough.

Therefore, if a company makes a model with strong performance, it is possible that it will no longer be open source.

比如法国的明星创业公司Mistral,其开源的最强7B模型Mistral-7B和首个开源MoE模型8x7B(MMLU 70)是开源社区声量最大的模型之一。但是,Mistral后续训练的Mistral-Medium(MMLU-75)、Mistral-Large(MMLU-81) 均是闭源模型。

At present, the best-performing closed-source model and the best-performing open-source model are both dominated by large companies, and among the large companies, Meta is the most determined to open source. If OpenAI's non-open-source is considered from the perspective of commercial returns, then what is the purpose of Meta's choice of open-source for users to try for free?

In last quarter's earnings conference, Zuckerberg responded that Meta open-sourced its AI technology to drive technological innovation, improve model quality, establish industry standards, attract talent, increase transparency, and support long-term strategies.

Specifically, open source brings a number of strategic benefits.

First, open-source software is generally more secure, more reliable, and more efficient due to continuous feedback and review from the community. This is important because security is one of the most critical topics in the field of AI.

Second, open source software will often become the industry standard. And when other companies build standards based on Meta's technology stack, new innovations will be more easily integrated into Meta's products. This subtle advantage is a huge competitive advantage.

Again, open source is very popular among developers. Because tech workers are eager to participate in widely adopted open systems, Meta will attract more top talent to stay ahead of the curve in emerging technologies. At the same time, due to Meta's unique data and product integration, open-source Llama infrastructure will not weaken Meta's core competitiveness.

Meta is the most committed to open source among the large companies, and it is also the company with the most benefits. Although it costs hundreds of billions of dollars to train large models, Meta's stock price has risen by about 272% since it focused its business on open-source large models in 2023. Meta has not only reaped fame from open source, but also reaped huge financial rewards.

Meta stock price chart, image from X

Meta's latest release of Llama 3 is also an open-source model. In addition to the 8B and 70B models with smaller parameters, the Llama 3 400B under training is likely to be an open-source model, and is expected to become the first open-source model to surpass GPT-4.

4.闭源to C，开源to B

Whether it's an open-source or closed-source model, you need to find the right business model.

Today, a trend that has gradually formed in the large model industry is that closed-source models are more inclined to do to C, and open-source models are more inclined to do to B.

Yang Zhilin, the founder of the dark side of the moon, once said that if you want to make a super app in the field of C, you must use a self-developed (closed-source) model, because "only a self-developed model can differentiate the user experience".

Yang Zhilin believes that the open source model is essentially a customer acquisition tool for B, or a long-tail application outside of the Super App, so that it is possible to give full play to the advantages of data or scenarios based on the open source model.

But the open source model can't build product barriers. For example, there are hundreds of applications based on the open-source diffusion model Stable Diffusion overseas, but none of them have actually come out.

Second, it is not possible to continuously optimize the model through the siphon effect of data on the basis of open source technology, because the open source model itself is distributed and there is no centralized place to receive data.

In contrast, the open source model is more suitable for landing in the field of to B.

Lin Luqiang, head of the open source of 0100000 things, told "Jiazi Lightyear" that toB is a single way to make money directly from customers, providing not products, but services and solutions, and it is a customized service. Whether to use open source or closed source for service? To B's customers must prefer the open source model, because it can not only save licensing costs, but also have higher room for customization.

The open-source model is often seen as the cheapest means of acquiring leads. Vendors can expand their user base through open source models of tens of Bits or less to gain sales leads and prove their technical strength. If customers have more customization needs, model manufacturers can also provide more services.

At the same time, open source and closed source are not a single choice question, and many companies have adopted a two-wheel drive strategy of open source and closed source, such as Zhipu AI, Baichuan Intelligence, Zero One Everything, and so on.

Wang Xiaochuan believes that from the perspective of to B, open source and closed source are actually needed. In the future, 80% of enterprises will use the open source model, because closed-source has no way to better adapt to the product, or the cost is particularly high, closed-source can provide services for the remaining 20%. The two are not competitive, but complementary in different products. ”

Regardless of whether it is open source or closed source, the fundamental problem facing the commercialization of large models is how to reduce the cost of inference. Only by reducing the cost of inference can large models be truly implemented on a large scale.

Today, the open source and closed source camps each have their own supporters. However, if we refer to the development trajectory of iOS and Android operating systems, the healthy competition between them has greatly promoted the iteration of products and the upgrading of user experience. This is the ultimate value of the battle between the open and closed sources.

(Cover image source: Create 2024 Baidu AI Developer Conference)

Robin Li said that the open source model will become more and more backward, why do many people disagree?

1. The gap is not widening, but narrowing

2. The "true and false open source" of the open source model

3. What is the significance of open source models?

4.闭源to C，开源to B

Read on

CNCC | The future of multimodal affective computing under large models

The "Fuxi Eye" large model was released! It has the world's largest ophthalmic image database

New car | The AI large model is on the car, 13 new/27 optimizations, and the ZEEKR 009 glorious OTA upgrade

AI Daily: Fudan and Baidu's new models can generate 1-hour long videos; The new version of ChatGPT for Windows is launched; Two new features have been added to NotebookLM

Surveying and Mapping Bulletin | Ren Ping: Noise data visualization based on LOD1 city model

The terminal AI grading standard has been implemented, and the "fire" of the mobile phone model has burned to the agent

J Clin Invest丨Yang Weili/Li Shihua/Li Xiaojiang's team used monkey models to reveal new pathological mechanisms of Parkinson's disease

Tens of millions of dollars lost by poisoning for large model training? Anthropic found a hidden bug in the LLM codebase

Nearly 1,000 teenagers in the city gathered at Zhonghai Expo to show their skills in the three major model competitions of navigation, aviation and architecture

DeepMind and MIT developed Fluid, which enables autoregressive models to achieve large-scale expansion of Wensheng graphs

AI Weekly | ByteDance's large model training was "poisoned"; Microsoft will terminate the Azure OpenAI service for individuals in China

ByteDance responded to the attack on the intern for the training of the large model: it has been dismissed and does not affect the online business

A number of large models have been rolled out in the field of traditional Chinese medicine, and the "AI old Chinese medicine" is coming?

Shoot the king to bomb? Photorealistic generative world model, with Pixar investment

Tencent, Huawei, etc. access to DeepSeek lose more than 400 million yuan per month, and the MaaS model as a service is about to be subverted? Titanium media AGI

The sex robot was unexpectedly empowered by a large model, and the concept stocks of adult products rose collectively, against the sky?