laitimes

The spring of open source large models is coming?

author:Data Ape
The spring of open source large models is coming?

In the field of contemporary artificial intelligence, if computing power is compared to the fuel of AI, then large models are equivalent to the engine of AI. Computing power acts as the foundation of the operation of the AI system, enabling complex algorithms and models to run, and large models convert this computing power into specific intelligent outputs. The open-source model is an engine that everyone can use.

In the development of information technology, the role of the open source community is like a catalyst, which accelerates the sharing of knowledge, technological innovation and problem solving, so that a new technology can be iteratively improved in the shortest time. This spirit of openness and collaboration is also driving the development of large models today. As a cutting-edge technology in the field of artificial intelligence, the complexity and R&D cost of large models are relatively high, and the existence of open source communities has greatly lowered these thresholds.

In this paper, we will sort out the development status of open source large models at home and abroad, compare the technical routes of different open source models, especially the exploration and attempts of Chinese developers, and find out the development trend of large models hidden behind the data.

The Evolution of Open Source Models – From Exploration to Breakthrough

The widespread attention of large models undoubtedly started with OpenAI's ChatGPT, and the GPT-3.5 and GPT-4 behind it are both closed-source models. It seems that the entire large-scale model industry is leveraged by closed-source models, but the opposite is true.

As early as 2018, Google open-sourced the BERT model based on the Transformer architecture, breaking the impression that language models can only read text sequentially. With bidirectional input, BERT can be pre-trained on two different but related NLP tasks: Masking Language Modeling (MLM) and Next Sentence Prediction (NSP). This allows BERT to establish context to help computers understand ambiguity in text. At the same time, BERT-large has 340 million parameters, is pre-trained with a large amount of text, and can be fine-tuned using question and answer datasets. This also laid the template for the later "large-scale parameters + pre-training + fine-tuning".

In 2019, OpenAI open-sourced GPT-2. Compared to BERT, GPT-2 has a stronger generative capacity and a larger pre-trained dataset. Since then, the open source model has gradually developed in the direction of "more parameters, larger data sets, longer contexts".

The spring of open source large models is coming?

Comparison of the BERT model and the GPT model

In 2022, OpenAI's GPT-3 was born, with 175 billion parameters, and the concept of reinforcement learning based on human feedback (RLHF) is also deeply rooted in people's hearts. Since then, OpenAI has led the way. But even so, OpenAI's CEO Sam Altman admits that OpenAI's ultimate goal is open artificial general intelligence.

Due to the high training cost of large models, many enterprises choose large models from a business perspective. There are also some enterprises that are concerned about the security risks posed by the abused model after open source. Despite this, the open source community is still thriving.

In 2023, Meta open-sourced the LLaMA model, with a maximum parameter volume of 65 billion and a training data volume of 2.0 T tokens.

In March 2024, the xAI open-source Grok model has a maximum parameter volume of 314 billion, which is the largest parameter scale among the current open-source models.

At the same time, domestic open-source large models have also begun to emerge, and the GLM series models of Tsinghua University and Zhipu AI are among the representatives.

Global Perspective – Players of the Open Source Scale

According to the report published by IISS, the main countries involved in the development of large models are China and the United States, in addition to the United Kingdom, France, South Korea, Russia, Israel, and many multinational companies and research institutions. As shown in the figure below, the demand for computing power for large models around the world is growing rapidly, and countries have invested a lot of resources to build their own models.

The spring of open source large models is coming?

Trend chart of the global demand for computing power for large language models

The developers of the model have tried a variety of release schemes, including not releasing the model (such as Google's limited disclosure of Bard as of March 21, 2023), limiting the API output (for example, although OpenAI has opened up the API calls of GPT-4, the number of calls in a fixed time is very limited), Sharing models under a non-commercial license (Meta open-sourced LLaMA allows anyone to use it, and only requires a dedicated Meta license if the product has more than 700 million monthly active users), as well as making the model available in its entirety and downloadable to the web (similar to what the EleutherAI and BigScience research groups do).

In general, research institutes and multinational corporations are more inclined to open source their models. For the former, open source can not only promote innovation, but also avoid the risk of duplication of work to a certain extent. For the latter, through open source, the company can demonstrate its technical strength, enhance its brand influence, attract potential customers and partners (especially R&D talents), and even build the open source ecosystem into its own competitive advantage. For example, Alibaba has both open-sourced the QWen model and launched a commercial version, and there are other large models on Alibaba Cloud.

It's not easy to balance open source and commerce. For the open source part, there should be clear license and commercial terms of use, and for the commercial part, on the one hand, it is necessary to let users and developers understand the difference between the open source model and the business model, maintain sufficient transparency, and on the other hand, there needs to be a corresponding strategy to ensure that the commercial use of the open source model does not lead to a conflict of interest with the community. In the context of the great development of the industry, the benefits of open source outweigh the disadvantages.

According to the estimation of the large model home, the global large model market will reach 28 billion US dollars in 2024, and by 2028, its scale will reach 109.5 billion US dollars.

The spring of open source large models is coming?

With such a huge market, what kind of weight does China occupy in it?

According to the prediction of Big Data House, by 2024, the scale of China's large model industry is expected to reach 21.6 billion yuan.

It is expected to increase to 117.9 billion yuan in 2028.

The spring of open source large models is coming?

This market is not only vast, but also developing rapidly, coupled with the large number of engineers in China, creating a favorable external environment for the growth and expansion of local large models. So what are the characteristics of the current competitive landscape?

Open Source Big Model Panorama – Core players and their models

On April 18, 2024, Meta released its latest open-source model, Llama 3, which has parameters of 8 billion (8B) and 70 billion (70B) versions. Llama 3 is trained on a dataset of over 15 trillion (15T) tokens, which is seven times the size of Llama 2 and contains four times the amount of code data.

The spring of open source large models is coming?

Google's open-sourced Gemma model in February attempts to achieve the best performance in the same size range with 2B and 7B parameter scales.

The spring of open source large models is coming?

Mistral AI has open-sourced the world's first "expert hybrid" architecture (MoE) large model, Mixtral 8x7B, adding a new fire to the development of AI Agent. The open-source large model ranking on the hugging-face website has recorded more players.

The spring of open source large models is coming?

Back in China, in August 2022, Tsinghua University open-sourced the Chinese-English bilingual pre-training model GLM-130B, which uses a general model algorithm for pre-training. In June 2023, Baichuan Intelligent released Baichuan-7B, an open-source commercial large-scale pre-trained language model, which supports bilingual Chinese and English. In October 2023, Zhipu AI open-sourced the ChatGLM3 series models. In November 2023, vivo open-sourced a large model with 7 billion parameters. In December 2023, Alibaba Cloud open-sourced the Qwen-72B, Qwen-1.8B, and Qwen-AudioQwen large models.

At present, there is no authoritative standard for the evaluation indicators of large models, and most of the results are obtained on some test sets, and the test sets are easy to overfit. To borrow the words of Yang Zhilin, founder of Moonshot AI/Kimi, the large model is like a computer in the new era, with parameters as large as CPU and context length as memory. From this point of view, the author has calculated the performance of major open source models at home and abroad as follows (as of January 2024):

The spring of open source large models is coming?
The spring of open source large models is coming?
The spring of open source large models is coming?
The spring of open source large models is coming?

Technology path selection - multi-dimensional exploration of open source large models

At present, the vast majority of open source models are based on the Transformer architecture, and its dominance has not been challenged so far. However, there are also objections, such as "the efficiency of the Transformer is too low", "Transformer cannot achieve AGI", and so on. This is because the advantages of the Transformer model are also its disadvantages: the self-attention mechanism at the heart of the model, while powerful, also comes with computational challenges. The main problem is that the complexity of processing information is quadratic, which leads to a significant increase in the amount of computing resources and memory required to process long sequences of inputs or in resource-constrained environments, which is one of the reasons for the current shortage of computing power.

Due to the limitations of the Transformer architecture, many alternative models have emerged, such as China's RWKV, Meta's Mega, Microsoft ResearchAsia's Retnet and Mamba, and DeepMind's Hawk and Griffin. These models were introduced one after another after Transformer dominated the field of large-scale model development.

In January 2024, the open-source RWKV's Yuanshi Intelligence completed a seed round of financing. RWKV is a type of RNN with transformer-level LLM performance. It can be trained directly like GPT (parallelizable) and combines the advantages of RNN and Transformer. At a time when computing power is becoming more and more tight, such exploration is particularly necessary.

Mega is able to model sequences in excess of one million bytes through its multi-scale decoder architecture, which allows it to process longer sequences than traditional models. Due to the reduced amount of self-attention calculations, Mega has a significant increase in spawn speed.

RetNet is a new type of autoregressive infrastructure, which introduces a Multi-Scale Retention (MSR) mechanism to replace the multi-head attention mechanism in Transformers. RetNet excels at scaling curves and contextual learning, and the inference cost is independent of sequence length. It outperforms Transformer in terms of memory consumption, throughput, and latency, especially when the model size is larger than 2B.

Mamba is based on a selective state space model, which selectively decides to focus on or ignore incoming inputs. Mamba is characterized by fast inference capabilities (5x higher throughput than Transformer) and linear scaling of sequence length. It excels in language modeling tasks and is comparable to twice the size of Transformer models.

Both the Griffin and Hawk models use a novel gated linear recurrent layer (RG-LRU), a new type of recurrent layer inspired by linear recurrent elements, which is used to build new recurrent blocks. Hawk is a model that mixes a multilayer perceptron (MLP) and a circular block. Griffin further blends MLP, loop blocks, and local attention for greater efficiency. By combining cyclic blocks and local attention, Griffin and Hawk achieve better performance and resource efficiency while maintaining the efficient advantages of RNNs and the expressive power of Transformers, especially when working with long sequences and large-scale parameters.

On the Transformer architecture, there are three main differences between different open-source models: data usage, training strategies, and optimization methods. As you can see from the table above, many models are based on pre-trained models such as LLaMA or Baichuan, which are fine-tuned with specialized datasets. Behind the PK of the final performance indicators is the fierce competition of data, computing resources, and algorithms.

Comparison of the development of large models in China and the United States

The comparison between China and the United States is simply the comparison between application and basic R&D: China is good at application implementation, while the United States tends to focus on basic model R&D. However, when it comes to specific industries, the respective characteristics of China and the United States are more prominent.

First of all, there are differences in the penetration rate of large models in different industries. There are two types of industries with high penetration rates, one is the industry with better scale, quality and diversity of data, such as office and transportation, and the other is the industry with high technology demand and strong innovation ability, such as finance and entertainment.

The development of large models in China and the United States in their respective industries also follows the above rules. For example, in the office field, Microsoft has fully introduced large model technology in Office, and domestic manufacturers such as Kingsoft Office have also followed closely by accessing large models such as MinMax and Baidu Wenxin. In the financial field, the Agricultural Bank of China has also launched a ChatABC model with tens of billions of parameters. This is the same place between China and the United States.

And the different places are more interesting.

For example, in the education industry, the United States tends to use AI to assist teachers, while China focuses more on test-oriented education. Turnitin's Gradescope is an assignment grading model, and MathGPT, launched by Good Future, is the first mathematical model in China.

In the medical industry, the penetration of large models in China is constrained by data and progress is slow, while the advantage of the United States in data makes it prefer to use large models in medical research and development. Google's Med-PaLM is one of them.

In the entertainment industry, the development of the United States has encountered resistance in terms of values, and China is expected to overtake in the corner. Ctrip launched the first vertical model of the travel industry, "Ctrip Asks", and the "Titian" model of Alibaba Entertainment led the popularity of Wonderful Duck camera products.

In the transportation industry, China and the United States are in a state of competition, especially in the field of intelligent driving. Based on China's abundant basic data in the field of transportation, as well as the resonance with electric vehicles, new energy and other fields, coupled with the government's policy support in basic data and computing power, cities such as Beijing and Shanghai have issued specific measures to support the development of artificial intelligence, and the development of large models in China's transportation field is bound to play the strongest sound.

Looking to the future: the development trend and challenges of open source large models

Although open source models can help small and medium-sized developers use them in a wide range of industries, I believe that they cannot replace true general-purpose models. At present, the largest number of parameters for open-source models is 300 billion, while GPT-4 is estimated to be more than 1.8 trillion. In terms of performance, versatility, and ability to handle complex tasks, no open source model can match the large closed or proprietary models designed specifically for advanced applications and research.

However, the open-source model can still serve as a good starting point, as it has done in the past. Especially in scenarios where there is a shortage of computing power, many times we may not need to run such a large model, like the 1.3B model that Xiaomi assembled in their cars. The key is to create value.

conclusion

Although there are a large number of open source models released in China, their influence is not as large as that of foreign models. On the one hand, because of the huge domestic downstream market, people are more inclined to use the open source model of leading enterprises to do application landing entrepreneurship, while foreign countries are better at basic research. On the other hand, due to the constraints of talent, capital and technology, China's primary market investment in large-scale model projects is not as active as abroad. In terms of industry application, on the basis of respecting the law of technology penetration, it is the long-term accumulation of basic data that affects the development of large models.

In the long run, China's AI field still lacks in the industry's basic data and computing power, and it will not be overnight to reverse these disadvantages. However, based on its own characteristics, especially in the application of Chinese, it can still be one step ahead.

Read on