Editor: Editorial Office

Just now, the 34 billion parameter domestic large model Wudao Skyhawk Aquila2 strongly rushed to the top of the list, becoming the strongest open source Chinese and English bilingual large model. What's more, this time KLCII not only open-sourced the star model, but also gave away a very well-known model peripheral!

The strongest Chinese-English bilingual large model, open source!

Today, the Aquila language model series has been fully upgraded to Aquila2, and a new heavyweight member has been added - the 34 billion parameter Aquila2-34B.

In the 22 evaluation standards of the four dimensions of code generation, examination, understanding, reasoning and language, Aquila2-34B strongly occupies the TOP 1 of many lists.

Open source brother reappears, bilingual LLM family barrel level open source! 34 billion parameters surpass Llama2-70B

However, the phrase "fully beyond Llama 2" is no longer news. Compared with scores, the industry values the capabilities of large models.

In terms of these actual capabilities, AquilaChat2's performance is still very eye-catching -

It not only has super reasoning ability, but also greatly improves its long text processing ability; Powerful generalization capabilities allow it to adapt to various real-world application scenarios, including AI Agent, code generation, and literature search.

What's more surprising is that KLCII not only open-sourced the Aquila2 model series, but also open-sourced Aquila2's innovative training algorithms, including the FlagScale framework and the FlagAttention arithmetic subset, as well as a new version of the semantic vector model BGE.

It can be said that the simultaneous opening of innovative training algorithms and best practices is unprecedented in the industry. This kind of family-barrel-level open source can be called the conscience of the industry in the large-model open source industry.

Aquila2 model full open source address:

https://github.com/FlagAI-Open/Aquila2

https://model.baai.ac.cn/

https://huggingface.co/BAAI

The strongest Chinese-English bilingual large model, open source!

22 comprehensive rankings leading, with only 1/2 of the number of parameters and 2/3 of the amount of training data, surpassing Llama2-70B and the rest of the open source pedestal model, how did Aquila2-34B do it?

This, of course, is due to the high-quality corpus accumulated by KLCII over the years. After these corpus pre-trained models, the comprehensive ability is very powerful, surpassing Tongyi Qianwen and Llama 2.

Architecture upgrades, algorithm innovations, and data iterations have also made Aquila2 further breakthroughs in comprehensive capabilities in Chinese and English.

The Aquila2 pedestal model provides a strong foundation for the AquilaChat2 dialogue model.

After training on high-quality instruction fine-tuning datasets, AquilaChat2-34B has become the strongest open source Chinese-English bilingual dialogue model today, and the subjective and objective evaluation results have achieved comprehensive leadership.

SFT model evaluation results

In addition, AquilaChat2-34B presents several interesting features - it not only has a wealth of native knowledge of the Chinese world, but also provides more accurate, comprehensive and human answers.

For mastery of the Chinese world, AquilaChat2-34B can even complete GPT-4.

In response to the question "how to fry tomatoes with screws", AquilaChat2-34B immediately cleverly guessed that users should want to ask "scrambled eggs with tomatoes".

In contrast, GPT-4 can only understand the "fried tomatoes with snail powder" layer.

If you ask the big model, "What is the unit of analysis when it is easy for college students to find a job", GPT-4's answer is simple and crude - professional.

AquilaChat2-34B insightfully said that the analysis unit can be industry, company type, rank, region, salary level, professional fit, etc.

Inference surpasses Llama 2, second only to GPT-4

The year in which we can achieve AGI is a very hot topic in the industry today.

How to achieve AGI? The most critical of this is the reasoning ability of large models.

On the evaluation benchmark Integrated Reasoning Dataset (IRD), more than a dozen popular models conducted a comprehensive competition in the accuracy of results and processes in the dimensions of inductive reasoning, deductive reasoning, abductive reasoning and causal reasoning.

The results show that AquilaChat2-34B ranks first in the IRD evaluation protocol, surpassing LLama2-70B-Chat, GPT-3.5 and other models, second only to GPT-4.

Evaluation results of SFT models on IRD datasets

Context window length, expanded to 16K

Long text input is an urgent problem for the industry right now.

How much text input can be received directly determines how much memory the large model has, and together with the amount of parameters, it jointly determines the application effect of the model.

In this regard, using Aquila2-34B as the base, KLCII underwent positional coding interpolation processing and SFT on 20W high-quality long text dialogue datasets, directly extending the effective context window length of the model to 16K.

The evaluation results of LongBench's four Chinese and English long text Q&A and long text summary tasks show that AquilaChat2-34B-16K is at the leading level of open source long text models, which is close to GPT-3.5.

Long text comprehension task assessment

In addition, we all know that large models generally have the problem of insufficient length extension ability, which seriously restricts the long text ability of large models.

KLCII and the Peking University team conducted a visual analysis of the attention distribution of multiple language models processing ultra-long text. They found that all language models had a fixed relative position bottleneck that was significantly smaller than the context window length.

To this end, KLCII team innovatively proposed NLPE (Non-Linearized Position Embedding) method, which improves the model epitaxy ability by adjusting the relative position coding and constraining the maximum relative length on the basis of the RoPE method.

Experiments on text continuation in many fields such as code, Chinese and English Few-Shot Leaning, and e-books show that NLPE can extend the 4K Aquila2-34B model to 32K length, and the continuity of the continued text is much better than that of Dynamic-NTK, position interpolation and other methods.

As shown in the figure below, the instruction following ability test on the HotpotQA, 2WikiMultihopQA and other datasets with a length of 5K~15K shows that the accuracy of AquilaChat2-7B (2K) after NLPE epitaxy is 17.2%, while the accuracy of AquilaChat2-7B of Dynamic-NTK extension is only 0.4%.

Comparison of NLPE and mainstream Dynamic-NTK epitaxial methods on SFT models

Comparison of NLPE and mainstream Dynamic-NTK epitaxy methods on the Base model (the lower the ppl value, the better)

At the same time, KLCII has also developed PiecewiseAttention, a segmented attention operator suitable for long text reasoning, to efficiently support Attention Map-oriented optimization algorithms such as NLPE, further reducing video memory occupation and improving computing speed.

Super generalization ability, never "high score low energy"

Speaking of which, many large models, although excellent in standard tests, are blinded when it comes to practical application.

In contrast, the Aquila2 model performs well on the exam, but how does it perform in real-world application scenarios?

You know, the ability to generalize large models, that is, the ability to draw inferences from one case is crucial.

This means that LLM can still effectively cope with unseen tasks and give accurate responses in addition to the training data.

If this large model achieves high scores in the benchmark test, but performs poorly in practical applications, that is, it is good at test questions but not good at solving practical problems, which is a manifestation of "high scores and low energy".

To evaluate the generalization capability of the Aquila2 model, KLCII's team validated it from three real-world application scenarios.

AI agents think for themselves in "Minecraft"

General-purpose agents, which can learn a variety of tasks in an open environment, are the embodiment of the model's important capabilities.

When it comes to testing agent tasks, the most common open-world game we can think of is, of course, "Minecraft".

There are infinitely generated complex worlds and a large number of open tasks, providing rich interaction interfaces for agents.

In March this year, KLCII team proposed Plan4MC, an efficient way to solve Minecraft multitasking without expert data.

Plan4MC trains the basic skills of agents through reinforcement learning methods of intrinsic rewards.

The agent then uses the reasoning capabilities of the large model AquilaChat2 to complete task planning.

For example, when the agent receives the task of "cutting wood and making a workbench to put nearby", it will interact with AquilaChat2 for multiple rounds.

First, the agent clarifies that its main task is to build a workbench, and from this it enters prompts, including "current environmental state" and "tasks to be completed".

Then AquilaChat2 receives the command, it starts giving feedback, telling the agent "what skill to use next", and also determines the next subtask: find nearby wood.

After the agent finds the wood, the next subtask is to cut the wood. Continuing to use the environment information as input, AquilaChat2 gives the next skill name.

In this way, the agent continues to push itself in the direction of the overall goal, interacting with AquilaChat2 to complete the task.

In this way, with the help of AquilaChat2, the agent built the perfect workbench.

Aquila2 + BGE2, complex literature can also be searched

The search of complex literature has made many scientific researchers bald.

Based on the retrieval method of traditional vector libraries, large models can perform well on some simple problems.

However, when faced with complex problems that require deep understanding, its capabilities are limited.

KLCII combines Aqiula2 with the open-source semantic vector model BGE2 to solve this big problem.

When you want to retrieve papers by a certain author on a certain topic, or ask a large model to generate summary text for multiple papers on a topic, it is not a problem.

Take a chestnut and let Aqiula2 give Mirella Lapata's paper on "summary."

Aqiula2 immediately gave complex literature that met the requirements.

Example of complex query in Aquila2+BGE2 literature retrieval scenario

AquilaSQL: The best "text-SQL" generative model

AquilaSQL, on the other hand, can act as a "translator" to accurately translate the natural language instructions issued by users into qualified SQL query statements.

In this way, the threshold for data query and analysis is greatly reduced.

In practical application scenarios, users can also perform secondary development based on AquilaSQL, grafting it into the local knowledge base and generating local query SQL.

In addition, it can further improve the data analysis performance of the model, so that the model not only returns query results, but also further generates analysis conclusions, charts, etc.

The Aquila pedestal model itself has excellent code generation capabilities.

On this basis, AquilaSQL underwent continuous pre-training of SQL corpus and two-stage training of SFT, and finally surpassed the SOTA model on the "text-SQL language generation model" ranking Cspider with an accuracy rate of 67.3%, and the GPT4 model without SQL corpus fine-tuning. The accuracy rate is only 30.8%.

In the figure below, we asked AquilaSQL to filter "the average height of people living in Beijing with an income greater than 1,000" from three data tables: height, income, and location.

AquilaSQL open source repository address: https://github.com/FlagAI-Open/FlagAI/tree/master/examples/Aquila/Aquila-sql

AquilaSQL successfully generated multi-table query statements to complete this complex query task.

"Family Bucket" level open source, the conscience of the industry

A trivia is that while Llama2 is also open source, its commercial license is not so user-friendly to Chinese users.

Moreover, Llama2 not only sets limits on Chinese commercial use, but also limits the monthly activity of commercial use.

The Llama 2 commercial agreement expressly states that commerce other than English is not permitted

In contrast, Aquila is commercially available worldwide — neither as restrictive as Llama2 nor as form-filled as other commercially available models.

In addition, many model teams do not open source key data such as hyperparameters and optimization schemes for model training when they open source. Aquila2 is all open source innovative training algorithms this time, BGE, FlagScale, FlagAttention, all shared with developers.

With this set of tools, developers can easily replicate Aquila2.

This unprecedented "family bucket" open source is simply the YYDS of the big model open source industry!

The reason why it open-source training tools and algorithms without reservation is based on KLCII's positioning as a non-profit organization to promote the prosperity of the global large model ecosystem through thorough and comprehensive open source sharing.

The new generation of semantic vector model BGE2

BGE (BAAI General Embedding) is KLCII's new open-source semantic vector model in August this year.

This time, the new generation of BGE2 will also be open source with Aquila2.

The BGE-LLM Embedder model in BGE2 integrates four capabilities: "knowledge retrieval", "memory retrieval", "example search", and "tool search".

For the first time, it realizes the comprehensive coverage of the main retrieval requirements of a large language model by a single semantic vector model.

Combined with specific usage scenarios, BGE-LLM Embedder will significantly improve the performance of large language models in important areas such as handling knowledge-intensive tasks, long-term memory, instruction following, and tool use.

FlagScale, an efficient parallel training framework

FlagScale is an efficient parallel training framework used by Aquila2-34B, which can provide one-stop training functions for large language models.

Thanks to the sharing of KLCII's team, large model developers can obtain the training configuration, optimization scheme and hyperparameters of Aquila2 models through the FlagScale project.

FlagScale open source repository: https://github.com/FlagOpen/FlagScale

As a result, KLCII has also become the "first large-scale model team in China" with complete open source training code and hyperparameters.

FlagScale is an extension of Megatron-LM and provides a number of feature enhancements, including distributed optimizer state repartitioning, precise positioning of training problem data, and parameter-to-Huggingface conversion.

After actual measurements, Aquila2 training throughput and GPU utilization have reached the industry-leading level.

FlagScale training throughput vs. GPU utilization

In addition, FlagScale also adopts a variety of parallel techniques, such as data parallelism, tensor parallelism, and 1F1B pipeline parallelism, to accelerate the training process, and uses BF16 for mixed-precision training.

In terms of performance optimization, FlagScale adopts FlashAttn V2, computing and communication overlap, gradient accumulation and other technologies to significantly improve computing efficiency.

In the future, FlagScale will continue to synchronize with the latest code of the upstream project Megatron-LM, introduce more customized functions, integrate the latest distributed training and inference technologies, mainstream large models, and support heterogeneous AI hardware.

In this way, a general, convenient, and efficient distributed large-model training inference framework can be built to meet model training tasks of different scales and needs.

Open source subsets FlagAttention

In addition, FlagPay is the first customized high-performance attention open-source subset that supports long text large model training and is developed using the Triton language.

In view of the needs of large model training, the Memory Efficient Attention operator of the Flash Care series is extended.

At present, the segmented attention operator - PiecewiseAttention has been implemented, which has been adapted to the number of domestic chips, and more heterogeneous chips will be adapted in the future.

FlagPay open source repository: https://github.com/FlagOpen/FlagAttention

PiecewisePay mainly solves the extrapolation problem of the Transformer model with rotational position encoding (Roformer).

When the sequence length during inference of a large model exceeds the maximum sequence length during training, the Attention weight between tokens that are far away increases abnormally.

However, Flash Attention cannot be efficiently implemented when it adopts segmented processing for the calculation of Attention Score, so KLCII team has developed a segmented PiecewiseAttention operator, which can be used by large model developers to achieve more flexible pre-processing.

In short, PiecewiseAttention has the following features:

- Versatility: Generic to models that use segmented computing attention can be migrated to large language models outside of Aquila.

- Ease of use: FlagPay is based on the Triton language implementation and provides the PyTorch interface, and the construction and installation process is more convenient than Flash Note developed by CUDA C.

- Extensibility: Also thanks to the Triton language, the FlagAttention algorithm itself has a low threshold for modification and extension, and developers can easily extend more new features on top of this.

In the future, the FlagPay project will continue to support attention operators with other function extensions to further optimize operator performance and adapt to more heterogeneous AI hardware in response to the needs of large model research.

Developer's Guide: Get started quickly with Aquila2

Aquila2 model weights & repository:

Usage method 1 (recommended): Load Aquila2 series models through FlagAI

https://github.com/FlagAI-Open/Aquila2

Usage 2: Download the weights separately through the FlagOpen model repository

https://model.baai.ac.cn/

Usage 3: Load Aquila2 series models through Hugging Face

https://huggingface.co/BAAI

The full Aquila2 series is compatible with multiple large-model ecological open source projects:

• LoRA/QLoRA: Lightweight model fine-tuning training technology that not only accelerates the training of large models, but also reduces the memory footprint.

• vLLM: Supports building high-throughput large-language model services, streaming output, and single-machine, multi-card, and distributed parallel inference.

• llama.cpp: Support non-GPU and 4-bit quantization, further reducing the threshold for developers.

Open source brother reappears, bilingual LLM family barrel level open source! 34 billion parameters surpass Llama2-70B