laitimes

Baichuan Intelligent released the Baichuan 3, a large model of over 100 billion, and the Chinese evaluation surpassed GPT-4

author:DoNews

DoNews reported on January 29 that on January 29, Baichuan Intelligence released Baichuan 3, a large language model with more than 100 billion parameters. In a number of authoritative general ability evaluations such as CMMLU, GAOKAO and AGI-Eval, Baichuan 3 has shown excellent capabilities, especially surpassing GPT-4 in Chinese tasks.

It also performed well in math and code specific evaluations such as MATH, HumanEval, and MBPP, proving Baichuan 3's strong strength in the field of natural language processing and code generation.

Not only that, its Chinese effect on authoritative medical evaluations such as MCMLE, MedExam, and CMExam, which have high requirements for logical reasoning ability and professionalism, also exceeds GPT-4, and is the best large model for Chinese medical tasks.

Baichuan 3 also breaks through the "iterative reinforcement learning" technology, further improves the semantic understanding and generation ability, and performs well in the format, rhyme, and ideogram of poetry creation, ahead of other large models.

The basic ability has been comprehensively improved, and the results of a number of authoritative Chinese tasks have surpassed GPT-4

Baichuan 3 performed well in several English reviews, reaching a level close to GPT-4. On many Chinese evaluation lists such as CMMLU, GAOKAO, HumanEval and MBPP, it has surpassed GPT-4 to show its advantages in Chinese tasks.

Baichuan Intelligent released the Baichuan 3, a large model of over 100 billion, and the Chinese evaluation surpassed GPT-4
Baichuan Intelligent released the Baichuan 3, a large model of over 100 billion, and the Chinese evaluation surpassed GPT-4

In addition, in the evaluation of alignment lists such as MT-Bench and IFEval, Baichuan 3 surpassed GPT-3.5, Claude and other large models, and was at the leading level in the industry.

Baichuan Intelligent released the Baichuan 3, a large model of over 100 billion, and the Chinese evaluation surpassed GPT-4

Different from the training of tens of billions and tens of billions of parameter models, the requirements for high-quality data, training stability, and training efficiency of more than 100 billion parameter models are several orders of magnitude higher in the training process. In order to better solve the related problems, Baichuan Intelligent has proposed a variety of innovative technical means and solutions such as "dynamic data selection", "importance maintenance" and "asynchronous CheckPoint storage" in the training process, which has effectively improved the capabilities of Baicuan 3.

In terms of high-quality data, traditional data screening relies on manual definition, and filters data through methods such as filtering, quality scoring, and textbook filtering. Baichuan Intelligence believes that data optimization and sampling is a dynamic process, which should be optimized with the training process of the model itself, rather than relying solely on manual prior data sampling and screening.

In order to comprehensively improve the data quality, Baichuan Intelligent has designed a set of dynamic training data selection scheme based on causal sampling, which can dynamically select the training data during the model training process and greatly improve the data quality.

In terms of training stability, due to the huge number of parameters in the model with more than 100 billion parameters, there are often problems such as gradient explosion, loss and flying, and model non-convergence during the training process. In this regard, Baichuan Intelligent proposed a progressive initialization method of "Salience-Consistency" to ensure the stability of the model in the early stage of training.

In addition, the monitoring scheme of the model training process is optimized, and the method of "effective rank" of parameters is introduced in the gradient, Loss and other indicators to detect the problems in the training process in advance, which greatly accelerates the positioning of the training problems and ensures the convergence effect of the final model.

In addition, in order to ensure efficient and stable training of models with more than 100 billion parameters on thousands of GPUs, Baichuan Intelligent has synchronously optimized the training stability and training framework of the model, and adopted the "asynchronous CheckPoint storage" mechanism, which can increase the frequency of storage without performance loss, reduce the impact of machine failure on the training task, and make the stable training time of Baichuan 3 reach more than one month, and the fault recovery time does not exceed 10 minutes.

In terms of training efficiency, Baichuan Intelligent has carried out a series of optimizations for the parallel training of models with more than 100 billion parameters, such as highly optimized RoPE. SwiGLU computation operator, the overlap of parameter communication and calculation in data parallelism, and the overlap of activation value communication and computation in sequence parallelism, thus effectively reducing the proportion of communication time, and the technology of offloading activation values to GPU in pipeline parallelism is introduced to solve the problem of uneven memory usage in pipeline parallelism, reducing the number of segments in pipeline parallelism and significantly reducing the cavitation rate. Through these technological innovations, Baichuan 3's training framework has improved its performance by more than 30% compared with mainstream frameworks in the industry.

The number of tokens in the medical dataset exceeds 100 billion, and the medical capacity is close to GPT-4

From the diagnosis and treatment of diseases to patient care and drug research and development, the large model can not only help doctors improve the efficiency and quality of diagnosis and treatment, help patients get better services and experience, but also help the society reduce medical costs and risks, and help medical resources achieve universal benefits and equal rights.

In addition, the medical problems are highly professional, the knowledge update speed is fast, the accuracy requirements are high, and the individual differences are large, which can fully reflect the various capabilities of the large model, and is called the "crown jewel of the large model" by Baichuan Intelligence. Therefore, leading large model companies such as OpenAI and Google regard medical care as the key training direction of the model and an important system for performance evaluation.

ChatGPT passed the United States Medical Licensing Examination (USMLE) as early as February 2023, showing its strong capabilities in the medical field. Google attaches more importance to the medical field, and has built a large medical model Med-PaLM based on the PaLM model, and the iterated Med-PaLM 2 scored more than 80 points in the medical exam MedQA, reaching the expert level.

In the medical field, the all-round nature of large models plays a crucial role. First of all, its multimodal learning capability can integrate various types of medical data such as text, image, and sound to provide more comprehensive and accurate analysis and diagnosis. Second, the deep reasoning ability of large models can help make complex medical decisions.

In addition, stable performance and knowledge up-to-date capabilities ensure the reliability and timeliness of medical advice. At the same time, the language understanding and generation capabilities of large models enable them to handle technical terms and complex sentence patterns. Finally, the application of pattern recognition and learning capabilities to large models enables them to learn and identify important patterns and features from complex medical data.

Therefore, it is not easy for large models to have good results in the medical field, which requires not only rich medical knowledge, appropriate prompts, but also excellent logical reasoning ability of the model itself.

In order to inject rich medical knowledge into Baichuan3, Baichuan Intelligent has built a medical dataset of more than 100 billion tokens in the model pre-training stage, including medical research literature, real electronic medical record data, professional books and knowledge base resources in the medical field, and Q&A materials for medical problems. The dataset covers all aspects of medical knowledge from theory to practical operation, from basic theory to clinical application, ensuring the professionalism and depth of knowledge of the model in the medical field.

In response to the problem of medical knowledge stimulation, Baichuan Intelligent has done systematic research and tuning for Prompt in the inference stage, and through accurate description of tasks and appropriate sample selection, the model output is more accurate and logical reasoning steps, which ultimately not only improves Baichuan 3's performance in a number of medical exams, but also provides users with more accurate and detailed feedback in real medical Q&A scenarios.

In terms of logical reasoning, Baichuan 3 surpassed GPT-4's excellent results in Chinese tasks in multiple authoritative evaluations such as mathematics and code, which has fully proved its strong basic logical reasoning ability. On the basis of having rich and high-quality professional medical knowledge, which can be fully stimulated through the optimized Prompt, combined with the reasoning ability of more than 100 billion parameters, Baichuan 3 has significantly improved its task effect in the medical field, and its performance in various Chinese and English medical tests has increased by 2 to 14 percentage points.

Baichuan 3 has performed well in a number of authoritative medical evaluation tasks, not only the evaluation results of Chinese medical tasks such as MCMLE, MedExam, and CMExam exceed GPT-4, but the evaluation results of English medical tasks such as USMLE and MedMCQA are also close to the level of GPT-4, which is the strongest Chinese large model with medical capabilities.

Baichuan Intelligent released the Baichuan 3, a large model of over 100 billion, and the Chinese evaluation surpassed GPT-4

Breakthrough in "iterative reinforcement learning" technology, the accuracy of creation has been greatly improved

Semantic understanding and text generation, as the most basic underlying capabilities of large models, are the pillars of other capabilities. In order to improve these two capabilities, the industry has carried out a lot of exploration and practice, and the RLHF (reinforcement learning based on human feedback) and RLIF (reinforcement learning based on AI feedback) introduced by OpenAI, Google, and Anthropic are among the key technologies.

The aligned model based on reinforcement learning can not only understand user instructions more accurately, especially those under multiple constraints and multiple rounds of dialogue, but also further improve the quality of generated content. However, giving full play to the role of reinforcement learning in large models not only requires a stable and efficient reinforcement learning training framework and high-quality high-quality partial order data, but also needs to balance between "exploration and utilization" to achieve continuous climbing of model capabilities.

For the above problems, Baichuan Intelligent has conducted in-depth research and given targeted solutions. In terms of reinforcement learning training framework, Baichuan Intelligent has developed a PPO training framework with dual-engine integration of training and inference and multi-model parallel scheduling, which can well support the efficient training of more than 100 billion models, and the training efficiency is 400% higher than that of the mainstream framework in the industry.

In terms of partial order data, Baichuan Intelligent innovatively uses the combination of RLHF and RLAIF to generate high-quality and high-quality partial order data, which achieves a better balance between data quality and data cost. On this basis, for the fundamental challenge of "exploration and utilization", Baichuan Intelligent realizes "iterative reinforcement learning" (iterative RLHF &RLAIF) through the synchronous upgrade of PPO exploration space and Reward Model evaluation space. Based on reinforcement learning, the version ramp-up can further exert the potential of the base model on the basis of SFT, and greatly improve the semantic understanding and generative creation capabilities of Baichuan 3.

Taking Tang and Song poems, which are the most challenging in text creation, as a treasure of traditional Chinese culture, poems not only have strict constraints in terms of format, level, duality, and rhyme, but also have highly concise content and far-reaching meanings.

If only through the fine-tuning of SFT, on the one hand, the creation data of high-quality poems requires extremely high expert costs, and on the other hand, it cannot achieve better constrained understanding and compliance in many aspects such as flatness, duality, and prosody. In addition, the traditional single-shot RLHF paradigm also encounters great challenges in the face of Tang and Song poems, and the responses generated by PPO during the training process may exceed the evaluation range of the Reward Model, resulting in the process of "exploration" getting out of control.

Baichuan 3 combines "RLHF & RLAIF" and iterative reinforcement learning methods to bring the poetry creation ability of large models to a new height. The usability is 500% higher than the current best model in the industry, and the style is far better than GPT-4. For the difficult style of Song Ci, which has a variety of formats, deep structures, and rich rhythms, the generated content can also be neatly matched and harmonious. Its precise and profound creative skills will allow everyone to easily create five-character poems and seven-character quatrains that are written and written, which can not only improve the humanistic quality of the public, but also help Chinese traditional culture to truly "live" in the era of large models.

Baichuan Intelligent released the Baichuan 3, a large model of over 100 billion, and the Chinese evaluation surpassed GPT-4
Baichuan Intelligent released the Baichuan 3, a large model of over 100 billion, and the Chinese evaluation surpassed GPT-4

As a large language model with a parameter scale of more than 100 billion, Baichuan 3 not only achieves a level close to GPT-4 in English, but also surpasses GPT-4 in the performance of a number of general Chinese tasks, which is a new milestone for Baichuan Intelligence.

Baichuan 3's comprehensive general capabilities and strong performance in the medical field will create a "super application" for Baichuan Intelligence and provide strong support for the implementation of large model technology in many complex application scenarios.

Read on