ChatGLM2-6B released: 8-32k context, 42% faster inference

Produced by|Open Source China

The GLM technical team announced another upgrade to ChatGLM-6B, releasing the ChatGLM2-6B. ChatGLM-6B was released on March 14 and has been downloaded over 300w on Huggingface as of June 24.

As of June 25, the ChatGLM2 model ranked Rank 0 with a score of 71.1 in the C-Eval list that mainly evaluates the Chinese capabilities of LLM models, and the ChatGLM2-6B model ranked Rank 6 with a score of 51.7, the highest open source model on the list.

ChatGLM2-6B released: 8-32k context, 42% faster inference

ChatGLM2-6B is the second generation version of the open source Chinese-English bilingual dialogue model ChatGLM-6B, on the basis of retaining many excellent features such as smooth dialogue and low deployment threshold of the original model, ChatGLM2-6B introduces the following new features:

More powerful performance: Based on the development experience of the original ChatGLM model, the base model of the ChatGLM2-6B has been fully upgraded. ChatGLM2-6B uses GLM's hybrid objective function, after pre-training with 1.4T Chinese and English identifiers and human preference alignment training, the evaluation results show that compared with the original model, ChatGLM2-6B has achieved significant improvement in the performance of MMLU (+23%), CEval(+33%), GSM8K (+571%), BBH (+60%) and other datasets, and has strong competitiveness in the same size open source model.
Longer context: Based on FlashAttention technology, the project team extended the Context Length of the pedestal model from 2K to 32K of the ChatGLM-6B, and trained with a context length of 8K in the dialogue phase, allowing more rounds of dialogue. However, the current version of ChatGLM2-6B has limited understanding of single-round ultra-long documents, and we will focus on optimizing them in subsequent iterations and upgrades.
More efficient inference: Based on Multi-Query Attention technology, ChatGLM2-6B has more efficient inference speed and lower memory occupation: under the official model implementation, the inference speed is increased by 42% compared with the original generation, and under INT4 quantization, the dialogue length supported by 6G memory has been increased from 1K to 8K.
More open protocol: ChatGLM2-6B weights are completely open to academic research and are allowed for commercial use with official written permission.

Evaluation results

The following are the results of the ChatGLM2-6B model on MMLU (English), C-Eval (Chinese), GSM8K (mathematics), and BBH (English).

Inference performance

ChatGLM2-6B uses Multi-Query Attention to improve build speed. The average speed of generating 2000 characters is compared below

Multi-Query Attention also reduces the memory occupation of KV Cache in the generation process, in addition, ChatGLM2-6B uses Causal Mask for dialogue training, and the KV Cache of the previous round can be reused during continuous dialogue, further optimizing the memory occupation. Therefore, when using a graphics card with 6GB of video memory for INT4 quantization inference, the original ChatGLM-6B model can generate up to 1119 characters to indicate memory exhaustion, while ChatGLM2-6B can generate at least 8192 characters.

The project team also tested the impact of quantification on model performance. The results show that the impact of quantification on model performance is within an acceptable range.

Example comparison

Compared to the original model, the ChatGLM2-6B has improved its capabilities in multiple dimensions, and here are some comparison examples.

Mathematical logic

Knowledge reasoning

Long document understanding