Details of the technical principles of mainstream large language models

2023-09-12 21:47:00

1. Compare the details of LLaMA, ChatGLM, Falcon and other big language models: tokenizer, positional coding, layer normalization, activation function, etc. 2. Distributed training technology for large language models: data parallelism, tensor model parallelism, pipeline parallelism, 3D parallelism, zero redundancy optimizer ZeRO, CPU offloading technology ZeRo-offload, mixed precision training, activated recalculation technology, Flash Attention, Paged Attention. 3. Efficient parameter fine-tuning techniques for large language models: prompt tuning, prefix tuning, adapter, LLaMA-adapter, LoRA.

0. Outline

Details of the technical principles of mainstream large language models

1. Details of large language models

1.0 transformer and LLM

1.1 Model structure

1.2 Training objectives

1.3 tokenizer

1.4 Location Code

1.5 Layer normalization

1.6 Activation Functions

1.7 Multi-query Attention 与 Grouped-query Attention

1.8 Parallel transformer block

1.9 Summary - Training stability

2. Distributed pre-training of LLM

2.0 Peer-to-peer communication and collective communication

2.1 Data parallelism

2.2 Tensor parallelism

2.3 Pipeline parallelism

2.4 3D parallelism

2.5 Mixed Precision Training

2.6 Activate Recalculation

2.7 ZeRO, zero redundancy optimizer

2.8 CPU-offload，ZeRO-offload

2.9 Flash Attention

2.10 vLLM: Paged Attention

3. Efficient fine-tuning of LLM parameters

3.0 Why efficient parameter fine-tuning?

3.1 prompt tuning

3.2 prefix tuning

3.3 adapter

3.4 LLaMA adapter

3.5 LoRA

3.6 Experimental comparison

4. References

Written by Spring

Source: WeChat public account: Tencent Technology Engineering

Source: https://mp.weixin.qq.com/s/P1enjLqH-UWNy7uaIviWRA