Background:

At present, the technical application of large models has blossomed everywhere. The fastest way to apply it is to fine-tune the model using data from your own vertical. ChatGLM2-6B has a prominent effect on the large model of domestic open source. The content shared in this article is to perform vertical LORA fine-tuning on the P40 machine of Group EA using the chatglm2-6b model.

Introduction to chatglm2-6b

github： https://github.com/THUDM/ChatGLM2-6B

ChatGLM2-6B has several improvements over ChatGLM:

1. Performance improvement: Compared with the original model, the base model of ChatGLM2-6B has been upgraded, and good results have been achieved in various dataset evaluations;

2. Longer context: We extend the Context Length of the pedestal model from 2K to 32K of ChatGLM-6B, and train with 8K context length in the dialogue stage;

3. More efficient inference: Based on Multi-Query Attention technology, ChatGLM2-6B has more efficient inference speed and lower memory usage: under the official model implementation, the inference speed is increased by 42% compared with the original generation;

4. More open protocol: ChatGLM2-6B weights are completely open to academic research and are allowed for free commercial use after filling out a questionnaire for registration.

Second, the introduction of fine-tuning environment

2.1 Performance Requirements

Reasoning this piece, chatglm2-6b only needs 14G of video memory on the accuracy of fp16, so P40 can be covered.

chatglm2-6b does LORA fine-tuning on the P40

The configuration of the P40 graphics card on the EA is as follows:

2.2 Mirror environment

Before making fine-tuning, you need to compile the environment for configuration, and I use the docker image to load the image environment, the specific configuration is as follows:

FROM base-clone-mamba-py37-cuda11.0-gpu

# mpich
RUN yum install mpich  

# create my own environment
RUN conda create -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge/ --override --yes --name py39 python=3.9
# display my own environment in Launcher
RUN source activate py39 \
    && conda install --yes --quiet ipykernel \
    && python -m ipykernel install --name py39 --display-name "py39"

# install your own requirement package
RUN source activate py39 \
    && conda install -y -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/ \
    pytorch  torchvision torchaudio faiss-gpu \
    && pip install --no-cache-dir  --ignore-installed -i https://pypi.tuna.tsinghua.edu.cn/simple \
    protobuf \
    streamlit \
    transformers==4.29.1 \
    cpm_kernels \
    mdtex2html \
    gradio==3.28.3 \
	sentencepiece \
	accelerate \
	langchain \
    pymupdf \
	unstructured[local-inference] \
	layoutparser[layoutmodels,tesseract] \
	nltk~=3.8.1 \
	sentence-transformers \
	beautifulsoup4 \
	icetk \
	fastapi~=0.95.0 \
	uvicorn~=0.21.1 \
	pypinyin~=0.48.0 \
    click~=8.1.3 \
    tabulate \
    feedparser \
    azure-core \
    openai \
    pydantic~=1.10.7 \
    starlette~=0.26.1 \
    numpy~=1.23.5 \
    tqdm~=4.65.0 \
    requests~=2.28.2 \
    rouge_chinese \
    jieba \
    datasets \
    deepspeed \
	pdf2image \
	urllib3==1.26.15 \
    tenacity~=8.2.2 \
    autopep8 \
    paddleocr \
    mpi4py \
    tiktoken

If you need to use deepspeed to train, the mpich messaging toolkit is missing on the EA and needs to be installed manually.

2.3 Model Download

Huggingface Address: https://huggingface.co/THUDM/chatglm2-6b/tree/main

Third, LORA fine-tuning

3.1 Introduction to LORA

paper： https://arxiv.org/pdf/2106.09685.pdf

Low-Rank Adaptation of Large Language Models fine-tuning method: freeze the pre-trained model weight parameters, and in the case of freezing the original model parameters, by adding additional network layers to the model, and training only these new network layer parameters.

LoRA's thoughts:

Add a bypass next to the original PLM (Pre-trained Language Model) to do a dimensionality reduction and upgrading operation.
When training, the parameters of PLM are fixed, and only the dimensionality reduction matrix A and the ascending moment B are trained. The input and output dimensions of the model remain unchanged, and the BA is superimposed on the parameters of PLM when output.
Initialize A with a random Gaussian distribution and B with a 0 matrix to ensure that this bypass matrix is still a 0 matrix at the beginning of training.

3.2 Fine-tuning

The PEFT tool provided by Huggingface makes it easy to fine-tune the PLM model, and here the PEFT tool is also used to create LORA.

PEFT's GitHub: https://gitcode.net/mirrors/huggingface/peft?utm_source=csdn_github_accelerator

Load the model and LoRa fine-tuning:

# load model
    tokenizer = AutoTokenizer.from_pretrained(args.model_dir, trust_remote_code=True)
    model = AutoModel.from_pretrained(args.model_dir, trust_remote_code=True)
    
    print("tokenizer:", tokenizer)
    
    # get LoRA model
    config = LoraConfig(
        r=args.lora_r,
        lora_alpha=32,
        lora_dropout=0.1,
        bias="none",)
    
    # 加载lora模型
    model = get_peft_model(model, config)
    # 半精度方式
    model = model.half().to(device)

It should be noted here that to load the local model with huggingface, you need to create a work file, and there is no permission on the EA to create it in .cache, here you need to make your own work path first.

import os
os.environ['TRANSFORMERS_CACHE'] = os.path.dirname(os.path.abspath(__file__))+"/work/"
os.environ['HF_MODULES_CACHE'] = os.path.dirname(os.path.abspath(__file__))+"/work/"

If you need to train with deepspeed, select the zero-stage method you need:

conf = {"train_micro_batch_size_per_gpu": args.train_batch_size,
            "gradient_accumulation_steps": args.gradient_accumulation_steps,
            "optimizer": {
                "type": "Adam",
                "params": {
                    "lr": 1e-5,
                    "betas": [
                        0.9,
                        0.95
                    ],
                    "eps": 1e-8,
                    "weight_decay": 5e-4
                }
            },
            "fp16": {
                "enabled": True
            },
            "zero_optimization": {
                "stage": 1,
                "offload_optimizer": {
                    "device": "cpu",
                    "pin_memory": True
                },
                "allgather_partitions": True,
                "allgather_bucket_size": 2e8,
                "overlap_comm": True,
                "reduce_scatter": True,
                "reduce_bucket_size": 2e8,
                "contiguous_gradients": True
            },
            "steps_per_print": args.log_steps
            }

Other are data processing work, what needs to be paid attention to is how to build prompts, I personally think that it is very important to fine-tune the construction prompt in the field, and the final impact on the model is also relatively large.

Fourth, fine-tune the results

At present, the model is still in finetune, batch=1, epoch=3, and has been iterated for one round.

Author: JD Retail Zheng Shaoqiang

Source: Jingdong Cloud Developer Community Please indicate the source for reprinting

chatglm2-6b does LORA fine-tuning on the P40