Large language models (LLMs) have made significant progress over the past year. Since the outbreak of ChatGPT, many open-source large model LLMs have been gradually developed, such as Meta AI's Llama 2, Mistrals Mistral & Mixtral model, TII Falcon, etc. These LLMs are capable of a wide range of tasks, including chatbots, Q&A, and automated summarization, without the need for additional training. However, if you want to customize the model for your application, you may need to fine-tune the model on your dataset to get higher-quality results than if you were to use or train a smaller model directly.

In this article, we'll show you how to use Hugging Face's TRL, Transformers framework, and datasets to fine-tune an open large language model. We will proceed by following these steps:

Note: This article is written to run on consumer-grade GPUs (24GB), such as NVIDIA A10G or RTX 4090/3090, but can also be easily adapted to run on larger GPUs.

1. Define our use cases

When fine-tuning an LLM, it's important to understand your use case and the problem you're trying to solve. This will help you choose the right model or help you create a dataset to fine-tune your model. If you haven't defined your use case yet, you may need to rethink. Not all use cases require fine-tuning, and it is recommended to evaluate and try a model that has already been fine-tuned or API-based before fine-tuning your own model.

For example, we'll use the following use cases:

We want to fine-tune a model that can generate SQL queries based on natural language instructions that can then be integrated into our BI tools. The goal is to reduce the time it takes to create SQL queries and make it easier for non-technical users to create SQL queries.

Converting natural language to SQL queries is a good use case for fine-tuning LLMs because it is a complex task that requires a deep understanding of the data and the SQL language.

2. Set up the development environment

First, we need to install Hugging Face's libraries, including trl, transformers, and datasets, as well as Pytorch. TRL is a new library, built on top of transformers and datasets, that simplifies the process of fine-tuning large language models, including RLHF (reinforcement learning learns from human feedback) and aligned open LLMs. If you're not familiar with TRL yet, don't worry, it's a new tool designed to make the fine-tuning process easier.

# 安装Pytorch和其他库
!pip install "torch==2.1.2" tensorboard

# 安装Hugging Face库
!pip install  --upgrade \
  "transformers==4.36.2" \
  "datasets==2.16.1" \
  "accelerate==0.26.1" \
  "evaluate==0.4.1" \
  "bitsandbytes==0.42.0" \
  # "trl==0.7.10" # \
  # "peft==0.7.1" \

# 从github安装peft & trl
!pip install git+https://github.com/huggingface/trl@a3c5b7178ac4f65569975efadc97db2f3749c65e --upgrade
!pip install git+https://github.com/huggingface/peft@4a1559582281fc3c9283892caea8ccef1d6f5a4f--upgrade

If your GPU is based on Ampere architecture (such as NVIDIA A10G or RTX 4090/3090) or newer, you can take advantage of Flash Attention. Flash Attention significantly improves the computation speed and reduces memory consumption by optimizing the computation process of the attention mechanism and employing some classic techniques such as chunking and recalculation. In short, this technology can triple the speed of training. For more details, you can visit the official page of FlashAttention.

Note: If your computer has less than 96GB of RAM and a large number of CPU cores, you may need to adjust the value of the MAX_JOBS. In our test, we used a g5.2xlarge instance with 4 jobs set up.

import torch; assert torch.cuda.get_device_capability()[0] >= 8, 'Hardware not supported for Flash Attention'
# install flash-attn
!pip install ninja packaging
!MAX_JOBS=4 pip install flash-attn --no-build-isolation

It may take a while (about 10 to 45 minutes) to install Flash Attention.

We will be leveraging Hugging Face Hub as a remote model version control service. This means that during the training process, our models, logs, and related information will be automatically uploaded to the Hugging Face Hub. In order to use this service, you will need to register an account on Hugging Face. Once registration is complete, we'll use the login tool included in the huggingface_hub package to log in to your account and save your access token on your local disk.

from huggingface_hub import login

login(
  token="", # 在此处添加您的token
  add_to_git_credential=True
)

3. Create and prepare the dataset

Once you've determined that fine-tuning is the right solution, we need to prepare a dataset to train our model. This dataset should be a diverse task demonstration that demonstrates the problem you want to solve. There are many ways to create a dataset, such as:

Leverage existing open-source datasets such as Spider
Generate synthetic datasets, such as Alpaca, with large language models
Hire humans to create datasets, such as Dolly
Combine the above methods, such as Orca

Each method has its own advantages and disadvantages and depends on budget, time, and quality requirements. For example, using an existing dataset is the easiest, but may not be specific to your specific use case, while manually creating a dataset is accurate, but also costly and time-consuming. Several methods can also be combined to create a directive dataset, as shown in the show.

In our example, we'll use an existing dataset called sql-create-context, which contains natural language instructions, database schema definitions, and corresponding SQL query samples.

With the release of the latest version of TRL, we now support popular instruction and conversation dataset formats. This means that we only need to convert the dataset to one of the supported formats, and TRL will automatically handle the next steps. Supported formats include:

Conversation format

{"messages": [{"role": "system", "content": "You are..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}{"messages": [{"role": "system", "content": "You are..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}{"messages": [{"role": "system", "content": "You are..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

Instruction format

{"prompt": "<prompt text>", "completion": "<ideal generated text>"}{"prompt": "<prompt text>", "completion": "<ideal generated text>"}{"prompt": "<prompt text>", "completion": "<ideal generated text>"}

We'll use Hugging Face's Datasets library to load our open-source dataset and convert it into a conversational format. In this format, we will include the database schema definition in the system message as information for our assistant. We then save the dataset as a JSONL file so that it can be used to fine-tune our model. We randomly downsampled the dataset and kept only 10,000 samples.

Note: If you already have a dataset, such as one obtained through a partnership with OpenAI, you can skip this step and go straight to fine-tuning.

from datasets import load_dataset

# 将数据集转换为OAI消息
system_message = """您是SQL查询翻译的文本。用户将用英语向您提问，您将根据提供的SCHEMA生成SQL查询。
SCHEMA:
{schema}"""

def create_conversation(sample):
  return {
    "messages": [
      {"role": "system", "content": system_message.format(schema=sample["context"])},
      {"role": "user", "content": sample["question"]},
      {"role": "assistant", "content": sample["answer"]}
    ]
  }

# 从hub加载数据集
dataset = load_dataset("b-mc2/sql-create-context", split="train")
dataset = dataset.shuffle().select(range(12500))

# 将数据集转换为OAI消息
dataset = dataset.map(create_conversation, remove_columns=dataset.features,batched=False)
# 将数据集拆分为10000个训练样本和2500个测试样本
dataset = dataset.train_test_split(test_size=2500/12500)

print(dataset["train"][345]["messages"])

# 将数据集保存到磁盘
dataset["train"].to_json("train_dataset.json", orient="records")
dataset["test"].to_json("test_dataset.json", orient="records")

4. Use trl and SFTTrainer to fine-tune large language models

Now, we're ready to start fine-tuning our model. We'll use SFTTrainer in trl to fine-tune our model. It simplifies supervised fine-tuning of open large language models. SFTTrainer is a derivative of the Trainer in the transformers library, which inherits all the core features such as logging, evaluation, and model checkpointing, and adds some useful features, such as:

Dataset format conversion, support for dialogue and instruction formats
Training is performed only when the dataset is completed, ignoring prompts
Datasets are packaged to improve training efficiency
Efficient Parameter Fine-Tuning (PEFT) support, including Q-LoRA technology
Prepare models and tokenizers for conversation fine-tuning (e.g. adding special tags)

In our example, we'll take advantage of dataset format conversion, dataset packaging, and Parameter Efficient Fine-Tuning (PEFT) capabilities. We will employ QLoRA technology, which is a way to quantize to reduce the memory footprint of large language models during fine-tuning while maintaining model performance.

First, we'll load our JSON format dataset from disk.

from datasets import load_dataset

# 从磁盘加载jsonl数据
dataset = load_dataset("json", data_files="train_dataset.json", split="train")

Next, we'll load our large language model. For our use case, we chose CodeLlama 7B, a large language model trained specifically for code synthesis and comprehension. If you have other preferences, such as the Mistral, Mixtral model, or TII Falcon, you can easily switch between them by simply adjusting our model_id. We will use the Bitsandbytes tool to quantize the model to 4 bits to reduce memory requirements.

Note that the larger the size of the model, the more memory it requires. In our example, we are using the 7B version of the model, which can be fine-tuned on a GPU with 24GB of RAM. If you have a smaller GPU memory, you may want to consider using a smaller model.

It is critical to properly prepare the model and tokenizer for training the chat/conversation model. We need to add new special tags to the tokenizer and model to teach them different roles in the dialogue. In TRL we have a convenient way to use setup_chat_format, which adds special tags to the tokenizer, such as <|im_start|> and <|im_end|> to indicate the beginning and end of the conversation. Resize the model embedding layer to accommodate the new markup. Sets the chat_template of the tokenizer, which is used to format the input data into a chat-like format. The default is chatml from OpenAI.

It's important to properly configure your model and tokenizer to train a conversation or conversation model. We need to add special markers to tokenizers and models, such as <|im_start|> to start a conversation and a <|im_end|> to end a conversation, to teach them their role in the conversation. In the TRL library, we have a convenient method called setup_chat_format, which is:

Add special conversation markers to the tokenizer to indicate the beginning and end of the conversation.
Adjust the size of the model's embedding layer to accommodate the new markup.
Sets the chat_template of the tokenizer, which is used to format the input data into a chat-like format. By default, the chatml format provided by OpenAI is used.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from trl import setup_chat_format

# Hugging Face model id
model_id = "codellama/CodeLlama-7b-hf" # or `mistralai/Mistral-7B-v0.1`

# BitsAndBytesConfig int-4 config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16
)

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
    quantization_config=bnb_config
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.padding_side = 'right' # 以防止警告

# 将聊天模板设置为OAI chatML，如果您从微调模型开始，请删除
model, tokenizer = setup_chat_format(model, tokenizer)

SFTTrainer's integration with peft makes it very easy to efficiently tune LLMs using QLoRA. We just need to create our LoraConfig and provide it to the trainer. Our LoraConfig parameters are defined based on the QLoRA paper.

from peft import LoraConfig

# 基于QLoRA论文和Sebastian Raschka实验的LoRA配置
peft_config = LoraConfig(
        lora_alpha=128,
        lora_dropout=0.05,
        r=256,
        bias="none",
        target_modules="all-linear",
        task_type="CAUSAL_LM",
)

Before we start training, we need to define the hyperparameters (TrainingArguments) we want to use.

from transformers import TrainingArguments

args = TrainingArguments(
    output_dir="code-llama-7b-text-to-sql", # 要保存的目录和存储库ID
    num_train_epochs=3,                     # 训练周期数
    per_device_train_batch_size=3,          # 训练期间每个设备的批量大小
    gradient_accumulation_steps=2,          # 反向/更新前的步骤数
    gradient_checkpointing=True,            # 使用渐变检查点来节省内存
    optim="adamw_torch_fused",              # 使用融合的adamw优化器
    logging_steps=10,                       # 每10步记录一次
    save_strategy="epoch",                  # 每个epoch保存检查点
    learning_rate=2e-4,                     # 学习率，基于QLoRA论文
    bf16=True,                              # 使用bfloat16精度
    tf32=True,                              # 使用tf32精度
    max_grad_norm=0.3,                      # 基于QLoRA论文的最大梯度范数
    warmup_ratio=0.03,                      # 根据QLoRA论文的预热比例
    lr_scheduler_type="constant",           # 使用恒定学习率调度器
    push_to_hub=True,                       # 将模型推送到Hub
    report_to="tensorboard",                # 将指标报告到Tensorboard
)

We now have all the elements to create an SFTTrainer and start model training.

from trl import SFTTrainer

max_seq_length = 3072 # 数据集模型和打包的最大序列长度

trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=dataset,
    peft_config=peft_config,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    packing=True,
    dataset_kwargs={
        "add_special_tokens": False,  # 我们使用特殊 tokens
        "append_concat_token": False, # 不需要添加额外的分隔符 token
    }
)

We can start model training by calling the train() method of the Trainer instance. This will start a training loop that lasts for 3 cycles. Because we use a parameter-efficient fine-tuning approach, we only save the adjusted model weights, not the entire model.

# 开始训练，模型会自动保存到hub和输出目录
trainer.train()

# 保存模型
trainer.save_model()

3 cycles of training using Flash Attention took 01:29:58 time on a 10k sample dataset on a g5.2xlarge. The instance cost is 1,212$/h, which makes the total cost only 1.8$.

# 再次释放内存
del model
del trainer
torch.cuda.empty_cache()

Optional step: Merge the LoRA adapter into the original model

When using QLoRA, we only train the adapter, not the entire model. This means that when we save the model during training, we only save the weights of the adapter. If you want to save the entire model so that it can be more easily used with text-generated inference, you can use the merge_and_unload method to merge adapter weights into the model weights, and then use the save_pretrained method to save the model. This will save a default model that can be used for inference.

Note: This process may require more than 30GB of CPU memory.

#### COMMENT IN TO MERGE PEFT AND BASE MODEL ####
# from peft import PeftModel, PeftConfig
# from transformers import AutoModelForCausalLM, AutoTokenizer
# from peft import AutoPeftModelForCausalLM

# # Load PEFT model on CPU
# config = PeftConfig.from_pretrained(args.output_dir)
# model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path,low_cpu_mem_usage=True)
# tokenizer = AutoTokenizer.from_pretrained(args.output_dir)
# model.resize_token_embeddings(len(tokenizer))
# model = PeftModel.from_pretrained(model, args.output_dir)
# model = AutoPeftModelForCausalLM.from_pretrained(
#     args.output_dir,
#     torch_dtype=torch.float16,
#     low_cpu_mem_usage=True,
# )
# # Merge LoRA and base model and save
# merged_model = model.merge_and_unload()
# merged_model.save_pretrained(args.output_dir,safe_serialization=True, max_shard_size="2GB")

5. Test and evaluate large language models

Once the training is complete, we need to evaluate and test the model. We will take different samples from the raw dataset and evaluate the performance of the model through a simple loop and accuracy as a measure.

Note: Evaluating a Generative AI model is not easy, as an input may have multiple correct outputs.

import torch
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer, pipeline

peft_model_id = "./code-llama-7b-text-to-sql"
# peft_model_id = args.output_dir

# Load Model with PEFT adapter
model = AutoPeftModelForCausalLM.from_pretrained(
  peft_model_id,
  device_map="auto",
  torch_dtype=torch.float16
)
# load into pipeline
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

Let's load the test dataset and try to generate a directive.

from datasets import load_dataset
from random import randint

# 加载我们的测试数据集
eval_dataset = load_dataset("json", data_files="test_dataset.json", split="train")
rand_idx = randint(0, len(eval_dataset))

# 样品测试
prompt = pipe.tokenizer.apply_chat_template(eval_dataset[rand_idx]["messages"][:2], tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=256, do_sample=False, temperature=0.1, top_k=50, top_p=0.1, eos_token_id=pipe.tokenizer.eos_token_id, pad_token_id=pipe.tokenizer.pad_token_id)

print(f"Query:\n{eval_dataset[rand_idx]['messages'][1]['content']}")
print(f"Original Answer:\n{eval_dataset[rand_idx]['messages'][2]['content']}")
print(f"Generated Answer:\n{outputs[0]['generated_text'][len(prompt):].strip()}")

Our model successfully generated a SQL query based on natural language instructions. Now, let's do a comprehensive evaluation of the 2,500 samples in the test dataset. As mentioned earlier, assessing the accuracy of a generative model is not an easy task. In our experiments, we evaluated the match between the generated SQL query and the real SQL query. Another, more precise approach is to automate these SQL queries and compare the results to real data, but this requires more preparation.

from tqdm import tqdm

def evaluate(sample):
    prompt = pipe.tokenizer.apply_chat_template(sample["messages"][:2], tokenize=False, add_generation_prompt=True)
    outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95, eos_token_id=pipe.tokenizer.eos_token_id, pad_token_id=pipe.tokenizer.pad_token_id)
    predicted_answer = outputs[0]['generated_text'][len(prompt):].strip()
    if predicted_answer == sample["messages"][2]["content"]:
        return 1
    else:
        return 0

success_rate = []
number_of_eval_samples = 1000
# 迭代eval数据集并预测
for s in tqdm(eval_dataset.shuffle().select(range(number_of_eval_samples))):
    success_rate.append(evaluate(s))

# 计算精度
accuracy = sum(success_rate)/len(success_rate)

print(f"Accuracy: {accuracy*100:.2f}%")

We tested it on 1000 samples of the evaluation dataset with 79.50% accuracy, and the whole process took about 25 minutes.

This result is quite good, but we need to be cautious about this indicator. If you can run these queries on a real database and compare the results, it would be a more reliable way to evaluate. Since the same instruction may correspond to multiple correct SQL queries, we can also further improve the performance of the model through techniques such as few-shot learning, RAG (Retrieval-Augmented Generation), and self-healing.

6. Deploy the large language model to the production environment

Now you can deploy your large language model to production. In order to deploy open large language models in production, we recommend using Text Generation Inference (TGI). TGI is a high-performance solution specifically designed for deploying and delivering large language models (LLMs). It supports a variety of popular open large language models, including Llama, Mistral, Mixtral, StarCoder, T5, and more, through tensor parallelism and continuous batch processing. Companies like IBM, Grammarly, Uber, Deutsche Telekom, and others are all using text to generate reasoning. You can deploy your model in a variety of ways, such as:

Use the inference endpoint provided by Hugging Face
DIY

If you already have Docker installed, you can use the following command to start the inference server.

Note: Make sure you have enough GPU memory to run the container. In a notebook, you may need to reboot the kernel to free up any allocated GPU memory.

%%bash
# model=$PWD/{args.output_dir} # path to model
model=$(pwd)/code-llama-7b-text-to-sql # path to model
num_shard=1             # number of shards
max_input_length=1024   # max input length
max_total_tokens=2048   # max total tokens

docker run -d --name tgi --gpus all -ti -p 8080:80 \
  -e MODEL_ID=/workspace \
  -e NUM_SHARD=$num_shard \
  -e MAX_INPUT_LENGTH=$max_input_length \
  -e MAX_TOTAL_TOKENS=$max_total_tokens \
  -v $model:/workspace \
  ghcr.io/huggingface/text-generation-inference:latest

Once your container is started, you can start sending inference requests.

import requests as r
from transformers import AutoTokenizer
from datasets import load_dataset
from random import randint

# 再次加载我们的测试数据集和Tokenizer
tokenizer = AutoTokenizer.from_pretrained("code-llama-7b-text-to-sql")
eval_dataset = load_dataset("json", data_files="test_dataset.json", split="train")
rand_idx = randint(0, len(eval_dataset))

# 生成与第一次本地测试相同的提示
prompt = tokenizer.apply_chat_template(eval_dataset[rand_idx]["messages"][:2], tokenize=False, add_generation_prompt=True)
request= {"inputs":prompt,"parameters":{"temperature":0.2, "top_p": 0.95, "max_new_tokens": 256}}

# 向推理服务器发送请求
resp = r.post("http://127.0.0.1:8080/generate", json=request)

output = resp.json()["generated_text"].strip()
time_per_token = resp.headers.get("x-time-per-token")
time_prompt_tokens = resp.headers.get("x-prompt-tokens")

# 打印结果
print(f"Query:\n{eval_dataset[rand_idx]['messages'][1]['content']}")
print(f"Original Answer:\n{eval_dataset[rand_idx]['messages'][2]['content']}")
print(f"Generated Answer:\n{output}")
print(f"Latency per token: {time_per_token}ms")
print(f"Latency prompt encoding: {time_prompt_tokens}ms")

When you're done with your work, don't forget to stop your container.

!docker stop tgi

7. Summary

With the development of large language models and the proliferation of tools like TRL, now is an excellent time for businesses to invest in open large language model technology. Fine-tuning open large language models for specific tasks can significantly improve efficiency and open up new possibilities for innovation and service quality improvement. With the increasing availability of technology and the increasing cost-effectiveness, there has never been a better time to start taking advantage of open large language models.

八、References

[1]. Flame 2 (https://huggingface.co/meta-llama/Llama-2-70b-chat-hf)

[2]. Mistral（https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2）

[3]. Mixtral （https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1）

[4]. Falcon (https://huggingface.co/tiiuae/falcon-40b)

[5]. TRL(https://huggingface.co/docs/trl/index)

[6]. Transformers(https://huggingface.co/docs/transformers/index)

[7]. datasets(https://huggingface.co/docs/datasets/index)

[8]. FlashAttention （https://github.com/Dao-AILab/flash-attention/tree/main）

[9]. Hugging Face Hub （https://huggingface.co/models）

[10]. Orca: Progressive Learning from Complex Explanation Traces of GPT-4. （https://arxiv.org/abs/2306.02707）

[11].sql-create-context (https://huggingface.co/datasets/b-mc2/sql-create-context)

[12]. Text Generation Inference (TGI) https://github.com/huggingface/text-generation-inference

How to fine-tune large language models (LLMs) with Hugging Face