针对基于清华大学ChatGLM-6B采用LoRA技术微调开源项目代码解读

作者：大狗在海里 2023-04-24 11:04:00

声明如下：

1.没有开源就没有任何行业的进步，那些开源项目的作者值得被每个人尊重

2.本文没有贬低作者的意思,作为布道者应该尽可能减少学习者的误区操作成本

在这个“开源为王，数据为王，模型为王”的大时代。持续的学习能力才不会有35岁的危机。

看到一个ChatGLM-6B采用LoRA的开源项目帮你快速在ChatGLM-6B上实现自己私有对话机器人。下面学习解读下：

1.数据准备阶段(数据才是最重要的)

cover_alpaca2jsonl.py

直接将斯坦福大羊驼的数据进行格式转化换成自己的格式，核心代码功能：instruction-->Instruction;input-->Input;+Answer:output-->target

def format_example(example: dict) -> dict:

context = f"Instruction: {example['instruction']}\n"

if example.get("input"):

context += f"Input: {example['input']}\n"

context += "Answer: "

target = example["output"]

return {"context": context, "target": target}

斯坦福大羊驼的数据格式样例：

{

"instruction": "Give three tips for staying healthy.",

"input": "",

"output": "1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consist

ent sleep schedule."

}

这个cover_alpaca2jsonl.py转换后的数据格式样例：请注意：一定要看源代码!!!!,不知道是作者的疏忽还是其他原因，git代码生成数据jsonl上是“”“Response:”“”，代码是“”“Answer: ”“”

{

"context":"Instruction: Give three tips for staying healthy.\nAnswer: ",

"target":"1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule."

}

2.tokenize_dataset_rows.py

#备注文档写法错误

如果是代码里的实现文档使用应该是这么写：

--skip_overlength true/false

或者代码改成这样才ok

parser.add_argument("--skip_overlength", type=bool,action="store_true", default=False)

3.微调finetune.py

请根据实际硬件跟代码要求注意选择不同的数据类型,例如fp32,fp16,half,int8等，需要根据实际情况调整

假设出现RuntimeError: expected scalar type Half but found Float，直接将--fp16去掉即可

python finetune.py \

--dataset_path data/alpaca \

--lora_rank 8 \

--per_device_train_batch_size 6 \

--gradient_accumulation_steps 1 \

--max_steps 52000 \

--save_steps 1000 \

--save_total_limit 2 \

--learning_rate 1e-4 \

--fp16 \

--remove_unused_columns false \

--logging_steps 50 \

--output_dir output

最后赶紧在单机单卡,一机多卡，多机多卡上训练自己的大模型吧。

项目git地址：

https://github.com/mymusise/ChatGLM-Tuning.git

针对基于清华大学ChatGLM-6B采用LoRA技术微调开源项目代码解读

继续阅读

开源项目管理：使用automake 各组件的关联

mjpg_streamer源码的分析及针对图像处理算法的修改

深度KWeaver：价值驱动，认知智能走向开源共创

解析开源领域的摩尔定律现象

HandAI开源项目，拉近人和摄影的距离：基于手势识别完成不同的拍摄行为项目功能项目设计思路项目例子致谢

力软开发运维一体化平台是采用市面主流技术开发框架，同时整合优质第三方开源项目而研发出的产品，能有效帮助企业实现从业务需求

【周五了，请用这个开源项目离开工位去摸鱼】Genact是一个很有趣的项目，它是一个无实际意义的活动生成器。可以在你工作时

场景、技术与伙伴：鸿蒙完成开源生态“黄金三角”构建

共筑使能千行百业的数字底座 | HDC 2022松湖对话顺利召开

4个好用的springboot开源项目#程序员#计算机#干货分享

Eclipse3.6 SVN plugin installation---subversive

推荐一些查找Android开源项目的国内网站

如何成为一名成功的自由程序员？

Apache CXF WebService1 简介2 原生ws和rs规范用法3 springboot整合Jax-ws和Jax-rs

值得学习17个C/C++ 超经典开源项目

10个超炫酷的前端3D开源项目