前言

在AI盛起的当下，各类AI应用不断地出现在人们的视野中，AI正在重塑着各行各业。相信现在各大公司都在进行着不同程度的AI布局，有AI大模型自研能力的公司毕竟是少数，对于大部分公司来说，在一款开源可商用的大模型基础上进行行业数据微调也正在成为一种不错的选择。

本文主要内容是一个目前市面上呈现出来的开源可商用的大语言模型的资源库，里面罗列了大大小小很多个大语言模型和数据集等学习资源，后面会持续更新，建议大家点赞收藏。

这里附上之前作者整理过的优质资源贴，感兴趣的可以查看：

基于LangChain的优秀项目资源库

优秀的多模态大模型(LLM)资源库

开放的LLMs

这些LLMs都可以用于商业用途（例如Apache 2.0、MIT、OpenRAIL-M等许可）。欢迎贡献！

语言模型	发布日期	检查点	论文/博客	参数（B）	上下文长度	许可证	试用
T5	2019/10	T5和Flan-T5[1]，Flan-T5-xxl (HF)[2]	探索统一的文本到文本转换的迁移学习极限[3]	0.06 - 11	512[4]	Apache 2.0	T5-Large[5]
UL2	2022/10	UL2和Flan-UL2[6]，Flan-UL2 (HF)[7]	UL2 20B: 开源的统一语言学习器[8]	20	512, 2048[9]	Apache 2.0
Cerebras-GPT	2023/03	Cerebras-GPT[10]	Cerebras-GPT：一个开放、计算高效的大型语言模型家族[11]（论文[12]）	0.111 - 13	2048[13]	Apache 2.0	Cerebras-GPT-1.3B[14]
Open Assistant（Pythia家族）	2023/03	OA-Pythia-12B-SFT-8[15]，OA-Pythia-12B-SFT-4[16]，OA-Pythia-12B-SFT-1[17]	Democratizing Large Language Model Alignment[18]	12	2048	Apache 2.0	Pythia-2.8B[19]
Pythia	2023/04	pythia 70M - 12B[20]	Pythia：一个用于分析大型语言模型在训练和扩展中的套件[21]	0.07 - 12	2048[22]	Apache 2.0
Dolly	2023/04	dolly-v2-12b[23]	Free Dolly：介绍世界上第一个真正开放、商业可行的指令调整LLM[24]	3, 7, 12	2048[25]	MIT
DLite	2023/05	dlite-v2-1_5b[26]	宣布DLite V2：轻量级、开放的LLMs，可在任何地方运行[27]	0.124 - 1.5	1024	Apache 2.0	DLite-v2-1.5B[28]
RWKV	2021/08	RWKV, ChatRWKV[29]	RWKV语言模型（以及我的LM技巧）[30]	0.1 - 14	无限制（RNN）[31]	Apache 2.0
GPT-J-6B	2023/06	GPT-J-6B[32]，GPT4All-J[33]	GPT-J-6B：基于JAX的6B Transformer[34]	6	2048[35]	Apache 2.0
GPT-NeoX-20B	2022/04	GPT-NEOX-20B[36]	GPT-NeoX-20B：一个开源的自回归语言模型[37]	20	2048[38]	Apache 2.0
Bloom	2022/11	Bloom[39]	BLOOM：一个176B参数的开放式多语言语言模型[40]	176	2048[41]	OpenRAIL-M v1[42]
StableLM-Alpha	2023/04	StableLM-Alpha[43]	Stability AI推出其StableLM系列语言模型的首个模型[44]	3 - 65	4096[45]	CC BY-SA-4.0
FastChat-T5	2023/04	fastchat-t5-3b-v1.0[46]	我们很高兴发布FastChat-T5：我们紧凑且商业友好的聊天机器人！[47]	3	512	Apache 2.0
h2oGPT	2023/05	h2oGPT[48]	构建世界上最好的开源大型语言模型：H2O.ai的旅程[49]	12 - 20	256 - 2048[50]	Apache 2.0
MPT-7B	2023/05	MPT-7B[51]，MPT-7B-Instruct[52]	介绍MPT-7B：一个新的开源、商业可用的LLM标准[53]	7	84k (ALiBi)[54]	Apache 2.0, CC BY-SA-3.0
RedPajama-INCITE	2023/05	RedPajama-INCITE[55]	发布3B和7B的RedPajama-INCITE系列模型，包括基础模型、指导调优模型和聊天模型[56]	3 - 7	2048[57]	Apache 2.0	RedPajama-INCITE-Instruct-3B-v1[58]
OpenLLaMA	2023/05	open_llama_7b_700bt_preview[59]，open_llama_3b_600bt_preview[60]	OpenLLaMA：LLaMA的开放再现[61]	3, 7	2048[62]	Apache 2.0	OpenLLaMA-7B-Preview_200bt[63]
Falcon	2023/05	Falcon-40B[64]，Falcon-7B[65]	论文即将推出	7, 40	2048	Apache 2.0
Baichuan	2023/06	Baichuan-7B[66]	无	7	4096	Apache 2.0	baichuan/7b[67]

用于代码的开放型语言模型

语言模型	发布日期	检查点	论文/博客	参数 (B)	上下文长度	许可证	尝试使用
SantaCoder	2023/01	santacoder[68]	SantaCoder：不要追求完美！[69]	1.1	2048[70]	OpenRAIL-M v1[71]	SantaCoder[72]
StarCoder	2023/05	starcoder[73]	StarCoder：用于代码的最先进语言模型[74]，StarCoder：愿源代码与你同在！[75]	15	8192[76]	OpenRAIL-M v1[77]
StarChat Alpha	2023/05	starchat-alpha[78]	使用StarCoder创建编码助手[79]	16	8192[80]	OpenRAIL-M v1[81]
Replit Code	2023/05	replit-code-v1-3b[82]	在一周内训练最先进的代码语言模型并量化“氛围”——与Replit的Reza Shabani合作[83]	2.7	无限大？(ALiBi)[84]	CC BY-SA-4.0	Replit-Code-v1-3B[85]
CodeGen2	2023/04	codegen2 1B-16B[86]	CodeGen2：关于在编程和自然语言上训练LLM的经验教训[87]	1 - 16	2048[88]	Apache 2.0
CodeT5+	2023/05	CodeT5+[89]	CodeT5+：用于代码理解和生成的开放式代码大语言模型[90]	0.22 - 16	512[91]	BSD-3-Clause	Codet5+-6B[92]

用于预训练的开放式LLM数据集

名称	发布日期	论文/博客	数据集	标记数 (T)	许可证
starcoderdata	2023/05	StarCoder: 一种用于代码的最先进LLM[93]	starcoderdata[94]	0.25	Apache 2.0
RedPajama	2023/04	RedPajama：一个旨在创建领先的开源模型的项目，首先复现了超过1.2万亿标记的LLaMA训练数据集[95]	RedPajama-Data[96]	1.2	Apache 2.0

用于指令调整的开放式LLM数据集

名称	发布日期	论文/博客	数据集	样本数 (K)	许可证
MPT-7B-Instruct	2023/05	Introducing MPT-7B: 用于开源、商业可用LLM的新标准[97]	dolly_hhrlhf[98]	59	CC BY-SA-3.0
databricks-dolly-15k	2023/04	Free Dolly: 介绍世界上第一个真正开放的、商业可行的指令调整LLM[99]	databricks-dolly-15k[100]	15	CC BY-SA-3.0
OIG (Open Instruction Generalist)	2023/03	OIG数据集[101]	OIG[102]	44,000	Apache 2.0

用于对齐调整的开放式LLM数据集

名称	发布日期	论文/博客	数据集	样本数 (K)	许可证
OpenAssistant Conversations Dataset	2023/04	OpenAssistant Conversations - 民主化大型语言模型对齐[103]	oasst1[104]	161	Apache 2.0

开放式LLM评估

•lmsys.org的排行榜[105]•MosaicML的评估[106]•Holistic Evaluation of Language Models (HELM)[107]•LLM-Leaderboard[108]•TextSynth服务器基准测试[109]•Hugging Face的开放式LLM排行榜[110]

许可证的含义是什么？

•Apache 2.0[111]：允许用户以任何目的使用软件，分发软件，修改软件，并在许可证条款下分发修改后的软件，无需支付版税。

•MIT[112]：与Apache 2.0类似，但更短且更简单。与Apache 2.0不同，不要求声明对原始代码的任何重大更改。

•CC BY-SA-4.0[113]：允许（i）复制和重新分发材料，以及（ii）混合、转换和基于材料进行构建，用于任何目的，甚至包括商业用途。但是，如果您进行后者，则必须以与原始许可证相同的许可证分发您的贡献。（因此，对于内部团队可能不可行。）

•OpenRAIL-M v1[114]：允许免费访问和灵活的下游使用和共享模型及其修改，并附带一组使用限制（请参阅附件A[115]）。

免责声明：本存储库提供的信息不构成法律建议，也不旨在构成法律建议。本存储库的维护者对使用这些模型的第三方的行为不承担责任。在商业用途中使用模型之前，请咨询律师。

改进

• 完善上下文长度的条目，并使用?检查条目

• 添加训练的标记数量？（参见考虑事项[117]）

• 添加（链接到）训练代码？

• 添加（链接到）评估基准？

声明

本文翻译整理自：GitHub - eugeneyan/open-llms: A list of open LLMs available for commercial use.[118]，后续会定期更新，请点赞、收藏。

后续会根据该项目的更新情况不定期进行资源贴的同步更新，感兴趣的同学可以点赞、收藏。

References

[1] T5和Flan-T5: https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints

[2] Flan-T5-xxl (HF): https://huggingface.co/google/flan-t5-xxl

[3] 探索统一的文本到文本转换的迁移学习极限: https://github.com/google-research/text-to-text-transfer-transformer#released-model-checkpoints

[4] 512: https://discuss.huggingface.co/t/does-t5-truncate-input-longer-than-512-internally/3602

[5] T5-Large: https://github.com/slai-labs/get-beam/tree/main/examples/t5

[6] UL2和Flan-UL2: https://github.com/google-research/google-research/tree/master/ul2#checkpoints

[7] Flan-UL2 (HF): https://huggingface.co/google/flan-ul2

[8] UL2 20B: 开源的统一语言学习器: https://ai.googleblog.com/2022/10/ul2-20b-open-source-unified-language.html

[9] 512, 2048: https://huggingface.co/google/flan-ul2#tldr

[10] Cerebras-GPT: https://huggingface.co/cerebras

[11] Cerebras-GPT：一个开放、计算高效的大型语言模型家族: https://www.cerebras.net/blog/cerebras-gpt-a-family-of-open-compute-efficient-large-language-models/

[12] 论文: https://arxiv.org/abs/2304.03208

[13] 2048: https://huggingface.co/cerebras/Cerebras-GPT-13B#model-details

[14] Cerebras-GPT-1.3B: https://github.com/slai-labs/get-beam/tree/main/examples/cerebras-gpt

[15] OA-Pythia-12B-SFT-8: https://huggingface.co/OpenAssistant/pythia-12b-sft-v8-7k-steps

[16] OA-Pythia-12B-SFT-4: https://huggingface.co/OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5

[17] OA-Pythia-12B-SFT-1: https://huggingface.co/OpenAssistant/oasst-sft-1-pythia-12b

[18] Democratizing Large Language Model Alignment: https://arxiv.org/abs/2304.07327

[19] Pythia-2.8B: https://github.com/slai-labs/get-beam/tree/main/examples/pythia

[20] pythia 70M - 12B: https://github.com/EleutherAI/pythia

[21] Pythia：一个用于分析大型语言模型在训练和扩展中的套件: https://arxiv.org/abs/2304.01373

[22] 2048: https://arxiv.org/pdf/2304.01373.pdf

[23] dolly-v2-12b: https://huggingface.co/databricks/dolly-v2-12b

[24] Free Dolly：介绍世界上第一个真正开放、商业可行的指令调整LLM: https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm

[25] 2048: https://github.com/databrickslabs/dolly#dolly

[26] dlite-v2-1_5b: https://huggingface.co/aisquared/dlite-v2-1_5b

[27] 宣布DLite V2：轻量级、开放的LLMs，可在任何地方运行: https://medium.com/ai-squared/announcing-dlite-v2-lightweight-open-llms-that-can-run-anywhere-a852e5978c6e

[28] DLite-v2-1.5B: https://github.com/slai-labs/get-beam/tree/main/examples/dlite-v2

[29] RWKV, ChatRWKV: https://github.com/BlinkDL/RWKV-LM#rwkv-parallelizable-rnn-with-transformer-level-llm-performance-pronounced-as-rwakuv-from-4-major-params-r-w-k-v

[30] RWKV语言模型（以及我的LM技巧）: https://github.com/BlinkDL/RWKV-LM

[31] 无限制（RNN）: https://github.com/BlinkDL/RWKV-LM#rwkv-parallelizable-rnn-with-transformer-level-llm-performance-pronounced-as-rwakuv-from-4-major-params-r-w-k-v

[32] GPT-J-6B: https://github.com/kingoflolz/mesh-transformer-jax/#gpt-j-6b

[33] GPT4All-J: https://github.com/nomic-ai/gpt4all#raw-model

[34] GPT-J-6B：基于JAX的6B Transformer: https://arankomatsuzaki.wordpress.com/2021/06/04/gpt-j/

[35] 2048: https://github.com/kingoflolz/mesh-transformer-jax/#gpt-j-6b

[36] GPT-NEOX-20B: https://huggingface.co/EleutherAI/gpt-neox-20b

[37] GPT-NeoX-20B：一个开源的自回归语言模型: https://arxiv.org/abs/2304.04165

[38] 2048: https://huggingface.co/EleutherAI/gpt-neox-20b

[39] Bloom: https://huggingface.co/bigscience/bloom

[40] BLOOM：一个176B参数的开放式多语言语言模型: https://arxiv.org/abs/2211.05100

[41] 2048: https://huggingface.co/bigscience/bloom

[42] OpenRAIL-M v1: https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement

[43] StableLM-Alpha: https://github.com/Stability-AI/StableLM#stablelm-alpha

[44] Stability AI推出其StableLM系列语言模型的首个模型: https://stability.ai/blog/stability-ai-launches-the-first-of-its-stablelm-suite-of-language-models

[45] 4096: https://github.com/Stability-AI/StableLM#stablelm-alpha

[46] fastchat-t5-3b-v1.0: https://huggingface.co/lmsys/fastchat-t5-3b-v1.0

[47] 我们很高兴发布FastChat-T5：我们紧凑且商业友好的聊天机器人！: https://twitter.com/lmsysorg/status/1652037026705985537?s=20

[48] h2oGPT: https://github.com/h2oai/h2ogpt

[49] 构建世界上最好的开源大型语言模型：H2O.ai的旅程: https://h2o.ai/blog/building-the-worlds-best-open-source-large-language-model-h2o-ais-journey/

[50] 256 - 2048: https://huggingface.co/h2oai

[51] MPT-7B: https://huggingface.co/mosaicml/mpt-7b

[52] MPT-7B-Instruct: https://huggingface.co/mosaicml/mpt-7b-instruct

[53] 介绍MPT-7B：一个新的开源、商业可用的LLM标准: https://www.mosaicml.com/blog/mpt-7b

[54] 84k (ALiBi): https://huggingface.co/mosaicml/mpt-7b#how-is-this-model-different

[55] RedPajama-INCITE: https://huggingface.co/togethercomputer

[56] 发布3B和7B的RedPajama-INCITE系列模型，包括基础模型、指导调优模型和聊天模型: https://www.together.xyz/blog/redpajama-models-v1

[57] 2048: https://huggingface.co/togethercomputer/RedPajama-INCITE-Instruct-7B-v0.1/blob/157bf3174feebb67f37e131ea68f84dee007c687/config.json#L13

[58] RedPajama-INCITE-Instruct-3B-v1: https://github.com/slai-labs/get-beam/tree/main/examples/redpajama-incite-instruct

[59] open_llama_7b_700bt_preview: https://huggingface.co/openlm-research/open_llama_7b_700bt_preview

[60] open_llama_3b_600bt_preview: https://huggingface.co/openlm-research/open_llama_3b_600bt_preview

[61] OpenLLaMA：LLaMA的开放再现: https://github.com/openlm-research/open_llama

[62] 2048: https://huggingface.co/h2oai

[63] OpenLLaMA-7B-Preview_200bt: https://github.com/slai-labs/get-beam/tree/main/examples/openllama

[64] Falcon-40B: https://huggingface.co/tiiuae/falcon-40b

[65] Falcon-7B: https://huggingface.co/tiiuae/falcon-7b

[66] Baichuan-7B: https://huggingface.co/baichuan-inc/baichuan-7B

[67] baichuan/7b: https://github.com/baichuan-inc/baichuan-7B

[68] santacoder: https://huggingface.co/bigcode/santacoder

[69] SantaCoder：不要追求完美！: https://arxiv.org/abs/2301.03988

[70] 2048: https://huggingface.co/bigcode/santacoder/blob/main/README.md#model-summary

[71] OpenRAIL-M v1: https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement

[72] SantaCoder: https://github.com/slai-labs/get-beam/tree/main/examples/santacoder

[73] starcoder: https://huggingface.co/bigcode/starcoder

[74] StarCoder：用于代码的最先进语言模型: https://huggingface.co/blog/starcoder

[75] StarCoder：愿源代码与你同在！: https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view

[76] 8192: https://huggingface.co/bigcode/starcoder#model-summary

[77] OpenRAIL-M v1: https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement

[78] starchat-alpha: https://huggingface.co/HuggingFaceH4/starchat-alpha

[79] 使用StarCoder创建编码助手: https://huggingface.co/blog/starchat-alpha

[80] 8192: https://huggingface.co/bigcode/starcoder#model-summary

[81] OpenRAIL-M v1: https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement

[82] replit-code-v1-3b: https://huggingface.co/replit/replit-code-v1-3b

[83] 在一周内训练最先进的代码语言模型并量化“氛围”——与Replit的Reza Shabani合作: https://www.latent.space/p/reza-shabani#details

[84] 无限大？(ALiBi): https://huggingface.co/replit/replit-code-v1-3b#model-description

[85] Replit-Code-v1-3B: https://github.com/slai-labs/get-beam/tree/main/examples/replit-code

[86] codegen2 1B-16B: https://github.com/salesforce/CodeGen2

[87] CodeGen2：关于在编程和自然语言上训练LLM的经验教训: https://arxiv.org/abs/2305.02309

[88] 2048: https://arxiv.org/abs/2305.02309

[89] CodeT5+: https://github.com/salesforce/CodeT5/tree/main/CodeT5+

[90] CodeT5+：用于代码理解和生成的开放式代码大语言模型: https://arxiv.org/abs/2305.07922

[91] 512: https://arxiv.org/abs/2305.07922

[92] Codet5+-6B: https://github.com/slai-labs/get-beam/tree/main/examples/codeT5%2B

[93] StarCoder: 一种用于代码的最先进LLM: https://huggingface.co/blog/starcoder

[94] starcoderdata: https://huggingface.co/datasets/bigcode/starcoderdata

[95] RedPajama：一个旨在创建领先的开源模型的项目，首先复现了超过1.2万亿标记的LLaMA训练数据集: https://www.together.xyz/blog/redpajama

[96] RedPajama-Data: https://github.com/togethercomputer/RedPajama-Data

[97] Introducing MPT-7B: 用于开源、商业可用LLM的新标准: https://www.mosaicml.com/blog/mpt-7b

[98] dolly_hhrlhf: https://huggingface.co/datasets/mosaicml/dolly_hhrlhf

[99] Free Dolly: 介绍世界上第一个真正开放的、商业可行的指令调整LLM: https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm

[100] databricks-dolly-15k: https://huggingface.co/datasets/databricks/databricks-dolly-15k

[101] OIG数据集: https://laion.ai/blog/oig-dataset/

[102] OIG: https://huggingface.co/datasets/laion/OIG

[103] OpenAssistant Conversations - 民主化大型语言模型对齐: https://drive.google.com/file/d/10iR5hKwFqAKhL3umx8muOWSRm7hs5FqX/view

[104] oasst1: https://huggingface.co/datasets/OpenAssistant/oasst1

[105] lmsys.org的排行榜: https://chat.lmsys.org/?leaderboard

[106] MosaicML的评估: https://twitter.com/jefrankle/status/1654631746506301441

[107] Holistic Evaluation of Language Models (HELM): https://crfm.stanford.edu/helm/latest/?groups=1

[108] LLM-Leaderboard: https://github.com/LudwigStumpp/llm-leaderboard

[109] TextSynth服务器基准测试: https://bellard.org/ts_server/

[110] Hugging Face的开放式LLM排行榜: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

[111] Apache 2.0: https://en.wikipedia.org/wiki/Apache_License

[112] MIT: https://en.wikipedia.org/wiki/MIT_License

[113] CC BY-SA-4.0: https://creativecommons.org/licenses/by-sa/4.0/

[114] OpenRAIL-M v1: https://www.bigcode-project.org/docs/pages/model-license/

[115] 附件A: https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement

[116] BSD-3-Clause: https://en.wikipedia.org/wiki/BSD_licenses

[117] 考虑事项: https://github.com/eugeneyan/open-llms/issues/7

[118] GitHub - eugeneyan/open-llms: A list of open LLMs available for commercial use.: https://github.com/eugeneyan/open-llms/tree/main

open-llms 开源可商用的优秀大模型资源库

前言

开放的LLMs

用于代码的开放型语言模型

用于预训练的开放式LLM数据集

用于指令调整的开放式LLM数据集

用于对齐调整的开放式LLM数据集

开放式LLM评估

许可证的含义是什么？

改进

声明

References

继续阅读

#chat2db##openai#阿里巴巴开源自然语言生成SQL的工具，在github上面可以下载。使用前需要具有ope

RocketMQ on openEuler 提供高性能消息队列的稳定性解决方案

Linux服务器报错too many open files错误解决方案

python使用open经常报错：TypeError: an integer is required的解决方案

opencv交叉编译包含ffmpeg

咱们来录制一期后台的操作说明。·首先咱们的代理拿到自己的后台之后会跟我这个页面是一样的，这里边有几样重要的东西，我要跟你

Linux系统编程——系统调用之 I/O 操作（文件操作）一、文件描述符二、常用 I/0 函数三、实战篇

Linux文件IO-open,write,read,lseek,close

Linux下读写函数

linux中的open

接触自动驾驶开源软件openTCS

Don'tbediscouraged.It'softenthelastkeyinthebunchthatopensthe

Linux下C中chmod函数

Linux下的文件操作（Linux系统调用和ANSIC文件操作）

Linux下C中chdir函数