前言

在AI盛起的當下，各類AI應用不斷地出現在人們的視野中，AI正在重塑着各行各業。相信現在各大公司都在進行着不同程度的AI布局，有AI大模型自研能力的公司畢竟是少數，對于大部分公司來說，在一款開源可商用的大模型基礎上進行行業資料微調也正在成為一種不錯的選擇。

本文主要内容是一個目前市面上呈現出來的開源可商用的大語言模型的資源庫，裡面羅列了大大小小很多個大語言模型和資料集等學習資源，後面會持續更新，建議大家點贊收藏。

這裡附上之前作者整理過的優質資源貼，感興趣的可以檢視：

基于LangChain的優秀項目資源庫

優秀的多模态大模型(LLM)資源庫

開放的LLMs

這些LLMs都可以用于商業用途（例如Apache 2.0、MIT、OpenRAIL-M等許可）。歡迎貢獻！

語言模型	釋出日期	檢查點	論文/部落格	參數（B）	上下文長度	許可證	試用
T5	2019/10	T5和Flan-T5[1]，Flan-T5-xxl (HF)[2]	探索統一的文本到文本轉換的遷移學習極限[3]	0.06 - 11	512[4]	Apache 2.0	T5-Large[5]
UL2	2022/10	UL2和Flan-UL2[6]，Flan-UL2 (HF)[7]	UL2 20B: 開源的統一語言學習器[8]	20	512, 2048[9]	Apache 2.0
Cerebras-GPT	2023/03	Cerebras-GPT[10]	Cerebras-GPT：一個開放、計算高效的大型語言模型家族[11]（論文[12]）	0.111 - 13	2048[13]	Apache 2.0	Cerebras-GPT-1.3B[14]
Open Assistant（Pythia家族）	2023/03	OA-Pythia-12B-SFT-8[15]，OA-Pythia-12B-SFT-4[16]，OA-Pythia-12B-SFT-1[17]	Democratizing Large Language Model Alignment[18]	12	2048	Apache 2.0	Pythia-2.8B[19]
Pythia	2023/04	pythia 70M - 12B[20]	Pythia：一個用于分析大型語言模型在訓練和擴充中的套件[21]	0.07 - 12	2048[22]	Apache 2.0
Dolly	2023/04	dolly-v2-12b[23]	Free Dolly：介紹世界上第一個真正開放、商業可行的指令調整LLM[24]	3, 7, 12	2048[25]	MIT
DLite	2023/05	dlite-v2-1_5b[26]	宣布DLite V2：輕量級、開放的LLMs，可在任何地方運作[27]	0.124 - 1.5	1024	Apache 2.0	DLite-v2-1.5B[28]
RWKV	2021/08	RWKV, ChatRWKV[29]	RWKV語言模型（以及我的LM技巧）[30]	0.1 - 14	無限制（RNN）[31]	Apache 2.0
GPT-J-6B	2023/06	GPT-J-6B[32]，GPT4All-J[33]	GPT-J-6B：基于JAX的6B Transformer[34]	6	2048[35]	Apache 2.0
GPT-NeoX-20B	2022/04	GPT-NEOX-20B[36]	GPT-NeoX-20B：一個開源的自回歸語言模型[37]	20	2048[38]	Apache 2.0
Bloom	2022/11	Bloom[39]	BLOOM：一個176B參數的開放式多語言語言模型[40]	176	2048[41]	OpenRAIL-M v1[42]
StableLM-Alpha	2023/04	StableLM-Alpha[43]	Stability AI推出其StableLM系列語言模型的首個模型[44]	3 - 65	4096[45]	CC BY-SA-4.0
FastChat-T5	2023/04	fastchat-t5-3b-v1.0[46]	我們很高興釋出FastChat-T5：我們緊湊且商業友好的聊天機器人！[47]	3	512	Apache 2.0
h2oGPT	2023/05	h2oGPT[48]	建構世界上最好的開源大型語言模型：H2O.ai的旅程[49]	12 - 20	256 - 2048[50]	Apache 2.0
MPT-7B	2023/05	MPT-7B[51]，MPT-7B-Instruct[52]	介紹MPT-7B：一個新的開源、商業可用的LLM标準[53]	7	84k (ALiBi)[54]	Apache 2.0, CC BY-SA-3.0
RedPajama-INCITE	2023/05	RedPajama-INCITE[55]	釋出3B和7B的RedPajama-INCITE系列模型，包括基礎模型、指導調優模型和聊天模型[56]	3 - 7	2048[57]	Apache 2.0	RedPajama-INCITE-Instruct-3B-v1[58]
OpenLLaMA	2023/05	open_llama_7b_700bt_preview[59]，open_llama_3b_600bt_preview[60]	OpenLLaMA：LLaMA的開放再現[61]	3, 7	2048[62]	Apache 2.0	OpenLLaMA-7B-Preview_200bt[63]
Falcon	2023/05	Falcon-40B[64]，Falcon-7B[65]	論文即将推出	7, 40	2048	Apache 2.0
Baichuan	2023/06	Baichuan-7B[66]	無	7	4096	Apache 2.0	baichuan/7b[67]

用于代碼的開放型語言模型

語言模型	釋出日期	檢查點	論文/部落格	參數 (B)	上下文長度	許可證	嘗試使用
SantaCoder	2023/01	santacoder[68]	SantaCoder：不要追求完美！[69]	1.1	2048[70]	OpenRAIL-M v1[71]	SantaCoder[72]
StarCoder	2023/05	starcoder[73]	StarCoder：用于代碼的最先進語言模型[74]，StarCoder：願源代碼與你同在！[75]	15	8192[76]	OpenRAIL-M v1[77]
StarChat Alpha	2023/05	starchat-alpha[78]	使用StarCoder建立編碼助手[79]	16	8192[80]	OpenRAIL-M v1[81]
Replit Code	2023/05	replit-code-v1-3b[82]	在一周内訓練最先進的代碼語言模型并量化“氛圍”——與Replit的Reza Shabani合作[83]	2.7	無限大？(ALiBi)[84]	CC BY-SA-4.0	Replit-Code-v1-3B[85]
CodeGen2	2023/04	codegen2 1B-16B[86]	CodeGen2：關于在程式設計和自然語言上訓練LLM的經驗教訓[87]	1 - 16	2048[88]	Apache 2.0
CodeT5+	2023/05	CodeT5+[89]	CodeT5+：用于代碼了解和生成的開放式代碼大語言模型[90]	0.22 - 16	512[91]	BSD-3-Clause	Codet5+-6B[92]

用于預訓練的開放式LLM資料集

名稱	釋出日期	論文/部落格	資料集	标記數 (T)	許可證
starcoderdata	2023/05	StarCoder: 一種用于代碼的最先進LLM[93]	starcoderdata[94]	0.25	Apache 2.0
RedPajama	2023/04	RedPajama：一個旨在建立領先的開源模型的項目，首先複現了超過1.2萬億标記的LLaMA訓練資料集[95]	RedPajama-Data[96]	1.2	Apache 2.0

用于指令調整的開放式LLM資料集

名稱	釋出日期	論文/部落格	資料集	樣本數 (K)	許可證
MPT-7B-Instruct	2023/05	Introducing MPT-7B: 用于開源、商業可用LLM的新标準[97]	dolly_hhrlhf[98]	59	CC BY-SA-3.0
databricks-dolly-15k	2023/04	Free Dolly: 介紹世界上第一個真正開放的、商業可行的指令調整LLM[99]	databricks-dolly-15k[100]	15	CC BY-SA-3.0
OIG (Open Instruction Generalist)	2023/03	OIG資料集[101]	OIG[102]	44,000	Apache 2.0

用于對齊調整的開放式LLM資料集

名稱	釋出日期	論文/部落格	資料集	樣本數 (K)	許可證
OpenAssistant Conversations Dataset	2023/04	OpenAssistant Conversations - 民主化大型語言模型對齊[103]	oasst1[104]	161	Apache 2.0

開放式LLM評估

•lmsys.org的排行榜[105]•MosaicML的評估[106]•Holistic Evaluation of Language Models (HELM)[107]•LLM-Leaderboard[108]•TextSynth伺服器基準測試[109]•Hugging Face的開放式LLM排行榜[110]

許可證的含義是什麼？

•Apache 2.0[111]：允許使用者以任何目的使用軟體，分發軟體，修改軟體，并在許可證條款下分發修改後的軟體，無需支付版稅。

•MIT[112]：與Apache 2.0類似，但更短且更簡單。與Apache 2.0不同，不要求聲明對原始代碼的任何重大更改。

•CC BY-SA-4.0[113]：允許（i）複制和重新分發材料，以及（ii）混合、轉換和基于材料進行建構，用于任何目的，甚至包括商業用途。但是，如果您進行後者，則必須以與原始許可證相同的許可證分發您的貢獻。（是以，對于内部團隊可能不可行。）

•OpenRAIL-M v1[114]：允許免費通路和靈活的下遊使用和共享模型及其修改，并附帶一組使用限制（請參閱附件A[115]）。

免責聲明：本存儲庫提供的資訊不構成法律建議，也不旨在構成法律建議。本存儲庫的維護者對使用這些模型的第三方的行為不承擔責任。在商業用途中使用模型之前，請咨詢律師。

改進

• 完善上下文長度的條目，并使用?檢查條目

• 添加訓練的标記數量？（參見考慮事項[117]）

• 添加（連結到）訓練代碼？

• 添加（連結到）評估基準？

聲明

本文翻譯整理自：GitHub - eugeneyan/open-llms: A list of open LLMs available for commercial use.[118]，後續會定期更新，請點贊、收藏。

後續會根據該項目的更新情況不定期進行資源貼的同步更新，感興趣的同學可以點贊、收藏。

References

[1] T5和Flan-T5: https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints

[2] Flan-T5-xxl (HF): https://huggingface.co/google/flan-t5-xxl

[3] 探索統一的文本到文本轉換的遷移學習極限: https://github.com/google-research/text-to-text-transfer-transformer#released-model-checkpoints

[4] 512: https://discuss.huggingface.co/t/does-t5-truncate-input-longer-than-512-internally/3602

[5] T5-Large: https://github.com/slai-labs/get-beam/tree/main/examples/t5

[6] UL2和Flan-UL2: https://github.com/google-research/google-research/tree/master/ul2#checkpoints

[7] Flan-UL2 (HF): https://huggingface.co/google/flan-ul2

[8] UL2 20B: 開源的統一語言學習器: https://ai.googleblog.com/2022/10/ul2-20b-open-source-unified-language.html

[9] 512, 2048: https://huggingface.co/google/flan-ul2#tldr

[10] Cerebras-GPT: https://huggingface.co/cerebras

[11] Cerebras-GPT：一個開放、計算高效的大型語言模型家族: https://www.cerebras.net/blog/cerebras-gpt-a-family-of-open-compute-efficient-large-language-models/

[12] 論文: https://arxiv.org/abs/2304.03208

[13] 2048: https://huggingface.co/cerebras/Cerebras-GPT-13B#model-details

[14] Cerebras-GPT-1.3B: https://github.com/slai-labs/get-beam/tree/main/examples/cerebras-gpt

[15] OA-Pythia-12B-SFT-8: https://huggingface.co/OpenAssistant/pythia-12b-sft-v8-7k-steps

[16] OA-Pythia-12B-SFT-4: https://huggingface.co/OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5

[17] OA-Pythia-12B-SFT-1: https://huggingface.co/OpenAssistant/oasst-sft-1-pythia-12b

[18] Democratizing Large Language Model Alignment: https://arxiv.org/abs/2304.07327

[19] Pythia-2.8B: https://github.com/slai-labs/get-beam/tree/main/examples/pythia

[20] pythia 70M - 12B: https://github.com/EleutherAI/pythia

[21] Pythia：一個用于分析大型語言模型在訓練和擴充中的套件: https://arxiv.org/abs/2304.01373

[22] 2048: https://arxiv.org/pdf/2304.01373.pdf

[23] dolly-v2-12b: https://huggingface.co/databricks/dolly-v2-12b

[24] Free Dolly：介紹世界上第一個真正開放、商業可行的指令調整LLM: https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm

[25] 2048: https://github.com/databrickslabs/dolly#dolly

[26] dlite-v2-1_5b: https://huggingface.co/aisquared/dlite-v2-1_5b

[27] 宣布DLite V2：輕量級、開放的LLMs，可在任何地方運作: https://medium.com/ai-squared/announcing-dlite-v2-lightweight-open-llms-that-can-run-anywhere-a852e5978c6e

[28] DLite-v2-1.5B: https://github.com/slai-labs/get-beam/tree/main/examples/dlite-v2

[29] RWKV, ChatRWKV: https://github.com/BlinkDL/RWKV-LM#rwkv-parallelizable-rnn-with-transformer-level-llm-performance-pronounced-as-rwakuv-from-4-major-params-r-w-k-v

[30] RWKV語言模型（以及我的LM技巧）: https://github.com/BlinkDL/RWKV-LM

[31] 無限制（RNN）: https://github.com/BlinkDL/RWKV-LM#rwkv-parallelizable-rnn-with-transformer-level-llm-performance-pronounced-as-rwakuv-from-4-major-params-r-w-k-v

[32] GPT-J-6B: https://github.com/kingoflolz/mesh-transformer-jax/#gpt-j-6b

[33] GPT4All-J: https://github.com/nomic-ai/gpt4all#raw-model

[34] GPT-J-6B：基于JAX的6B Transformer: https://arankomatsuzaki.wordpress.com/2021/06/04/gpt-j/

[35] 2048: https://github.com/kingoflolz/mesh-transformer-jax/#gpt-j-6b

[36] GPT-NEOX-20B: https://huggingface.co/EleutherAI/gpt-neox-20b

[37] GPT-NeoX-20B：一個開源的自回歸語言模型: https://arxiv.org/abs/2304.04165

[38] 2048: https://huggingface.co/EleutherAI/gpt-neox-20b

[39] Bloom: https://huggingface.co/bigscience/bloom

[40] BLOOM：一個176B參數的開放式多語言語言模型: https://arxiv.org/abs/2211.05100

[41] 2048: https://huggingface.co/bigscience/bloom

[42] OpenRAIL-M v1: https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement

[43] StableLM-Alpha: https://github.com/Stability-AI/StableLM#stablelm-alpha

[44] Stability AI推出其StableLM系列語言模型的首個模型: https://stability.ai/blog/stability-ai-launches-the-first-of-its-stablelm-suite-of-language-models

[45] 4096: https://github.com/Stability-AI/StableLM#stablelm-alpha

[46] fastchat-t5-3b-v1.0: https://huggingface.co/lmsys/fastchat-t5-3b-v1.0

[47] 我們很高興釋出FastChat-T5：我們緊湊且商業友好的聊天機器人！: https://twitter.com/lmsysorg/status/1652037026705985537?s=20

[48] h2oGPT: https://github.com/h2oai/h2ogpt

[49] 建構世界上最好的開源大型語言模型：H2O.ai的旅程: https://h2o.ai/blog/building-the-worlds-best-open-source-large-language-model-h2o-ais-journey/

[50] 256 - 2048: https://huggingface.co/h2oai

[51] MPT-7B: https://huggingface.co/mosaicml/mpt-7b

[52] MPT-7B-Instruct: https://huggingface.co/mosaicml/mpt-7b-instruct

[53] 介紹MPT-7B：一個新的開源、商業可用的LLM标準: https://www.mosaicml.com/blog/mpt-7b

[54] 84k (ALiBi): https://huggingface.co/mosaicml/mpt-7b#how-is-this-model-different

[55] RedPajama-INCITE: https://huggingface.co/togethercomputer

[56] 釋出3B和7B的RedPajama-INCITE系列模型，包括基礎模型、指導調優模型和聊天模型: https://www.together.xyz/blog/redpajama-models-v1

[57] 2048: https://huggingface.co/togethercomputer/RedPajama-INCITE-Instruct-7B-v0.1/blob/157bf3174feebb67f37e131ea68f84dee007c687/config.json#L13

[58] RedPajama-INCITE-Instruct-3B-v1: https://github.com/slai-labs/get-beam/tree/main/examples/redpajama-incite-instruct

[59] open_llama_7b_700bt_preview: https://huggingface.co/openlm-research/open_llama_7b_700bt_preview

[60] open_llama_3b_600bt_preview: https://huggingface.co/openlm-research/open_llama_3b_600bt_preview

[61] OpenLLaMA：LLaMA的開放再現: https://github.com/openlm-research/open_llama

[62] 2048: https://huggingface.co/h2oai

[63] OpenLLaMA-7B-Preview_200bt: https://github.com/slai-labs/get-beam/tree/main/examples/openllama

[64] Falcon-40B: https://huggingface.co/tiiuae/falcon-40b

[65] Falcon-7B: https://huggingface.co/tiiuae/falcon-7b

[66] Baichuan-7B: https://huggingface.co/baichuan-inc/baichuan-7B

[67] baichuan/7b: https://github.com/baichuan-inc/baichuan-7B

[68] santacoder: https://huggingface.co/bigcode/santacoder

[69] SantaCoder：不要追求完美！: https://arxiv.org/abs/2301.03988

[70] 2048: https://huggingface.co/bigcode/santacoder/blob/main/README.md#model-summary

[71] OpenRAIL-M v1: https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement

[72] SantaCoder: https://github.com/slai-labs/get-beam/tree/main/examples/santacoder

[73] starcoder: https://huggingface.co/bigcode/starcoder

[74] StarCoder：用于代碼的最先進語言模型: https://huggingface.co/blog/starcoder

[75] StarCoder：願源代碼與你同在！: https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view

[76] 8192: https://huggingface.co/bigcode/starcoder#model-summary

[77] OpenRAIL-M v1: https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement

[78] starchat-alpha: https://huggingface.co/HuggingFaceH4/starchat-alpha

[79] 使用StarCoder建立編碼助手: https://huggingface.co/blog/starchat-alpha

[80] 8192: https://huggingface.co/bigcode/starcoder#model-summary

[81] OpenRAIL-M v1: https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement

[82] replit-code-v1-3b: https://huggingface.co/replit/replit-code-v1-3b

[83] 在一周内訓練最先進的代碼語言模型并量化“氛圍”——與Replit的Reza Shabani合作: https://www.latent.space/p/reza-shabani#details

[84] 無限大？(ALiBi): https://huggingface.co/replit/replit-code-v1-3b#model-description

[85] Replit-Code-v1-3B: https://github.com/slai-labs/get-beam/tree/main/examples/replit-code

[86] codegen2 1B-16B: https://github.com/salesforce/CodeGen2

[87] CodeGen2：關于在程式設計和自然語言上訓練LLM的經驗教訓: https://arxiv.org/abs/2305.02309

[88] 2048: https://arxiv.org/abs/2305.02309

[89] CodeT5+: https://github.com/salesforce/CodeT5/tree/main/CodeT5+

[90] CodeT5+：用于代碼了解和生成的開放式代碼大語言模型: https://arxiv.org/abs/2305.07922

[91] 512: https://arxiv.org/abs/2305.07922

[92] Codet5+-6B: https://github.com/slai-labs/get-beam/tree/main/examples/codeT5%2B

[93] StarCoder: 一種用于代碼的最先進LLM: https://huggingface.co/blog/starcoder

[94] starcoderdata: https://huggingface.co/datasets/bigcode/starcoderdata

[95] RedPajama：一個旨在建立領先的開源模型的項目，首先複現了超過1.2萬億标記的LLaMA訓練資料集: https://www.together.xyz/blog/redpajama

[96] RedPajama-Data: https://github.com/togethercomputer/RedPajama-Data

[97] Introducing MPT-7B: 用于開源、商業可用LLM的新标準: https://www.mosaicml.com/blog/mpt-7b

[98] dolly_hhrlhf: https://huggingface.co/datasets/mosaicml/dolly_hhrlhf

[99] Free Dolly: 介紹世界上第一個真正開放的、商業可行的指令調整LLM: https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm

[100] databricks-dolly-15k: https://huggingface.co/datasets/databricks/databricks-dolly-15k

[101] OIG資料集: https://laion.ai/blog/oig-dataset/

[102] OIG: https://huggingface.co/datasets/laion/OIG

[103] OpenAssistant Conversations - 民主化大型語言模型對齊: https://drive.google.com/file/d/10iR5hKwFqAKhL3umx8muOWSRm7hs5FqX/view

[104] oasst1: https://huggingface.co/datasets/OpenAssistant/oasst1

[105] lmsys.org的排行榜: https://chat.lmsys.org/?leaderboard

[106] MosaicML的評估: https://twitter.com/jefrankle/status/1654631746506301441

[107] Holistic Evaluation of Language Models (HELM): https://crfm.stanford.edu/helm/latest/?groups=1

[108] LLM-Leaderboard: https://github.com/LudwigStumpp/llm-leaderboard

[109] TextSynth伺服器基準測試: https://bellard.org/ts_server/

[110] Hugging Face的開放式LLM排行榜: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

[111] Apache 2.0: https://en.wikipedia.org/wiki/Apache_License

[112] MIT: https://en.wikipedia.org/wiki/MIT_License

[113] CC BY-SA-4.0: https://creativecommons.org/licenses/by-sa/4.0/

[114] OpenRAIL-M v1: https://www.bigcode-project.org/docs/pages/model-license/

[115] 附件A: https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement

[116] BSD-3-Clause: https://en.wikipedia.org/wiki/BSD_licenses

[117] 考慮事項: https://github.com/eugeneyan/open-llms/issues/7

[118] GitHub - eugeneyan/open-llms: A list of open LLMs available for commercial use.: https://github.com/eugeneyan/open-llms/tree/main

open-llms 開源可商用的優秀大模型資源庫

前言

開放的LLMs

用于代碼的開放型語言模型

用于預訓練的開放式LLM資料集

用于指令調整的開放式LLM資料集

用于對齊調整的開放式LLM資料集

開放式LLM評估

許可證的含義是什麼？

改進

聲明

References

繼續閱讀

#chat2db##openai#阿裡巴巴開源自然語言生成SQL的工具，在github上面可以下載下傳。使用前需要具有ope

RocketMQ on openEuler 提供高性能消息隊列的穩定性解決方案

Linux伺服器報錯too many open files錯誤解決方案

python使用open經常報錯：TypeError: an integer is required的解決方案

opencv交叉編譯包含ffmpeg

咱們來錄制一期背景的操作說明。·首先咱們的代理拿到自己的背景之後會跟我這個頁面是一樣的，這裡邊有幾樣重要的東西，我要跟你

Linux系統程式設計——系統調用之 I/O 操作（檔案操作）一、檔案描述符二、常用 I/0 函數三、實戰篇

Linux檔案IO-open,write,read,lseek,close

Linux下讀寫函數

linux中的open

接觸自動駕駛開源軟體openTCS

Don'tbediscouraged.It'softenthelastkeyinthebunchthatopensthe

Linux下C中chmod函數

Linux下的檔案操作（Linux系統調用和ANSIC檔案操作）

Linux下C中chdir函數