前言
在AI盛起的當下,各類AI應用不斷地出現在人們的視野中,AI正在重塑着各行各業。相信現在各大公司都在進行着不同程度的AI布局,有AI大模型自研能力的公司畢竟是少數,對于大部分公司來說,在一款開源可商用的大模型基礎上進行行業資料微調也正在成為一種不錯的選擇。
本文主要内容是一個目前市面上呈現出來的開源可商用的大語言模型的資源庫,裡面羅列了大大小小很多個大語言模型和資料集等學習資源,後面會持續更新,建議大家點贊收藏。
這裡附上之前作者整理過的優質資源貼,感興趣的可以檢視:
基于LangChain的優秀項目資源庫
優秀的多模态大模型(LLM)資源庫
開放的LLMs
這些LLMs都可以用于商業用途(例如Apache 2.0、MIT、OpenRAIL-M等許可)。歡迎貢獻!
語言模型 | 釋出日期 | 檢查點 | 論文/部落格 | 參數(B) | 上下文長度 | 許可證 | 試用 |
T5 | 2019/10 | T5和Flan-T5[1],Flan-T5-xxl (HF)[2] | 探索統一的文本到文本轉換的遷移學習極限[3] | 0.06 - 11 | 512[4] | Apache 2.0 | T5-Large[5] |
UL2 | 2022/10 | UL2和Flan-UL2[6],Flan-UL2 (HF)[7] | UL2 20B: 開源的統一語言學習器[8] | 20 | 512, 2048[9] | Apache 2.0 | |
Cerebras-GPT | 2023/03 | Cerebras-GPT[10] | Cerebras-GPT:一個開放、計算高效的大型語言模型家族[11](論文[12]) | 0.111 - 13 | 2048[13] | Apache 2.0 | Cerebras-GPT-1.3B[14] |
Open Assistant(Pythia家族) | 2023/03 | OA-Pythia-12B-SFT-8[15],OA-Pythia-12B-SFT-4[16],OA-Pythia-12B-SFT-1[17] | Democratizing Large Language Model Alignment[18] | 12 | 2048 | Apache 2.0 | Pythia-2.8B[19] |
Pythia | 2023/04 | pythia 70M - 12B[20] | Pythia:一個用于分析大型語言模型在訓練和擴充中的套件[21] | 0.07 - 12 | 2048[22] | Apache 2.0 | |
Dolly | 2023/04 | dolly-v2-12b[23] | Free Dolly:介紹世界上第一個真正開放、商業可行的指令調整LLM[24] | 3, 7, 12 | 2048[25] | MIT | |
DLite | 2023/05 | dlite-v2-1_5b[26] | 宣布DLite V2:輕量級、開放的LLMs,可在任何地方運作[27] | 0.124 - 1.5 | 1024 | Apache 2.0 | DLite-v2-1.5B[28] |
RWKV | 2021/08 | RWKV, ChatRWKV[29] | RWKV語言模型(以及我的LM技巧)[30] | 0.1 - 14 | 無限制(RNN)[31] | Apache 2.0 | |
GPT-J-6B | 2023/06 | GPT-J-6B[32],GPT4All-J[33] | GPT-J-6B:基于JAX的6B Transformer[34] | 6 | 2048[35] | Apache 2.0 | |
GPT-NeoX-20B | 2022/04 | GPT-NEOX-20B[36] | GPT-NeoX-20B:一個開源的自回歸語言模型[37] | 20 | 2048[38] | Apache 2.0 | |
Bloom | 2022/11 | Bloom[39] | BLOOM:一個176B參數的開放式多語言語言模型[40] | 176 | 2048[41] | OpenRAIL-M v1[42] | |
StableLM-Alpha | 2023/04 | StableLM-Alpha[43] | Stability AI推出其StableLM系列語言模型的首個模型[44] | 3 - 65 | 4096[45] | CC BY-SA-4.0 | |
FastChat-T5 | 2023/04 | fastchat-t5-3b-v1.0[46] | 我們很高興釋出FastChat-T5:我們緊湊且商業友好的聊天機器人![47] | 3 | 512 | Apache 2.0 | |
h2oGPT | 2023/05 | h2oGPT[48] | 建構世界上最好的開源大型語言模型:H2O.ai的旅程[49] | 12 - 20 | 256 - 2048[50] | Apache 2.0 | |
MPT-7B | 2023/05 | MPT-7B[51],MPT-7B-Instruct[52] | 介紹MPT-7B:一個新的開源、商業可用的LLM标準[53] | 7 | 84k (ALiBi)[54] | Apache 2.0, CC BY-SA-3.0 | |
RedPajama-INCITE | 2023/05 | RedPajama-INCITE[55] | 釋出3B和7B的RedPajama-INCITE系列模型,包括基礎模型、指導調優模型和聊天模型[56] | 3 - 7 | 2048[57] | Apache 2.0 | RedPajama-INCITE-Instruct-3B-v1[58] |
OpenLLaMA | 2023/05 | open_llama_7b_700bt_preview[59],open_llama_3b_600bt_preview[60] | OpenLLaMA:LLaMA的開放再現[61] | 3, 7 | 2048[62] | Apache 2.0 | OpenLLaMA-7B-Preview_200bt[63] |
Falcon | 2023/05 | Falcon-40B[64],Falcon-7B[65] | 論文即将推出 | 7, 40 | 2048 | Apache 2.0 | |
Baichuan | 2023/06 | Baichuan-7B[66] | 無 | 7 | 4096 | Apache 2.0 | baichuan/7b[67] |
用于代碼的開放型語言模型
語言模型 | 釋出日期 | 檢查點 | 論文/部落格 | 參數 (B) | 上下文長度 | 許可證 | 嘗試使用 |
SantaCoder | 2023/01 | santacoder[68] | SantaCoder:不要追求完美![69] | 1.1 | 2048[70] | OpenRAIL-M v1[71] | SantaCoder[72] |
StarCoder | 2023/05 | starcoder[73] | StarCoder:用于代碼的最先進語言模型[74],StarCoder:願源代碼與你同在![75] | 15 | 8192[76] | OpenRAIL-M v1[77] | |
StarChat Alpha | 2023/05 | starchat-alpha[78] | 使用StarCoder建立編碼助手[79] | 16 | 8192[80] | OpenRAIL-M v1[81] | |
Replit Code | 2023/05 | replit-code-v1-3b[82] | 在一周内訓練最先進的代碼語言模型并量化“氛圍”——與Replit的Reza Shabani合作[83] | 2.7 | 無限大?(ALiBi)[84] | CC BY-SA-4.0 | Replit-Code-v1-3B[85] |
CodeGen2 | 2023/04 | codegen2 1B-16B[86] | CodeGen2:關于在程式設計和自然語言上訓練LLM的經驗教訓[87] | 1 - 16 | 2048[88] | Apache 2.0 | |
CodeT5+ | 2023/05 | CodeT5+[89] | CodeT5+:用于代碼了解和生成的開放式代碼大語言模型[90] | 0.22 - 16 | 512[91] | BSD-3-Clause | Codet5+-6B[92] |
用于預訓練的開放式LLM資料集
名稱 | 釋出日期 | 論文/部落格 | 資料集 | 标記數 (T) | 許可證 |
starcoderdata | 2023/05 | StarCoder: 一種用于代碼的最先進LLM[93] | starcoderdata[94] | 0.25 | Apache 2.0 |
RedPajama | 2023/04 | RedPajama:一個旨在建立領先的開源模型的項目,首先複現了超過1.2萬億标記的LLaMA訓練資料集[95] | RedPajama-Data[96] | 1.2 | Apache 2.0 |
用于指令調整的開放式LLM資料集
名稱 | 釋出日期 | 論文/部落格 | 資料集 | 樣本數 (K) | 許可證 |
MPT-7B-Instruct | 2023/05 | Introducing MPT-7B: 用于開源、商業可用LLM的新标準[97] | dolly_hhrlhf[98] | 59 | CC BY-SA-3.0 |
databricks-dolly-15k | 2023/04 | Free Dolly: 介紹世界上第一個真正開放的、商業可行的指令調整LLM[99] | databricks-dolly-15k[100] | 15 | CC BY-SA-3.0 |
OIG (Open Instruction Generalist) | 2023/03 | OIG資料集[101] | OIG[102] | 44,000 | Apache 2.0 |
用于對齊調整的開放式LLM資料集
名稱 | 釋出日期 | 論文/部落格 | 資料集 | 樣本數 (K) | 許可證 |
OpenAssistant Conversations Dataset | 2023/04 | OpenAssistant Conversations - 民主化大型語言模型對齊[103] | oasst1[104] | 161 | Apache 2.0 |
開放式LLM評估
•lmsys.org的排行榜[105]•MosaicML的評估[106]•Holistic Evaluation of Language Models (HELM)[107]•LLM-Leaderboard[108]•TextSynth伺服器基準測試[109]•Hugging Face的開放式LLM排行榜[110]
許可證的含義是什麼?
•Apache 2.0[111]:允許使用者以任何目的使用軟體,分發軟體,修改軟體,并在許可證條款下分發修改後的軟體,無需支付版稅。
•MIT[112]:與Apache 2.0類似,但更短且更簡單。與Apache 2.0不同,不要求聲明對原始代碼的任何重大更改。
•CC BY-SA-4.0[113]:允許(i)複制和重新分發材料,以及(ii)混合、轉換和基于材料進行建構,用于任何目的,甚至包括商業用途。但是,如果您進行後者,則必須以與原始許可證相同的許可證分發您的貢獻。(是以,對于内部團隊可能不可行。)
•OpenRAIL-M v1[114]:允許免費通路和靈活的下遊使用和共享模型及其修改,并附帶一組使用限制(請參閱附件A[115])。
•BSD-3-Clause[116]:此版本允許無限制地以任何目的進行再分發,隻要保留其版權聲明和許可證中的免責聲明。
免責聲明: 本存儲庫提供的資訊不構成法律建議,也不旨在構成法律建議。本存儲庫的維護者對使用這些模型的第三方的行為不承擔責任。在商業用途中使用模型之前,請咨詢律師。
改進
• 完善上下文長度的條目,并使用?檢查條目
• 添加訓練的标記數量?(參見考慮事項[117])
• 添加(連結到)訓練代碼?
• 添加(連結到)評估基準?
聲明
本文翻譯整理自:GitHub - eugeneyan/open-llms: A list of open LLMs available for commercial use.[118],後續會定期更新,請點贊、收藏。
後續會根據該項目的更新情況不定期進行資源貼的同步更新,感興趣的同學可以點贊、收藏。
References
[1] T5和Flan-T5: https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints
[2] Flan-T5-xxl (HF): https://huggingface.co/google/flan-t5-xxl
[3] 探索統一的文本到文本轉換的遷移學習極限: https://github.com/google-research/text-to-text-transfer-transformer#released-model-checkpoints
[4] 512: https://discuss.huggingface.co/t/does-t5-truncate-input-longer-than-512-internally/3602
[5] T5-Large: https://github.com/slai-labs/get-beam/tree/main/examples/t5
[6] UL2和Flan-UL2: https://github.com/google-research/google-research/tree/master/ul2#checkpoints
[7] Flan-UL2 (HF): https://huggingface.co/google/flan-ul2
[8] UL2 20B: 開源的統一語言學習器: https://ai.googleblog.com/2022/10/ul2-20b-open-source-unified-language.html
[9] 512, 2048: https://huggingface.co/google/flan-ul2#tldr
[10] Cerebras-GPT: https://huggingface.co/cerebras
[11] Cerebras-GPT:一個開放、計算高效的大型語言模型家族: https://www.cerebras.net/blog/cerebras-gpt-a-family-of-open-compute-efficient-large-language-models/
[12] 論文: https://arxiv.org/abs/2304.03208
[13] 2048: https://huggingface.co/cerebras/Cerebras-GPT-13B#model-details
[14] Cerebras-GPT-1.3B: https://github.com/slai-labs/get-beam/tree/main/examples/cerebras-gpt
[15] OA-Pythia-12B-SFT-8: https://huggingface.co/OpenAssistant/pythia-12b-sft-v8-7k-steps
[16] OA-Pythia-12B-SFT-4: https://huggingface.co/OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5
[17] OA-Pythia-12B-SFT-1: https://huggingface.co/OpenAssistant/oasst-sft-1-pythia-12b
[18] Democratizing Large Language Model Alignment: https://arxiv.org/abs/2304.07327
[19] Pythia-2.8B: https://github.com/slai-labs/get-beam/tree/main/examples/pythia
[20] pythia 70M - 12B: https://github.com/EleutherAI/pythia
[21] Pythia:一個用于分析大型語言模型在訓練和擴充中的套件: https://arxiv.org/abs/2304.01373
[22] 2048: https://arxiv.org/pdf/2304.01373.pdf
[23] dolly-v2-12b: https://huggingface.co/databricks/dolly-v2-12b
[24] Free Dolly:介紹世界上第一個真正開放、商業可行的指令調整LLM: https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm
[25] 2048: https://github.com/databrickslabs/dolly#dolly
[26] dlite-v2-1_5b: https://huggingface.co/aisquared/dlite-v2-1_5b
[27] 宣布DLite V2:輕量級、開放的LLMs,可在任何地方運作: https://medium.com/ai-squared/announcing-dlite-v2-lightweight-open-llms-that-can-run-anywhere-a852e5978c6e
[28] DLite-v2-1.5B: https://github.com/slai-labs/get-beam/tree/main/examples/dlite-v2
[29] RWKV, ChatRWKV: https://github.com/BlinkDL/RWKV-LM#rwkv-parallelizable-rnn-with-transformer-level-llm-performance-pronounced-as-rwakuv-from-4-major-params-r-w-k-v
[30] RWKV語言模型(以及我的LM技巧): https://github.com/BlinkDL/RWKV-LM
[31] 無限制(RNN): https://github.com/BlinkDL/RWKV-LM#rwkv-parallelizable-rnn-with-transformer-level-llm-performance-pronounced-as-rwakuv-from-4-major-params-r-w-k-v
[32] GPT-J-6B: https://github.com/kingoflolz/mesh-transformer-jax/#gpt-j-6b
[33] GPT4All-J: https://github.com/nomic-ai/gpt4all#raw-model
[34] GPT-J-6B:基于JAX的6B Transformer: https://arankomatsuzaki.wordpress.com/2021/06/04/gpt-j/
[35] 2048: https://github.com/kingoflolz/mesh-transformer-jax/#gpt-j-6b
[36] GPT-NEOX-20B: https://huggingface.co/EleutherAI/gpt-neox-20b
[37] GPT-NeoX-20B:一個開源的自回歸語言模型: https://arxiv.org/abs/2304.04165
[38] 2048: https://huggingface.co/EleutherAI/gpt-neox-20b
[39] Bloom: https://huggingface.co/bigscience/bloom
[40] BLOOM:一個176B參數的開放式多語言語言模型: https://arxiv.org/abs/2211.05100
[41] 2048: https://huggingface.co/bigscience/bloom
[42] OpenRAIL-M v1: https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement
[43] StableLM-Alpha: https://github.com/Stability-AI/StableLM#stablelm-alpha
[44] Stability AI推出其StableLM系列語言模型的首個模型: https://stability.ai/blog/stability-ai-launches-the-first-of-its-stablelm-suite-of-language-models
[45] 4096: https://github.com/Stability-AI/StableLM#stablelm-alpha
[46] fastchat-t5-3b-v1.0: https://huggingface.co/lmsys/fastchat-t5-3b-v1.0
[47] 我們很高興釋出FastChat-T5:我們緊湊且商業友好的聊天機器人!: https://twitter.com/lmsysorg/status/1652037026705985537?s=20
[48] h2oGPT: https://github.com/h2oai/h2ogpt
[49] 建構世界上最好的開源大型語言模型:H2O.ai的旅程: https://h2o.ai/blog/building-the-worlds-best-open-source-large-language-model-h2o-ais-journey/
[50] 256 - 2048: https://huggingface.co/h2oai
[51] MPT-7B: https://huggingface.co/mosaicml/mpt-7b
[52] MPT-7B-Instruct: https://huggingface.co/mosaicml/mpt-7b-instruct
[53] 介紹MPT-7B:一個新的開源、商業可用的LLM标準: https://www.mosaicml.com/blog/mpt-7b
[54] 84k (ALiBi): https://huggingface.co/mosaicml/mpt-7b#how-is-this-model-different
[55] RedPajama-INCITE: https://huggingface.co/togethercomputer
[56] 釋出3B和7B的RedPajama-INCITE系列模型,包括基礎模型、指導調優模型和聊天模型: https://www.together.xyz/blog/redpajama-models-v1
[57] 2048: https://huggingface.co/togethercomputer/RedPajama-INCITE-Instruct-7B-v0.1/blob/157bf3174feebb67f37e131ea68f84dee007c687/config.json#L13
[58] RedPajama-INCITE-Instruct-3B-v1: https://github.com/slai-labs/get-beam/tree/main/examples/redpajama-incite-instruct
[59] open_llama_7b_700bt_preview: https://huggingface.co/openlm-research/open_llama_7b_700bt_preview
[60] open_llama_3b_600bt_preview: https://huggingface.co/openlm-research/open_llama_3b_600bt_preview
[61] OpenLLaMA:LLaMA的開放再現: https://github.com/openlm-research/open_llama
[62] 2048: https://huggingface.co/h2oai
[63] OpenLLaMA-7B-Preview_200bt: https://github.com/slai-labs/get-beam/tree/main/examples/openllama
[64] Falcon-40B: https://huggingface.co/tiiuae/falcon-40b
[65] Falcon-7B: https://huggingface.co/tiiuae/falcon-7b
[66] Baichuan-7B: https://huggingface.co/baichuan-inc/baichuan-7B
[67] baichuan/7b: https://github.com/baichuan-inc/baichuan-7B
[68] santacoder: https://huggingface.co/bigcode/santacoder
[69] SantaCoder:不要追求完美!: https://arxiv.org/abs/2301.03988
[70] 2048: https://huggingface.co/bigcode/santacoder/blob/main/README.md#model-summary
[71] OpenRAIL-M v1: https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement
[72] SantaCoder: https://github.com/slai-labs/get-beam/tree/main/examples/santacoder
[73] starcoder: https://huggingface.co/bigcode/starcoder
[74] StarCoder:用于代碼的最先進語言模型: https://huggingface.co/blog/starcoder
[75] StarCoder:願源代碼與你同在!: https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view
[76] 8192: https://huggingface.co/bigcode/starcoder#model-summary
[77] OpenRAIL-M v1: https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement
[78] starchat-alpha: https://huggingface.co/HuggingFaceH4/starchat-alpha
[79] 使用StarCoder建立編碼助手: https://huggingface.co/blog/starchat-alpha
[80] 8192: https://huggingface.co/bigcode/starcoder#model-summary
[81] OpenRAIL-M v1: https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement
[82] replit-code-v1-3b: https://huggingface.co/replit/replit-code-v1-3b
[83] 在一周内訓練最先進的代碼語言模型并量化“氛圍”——與Replit的Reza Shabani合作: https://www.latent.space/p/reza-shabani#details
[84] 無限大?(ALiBi): https://huggingface.co/replit/replit-code-v1-3b#model-description
[85] Replit-Code-v1-3B: https://github.com/slai-labs/get-beam/tree/main/examples/replit-code
[86] codegen2 1B-16B: https://github.com/salesforce/CodeGen2
[87] CodeGen2:關于在程式設計和自然語言上訓練LLM的經驗教訓: https://arxiv.org/abs/2305.02309
[88] 2048: https://arxiv.org/abs/2305.02309
[89] CodeT5+: https://github.com/salesforce/CodeT5/tree/main/CodeT5+
[90] CodeT5+:用于代碼了解和生成的開放式代碼大語言模型: https://arxiv.org/abs/2305.07922
[91] 512: https://arxiv.org/abs/2305.07922
[92] Codet5+-6B: https://github.com/slai-labs/get-beam/tree/main/examples/codeT5%2B
[93] StarCoder: 一種用于代碼的最先進LLM: https://huggingface.co/blog/starcoder
[94] starcoderdata: https://huggingface.co/datasets/bigcode/starcoderdata
[95] RedPajama:一個旨在建立領先的開源模型的項目,首先複現了超過1.2萬億标記的LLaMA訓練資料集: https://www.together.xyz/blog/redpajama
[96] RedPajama-Data: https://github.com/togethercomputer/RedPajama-Data
[97] Introducing MPT-7B: 用于開源、商業可用LLM的新标準: https://www.mosaicml.com/blog/mpt-7b
[98] dolly_hhrlhf: https://huggingface.co/datasets/mosaicml/dolly_hhrlhf
[99] Free Dolly: 介紹世界上第一個真正開放的、商業可行的指令調整LLM: https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm
[100] databricks-dolly-15k: https://huggingface.co/datasets/databricks/databricks-dolly-15k
[101] OIG資料集: https://laion.ai/blog/oig-dataset/
[102] OIG: https://huggingface.co/datasets/laion/OIG
[103] OpenAssistant Conversations - 民主化大型語言模型對齊: https://drive.google.com/file/d/10iR5hKwFqAKhL3umx8muOWSRm7hs5FqX/view
[104] oasst1: https://huggingface.co/datasets/OpenAssistant/oasst1
[105] lmsys.org的排行榜: https://chat.lmsys.org/?leaderboard
[106] MosaicML的評估: https://twitter.com/jefrankle/status/1654631746506301441
[107] Holistic Evaluation of Language Models (HELM): https://crfm.stanford.edu/helm/latest/?groups=1
[108] LLM-Leaderboard: https://github.com/LudwigStumpp/llm-leaderboard
[109] TextSynth伺服器基準測試: https://bellard.org/ts_server/
[110] Hugging Face的開放式LLM排行榜: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
[111] Apache 2.0: https://en.wikipedia.org/wiki/Apache_License
[112] MIT: https://en.wikipedia.org/wiki/MIT_License
[113] CC BY-SA-4.0: https://creativecommons.org/licenses/by-sa/4.0/
[114] OpenRAIL-M v1: https://www.bigcode-project.org/docs/pages/model-license/
[115] 附件A: https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement
[116] BSD-3-Clause: https://en.wikipedia.org/wiki/BSD_licenses
[117] 考慮事項: https://github.com/eugeneyan/open-llms/issues/7
[118] GitHub - eugeneyan/open-llms: A list of open LLMs available for commercial use.: https://github.com/eugeneyan/open-llms/tree/main