引言

今天，Meta 釋出了 Llama 2，其包含了一系列最先進的開放大語言模型，我們很高興能夠将其全面內建入 Hugging Face，并全力支援其釋出。Llama 2 的社群許可證相當寬松，且可商用。其代碼、預訓練模型和微調模型均于今天釋出了。

通過與 Meta 合作，我們已經順利地完成了對 Llama 2 的內建，你可以在 Hub 上找到 12 個開放模型 (3 個基礎模型以及 3 個微調模型，每個模型都有 2 種 checkpoint: 一個是 Meta 的原始 checkpoint，一個是 transformers 格式的 checkpoint)。以下列出了 Hugging Face 支援 Llama 2 的主要工作:

Llama 2 已入駐 Hub: 包括模型卡及相應的許可證。
支援 Llama 2 的 transformers 庫
使用單 GPU 微調 Llama 2 小模型的示例
Text Generation Inference (TGI) 已內建 Llama 2，以實作快速高效的生産化推理
推理終端 (Inference Endpoints) 已內建 Llama 2

何以 Llama 2?

Llama 2 引入了一系列預訓練和微調 LLM，參數量範圍從 7B 到 70B (7B、13B、70B)。其預訓練模型比 Llama 1 模型有了顯著改進，包括訓練資料的總詞元數增加了 40%、上下文長度更長 (4k 詞元)，以及利用了分組查詢注意力機制來加速 70B 模型的推理！

但最令人興奮的還是其釋出的微調模型 (Llama 2-Chat)，該模型已使用基于人類回報的強化學習 (Reinforcement Learning from Human Feedback，RLHF) 技術針對對話場景進行了優化。在相當廣泛的有用性和安全性測試基準中，Llama 2-Chat 模型的表現優于大多數開放模型，且其在人類評估中表現出與 ChatGPT 相當的性能。更多詳情，可參閱其論文。

模型訓練與微調工作流

圖來自 Llama 2: Open Foundation and Fine-Tuned Chat Models 一文

如果你一直在等一個閉源聊天機器人的開源替代，那你算是等着了！Llama 2-Chat 将是你的最佳選擇！

模型許可證可否商用?預訓練詞元數排行榜得分Falcon-7BApache 2.0✅1,500B47.01MPT-7BApache 2.0✅1,000B48.7Llama-7BLlama 許可證❌1,000B49.71Llama-2-7BLlama 2 許可證✅2,000B54.32Llama-33BLlama 許可證❌1,500B*Llama-2-13BLlama 2 許可證✅2,000B58.67mpt-30BApache 2.0✅1,000B55.7Falcon-40BApache 2.0✅1,000B61.5Llama-65BLlama 許可證❌1,500B62.1Llama-2-70BLlama 2 許可證✅2,000B*Llama-2-70B-chat*Llama 2 許可證✅2,000B66.8

*目前，我們正在對 Llama 2 70B (非聊天版) 進行評測。評測結果後續将更新至此表。

示範

你可以通過這個空間或下面的應用輕松試用 Llama 2 大模型 (700 億參數！):

它們背後都是基于 Hugging Face 的 TGI 架構，該架構也支撐了 HuggingChat，我們會在下文分享更多相關内容。

推理

本節，我們主要介紹可用于對 Llama 2 模型進行推理的兩種不同方法。在使用這些模型之前，請確定你已在 Meta Llama 2 存儲庫頁面申請了模型通路權限。

**注意: 請務必按照頁面上的訓示填寫 Meta 官方表格。填完兩個表格數小時後，使用者就可以通路模型存儲庫。

使用 transformers

從 transformers 4.31 版本開始，HF 生态中的所有工具和機制都可以适用于 Llama 2，如:

訓練、推理腳本及其示例
安全檔案格式 (safetensors )
與 bitsandbytes (4 比特量化) 和 PEFT 等工具
幫助模型進行文本生成的輔助工具
導出模型以進行部署的機制

你隻需確定使用最新的 transformers 版本并登入你的 Hugging Face 帳戶。

pip install transformers
huggingface-cli login

下面是如何使用 transformers 進行推理的代碼片段:

from transformers import AutoTokenizer
import transformers
import torch

model = "meta-llama/Llama-2-7b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

sequences = pipeline(
    'I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?\n',
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=200,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Result: I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?
Answer:
Of course! If you enjoyed "Breaking Bad" and "Band of Brothers," here are some other TV shows you might enjoy:
1. "The Sopranos" - This HBO series is a crime drama that explores the life of a New Jersey mob boss, Tony Soprano, as he navigates the criminal underworld and deals with personal and family issues.
2. "The Wire" - This HBO series is a gritty and realistic portrayal of the drug trade in Baltimore, exploring the impact of drugs on individuals, communities, and the criminal justice system.
3. "Mad Men" - Set in the 1960s, this AMC series follows the lives of advertising executives on Madison Avenue, expl

另外，盡管模型本身的上下文長度僅 4k 詞元，但你可以使用 transformers 支援的技術，如旋轉位置嵌入縮放 (rotary position embedding scaling) ，進一步把它變長！

使用 TGI 和推理終端

Text Generation Inference (TGI) 是 Hugging Face 開發的生産級推理容器，可用于輕松部署大語言模型。它支援流式組批、流式輸出、基于張量并行的多 GPU 快速推理，并支援生産級的日志記錄和跟蹤等功能。

你可以在自己的基礎設施上部署并嘗試 TGI，也可以直接使用 Hugging Face 的推理終端。如果要用推理終端部署 Llama 2 模型，請登入模型頁面并單擊 Deploy -> Inference Endpoints 菜單。

要推理 7B 模型，我們建議你選擇 “GPU [medium] - 1x Nvidia A10G”。
要推理 13B 模型，我們建議你選擇 “GPU [xlarge] - 1x Nvidia A100”。
要推理 70B 模型，我們建議你選擇 “GPU [xxxlarge] - 8x Nvidia A100”。

注意: 如果你配額不夠，請發送郵件至 [email protected] 申請更新配額，通過後你就可以通路 A100 了。

你還可以從我們的另一篇博文中了解更多有關如何使用 Hugging Face 推理終端部署 LLM 的知識 , 文中包含了推理終端支援的超參以及如何使用其 Python 和 Javascript API 實作流式輸出等資訊。

用 PEFT 微調

訓練 LLM 在技術和計算上都有一定的挑戰。本節，我們将介紹 Hugging Face 生态中有哪些工具可以幫助開發者在簡單的硬體上高效訓練 Llama 2，我們還将展示如何在單張 NVIDIA T4 (16GB - Google Colab) 上微調 Llama 2 7B 模型。你可以通過讓 LLM 更可得這篇博文了解更多資訊。

我們建構了一個腳本，其中使用了 QLoRA 和 trl 中的 SFTTrainer 來對 Llama 2 進行指令微調。

下面的指令給出了在 timdettmers/openassistant-guanaco 資料集上微調 Llama 2 7B 的一個示例。該腳本可以通過 merge_and_push 參數将 LoRA 權重合并到模型權重中，并将其儲存為 safetensor 格式。這樣，我們就能使用 TGI 和推理終端部署微調後的模型。

首先安裝 trl 包并下載下傳腳本:

pip install trl
git clone https://github.com/lvwerra/trl

然後，你就可以運作腳本了:

python trl/examples/scripts/sft_trainer.py \
    --model_name meta-llama/Llama-2-7b-hf \
    --dataset_name timdettmers/openassistant-guanaco \
    --load_in_4bit \
    --use_peft \
    --batch_size 4 \
    --gradient_accumulation_steps 2

其他資源

論文
Hub 上的模型
Open LLM 排行榜
Meta 提供的 Llama 2 模型使用大全

總結

Llama 2 的推出讓我們非常興奮！後面我們會圍繞它陸陸續續推出更多内容，包括如何微調一個自己的模型，如何在裝置側運作 Llama 2 小模型等，敬請期待！

寶子們可以戳閱讀原文檢視文中所有的外部連結喲！

英文原文: https://hf.co/blog/llama2

原文作者: Philipp Schmid，Omar Sanseviero，Pedro Cuenca，Lewis Tunstall

Llama 2 來襲 - 在 Hugging Face 上玩轉它

引言

目錄

何以 Llama 2?

示範

推理

使用 transformers

使用 TGI 和推理終端

用 PEFT 微調

其他資源

總結