作者｜BBuf、謝子鵬、馮文

2017 年，Google 提出了 Transformer 架構，随後 BERT 、GPT、T5等預訓練模型不斷湧現，并在各項任務中都不斷重新整理 SOTA 紀錄。去年，清華提出了 GLM 模型（https://github.com/THUDM/GLM），不同于上述預訓練模型架構，它采用了一種自回歸的空白填充方法，在 NLP 領域三種主要的任務（自然語言了解、無條件生成、有條件生成）上都取得了不錯的結果。

很快，清華基于 GLM 架構又推出了 GLM-130B（https://keg.cs.tsinghua.edu.cn/glm-130b/zh/posts/glm-130b/），這是一個開源開放的雙語（中文和英文）雙向稠密模型，擁有 1300 億參數，在語言了解、語言模組化、翻譯、Zero-Shot 等方面都更加出色。

預訓練模型的背後離不開開源深度學習架構的助力。在此之前，GLM 的開源代碼主要是由 PyTorch、DeepSpeed 以及 Apex 來實作，并且基于 DeepSpeed 提供的資料并行和模型并行技術訓練了 GLM-Large（335M），GLM-515M（515M），GLM-10B（10B）等大模型，這在一定程度上降低了 GLM 預訓練模型的使用門檻。

即便如此，對更廣大範圍的普通使用者來說，訓練 GLM 這樣的模型依然令人頭秃，同時，預訓練模型的性能優化還有更大的提升空間。

為此，我們近期将原始的 GLM 項目移植到了使用 OneFlow 後端進行訓練的 One-GLM 項目。得益于 OneFlow 和 PyTorch 無縫相容性，我們快速且平滑地移植了 GLM，并成功跑通了預訓練任務（訓練 GLM-large）。

此外，由于 OneFlow 原生支援 DeepSpeed 和 Apex 的很多功能和優化技術，使用者不再需要這些插件就可訓練 GLM 等大模型。更重要的是，針對目前 OneFlow 移植的 GLM 模型，在簡單調優後就能在性能以及顯存占用上有大幅提升。

具體是怎麼做到的？下文将進行揭曉。

One-GLM：https://github.com/Oneflow-Inc/one-glm
OneFlow：https://github.com/Oneflow-Inc/oneflow

1、GLM-large 訓練性能和顯存的表現

首先先展示一下分别使用官方的 GLM 倉庫以及 One-GLM 倉庫訓練 GLM-large 網絡的性能和顯存表現（資料并行技術），硬體環境為 A100 PCIE 40G，BatchSize 設定為 8。

可以看到，在 GLM-large 的訓練任務中，相比原始的基于 PyTorch、DeepSpeed、Apex 的 GLM 實作，OneFlow的性能有 120% - 276% 的加速，并且顯存占用降低了10% -30%（測試結果均可用 oneflow >=0.9.0 複現）。

2、GLM 遷移，隻需修改幾行代碼

由于 OneFlow 無縫相容了 PyTorch 的生态，隻需改動幾行代碼，就可以讓使用者輕松遷移 GLM 大模型到 One-GLM：

将 import torch 替換為 import oneflow as torch
将 import torch.xx 替換為 import oneflow.xx
将 from apex.optimizers import FusedAdam as Adam 替換為 from oneflow.optim import Adam
将 from apex.normalization.fused_layer_norm import FusedLayerNorm as LayerNorm 替換為 from oneflow.nn import LayerNorm
注釋掉 torch.distributed.ReduceOp，torch.distributed.new_group,，torch.distributed.TCPStore，torch.distributed.all_reduce 這些API，它們是 PyTorch DDP 所需要的，但 OneFlow 的資料并行是由内部的 SBP 和 Global Tensor 機制實作，并不需要這些 API。

其它許多模型的遷移更簡單，比如在和 torchvision 對标的 flowvision 中，許多模型隻需通過在 torchvision 模型檔案中加入 import oneflow as torch 即可得到，讓使用者幾乎沒有額外成本。

此外，OneFlow 還提供全局 “mock torch” 功能（https://docs.oneflow.org/master/cookies/oneflow_torch.html），在指令行運作 eval $(oneflow-mock-torch) 就可以讓接下來運作的所有 Python 腳本裡的 import torch 都自動指向 oneflow。

3、兩大調優手段

loss 計算部分的優化

在原始的 GLM 實作中，loss計算部分使用到了 mpu.vocab_parallel_cross_entropy 這個函數 (https://github.com/THUDM/GLM/blob/main/pretrain_glm.py#L263) 。

通過分析這個函數，發現它實作了 sparse_softmax_cross_entropy 的功能，但在實作過程中，原始的 GLM 倉庫使用了 PyTorch 的 autograd.Function 子產品，并且使用了大量的小算子來拼接出 sparse_softmax_cross_entropy 整體的功能。而在 OneFlow 的算子庫中，已經有 sparse_softmax_cross_entropy 這個算子對應的 CUDA 實作了，也就是 flow.sparse_softmax_cross_entropy 這個 API。

是以，我們将 GLM 對 sparse_softmax_cross_entropy 的 naive 實作替換為 flow.sparse_softmax_cross_entropy 這個 API，并進行了 loss 對齊實驗。

結果如何？下圖展示了基于 OneFlow 的 Graph 模式訓練 GLM-large 模型前 1000 輪的 loss 對齊情況，并分别測試了 FP32 和 AMP 模式：

可以看到，将原始 GLM 的 naive sparse_softmax_cross_entropy 實作替換為 flow.sparse_softmax_cross_entropy 之後 loss 是完全對齊的，可以保證正确性。

相比原始的 GLM 的單卡性能，這個替換使得 One-GLM 的單卡性能有大幅提升，主要原因是 OneFlow 對 sparse_softmax_cross_entropy 算子做了極緻的性能優化，并且減少了原始 GLM 中大量的碎算子拼湊帶來的訪存開銷。此外，這樣做也降低了 torch.autograd.Function 本身帶來的一些系統開銷。

CUDA Kernel Fuse

除上述優化外，GLM 模型本質上就是一個編解碼的 Transformer 架構，是以我們将之前優化 GPT、BERT 的一些 Fuse Pattern 也帶給了 One-GLM 模型。具體包含以下兩個 Fuse Pattern :

fused_bias_add_gelu: 将 bias_add 和 gelu 算子融合在一起。
fused_bias_add_dropout：将 bias_add 和 dropout 算子融合在一起。

這兩個 fuse 都可以顯著改善計算的訪存，并減少 Kernel Launch 帶來的開銷，由于 GLM 模型越大則層數就會越多，那麼這種 Fuse Pattern 帶來的的優勢也會不斷放大。

最終，在上述兩方面的優化作用下，在 A100 PCIE 40G，batch_size = 8 環境中的訓練 GLM-large 的任務時，單卡 FP32 模式的性能相比原始的 GLM 取得了 280%（FP32 模式）和 307%（ AMP 模式）的訓練加速。

4、LiBai 也能輕松搞定 GLM 推理

當模型規模過于龐大，單個 GPU 裝置無法容納大規模模型參數時，便捷好用的分布式訓練和推理需求就相繼出現，業内也随之推出相應的工具。

基于 OneFlow 建構的 LiBai 模型庫讓分布式上手難度降到最低，使用者不需要關注模型如何配置設定在不同的顯示卡裝置，隻需要修改幾個配置資料就可以設定不同的分布式政策。當然，加速性能更是出衆。

LiBai ：https://github.com/Oneflow-Inc/libai
LiBai 相關介紹：大模型訓練之難，難于上青天？預訓練易用、效率超群的「李白」模型庫來了！
GLM：https://github.com/Oneflow-Inc/libai/tree/glm_project/projects/GLM

用 LiBai 搭建的 GLM 可以便捷地實作model parallel + pipeline parallel推理, 很好地解決單卡放不下大規模模型的問題。

那麼，使用者如何利用大規模模型訓練與推理倉庫 LiBai 來建構 GLM 的分布式推理部分？下面用一個小例子解釋一下。

分布式推理具有天然優勢

要知道，模型的參數其實就是許多 tensor，也就是以矩陣的形式出現，大模型的參數也就是大矩陣，并行政策就是把大矩陣分為多個小矩陣，并配置設定到不同的顯示卡或不同的裝置上，基礎的 LinearLayer 在LiBai中的實作代碼如下：

class Linear1D(nn.Module):
    def __init__(self, in_features, out_features, parallel="data", layer_idx=0, ...):
        super().__init__()

        if parallel == "col":
            weight_sbp = dist.get_nd_sbp([flow.sbp.broadcast, flow.sbp.split(0)])
        elif parallel == "row":
            weight_sbp = dist.get_nd_sbp([flow.sbp.broadcast, flow.sbp.split(1)])
        elif parallel == "data":
            weight_sbp = dist.get_nd_sbp([flow.sbp.broadcast, flow.sbp.broadcast])
        else:
            raise KeyError(f"{parallel} is not supported! Only support ('data', 'row' and 'col')")

        self.weight = flow.nn.Parameter(
            flow.empty(
                (out_features, in_features),
                dtype=flow.float32,
                placement=dist.get_layer_placement(layer_idx),  # for pipeline parallelism placement
                sbp=weight_sbp,
            )
        )
        init_method(self.weight)
        ...
    
    def forward(self, x):
        ...

在這裡，使用者可選擇去如何切分 Linear 層的矩陣，如何切分資料矩陣，而OneFlow 中的 SBP 控制豎着切、橫着切以及其他拆分矩陣的方案（模型并行、資料并行），以及通過設定 Placement 來控制這個 LinearLayer 是放在第幾張顯示卡上（流水并行）。

是以，根據 LiBai 中各種 layer 的設計原理以及基于 OneFlow 中 tensor 自帶的 SBP 和 Placement 屬性的天然優勢，使得使用者搭建的模型能夠很簡單地就實作資料并行、模型并行以及流水并行操作。

GLM 推理的 Demo 示範

這裡為使用者展示 LiBai 中 GLM 的單卡和便捷的多卡推理 Demo，模型可在 HuggingFace 上擷取：https://huggingface.co/models?filter=glm

單卡 generate 任務，我們選擇 glm-10b 模型：

python demo.py

# demo.py
import oneflow as flow
from projects.GLM.tokenizer.glm_tokenizer import GLMGPT2Tokenizer
from libai.utils import distributed as dist
from projects.GLM.configs.glm_inference import cfg
from projects.GLM.modeling_glm import GLMForConditionalGeneration
from projects.GLM.utils.glm_loader import GLMLoaderHuggerFace
from omegaconf import DictConfig

tokenizer = GLMGPT2Tokenizer.from_pretrained("/data/home/glm-10b")
input_ids = tokenizer.encode(
    [
        "Ng is an adjunct professor at [MASK] (formerly associate professor and Director of its Stanford AI Lab or SAIL ). Also a pioneer in online education, Ng co-founded Coursera and deeplearning.ai."
    ],
    return_tensors="of",
)
inputs = {"input_ids": input_ids, "attention_mask": flow.ones(input_ids.size())}
inputs = tokenizer.build_inputs_for_generation(inputs, max_gen_length=512)

sbp = dist.get_nd_sbp([flow.sbp.broadcast, flow.sbp.broadcast])
placement = dist.get_layer_placement(0)

dist.set_device_type("cpu")
loader = GLMLoaderHuggerFace(GLMForConditionalGeneration, cfg, "/path/to/glm-10b")
model = loader.load()
model = model.half().cuda()

dist.set_device_type("cuda")
outputs = model.generate(
    inputs=inputs['input_ids'].to_global(sbp=sbp, placement=placement), 
    position_ids=inputs['position_ids'].to_global(sbp=sbp, placement=placement), 
    generation_attention_mask=inputs['generation_attention_mask'].to_global(sbp=sbp, placement=placement).half(), 
    max_length=512
)
res = tokenizer.decode(outputs[0])
print(res)
>>> [CLS] Ng is an adjunct professor at [MASK] (formerly associate professor and Director of its Stanford AI Lab or SAIL ). Also a pioneer in online education, Ng co-founded Coursera and deeplearning.ai.<|endoftext|> <|startofpiece|>  Stanford University and a co-founder of <|endofpiece|>

4卡 model parallel+pipeline parallel generate 任務，選擇 glm-10b 模型：

python3 -m oneflow.distributed.launch --nproc_per_node 4 demo.py

# demo.py
import oneflow as flow
from projects.GLM.tokenizer.glm_tokenizer import GLMGPT2Tokenizer
from libai.utils import distributed as dist
from projects.GLM.configs.glm_inference import cfg
from projects.GLM.modeling_glm import GLMForConditionalGeneration
from projects.GLM.utils.glm_loader import GLMLoaderHuggerFace
from omegaconf import DictConfig

# 隻需簡單配置并行方案
parallel_config = DictConfig(
    dict(
        data_parallel_size=1,
        tensor_parallel_size=2,
        pipeline_parallel_size=2,
        pipeline_num_layers=2 * 24
    )
)
dist.setup_dist_util(parallel_config)

tokenizer = GLMGPT2Tokenizer.from_pretrained("/data/home/glm-10b")
input_ids = tokenizer.encode(
    [
        "Ng is an adjunct professor at [MASK] (formerly associate professor and Director of its Stanford AI Lab or SAIL ). Also a pioneer in online education, Ng co-founded Coursera and deeplearning.ai."
    ],
    return_tensors="of",
)
inputs = {"input_ids": input_ids, "attention_mask": flow.ones(input_ids.size())}
inputs = tokenizer.build_inputs_for_generation(inputs, max_gen_length=512)

sbp = dist.get_nd_sbp([flow.sbp.broadcast, flow.sbp.broadcast])
placement = dist.get_layer_placement(0)

loader = GLMLoaderHuggerFace(GLMForConditionalGeneration, cfg, "/path/to/glm-10b")
model = loader.load()

outputs = model.generate(
    inputs=inputs['input_ids'].to_global(sbp=sbp, placement=placement), 
    position_ids=inputs['position_ids'].to_global(sbp=sbp, placement=placement), 
    generation_attention_mask=inputs['generation_attention_mask'].to_global(sbp=sbp, placement=placement), 
    max_length=512
)
res = tokenizer.decode(outputs[0])
if dist.is_main_process():
    print(res)
>>> [CLS] Ng is an adjunct professor at [MASK] (formerly associate professor and Director of its Stanford AI Lab or SAIL ). Also a pioneer in online education, Ng co-founded Coursera and deeplearning.ai.<|endoftext|> <|startofpiece|>  Stanford University and a co-founder of <|endofpiece|>

使用 One- GLM 訓練的模型進行推理

LiBai對于OneFlow的模型加載同樣友善，如果你希望使用one-glm訓練後的模型進行推理，隻需簡單的将上述demo中的 GLMLoaderHuggerFace 替換為 GLMLoaderLiBai。

5、結語

基于 OneFlow 來移植 GLM 大模型非常簡單，相比于原始版本 PyTorch GLM 訓練 GLM-large 模型，OneFlow 能大幅提升性能和節省顯存。

此外，通過使用 GLM-10B 這個百億級大模型做推理，表明基于 OneFlow 的 LiBai 來做大模型推理可以開箱即用，并實作更高的推理速度，如果你想配置不同的并行方式來推理大模型，隻需要簡單配置檔案的幾個參數即可。

未來，OneFlow團隊将探索使用 OneFlow 訓練更大的 GLM-130B 千億模型的可行性，相信基于 OneFlow 可以更快地訓練 GLM-130B 千億級别模型，加速國産大模型訓練和推理任務。

歡迎Star、試用One-GLM：

One-GLM：https://github.com/Oneflow-Inc/one-glm
OneFlow：https://github.com/Oneflow-Inc/oneflow

歡迎 Star、試用 OneFlow 最新版本：https://github.com/Oneflow-Inc/oneflow/

GLM國産大模型訓練加速：性能最高提升3倍，顯存節省1/3

1、GLM-large 訓練性能和顯存的表現

2、GLM 遷移，隻需修改幾行代碼

3、兩大調優手段

4、LiBai 也能輕松搞定 GLM 推理

5、結語

繼續閱讀

網絡時代，什麼最令人期待？那必須是更快更多更好！現實網絡生活，各種各樣的資料，五花八門，不計其數，怎樣才能快速查詢？才能

7月4日，騰訊雲正式釋出AI原生（AINative）向量資料庫TencentCloudVectorDB，但是友友們知道什

pytorch-模型訓練-加載圖像分類模型訓練參數

模型訓練測試之三：yolov5 模型訓練及windows部署（一）

【重磅！今日華為釋出大模型時代AI存儲新品】大模型時代AI發展趨勢已來，華為今日召開線上釋出會，正式推出OceanSto

【蘋果iOS17隐藏AI技能被發現：通過視訊&音頻識别進行相冊搜尋】AI奇點網7月18日報道丨蘋果在今年6月的W

作為當今科技領域的熱門話題，#chatgpt到底有多牛#，以ChatGPT為代表的AIGC應用加快改變了人們的生活和工作

深度學習：tf.keras實作模型搭建、模型訓練和預測

#華為雲自動駕駛開發平台重磅釋出##媒體：中國需要自己的自動駕駛解決方案#看到這兩個硬核的标題，真的是振奮人心，民族自豪

目前在國内大規模企業分為三類，其中包括網際網路企業、人工智能企業和初創研究企業。在目前這個時代，人工智能和大模型有些相似，

未來人工智領域必将會成為國際競争的重點，那麼大模型的訓練就成為重點了，如果基座模型和基礎都來自國外平台，那麼資訊洩露幾乎

華為雲釋出自動駕駛開發平台，科技感比YY星耀版還強？日前，華為雲公有雲業務部總裁高江海在釋出會上宣布，華為雲自動駕駛開發

【論文筆記】Accelerated Training for Massive Classification via Dynamic Class Selection

【YOLO】使用VOC資料集訓練自己的YOLOv3模型（Keras/TensorFlow）0. 前期準備（因人而異）1. 試驗官方模型2. 制作VOC資料集3. YOLO模型訓練4. 模型驗證5. 常見問題

yolov3 訓練及資料集準備【記錄】yolov3 訓練及資料集準備【記錄】

關注數字技術大國競争！在人工智能這塊制高點上，決勝的關鍵在于這個要素。但很遺憾，目前美國這個資料是中國的2.6倍！大國産