LLM inference is accelerated by 2.8 times, and CMU Tsinghua Yao class alumni proposed the speculative inference engine SpecInfer

Machine Heart column

Heart of the Machine Editorial Office

Recently, the Catalyst Group team from Carnegie Mellon University (CMU) released a "speculative inference" engine SpecInfer, which can help large models with lightweight small models to achieve two to three times the inference acceleration without affecting the accuracy of the generated content.

With the advent of ChatGPT, large-scale language model (LLM) research and its application have received extensive attention from academia and industry. On the one hand, open source LLM models continue to emerge, such as OPT, BLOOM, LLaMA, etc., the launch of these pre-trained models has greatly promoted the research of LLM, so that LLM can be applied to solve increasingly complex practical problems. With these open source models, it has become easier to quickly build a suite of LLM-based application services, but LLM faces high compute and storage requirements that are prohibitively expensive.

On the other hand, miniaturized LLM represented by alpaca families (such as Alpaca, Vicuna, Guanaco), which has been fine-tuned or distilled, has also become one of the current research focuses, and has shown excellent performance in many evaluations; In addition, a number of system optimization techniques represented by Quantization, LoRA, and Offloading make it possible to deploy these LLMs with lower resource requirements. However, there is no free lunch, and the evidence shows that [1] these miniaturized LLM and system optimization techniques for low-resource scenarios often bring about a decline in model quality and affect the final application effect.

Therefore, how to make LLM inference efficient and cheap under the premise of ensuring the quality of model output has become a very important research problem in the field of MLSys. Recently, the Catalyst Group team from Carnegie Mellon University (CMU) released a "speculative inference" engine SpecInfer, which can help large models with lightweight small models to achieve two to three times the inference acceleration without affecting the accuracy of the generated content.

LLM inference is accelerated by 2.8 times, and CMU Tsinghua Yao class alumni proposed the speculative inference engine SpecInfer

Link to the paper: https://arxiv.org/abs/2305.09781

Project address: https://github.com/flexflow/FlexFlow/tree/inference

Zhihao Jia, assistant professor at CMU and one of the authors of the paper, said: "Generative large-scale language models are not only inefficient in inference, but also expensive to deploy; Their miniaturized versions have advantages in speed and price, but also affect the quality of the generated content; SpecInfer can achieve a win-win for both."

Tianqi Chen, an assistant professor also from CMU Catalyst Group, said: "SpecInfer can be applied to scenarios such as LLM deployment on the cloud, making LLM inference more scalable."

Research status

At present, LLM inference mainly relies on auto-regressive decoding, each step of decoding can only produce one output token, and the historical output content needs to be spliced back as the input of LLM before the next step of decoding. Considering this data dependence, existing LLM inference systems such as FasterTransformer will use an incremental decoding technique to cache the key/value corresponding to the decoded token to avoid recalculation. However, such systems still face two key drawbacks: 1) due to the token-by-token computing decoding paradigm, the operator parallelism is limited, and GPU hardware resources are difficult to be fully utilized; 2) When the sequence is too long, the KV-cache space consumption is too large, and the limited GPU memory cannot be carried. Therefore, when faced with very large-scale LLM inference (such as GPT-4 32K tokens), existing systems often face the problem of inefficient resource utilization and high inference latency.

Incremental Decoding diagram

In order to solve the above problems, the researchers proposed a "speculative" inference engine SpecInfer, the core idea of which is to replace LLM for speculative inference by replacing LLM with a "small model" SSM (Small Speculative Model) that is computationally less expensive, each time will tentatively reason multiple steps, and the reasoning results of multiple SSMs will be aggregated into a Speculated Token Tree, which is verified by LLM, realizes parallel inference through efficient tree decoding operators, and the path through verification will be output as the sequence of inference results of the model.

In general, SpecInfer uses the inherent knowledge of SSM to help LLM complete the main inference process at a lower computational cost, while LLM breaks the computational dependence of token decoding to a certain extent, and ensures that the final output result fully conforms to the original inference semantics through parallel computing.

‍

SpecInfer workflow

System design

SpecInfer system architecture

Learning-based Speculator

The main role of Speculator is to quickly generate speculations about the future output of LLM using SSMs, which can be (fine-tuned) small versions of LLM (such as LLaMA 7B), quantized or distilled small-scale LLMs, searchable knowledge bases (such as reference text), or user-defined functions. In short, the closer the output of SSM is to LLM, the easier it will be to pass during verification, and the overall inference efficiency will be higher.

To this end, SpecInfer introduces the idea of ensemble learning, fusing the results of multiple SSMs to improve the degree of differentiation of output. In order to maximize the matching rate, Speculator proposes the Collective Boost-Tuning method, that is, on a public general data set (such as OpenWebText), fine-tune from a weak SSM, continuously filter the sequence with a low degree of matching from the data, and let the new SSM learn for multiple times to improve the overall speculation quality; In addition, Speculator introduces a learnable scheduler to decide which SSMs to use for longer matching sequence lengths.

Token Tree Verifier

SSM's inference speed advantage is a prerequisite for SpecInfer to accelerate inference, but another indispensable factor is LLM's support for parallel inference. In SpecInfer, LLM does not directly generate output tokens as an inference engine, but it needs to verify the tokens generated by SSM in Speculator to ensure that the output content conforms to LLM's inference semantics.

In SpecInfer, the output sequences generated by SSM are organized into a token tree structure to avoid redundant storage overhead. In order to be able to verify parallelization on the token tree, SpecInfer proposes a tree attention calculation method, through the constructed mask matrix and depth-first KV-cache update mechanism, Verifier can parallelize the decoding process of each path in the tree as much as possible without adding additional storage. Compared with the simple sequence-by-sequence or token-by-token decoding method, tree decoding can achieve the best in memory overhead and computing efficiency at the same time.

Tree-based Decoding diagram

Large-scale LLM and small-scale SSM work together

Speculative Inference performs a timeline comparison

Large-scale LLM can usually reach tens of times or even hundreds of times of small-scale SSM in terms of parameters, and SSM compared with LLM, in terms of inference speed, based on the usual system implementation, also has several times to dozens of times the performance advantages, SpecInfer combines SSM's extremely low inference delay and LLM's parallel verification capabilities, greatly reducing the more time-consuming LLM inference times, and finally can significantly improve the model inference speed while ensuring the quality of inference results.

System implementation

SpecInfer is implemented based on the FlexFlow system, supports user-defined model structure, imports model parameters, is compatible with operator or layer abstractions of mainstream deep learning frameworks, and now supports a variety of mainstream basic models such as regular GPT and LLaMA. It is worth noting that FlexFlow is a deep learning system for distributed scenarios, jointly maintained by researchers from CMU, Stanford, MIT, NVIDIA and other institutions, and is one of the earliest works in the field of machine learning systems to propose "automatic parallelism" (MLSys'19, ICML'18) [2,3], and the first work to integrate computational graph optimization and automatic parallel optimization for joint optimization (Unity, OSDI'22) [4]. ]。

With FlexFlow's automatic parallelism, SpecInfer automates optimally distributed deployments of large-scale LLMs. At the same time, SpecInfer can also support offloading operations, scaling the model at a lower cost. Through the unique "speculative inference" mechanism, SpecInfer can greatly reduce the number of inference steps required by LLM, thereby reducing the network communication overhead of distributed scenarios and alleviating the PCIe transmission bandwidth bottleneck in offloading scenarios.

Experimental results

End-to-end inference latency

End-to-end experiment: Tested on five conversational datasets using LLaMA-7B as LLM and LLaMA-160M as SSM, SpecInfer reduced inference latency by 1.9-2.8x compared to LLM that relied on incremental decoding.

Average step size per inference (LLM: OPT-13B + SSMs: OPT-125M)

Single inference average step size (LLM: LLaMA-7B + SSMs: LLaMA-160M)

Matching length test: Using OPT and LLaMA series models respectively, the average validation pass length of LLM in SpecInfer is tested, it can be seen that with the increase of the number of SSMs, the validation pass length of LLM will be improved on each dialog dataset, taking 5 SSMs as an example, OPT and LLaMA can average 3.68 and 2.67 on 5 datasets, which is 26.4% and 24.8% higher than using only a single SSM.

For more detailed experimental results, please refer to the original text of the paper: https://arxiv.org/abs/2305.09781

summary

SpecInfer is the first distributed LLM inference engine based on "speculative decoding", which can help existing mainstream LLM reduce memory access requirements, achieve two to three times lossless inference acceleration, and greatly reduce inference costs by integrating multiple small models and achieving optimization based on token trees.

About the author

The instructor of the SpecInfer program is Zhihao Jia, who is currently an assistant professor in the School of Computer Science at Carnegie Mellon University. His research interests include systems research for machine learning, quantum computing, and large-scale data analysis. He previously graduated from Yao Class of Tsinghua University, received his Ph.D. from Stanford University under Alex Aiken and Matei Zaharia, and received the Stanford Arthur Samuel Best Doctoral Thesis Award, NSF CAREER Asward, and several research awards from Amazon, Google, Meta, Oracle, and Qualcomm. Personal Page: https://www.cs.cmu.edu/~zhihaoj2/.

The main incubator of the SpecInfer project is CMU's Catalyst Group Lab, which is co-chaired by Zhihao Jia and Tianqi Chen at CMU, which is committed to integrating optimization technologies from machine learning algorithms, systems, hardware and other aspects to build automated machine learning systems. Previously, the lab also launched open source projects such as MLC-LLM [5] to promote the research and application of LLM large model-related systems. Lab Home: https://catalyst.cs.cmu.edu.

The co-authors of the papers are Xupeng Miao (postdoctoral researcher), Gabriele Olaro (Boyi) and Zhihao Zhang (Boyi), all from the CMU Catalyst Group team. Among them, Dr. Xupeng Miao graduated from Peking University, his main research directions include machine learning systems, data management and distributed computing, and has won the VLDB2022 Best Scalable Data Science Paper Award, the 2022 ACM China Youbo Award, the 2022 World Artificial Intelligence Conference (WAIC) Yunfan Award and other honors, personal homepage: https://hsword.github.io.

Bibliography:

[1] Gudibande, A., Wallace, E., Snell, C., Geng, X., Liu, H., Abbeel, P., Levine, S., & Song, D. (2023). The False Promise of Imitating Proprietary LLMs.

[2] Jia, Z., Lin, S., Qi, C. R., & Aiken, A. (2018, July). Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks. In ICML (pp. 2279-2288).

[3] Jia, Z., Zaharia, M., & Aiken, A. (2019). Beyond Data and Model Parallelism for Deep Neural Networks. Proceedings of Machine Learning and Systems, 1, 1-13.

[4] Unger, C., Jia, Z., Wu, W., Lin, S., Baines, M., Narvaez, C. E. Q., ... & Aiken, A. (2022). Unity: Accelerating Training Through Joint Optimization of Algebraic Transformations and Parallelization. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) (pp. 267-284).

[5] https://github.com/mlc-ai/mlc-llm