The paper, titled Sorted LLaMA, aims to reveal the potential advantages of the middle layer of large language models. The paper proposes a method called SoFT that utilizes a middle layer for dynamic inference and names it fine tuning.
The authors argue that while LLM excels at generating and understanding natural language, it is costly to deploy at scale. To solve this problem, they propose the SortedNet technique, which utilizes modular networks and accuracy features for sequencing to create submodels with different computational loads.
The core idea of SoFT is to achieve dynamic inference without any pre-training, by replacing standard supervised fine-tuning only at the same cost. This approach improves model efficiency by eliminating the need to use multiple models for various scenarios during inference.
By applying SoFT to LLaMa 2 13B and tuning it on the Stanford Alpaca dataset, the authors demonstrated that SoFT can double the model speed while maintaining or exceeding performance. In conclusion, this paper proposes a method to effectively utilize the middle layer of large language models to achieve dynamic inference while improving model efficiency.