The Xiaohongshu search team proposed a new decoding strategy to reduce the cost of large model inference

One of the key technologies is the Chain of Thought (CoT), which can effectively enhance the logical reasoning ability of large models by guiding large models to simulate the human thinking process step by step.

The self-consistency method (SC) has always been a widely used decoding strategy in chain of thought reasoning. SC improves the performance of the model by generating multiple strands of thought and taking the majority of the answers as the final answer. Although it brings significant performance gains in a variety of multi-step inference tasks, it is a costly method that requires multiple sampling of preset sizes.

At ICLR 2024, the Xiaohongshu search algorithm team proposed a simple and scalable sampling process, the Early-Stopping Self-Consistency (ESC) method, which can significantly reduce the cost of SC without sacrificing performance. Based on this, the team further derives an ESC control scheme to dynamically select the performance-cost balance for different tasks and models.

Subsequently, researchers from Xiaohongshu and Beijing Institute of Technology selected three mainstream reasoning tasks (mathematics, common sense, and symbolic reasoning) and used language models of different scales for experiments. Experimental results showed that ESC significantly reduced the average number of samples across six benchmarks, including MATH (-33.8%), GSM8K (-80.1%), StrategyQA (-76.8%), Commonsense QA (-78.5%), Coin Flip (-84.2%), and Last Letters (-67.4%), while maintaining almost the same performance.

This illustrates the effectiveness and innovation of ESC, which can significantly reduce the number of samples while maintaining inference performance, thereby reducing the computational cost. This is important for large language models, where the inference process is often computationally intensive.

The Xiaohongshu search team proposed a new decoding strategy to reduce the cost of large model inference

With the help of Chain of Thought (CoT) prompts, large language models (LLMs) demonstrate powerful reasoning capabilities. Based on this, since complex inference tasks usually allow multiple inference paths to lead to the correct answer, previous researchers have introduced a decoding strategy called Self-Consistency (SC) to further improve inference performance.

Compared with the traditional standard chain-of-thought prompts that only generate a single path (greedy search), the SC method samples multiple inference paths according to a preset sample size and determines the final answer through a voting mechanism. While this approach is effective, it incurs significant overhead proportional to the number of samples. Taking GPT-4 as an example, if the sample size is 40, the cost of testing it in the MATH dataset is as high as $2,000, which urgently requires an improved way to reduce the cost of SC.

In SC, the process of generating multiple samples can be thought of as an approximation of the true answer distribution predicted by the LLM. By selecting the result with the highest frequency as the final answer, the randomness that comes with a single sampling strategy can be reduced. However, considering that SC only needs the most credible answers, it does not require a perfect match for the entire answer distribution. Therefore, we do not see the need to directly generate all inference paths for each input that are aligned with the preset sample size. Instead, the generation process can be serialized into smaller sections, each named a sampling window. Considering that both the small window and the large number of sampling outputs originate from the same prediction answer distribution, the sampling window can be thought of as a probe that can reveal some information about the true distribution with only a small number of samples.

Figure 1: GPT-4's average entropy score within the sampling window of the MATH dataset

For the answer distribution, one conjecture is that the candidate distribution of correct answers is usually more concentrated, while the wrong answers are relatively scattered. We use entropy as a representation of the shape of the answer distribution. The above graph shows the average entropy of the distribution of correct and incorrect voting answers within the window, and the results show that the correct answer with a higher probability is usually accompanied by a lower entropy, so the entropy can be used as an indicator to determine whether to continue sampling.

Based on this, we propose an Early-Stopping Self-Consistency (ESC) method, which truncates the sampling process in a low-entropy window. To maintain performance as much as possible, we set the strictest threshold: entropy equals zero, i.e. all samples generated within the window have the same answer. Stopping sampling when this happens reduces sampling consumption while minimizing performance impact.

Early-stopping is a technique that is widely used when training models to prevent overfitting from occurring. In this paper, we introduce an early stop strategy that should be applied to reduce the cost of the multi-sampling process. Like the original SC, the ESC is completely unsupervised and model-agnostic, without any human annotation or additional training. We derive the theoretical upper bound of the probability of outcome inconsistency with or without the early stop method in the SC, and the results show that the ESC has a high probability of maintaining performance. In addition, we propose an ESC dynamic control scheme: by selecting the window size and the maximum number of samples, we can dynamically find the optimal performance-cost balance for different tasks and models to meet actual needs.

Figure 2: ESC vs. original SC process

The diagram shows the complete process comparison of ESC and the original SC. We divide the large sample size (in this case equal to 20) into several consecutive small windows (5 in this case), and stop sampling when all the answers in one window are the same, i.e., predict that the entropy of the answer distribution is zero.

2.1 Self-consistent method analysis

The core idea of the self-consistent approach is that for a complex problem, multiple lines of reasoning are often allowed, all of which ultimately lead to the same correct answer. Based on this, the voting process with a sample size of can be expressed as follows:

where represents the frequency at which the model predicts the result of is in the sampled instances. According to the law of large numbers, when approaching infinity, the distribution of the sampled results will approximate the true distribution results predicted by the model. Further, we can conclude that:

From the above formula, we can see that the process of multiple samples can reduce the noise introduced by a single sample, thereby improving performance. Our goal is to make sure that the one with the highest probability of prediction is chosen as the final answer. From this point of view, the entropy of the answer distribution is positively correlated with the performance performance, that is, when the entropy of the answer distribution is low, we only need fewer samples to significantly reduce the impact of sampling noise.

2.2 Early stop self-consistency method

Based on the analysis of 2.1, we designed a dynamic truncation strategy for multiple samples to achieve performance comparable to the original sample volume at a lower cost. Specifically, instead of generating all samples at once, we use a sliding generation window to generate all samples at once, and use the distribution entropy or similarity within the window as a truncation condition for early stop operation.

When all predictions within the window are consistent, the entropy of the answer distribution is 0, indicating that the voting results for this sample are highly consistent with the results when theoretically there are an infinite number of samples. So, as soon as this happens, we stop sampling further.

If no observation window that satisfies the condition is encountered during sampling, multiple observation windows are iterated until the preset sample size is reached. The algorithm flow is shown in Algorithm 1:

To assess the effect of introducing an early stop mechanism on outcome consistency, we perform a test to calculate a theoretical upper bound on the probability of outcome disagreement with or without an early stop scheme in SC. The results show that when the window size is 8, the probability of ESC disagreement with SC results is less than 0.002. This verifies that ESC can effectively reduce the number of samples while maintaining performance.

2.3 Dynamic control scheme

In order to meet different budget and performance requirements, we studied the dynamic control scheme of ESC to adjust the truncation strategy, and derived the appropriate window size and maximum number of samples (window size, maximum number of samples).

We propose a control mode for dynamic truncation: based on the first observation window (which is expressed as its window size), the expectation of inference performance and sampling cost at different window size () and maximum sampling size () settings can be derived:

The expected number of samples is:

The upper bound for the disagreement between the truncated result and the original result is:

Finally, considering the sampling budget and performance requirements, the appropriate (,) value is selected to perform the ESC based on the respective expected values. The algorithm flow is shown in Algorithm 2:

We evaluate the proposed ESC on six benchmark datasets for three types of inference tasks:

Arithmetic inference: The dataset uses MATH and GSM8K
Common sense reasoning: Datasets use CommonsenseQA and StrategyQA
Symbolic Inference: The dataset uses Last Letter Concatenation and Coin Flip

ESC is evaluated on three language models at different scales: GPT-4, GPT-3.5-Turbo, and LLaMA-2 7b. All experiments were performed in a few-shot setting, without the need to train or fine-tune the language model. For the MATH dataset, the sampling temperature is set to 0.5, while for the other datasets, it is set to 0.7.

3.1 Experimental results of ESC

We compared the baseline to the chain of thought prompts (CoT) and SC for greedy search.

The MATH dataset has a sample size of 64 and the other datasets have a sample size of 40, and the ESC uses the same value as the maximum sample size.

Correspondingly, the MATH dataset has a window size of 8 and the other datasets have a window size of 5. The results we report are based on the average of 10 runs, with variance data omitted due to limited space. is the average number of ESC samples, and L-SC represents the accuracy of the SC with a sample size of .

Table 1: Test results on six inference tasks

Table 2 : Inference accuracy with different maximum sample sizes on MATH dataset (%)

Figure 3: Robustness analysis of observation window sizes for different models on the GSM8K dataset

Based on the above results, the following three conclusions can be drawn:

ESC significantly reduces costs with little to no impact on performance
SC is significantly superior to CoT, confirming the validity of the voting process for inference. For ESC, much less than the corresponding maximum sample size, while the performance remains almost constant. We also tested the SC with as the sample size, and its accuracy dropped significantly. Overall, ESC can significantly reduce costs with little to no impact on performance. At the same sampling cost, ESC can achieve higher accuracy.
ESC is a decoding process that is robust to the maximum sample size and window size
Tables 2 and 3 show the performance at different maximum sample sizes and window sizes, respectively. As you can see, the ESC is robust to the maximum sample size and window size. As the sample size increases, the performance of the SC continues to improve. On top of this, ESC can provide significant cost savings while maintaining performance.
Cost savings are positively correlated with performance
As shown in Tables 1 and 2, it is clear that cost savings are positively correlated with performance. This is because better performance usually doesn't require larger sample sizes. However, ESC does not require any prior knowledge of model capabilities and task difficulty.

3.2 Experimental results of dynamic control schemes

To verify the effectiveness of the ESC dynamic control scheme, we compared the true and predicted sampling quantities and the percentage change in performance on the GSM8K dataset.

Regular and Pearson correlation coefficients were used to reflect the correlation, and the results are shown in Table 3 below. The results show that the predictions obtained by us based on the dynamic control scheme are highly reliable for balancing the sampling cost and voting performance.

Table 3: Experimental results of the dynamic control scheme

3.3 Experimental results of ESC in the open domain

The original SC was only for questions with fixed answers, while Jain et al. proposed UCS, which extended SC to open generation tasks by replacing voting with text similarity matching.

We performed ESC experiments on the MBPP dataset for different sample sizes (window size of 5). The experimental results show that ESC is also suitable for open-ended tasks.

Table 4: Experimental results of ESC in the open domain

3.4 Robustness studies of ESC

We perform a series of additional experiments to further test the robustness of the ESC, including robustness testing of sampling parameters and cues:

In the top half of Figure 4, we show that ESC is robust to sample volume savings as the decoded sampling temperature increases.
The lower left part of Figure 4 shows that the ESC is robust to the value of top- sampling.
The lower right part of Figure 4 shows that ESC can be generalized to the zero-shot approach.
Table 5 shows the accuracy of the ESCs and SCs for the different groups of demonstrations, and you can see that the ESCs are robust for a variety of examples.

Figure 4: ESC robustness analysis of sampling temperature, values, and zero-shot results

表 5 : 不同示例组的实验结果

In this work, a simple and efficient sampling process is introduced, called Early Stop Self-Consistency (ESC). By stopping the decoding process at a high confidence window, ESC dramatically reduces the cost of the SC without sacrificing performance. We further derive the control scheme of the ESC to dynamically select the performance-cost balance of different tasks and models, without the need for additional model capabilities and prior knowledge of task difficulty.

Experimental results show that ESC significantly reduces the actual sample size of self-consistent inference in six mainstream benchmarks, while achieving similar performance, which is very important for large model inference and can significantly save the cost of large model inference. We also demonstrate that ESC's control scheme can accurately predict the performance-cost trade-offs for various tasks and models, and can better meet actual budget and performance requirements. The results of the analysis experiments show that ESC can robustly achieve significant cost savings considering different decoding settings and examples, even on open generation tasks.

Address: https://arxiv.org/abs/2401.10480

Li Yiwei
He has published several papers in top conferences/journals in the field of machine learning and natural language processing, such as ICLR, AAAI, ACL, EMNLP, NAACL, NeurIPS, KBS, etc., and his main research directions are large language model inference and distillation, open domain dialogue generation, etc.
Yuan Peiwen
He is currently studying at Beijing Institute of Technology, an intern in the Xiaohongshu Community Search Group, and has published a number of papers in NeurIPS, ICLR, AAAI, EACL, etc. His main research interests are large language model inference and evaluation, and information retrieval.
Feng Shaoxiong
Responsible for the recall of the Xiaohongshu community search vector. He graduated from Beijing Institute of Technology with a Ph.D. degree and has published several papers in top conferences/journals in the field of machine learning/natural language processing, such as ICLR, AAAI, ACL, EMNLP, NAACL, EACL, KBS, etc. His main research interests include large language model evaluation, reasoning distillation, generative retrieval, and open-domain dialogue generation.
Dōgen-hsien
Head of the Xiaohongshu Transaction Search Team. He graduated from Zhejiang University with a Ph.D., and has published several papers in top conferences in the field of machine learning such as NeurIPS and ICML, and has been a reviewer for many top conferences/journals for a long time. The main business covers content search, e-commerce search, live broadcast search, etc.
Zeng Shu
Head of Xiaohongshu Community Search Semantic Understanding and Recall. Graduated from the Department of Electronics of Tsinghua University with a master's degree, he is engaged in algorithm work in natural language processing, recommendation, search and other related directions in the field of Internet.

Source: WeChat public account: Xiaohongshu Technology REDtech

Source: https://mp.weixin.qq.com/s/OVNMoAhkbgWISsIplC0vGQ

The Xiaohongshu search team proposed a new decoding strategy to reduce the cost of large model inference

Read on

Byte model released! "99% lower than the industry price", said Tan Cheng, president of Volcano Engine

Ant Bailing large model No. 1: The release of GPT-4o is not unexpected, and the direction of native multimodality is clear

The ByteDance large model made its debut with full staff: the price was 99% lower, and there was no parameter scale and running score

Tasly and Huawei released a large model of digital intelligence materia medica

99.3% cheaper than the industry! ByteDance's bean bag model is going to overturn the industry

What do you have to do to "tame" a large model that is not controlled?

Original | How multimodal large models can help enterprises in digital transformation

Patriot missiles are pulled by pallet trucks and run all over the streets, real or model?

Huawei's whole-home intelligence "anti-follower": do not shout the slogan of large models, and intensively cultivate AI health care

Multi-functional RNA analysis, the RNA language model of the Baidu team was published in the journal Nature

OpenAI叇巼બ模५GPT-4o, 微๓ "I'm going to say "I'm going to go"

Huawei's press conference was accused of fraud: the pictures generated by the large model are manually manipulated?

The 2023 annual reports of 58 listed banks took stock: net interest income grew negatively for the first time since 2017, accelerating the layout of large models

Byte took the lead in launching a large model price war

The construction standard construction period model of the school project of China Construction Eighth Bureau 2022 is available for download

Baidu released a new model of autonomous driving, saying that it is more than 10 times safer than real driving, and the comment area is lively