Author | Little play, Python

Regarding the commercial landing of large models, a very easy to think of scenario is intelligent customer service, as not only large models but also NLP field of one of the most important application scenarios, due to the high cost of manual customer service, AI customer service with the development of model technology has gradually entered our lives, in the mobile phone a variety of major APP are almost equipped with an intelligent customer service.

The large-scale language models (LLMs) named after Chat seem to be naturally suitable for the application scenarios of intelligent customer service, and the intelligent customer service driven by large models is also a very imaginative landing direction. However, in this year's ACL 2023, researchers from conversational AI technology company LivePerson calculated an economic account for large models to "replace" customer service work, and found that Backbone, which uses large models such as GPT-3 to do intelligent customer service, is likely to lose money?

ACL2023 | Lost? Introducing GPT-3 large models to intelligent customer service, but losing money?

The barrier of the large model, which is also the unavoidable problem of the large model, may be its high training and response cost, the cost of a single response using GPT-2 and Nvidia A100 GPUs is about 0.0011 cents, and if this model is replaced by a GPT-3-based Davinci model, the cost of a single response using OpenAI's API will become 1.1 cents. A prominent problem with this cost estimation method is that it is obvious that this rough API cost may be significantly different from the cost of commercial use by enterprises, and this cost will inevitably change with the rapid development of large models. Moreover, in many large model application scenarios, instead of directly calling the original large model API, it is necessary to "fine-tune" the large model such as the Prompt call level in the landing scenario, so the calculation of the cost of using the large model will become a prominent problem.

Therefore, this paper proposes an Expected Net Cost Savings (ENCS) framework, which is expected to measure the comprehensive relationship between the cost savings and costs of various LLMs and large model deployers under different brands, and it is through this ENCS framework that the author team found through a case study that the cost savings of using some smaller models, such as GPT-2, to complete inference and response tasks are better than GPT-3. The core reason is that although part of the "response quality" is lost, its "response cost" is greatly reduced, which shows that the response cost of these large models is still too high for the actual customers to realize the cost savings.

ENCS

First, let's take a look at how this expected net cost savings ENCS is calculated, the overall measurement process of ENCS can be seen from the figure below, more generally, ENCS measures the probability of each response generated by the large model being effective or used P(U) multiplied by the cost savings SU brought by the use of each response to the large model, minus the cost of generating the response, namely:

If you refine it a little further, if the response generated by the large model is not used directly but edited or "ignored" (ignoring can generally bring negative cost savings), the above equation can be corrected to:

The SU, SE, SI, etc. here can be estimated from the hourly cost R of human agent R and the human customer service response time Tr and the time spent accepting, editing, and ignoring the response:

A simple response to a large model with three possible actions: accept, edit, and ignore a small example of calculating the ENCS value, as shown in the following figure:

Case studies

The paper presents a case study of an anonymous retailer (AR), whose customer base is mainly composed of merchants and consumers who buy and sell on the AR platform, and the professional human customer service hired by AR will receive professional training and can respond professionally to different customers and various questions. AR employs about 350 agents, sends an average of 100,000 messages per month, and conducts about 15,000 conversations.

Through the conversation data provided by the retailer, the paper constructs a customized training dataset (Brand) and a general data (General) for a problem for AR retailers, and trains 11 mainstream models using three mainstream training strategies - Prompt Engineering, Fine-tuning and knowledge distillation:

In order to obtain the "usefulness" of the responses to these "intelligent customer service" answers, the paper uses expert scoring to score each conversation of these models on acceptance, editing, and ignorance, even for human customer service, people do not always accept their responses, and in intelligent customer service, GPT-3-based models perform best.

Assuming a human agent costs 10$ per hour, i.e. SU=SE=SI=10, each message takes an average of 30 seconds, while LLMs save 25 seconds, GPT-2 model generation costs 0.002 cents, Distilled GPT-2 costs 0.0011 cents, OpenAI APIs cost 1.09 cents, fine-tuned models cost 6.54 cents, and Cohere's APIs cost 0.25 cents Cents, which costs 0.5 cents using a fine-tuning model. By using ENCS to evaluate the "cost savings" of each model, as shown in the figure below, it can be seen that GPT-3 with higher response quality will result in a negative ENCS value, that is, it will not only not bring cost savings to the enterprise, but even increase the cost burden of the enterprise.

Specifically, AR retailers can save 4.47 cents using a single message using GPT-2 BFT BD, which is $53,653 using the GPT-2 model and potentially losing about $18,691 using the GPT-3 model, based on the number of 1,200,000 messages per year in AR.

For the calculated ENCS, a break-even point can be calculated for each model, as shown in the figure below, when the green line (labor cost savings) and the red line (model construction input) intersect to break even, it can be obtained, for a small enterprise with a total of about 500,000 messages per year, the use of large models to build intelligent customer service must quickly reduce the initial research and development costs, and for a large enterprise with about 20 million messages per year, Building intelligent agents using large models will lead to real cost savings.

Summary and discussion

This paper conducts a detailed and in-depth study on the business scenarios of large model applications in the field of intelligent customer service, and puts forward an analysis framework ENCS to evaluate how much "cost savings" the response of large models brings, and gives a somewhat counterintuitive but very reasonable conclusion - at present, the application cost of large models is still high, and only the scale effect brought by the large volume of large enterprises may have the motivation to complete the actual deployment of large models, and the application cost of large models is still too high for small enterprises. However, these analyses also mainly try to provide some insights on management and decision-making, and there is still a lot of work to be done on more detailed cost calculations, of course, in the end, these insights not only call for the technological progress of large models to bring us cost reduction, but also call for the emergence of some third-party platform companies to solve the real problems of some small enterprises that cannot afford to use large models, and let's look forward to the future progress of large models!

Thesis Title:

The economic trade-offs of large language models: A case study

ACL2023 | Lost? Introducing GPT-3 large models to intelligent customer service, but losing money?

ENCS

Case studies

Summary and discussion