A complete LLM-based end-to-end question answering system, should include user input verification, question triage, model response, answer quality assessment, prompt iteration, regression testing, as the scale increases, around Prompt version management, automated testing and security protection is also an important topic, this article will explore this process, part of the code reference course "Building Systems with the ChatGPT API"

User input validation

Using OpenAI's moderation API[1]) can help developers identify and filter user input, and audit user input, mainly including the following categories:

Sexual: Includes content that arouses sexual excitement, such as depictions of sexual activity or promotes sexual services, but excludes sex education and health content.
Hate: Includes content that expresses, incites, or promotes hate feelings based on race, gender, ethnicity, religion, nationality, sexual orientation, disability or caste.
Self-harm: Includes content that promotes, encourages, or depicts self-injurious behavior, such as suicide, cuts, and eating disorders.
Violence: Includes content that promotes or glorifies violence, or glorifies the suffering or humiliation of others.

import openai
import pandas as pd
response = openai.Moderation.create(input="策划一场谋杀计划")
moderation_output = response["results"][0]
moderation_output_df = pd.DataFrame(moderation_output)

分类标记类别类别概率值sexualFalseFalse3.962824e-07hateFalseFalse1.962326e-04harassmentFalseFalse1.402294e-02self-harmFalseFalse1.078697e-05sexual/minorsFalseFalse1.448917e-07hate/threateningFalseFalseFalse1.513400e-05violence/graphicFalseFalse9.522112e-07self-harm/intentFalseFalse2.334248e-07self-harm/instructionsFalseFalse3.670997e-10harassment/threateningFalseFalse2.882557e-02violenceTrueTrue9.977435e-01

In the Category field, there are various categories, as well as information about whether the input in each category is flagged, you can see that the input is flagged because of violent content (violence category), each category also provides a more detailed score (probability value), through the category and tag synthesis to determine whether it contains harmful content, output True or False (here True & True). In addition, Prompt should do a good Prompt anti-injection design, refer to this article Claude 2 has been jailbroken? One article takes you through the prompt attack!

Issues are categorized

When dealing with independent instruction set tasks in different situations, the problem type is first classified as the basis for determining which instructions to use, which can be achieved by defining fixed classes and hard-coding instructions related to processing specific categories of tasks, which can improve the quality and security of the system. (Here as a demonstration, this link to call LLM is not required)

delimiter = "####"

system_message = f"""
你现在扮演一名客服。
每个客户问题都将用{delimiter}字符分隔。
将每个问题分类到一个主要类别和一个次要类别中。
以 JSON 格式提供你的输出，包含以下键：primary 和 secondary。

主要类别：计费（Billing）、技术支持（Technical Support）、账户管理（Account Management）或一般咨询（General Inquiry）。

计费次要类别：
取消订阅或升级（Unsubscribe or upgrade）
添加付款方式（Add a payment method）
收费解释（Explanation for charge）
争议费用（Dispute a charge）

技术支持次要类别：
常规故障排除（General troubleshooting）
设备兼容性（Device compatibility）
软件更新（Software updates）

账户管理次要类别：
重置密码（Password reset）
更新个人信息（Update personal information）
关闭账户（Close account）
账户安全（Account security）

一般咨询次要类别：
产品信息（Product information）
定价（Pricing）
反馈（Feedback）
与人工对话（Speak to a human）

"""
user_message = "我想删除我的个人账户"
#user_message = "这个产品有什么用"
messages =  [
{'role':'system',
 'content': system_message},
{'role':'user',
 'content': f"{delimiter}{user_message}{delimiter}"},
]
response = get_completion_from_messages(messages)
print(response)

When the user question is I want to delete my personal account, matching to close the account, additional instructions can be provided to explain how to close the account

{
  "primary": "账户管理",
  "secondary": "关闭账户"
}

When the user asks what is the use of this product, matching product information can provide additional instructions for more product information

{
  "primary": "一般咨询",
  "secondary": "产品信息"
}

The model answers the question

The model answers the user's questions and brings relevant contextual information into the Prompt in a dynamic, on-demand manner.

Too much extraneous information makes the model more confused when dealing with context.
The model itself has a limit on the length of the context and cannot load too much information at once.
Dynamically loading information reduces token costs.
Use smarter retrieval mechanisms instead of just exact matches, such as semantic search in conjunction with text Embedding in the knowledge base.

import openai
def get_completion_from_messages(messages, model="gpt-3.5-turbo", temperature=0):
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=temperature, # 控制模型输出的随机程度
    )
    return response.choices[0].message["content"]

Assess the quality of answers

Use the GPT API to automate evaluation

def eval_with_rubric(test_set, assistant_answer):
    """
    使用 GPT API 评估生成的回答

    参数：
    test_set: 测试集
    assistant_answer: 客服的回复
    """

    cust_msg = test_set['customer_msg']
    context = test_set['context']
    completion = assistant_answer

    # 人设
    system_message = """\
    你是一位助理，通过查看客服使用的上下文来评估客服回答用户问题的情况。
    """

    # 具体指令
    user_message = f"""\
    你正在根据客服使用的上下文评估对问题的提交答案。以下是数据：
    [开始]
    ************
    [用户问题]: {cust_msg}
    ************
    [使用的上下文]: {context}
    ************
    [客服的回答]: {completion}
    ************
    [结束]

    请将提交的答案内容与上下文进行比较，忽略样式、语法或标点符号上的差异。
    回答以下问题：
    客服的回应是否只基于所提供的上下文？（是或否）
    回答中是否包含上下文中未提供的信息？（是或否）
    回应与上下文之间是否存在任何不一致之处？（是或否）
    计算用户提出了多少个问题。（输出一个数字）
    对于用户提出的每个问题，是否有相应的回答？
    问题1：（是或否）
    问题2：（是或否）
    ...
    问题N：（是或否）
    在提出的问题数量中，有多少个问题在回答中得到了回应？（输出一个数字）
"""

    messages = [
        {'role': 'system', 'content': system_message},
        {'role': 'user', 'content': user_message}
    ]

    response = get_completion_from_messages(messages)
    return response

Manually set standard answer assessments

Using Prompt to compare the degree of match between LLM-generated responses and human-set standard answers, this scoring scale actually comes from the OpenAI open-source evaluation framework, which includes many evaluation methods, but here is just one of them.

A collection of standard answers

test_set = [
    {
        "customer_msg": "如何升级我的订阅？",
        "ideal_answer": "您可以在用户设置中找到取消订阅或升级的选项，并按照步骤进行操作。"
    },
    {
        "customer_msg": "怎样绑定银行卡？",
        "ideal_answer": "您可以登录您的账户，然后在付款方式选项中添加新的付款方式，按照页面上的指引操作即可。"
    },
    {
        "customer_msg": "可以查看详细的收费情况吗？",
        "ideal_answer": "当然可以。您可以访问我们的网站，登录您的账户并前往收费解释页面，您会找到有关所有收费的详细解释。"
    },
    {
        "customer_msg": "怎么被乱扣费了？",
        "ideal_answer": "若您对某笔费用有异议，您可以联系我们的客服团队，提供相关细节并说明您的争议。我们的团队将会尽快与您取得联系并解决问题。"
    }
]

def eval_vs_ideal(test_set, assistant_answer):
    """
    评估回复是否与理想答案匹配

    参数：
    test_set: 测试集
    assistant_answer: 助手的回复
    """
    cust_msg = test_set['customer_msg']
    ideal = test_set['ideal_answer']
    completion = assistant_answer

    system_message = """\
    你是一位助理，通过将客服的回答与业务专家回答进行比较，评估客服对用户问题的回答质量。
    请输出一个单独的字母（A 、B、C、D、E），不要包含其他内容。
    """

    user_message = f"""\
    您正在比较一个给定问题的提交答案和专家答案。数据如下:
    [开始]
    ************
    [问题]: {cust_msg}
    ************
    [专家答案]: {ideal}
    ************
    [提交答案]: {completion}
    ************
    [结束]

    比较提交答案的事实内容与专家答案，关注在内容上，忽略样式、语法或标点符号上的差异。
    你的关注核心应该是答案的内容是否正确，内容的细微差异是可以接受的。
    提交的答案可能是专家答案的子集、超集，或者与之冲突。确定适用的情况，并通过选择以下选项之一回答问题：
    （A）提交的答案是专家答案的子集，并且与之完全一致。
    （B）提交的答案是专家答案的超集，并且与之完全一致。
    （C）提交的答案包含与专家答案完全相同的细节。
    （D）提交的答案与专家答案存在分歧。
    （E）答案存在差异，但从事实的角度来看这些差异并不重要。
    选项：ABCDE
"""

    messages = [
        {'role': 'system', 'content': system_message},
        {'role': 'user', 'content': user_message}
    ]
    response = get_completion_from_messages(messages)
    return response

Prompt iteration

In actual use, when encountering complex user problems and the model performance is not as expected, it is necessary to iterate on Prompt. For example, the question is: I canceled my subscription before, but why is there still a charge reminder? It's clear that this is a subcategory of disputed fees, but what actually matches is

{
  "primary": "计费",
  "secondary": "取消订阅或升级"
}

So you need to update and iterate Prompt to identify the user's complex intent, and you can carefully observe what happens to Prompt (in fact, you need to carefully analyze the user's intent, especially the final question.) )

delimiter = "####"

system_message = f"""
你现在扮演一名客服，你需要仔细分析用户的意图，特别是最终的问题。
每个客户问题都将用{delimiter}字符分隔。
将每个问题分类到一个主要类别和一个次要类别中。
以 JSON 格式提供你的输出，包含以下键：primary 和 secondary。

主要类别：计费（Billing）、技术支持（Technical Support）、账户管理（Account Management）或一般咨询（General Inquiry）。

计费次要类别：
取消订阅或升级（Unsubscribe or upgrade）
添加付款方式（Add a payment method）
收费解释（Explanation for charge）
争议费用（Dispute a charge）
...
"""

This time a normal match

{
  "primary": "计费",
  "secondary": "争议费用"
}

Regression testing

After iteration, Prompt, to ensure that it does not negatively affect previous test cases, requires the necessary regression testing to cover some related issues.

测试用例1. 如何升级我的订阅？
{
  "primary": "计费",
  "secondary": "取消订阅或升级"
}
测试用例2. 怎样绑定银行卡？
{
  "primary": "账户管理",
  "secondary": "添加付款方式"
}
测试用例3. 可以查看详细的收费情况吗？
{
  "primary": "计费",
  "secondary": "收费解释"
}
测试用例4.怎么被乱扣费了？
{
  "primary": "计费",
  "secondary": "争议费用"
}

When working with a small number of samples, it is possible to manually run tests and evaluate the results, but as the application matures, the number of problem use cases from users increases, and the size of Prompt increases, automated tests need to be introduced to periodically regress to verify Prompt quality.

Prompt changes should also require version control just like code, and the associated user issue use cases must be traceable.

Finally, there is security construction, Prompt, as an important asset of the company, also needs to be protected, such as using a dedicated LLM to analyze incoming Prompts to identify potential attacks; Store embeddings of previous attacks in a vector database to identify and prevent similar attacks in the future.

Think the content is good, welcome to follow, forward and like~, click to read the original article, get the best reading experience.

Resources

[1]

Moderation API: https://platform.openai.com/docs/guides/moderation

What links should an intelligent question answering system based on a large language model contain?