There is a boom in the industry around reducing manual work with AI assistants, increasing software developer productivity with code generators, and innovating through generative AI. These business opportunities have led many development teams to build knowledge bases, use vector databases, and embed large language models (LLMs) into their applications.

Some common use cases for building applications with LLM capabilities include search experiences, content generation, document summarization, chatbots, and customer service applications. Industry examples include developing patient portals in healthcare, improving workflows for junior bankers in financial services, and paving the way for the future of factories in manufacturing.

Companies investing in LLMs face some initial challenges, including improving data governance in terms of data quality, choosing an LLM architecture, addressing security risks, and developing a cloud infrastructure plan.

I'm more concerned about how organizations plan to test their LLM models and applications. Issues of concern include an airline honoring refunds offered by its chatbot, lawsuits over copyright infringement, and mitigating the risk of "illusions."

Amit Jain, co-founder and COO of Roadz, said: "Testing LLM models requires a multifaceted approach that goes beyond technical rigor. The team should make iterative improvements and create detailed documentation to document the model's development process, testing methodology, and performance metrics. Partnering with the research community to benchmark and share best practices is also effective. ”

Four Testing Strategies for Embedded Large Language Models (LLMs).

The development team needs a testing strategy for LLMs. When developing a testing strategy for LLMs embedded in custom applications, consider the following practical approaches as a starting point:

Create test data to extend software quality assurance
Automate model quality and performance testing
Evaluate RAG quality based on use case
Develop quality metrics and benchmarks

Create test data to extend software quality assurance

Most development teams don't create general-purpose large language models, but rather develop applications for specific end-users and use cases. In order to develop a testing strategy, the team needs to understand the user roles, goals, workflows, and quality benchmarks involved. "The first requirement for testing LLMs is to understand the tasks that the LLM should be able to solve," says Jakob Praher, CTO at Mindbreeze. One can then systematically refine the prompts or fine-tune the model. ”

For example, a large language model designed for customer service might contain a test dataset that contains common user questions and the best responses. Other LLM use cases may not have the means to directly evaluate the results, but developers can still use the test data for validation. "The most reliable way to test LLMs is to create relevant test data, but the challenge is the cost and time it takes to create such datasets," said Kishore Gadiraju, vice president of engineering at Solix Technologies. In addition, LLM testing requires bias, fairness, security, content control, and explainability testing. ”

Gadiraju shared the following LLM testing libraries and tools:

AI Fairness 360, an open-source toolkit for checking, reporting, and mitigating discrimination and bias in machine learning models;

DeepEval, an open-source LLM evaluation framework similar to Pytest but dedicated to LLM output unit testing;

Baserun, a tool to help debug, test, and iterate on improved models;

Nvidia NeMo-Guardrails, an open-source toolkit for adding programmable constraints to LLM outputs.

Monica Romila, Head of Data Science Tools and Runtime at IBM's Data and Artificial Intelligence Division, shares two areas of testing when using LLMs for enterprises:

Model quality assessment uses academic and internal datasets to assess model quality for use cases such as classification, extraction, summarization, generation, and augmented generation (RAG).

Model performance tests validate the model's latency (the time it takes to transfer data) and throughput (the amount of data processed over a specific period of time).

According to Romila, performance testing depends on two key parameters: the number of concurrent requests and the number of tokens generated (the text blocks used by the model). "It's important to test a variety of load sizes and types, and compare performance to existing models to see if updates are needed. ”

DevOps and cloud architects should consider the infrastructure requirements required to conduct performance and load testing of LLM applications. Heather Sundheim, General Manager of Solutions Engineering at SADA, said: "Deploying a test infrastructure for large language models involves setting up powerful computing resources, storage solutions, and testing frameworks. Automated configuration tools (e.g., Terraform) and version control systems (e.g., Git) play a key role in repeatable deployment and effective collaboration, emphasizing the importance of balancing resources, storage, deployment strategies, and collaboration tools for reliable LLM testing. ”

Evaluate RAG quality based on use case

Some techniques to improve the accuracy of LLMs include centralizing content, updating models with the latest data, and using RAGs in the query flow. RAG is essential to combine the power of LLMs with a company's proprietary information.

In a typical LLM application, the user enters a prompt, the application sends it to the LLM, the LLM generates a response, and the application sends the response back to the user. When using RAG, the application first sends prompts to an information database (such as a search engine or vector database) to retrieve relevant and topic-related information. The application sends the prompt and this contextual information to the LLM, which uses it to formulate a response. As a result, RAG limits the LLM's response to relevant and contextual information.

Igor Jablokov, Founder and CEO of Pryon, said: "RAG is more suitable for enterprise-level deployments where verifiable attribution of source content is required, especially in critical infrastructure. ”

Studies have shown that the use of RAG in conjunction with LLMs can reduce hallucinations and improve accuracy. However, using RAG also adds a new component, which needs to be tested for relevance and performance. The type of testing depends on how easy it is to assess RAG and LLM responses, and how well the development team is able to leverage end-user feedback.

I recently spoke with Deon Nicholas, CEO of Forethought, about his company's RAG evaluation options for generative customer support AI. He shared three different approaches:

(1) the gold standard dataset, which is a human-labeled dataset for the correct answer to a query, which can be used as a benchmark for model performance;

(2) reinforcement learning, i.e., testing the model in real-world scenarios, such as asking users about their satisfaction after interacting with a chatbot;

and (3) adversarial networks, i.e., training a L2 LLM to evaluate the performance of L1 LLMs, an approach that provides automated evaluation by not relying on human feedback.

"Each approach has its trade-offs and needs to strike a balance between human input and the risk of ignoring mistakes," Nicholas said. The best systems leverage these approaches across system components to minimize errors and facilitate robust AI deployments. ”

Develop quality metrics and benchmarks

Once you have the test data, the new or updated large language model (LLM), and the testing strategy, the next step is to validate the quality against the established goals.

Atena Reyhani, Chief Product Officer at ContractPodAi, said: "To ensure that safe, reliable and trustworthy AI is developed, it is critical to have specific, measurable key performance indicators (KPIs) and clear guardrails in place. Some criteria to consider include accuracy, consistency, speed, and relevance to domain-specific use cases. Developers need to evaluate the entire LLM ecosystem and operating model in the target domain to ensure that it delivers accurate, relevant, and comprehensive results. ”

One tool worth learning from is Chatbot Arena, which is an open environment for comparing LLM results. It uses the Elo rating system, an algorithm commonly used to rank players in competitive games, but it works just as well when evaluating responses produced by different LLM algorithms or versions.

Joe Regensburger, vice president of research at Immuta, said: "Human evaluation is a core part of testing, especially when it comes to hardening LLMs to accommodate real-world queries. Chatbot Arena is an example of crowdsourced testing, a type of human evaluator research that can provide an important feedback loop in order to incorporate user feedback. ”

Romila from IBM Data & AI shared three metrics to consider for different use cases of LLMs.

(1) The F1 score is a composite score of precision and recall and is suitable for situations where LLMs are used for classification or prediction. For example, a customer support LLM can be evaluated by assessing the accuracy of its recommended course of action.

(2) RougeL can be used to test the performance of RAGs and LLMs in summary use cases, but typically requires a human-created digest as a benchmark to evaluate the results.

(3) sacreBLEU, originally a method used to test language translation, is now also being used to quantitatively evaluate LLM responses, as well as other methods such as TER, ChrF, and BERTScore.

Some industries have specific quality and risk metrics to consider. Karthik Sj, Vice President of Product Management and Marketing at Aisera, said: "Assessing age-appropriateness and avoiding toxic content is critical in education, but in consumer-facing applications, relevance and delay of response should be prioritized. ”

Once a model is deployed, testing is not over, and data scientists should seek end-user reactions, performance metrics, and other feedback to improve the model. Dustin Pearce, VP of Engineering and Chief Information Security Officer at Amplitude, said, "Once deployed, it became critical to combine results with behavioral analytics that provided rapid feedback and clearer measures of model performance. ”

An important step in preparing for production is to use feature switches in your application. AI technology companies Anthropic, Character.ai, Notion, and Brex all use feature switches when building their products to collaboratively test applications, gradually introduce features to a large number of users, and conduct targeted experiments for different user groups.

Although new technologies have emerged to validate LLM applications, none of these techniques are easy to implement or provide exact results. Currently, it may be relatively easy to integrate with RAG and LLM to build applications, but this is only the tip of the iceberg when it comes to testing and supporting improvement efforts.

How do I test a large model?

Four Testing Strategies for Embedded Large Language Models (LLMs).

Create test data to extend software quality assurance

Evaluate RAG quality based on use case

Develop quality metrics and benchmarks

Read on

HUAWEI Home Storage tests the WeChat backup feature with HarmonyOS 4.2 or later

[China Metrology and Testing Society] chemical inspector, length measurement, thermal measurement, etc

Dark Zone Breakout Unlimited will open for closed testing on May 8!Teach you how to apply for test qualifications + configuration requirements

Xiangquan Town held the monthly meeting of the village-based assistance team in April and the knowledge test of "post training and business competition".

Quality car: 2024 mobile phone wireless charging performance test ranking

Unmanned aerial vehicles are also equipped with machine guns? The new test of the US military subverts cognition, and there may be no pilots on the battlefield in the future

I really don't understand why someone would hack the retractable lens of Huawei Pura70Ultra? Just look at the test of Xiaobai's evaluation, you can see that it is too late for friends to learn! It is not only to reduce the thickness of the fuselage

The test growth path of big data novices

Shiseido's daughter Li Mi tested people's hearts, and no one let her sit on the bus, so she was scolded by netizens

Title: Localized equipment to improve the level of Hong Kong police dutyIn recent years, Hong Kong's social situation has changed, and the police duty task is extremely heavy. In order to improve the efficiency of duty, the Hong Kong Police Force has begun to introduce it gradually

Penetration testing of MSSQL using Nmap

iFLYTEK Spark Large Model V3.5 Updated Long Text/Long Graphics/Long Voice to Help Office More Efficient

7 common "decimal processing" models

The secret of the domestic Sora is hidden in this Tsinghua large-scale model team

Jiuqi Nuwa Platform 2.0 is newly upgraded, and the AI model is accelerated, empowering the government and enterprises to govern the future digitally

Mentougou District released multi-dimensional large model application scenarios and continued to expand the closed-loop ecosystem of artificial intelligence industrialization

Special attention is paid to the comparison and selection of animal models of pancreatic cancer

The spring of open source large models is coming?

Q Jie responded to the collision and fire on the M7 highway that killed 3 people, and the iPhone16 model was exposed丨Bang Zaobao

Almost Real Toyota Century 1997 汽车模型

Spy photos of iPhone 16 series models are exposed, and the standard version camera adopts the design of iPhone X

Community renovation model, self-made "space probe" ...... Changning students' creative competition!

Taking advantage of the iterative growth of Geely's large model, Geely Radar's management explained in detail the "new species" of pickup trucks to lead the industry reform plan

This time, Blizzard's new meal model looks delicious!

5.2-5.5 SMITE 2 Alpha test is opened, configuration requirements + online time + game price

vivo X100S 现身 Geekbench 测试平台，天玑 9300+芯片加持

The Conference on Ecological Construction and Application Development of Large Models was successfully held [Zhongguancun Forum]

The final test of the First Descendants starts on May 25 How to participate in the test (how to participate)

The first in the industry, the National High-speed Train Industry Metrology and Testing Center was formally established

World of Warcraft Returns: National Server Test Creates the Future with Players

The development trend of large models: from dialog boxes to the industrial side

The strongest open-source medical AI model based on Llama 3 was released, refreshing the list