laitimes

How do I test a large model?

author:Not bald programmer
How do I test a large model?

There is a boom in the industry around reducing manual work with AI assistants, increasing software developer productivity with code generators, and innovating through generative AI. These business opportunities have led many development teams to build knowledge bases, use vector databases, and embed large language models (LLMs) into their applications.

Some common use cases for building applications with LLM capabilities include search experiences, content generation, document summarization, chatbots, and customer service applications. Industry examples include developing patient portals in healthcare, improving workflows for junior bankers in financial services, and paving the way for the future of factories in manufacturing.

Companies investing in LLMs face some initial challenges, including improving data governance in terms of data quality, choosing an LLM architecture, addressing security risks, and developing a cloud infrastructure plan.

I'm more concerned about how organizations plan to test their LLM models and applications. Issues of concern include an airline honoring refunds offered by its chatbot, lawsuits over copyright infringement, and mitigating the risk of "illusions."

Amit Jain, co-founder and COO of Roadz, said: "Testing LLM models requires a multifaceted approach that goes beyond technical rigor. The team should make iterative improvements and create detailed documentation to document the model's development process, testing methodology, and performance metrics. Partnering with the research community to benchmark and share best practices is also effective. ”

How do I test a large model?

Four Testing Strategies for Embedded Large Language Models (LLMs).

The development team needs a testing strategy for LLMs. When developing a testing strategy for LLMs embedded in custom applications, consider the following practical approaches as a starting point:

  • Create test data to extend software quality assurance
  • Automate model quality and performance testing
  • Evaluate RAG quality based on use case
  • Develop quality metrics and benchmarks
How do I test a large model?

Create test data to extend software quality assurance

Most development teams don't create general-purpose large language models, but rather develop applications for specific end-users and use cases. In order to develop a testing strategy, the team needs to understand the user roles, goals, workflows, and quality benchmarks involved. "The first requirement for testing LLMs is to understand the tasks that the LLM should be able to solve," says Jakob Praher, CTO at Mindbreeze. One can then systematically refine the prompts or fine-tune the model. ”

For example, a large language model designed for customer service might contain a test dataset that contains common user questions and the best responses. Other LLM use cases may not have the means to directly evaluate the results, but developers can still use the test data for validation. "The most reliable way to test LLMs is to create relevant test data, but the challenge is the cost and time it takes to create such datasets," said Kishore Gadiraju, vice president of engineering at Solix Technologies. In addition, LLM testing requires bias, fairness, security, content control, and explainability testing. ”

Gadiraju shared the following LLM testing libraries and tools:

AI Fairness 360, an open-source toolkit for checking, reporting, and mitigating discrimination and bias in machine learning models;

DeepEval, an open-source LLM evaluation framework similar to Pytest but dedicated to LLM output unit testing;

Baserun, a tool to help debug, test, and iterate on improved models;

Nvidia NeMo-Guardrails, an open-source toolkit for adding programmable constraints to LLM outputs.

Monica Romila, Head of Data Science Tools and Runtime at IBM's Data and Artificial Intelligence Division, shares two areas of testing when using LLMs for enterprises:

Model quality assessment uses academic and internal datasets to assess model quality for use cases such as classification, extraction, summarization, generation, and augmented generation (RAG).

Model performance tests validate the model's latency (the time it takes to transfer data) and throughput (the amount of data processed over a specific period of time).

According to Romila, performance testing depends on two key parameters: the number of concurrent requests and the number of tokens generated (the text blocks used by the model). "It's important to test a variety of load sizes and types, and compare performance to existing models to see if updates are needed. ”

DevOps and cloud architects should consider the infrastructure requirements required to conduct performance and load testing of LLM applications. Heather Sundheim, General Manager of Solutions Engineering at SADA, said: "Deploying a test infrastructure for large language models involves setting up powerful computing resources, storage solutions, and testing frameworks. Automated configuration tools (e.g., Terraform) and version control systems (e.g., Git) play a key role in repeatable deployment and effective collaboration, emphasizing the importance of balancing resources, storage, deployment strategies, and collaboration tools for reliable LLM testing. ”

How do I test a large model?

Evaluate RAG quality based on use case

Some techniques to improve the accuracy of LLMs include centralizing content, updating models with the latest data, and using RAGs in the query flow. RAG is essential to combine the power of LLMs with a company's proprietary information.

In a typical LLM application, the user enters a prompt, the application sends it to the LLM, the LLM generates a response, and the application sends the response back to the user. When using RAG, the application first sends prompts to an information database (such as a search engine or vector database) to retrieve relevant and topic-related information. The application sends the prompt and this contextual information to the LLM, which uses it to formulate a response. As a result, RAG limits the LLM's response to relevant and contextual information.

Igor Jablokov, Founder and CEO of Pryon, said: "RAG is more suitable for enterprise-level deployments where verifiable attribution of source content is required, especially in critical infrastructure. ”

Studies have shown that the use of RAG in conjunction with LLMs can reduce hallucinations and improve accuracy. However, using RAG also adds a new component, which needs to be tested for relevance and performance. The type of testing depends on how easy it is to assess RAG and LLM responses, and how well the development team is able to leverage end-user feedback.

I recently spoke with Deon Nicholas, CEO of Forethought, about his company's RAG evaluation options for generative customer support AI. He shared three different approaches:

(1) the gold standard dataset, which is a human-labeled dataset for the correct answer to a query, which can be used as a benchmark for model performance;

(2) reinforcement learning, i.e., testing the model in real-world scenarios, such as asking users about their satisfaction after interacting with a chatbot;

and (3) adversarial networks, i.e., training a L2 LLM to evaluate the performance of L1 LLMs, an approach that provides automated evaluation by not relying on human feedback.

"Each approach has its trade-offs and needs to strike a balance between human input and the risk of ignoring mistakes," Nicholas said. The best systems leverage these approaches across system components to minimize errors and facilitate robust AI deployments. ”

How do I test a large model?

Develop quality metrics and benchmarks

Once you have the test data, the new or updated large language model (LLM), and the testing strategy, the next step is to validate the quality against the established goals.

Atena Reyhani, Chief Product Officer at ContractPodAi, said: "To ensure that safe, reliable and trustworthy AI is developed, it is critical to have specific, measurable key performance indicators (KPIs) and clear guardrails in place. Some criteria to consider include accuracy, consistency, speed, and relevance to domain-specific use cases. Developers need to evaluate the entire LLM ecosystem and operating model in the target domain to ensure that it delivers accurate, relevant, and comprehensive results. ”

One tool worth learning from is Chatbot Arena, which is an open environment for comparing LLM results. It uses the Elo rating system, an algorithm commonly used to rank players in competitive games, but it works just as well when evaluating responses produced by different LLM algorithms or versions.

Joe Regensburger, vice president of research at Immuta, said: "Human evaluation is a core part of testing, especially when it comes to hardening LLMs to accommodate real-world queries. Chatbot Arena is an example of crowdsourced testing, a type of human evaluator research that can provide an important feedback loop in order to incorporate user feedback. ”

Romila from IBM Data & AI shared three metrics to consider for different use cases of LLMs.

(1) The F1 score is a composite score of precision and recall and is suitable for situations where LLMs are used for classification or prediction. For example, a customer support LLM can be evaluated by assessing the accuracy of its recommended course of action.

(2) RougeL can be used to test the performance of RAGs and LLMs in summary use cases, but typically requires a human-created digest as a benchmark to evaluate the results.

(3) sacreBLEU, originally a method used to test language translation, is now also being used to quantitatively evaluate LLM responses, as well as other methods such as TER, ChrF, and BERTScore.

Some industries have specific quality and risk metrics to consider. Karthik Sj, Vice President of Product Management and Marketing at Aisera, said: "Assessing age-appropriateness and avoiding toxic content is critical in education, but in consumer-facing applications, relevance and delay of response should be prioritized. ”

Once a model is deployed, testing is not over, and data scientists should seek end-user reactions, performance metrics, and other feedback to improve the model. Dustin Pearce, VP of Engineering and Chief Information Security Officer at Amplitude, said, "Once deployed, it became critical to combine results with behavioral analytics that provided rapid feedback and clearer measures of model performance. ”

An important step in preparing for production is to use feature switches in your application. AI technology companies Anthropic, Character.ai, Notion, and Brex all use feature switches when building their products to collaboratively test applications, gradually introduce features to a large number of users, and conduct targeted experiments for different user groups.

Although new technologies have emerged to validate LLM applications, none of these techniques are easy to implement or provide exact results. Currently, it may be relatively easy to integrate with RAG and LLM to build applications, but this is only the tip of the iceberg when it comes to testing and supporting improvement efforts.

Read on