Redirected from Datawhale

Datawhale dry goods

Author: Andong Chen, a member of Datawhale

The article "AI" Science Popularization takes you to understand the agent based on large models

Datawhale

An open-source organization focusing on the field of AI, bringing together many outstanding learners, with the mission - for the learner, and growing together with learners.

602 pieces of original content

Official account

Foreword

In the current information age, the development speed and influence of large language models (LLMs) are becoming more and more significant. The powerful inference and generation capabilities of large models have become the best components for building agents. This content comes from Datawhale's open-source "Generating Large Model Fundamentals (so-large-lm)", a cutting-edge course dedicated to exploring and understanding the development of large models:

Through this open source course, readers will be able to gain a more comprehensive understanding of agents, their design principles, advantages, application scenarios, and current limitations. We hope that this course will provide value to a wide range of learners, promote the in-depth learning and application of the basic knowledge of large model theory, and stimulate more innovation and exploration.

Brief introduction

Throughout the history of technology, humans have been trying to create an agent or entity that can autonomously accomplish preset goals, i.e., AI Agents or Agents, to assist humans in accomplishing a variety of tedious tasks. Over the years, agents have attracted continuous research and exploration as an active application field of artificial intelligence. Today, large language models are booming and changing with each passing day.

In the implementation of agent technology, especially in the construction of agents based on large language models (LLMs), LLMs play a crucial role in the intelligence of agents. These agents are capable of performing complex tasks by integrating LLMs with planning, memory, and other key technology modules. In this framework, the LLM acts as the core processing unit or "brain" responsible for managing and executing a series of actions required for a specific task or in response to a user query.

To illustrate the potential of LLM agents with a new example, imagine that we need to design a system to respond to the following queries:

What are the most popular EV brands in Europe at the moment?

This question can be answered directly by an LLM that is updated with the latest data. If LLMs lack real-time data, they can use a RAG (Retrieval Enhanced Generation) system where LLMs can access the latest automotive sales data or market reports.

Now, let's consider a more complex query:

What has been the growth trend of the European EV market over the past decade and how has this affected environmental policy? Can you provide a chart of the market growth over this period?

It is not enough to rely on LLMs to answer such complex questions. While a RAG system that combines LLMs with an external knowledge base can help, more needs to be done to answer this question holistically. This is due to the fact that to answer this question, it needs to be broken down into sub-questions, and then it needs to be solved through specific tools and processes to finally get the desired answer. One possible scenario is to develop an LLM agent with access to the latest environmental policy literature, market reports, and publicly available databases to obtain information on the growth of the electric vehicle market and its environmental impact.

In addition, the LLM agent needs to be equipped with a "data analysis" tool, which can help the agent use the collected data to produce intuitive charts and graphs that clearly show the growth trend of the European electric vehicle market over the past decade. While the advanced capabilities of such an agent are currently idealized, they involve a number of important technical considerations, such as the development of a solution plan and possible memory modules, which will help the agent track operational processes and monitor and evaluate the overall state of progress.

LLM Agent 架构

In general, an LLM-based agent framework includes the following core components:

User Request - A question or request from the user
Agent/Brain - The agent core that acts as a coordinator
Planning - Assists the agent in planning future actions
Memory - Manages the agent's past behavior

Agents

LLM is crucial in building an agent system with large language model (LLM) as the core, acting as the brain's brain and the core of multi-task coordination of the system. This agent parses and executes instructions based on prompt templates that not only guide the LLM but also define the agent's role and personality in detail, including background, personality, social environment, and demographic information. This personified description allows the agent to understand and perform tasks more precisely.

In order to optimize this process, the system design needs to take into account several key aspects:

First of all, the system needs to have rich contextual understanding and continuous learning capabilities, not only to process and remember a large amount of interactive information, but also to continuously optimize the execution strategy and prediction model.
Secondly, multi-modal interaction is introduced to integrate multiple input and output forms such as text, image, and sound, so that the system can handle complex tasks and environments more naturally and effectively. In addition, dynamic role adaptation and personalized feedback of agents are also key to improving user experience and execution efficiency.
Finally, strengthen security and reliability to ensure the stable operation of the system and win the trust of users. Integrating these elements, LLM-based agent systems are able to demonstrate greater efficiency and accuracy in handling specific tasks, while at the same time demonstrating greater adaptability and sustainability in terms of user interaction and long-term system development. This system is not only a tool for executing commands, but also an intelligent partner that can understand complex instructions, adapt to different scenarios, and continuously optimize its own behavior.

planning

No feedback planning

The planning module is the key for the agent to understand the problem and reliably find a solution, and it responds to user requests by breaking them down into necessary steps or subtasks. Popular techniques for task decomposition include Chain of Thought (COT) and Tree of Thought (TOT), which can be categorized as single-path inference and multi-path inference, respectively.

First, we introduce the "Chain of Thought (COT)" approach, which aims to deal with problems by increasing the computational test time by subdividing complex problems into a series of smaller, simpler tasks step by step. Not only does this make large tasks manageable, but it also helps us understand how the model solves problems step by step.

Next, some researchers proposed the "Thinking Tree (TOT)" method on this basis, which explores multiple possible paths at each decision-making step to form a tree-like structure diagram. This approach allows for different search strategies, such as width-first or depth-first searches, and utilizes classifiers to evaluate the effectiveness of each possibility.

Task decomposition can be done in different ways, including using LLMs directly for simple prompts, employing task-specific instructions, or combining direct human input. These strategies allow for flexible adaptation of the solution to the task according to different needs. Another approach is an LLM (LLM+P) that combines a classical planner, which relies on an external planner for long-term planning. This approach first translates the problem into a PDDL format, then uses the planner to generate a solution, and finally translates this solution back into natural language. This applies to scenarios that require detailed long-term planning, although relying on domain-specific PDDLs and planners may limit its scope of application.

These innovative approaches not only demonstrate the diversity and flexibility of problem-solving, but also provide us with a new perspective on how LLMs approach complex tasks.

There is a feedback plan

The above planning modules do not involve any feedback, which makes it challenging to achieve long-term planning to solve complex tasks. To address this challenge, a mechanism can be utilized that enables the model to iteratively think and refine the execution plan based on past actions and observations. The goal is to correct and improve past mistakes, which helps to improve the quality of the final result. This is especially important in complex real-world environments and tasks, where trial and error are key to completing the task. Two popular methods of this mechanism of reflection or criticism include ReAct and Reflexion.

The ReAct method proposes to achieve the ability to integrate inference and execution in large-scale language models (LLMs) by combining discrete actions and language descriptions of specific tasks. Discrete actions allow LLMs to interact with their environment, such as utilizing the Wikipedia search API, while the language description section facilitates LLMs to generate natural language-based inference paths. This strategy not only improves the LLM's ability to deal with complex problems, but also enhances the adaptability and flexibility of the model in real-world applications through direct interaction with the external environment. In addition, the natural language-based inference path increases the interpretability of the model's decision-making process, enabling users to better understand and verify the model's behavior. ReAct is also designed to provide transparency and control over the model's actions, ensuring the safety and reliability of the model's performance. Therefore, the development of ReAct provides a new perspective for the application of large-scale language models, and its method of integrating inference and execution opens up a new way to solve complex problems.

Reflexion is a framework that aims to enhance agents' reasoning skills by empowering them with dynamic memory and self-reflection abilities. The method employs a standard reinforcement learning (RL) setup, where the reward model provides a simple binary reward, and the action space follows the setting in ReAct, i.e., the action space of a specific task is enhanced by language to achieve complex inference steps. After each action, the agent calculates a heuristic assessment and, based on the results of self-reflection, selectively resets the environment to start a new attempt. Heuristic functions are used to determine when a trajectory is inefficient or contains hallucinations that should be stopped. Inefficient planning refers to trajectories that have not been successfully completed for a long time. Hallucinations are defined as encountering a series of successive identical actions that result in the observation of the same outcome in the environment.

memory

The memory module is a key component of the agent's internal log, which is responsible for storing past thoughts, actions, observations, and interactions with users. It is essential for the agent's learning and decision-making process. According to the LLM agent literature, memory can be divided into two main types: short-term memory and long-term memory, and hybrid memory that combines these two types of memory to improve the agent's long-term reasoning ability and experience accumulation.

Short-term memory - Contextual information that focuses on the current situation, is ephemeral and limited, and is usually achieved through contextual window-limited learning.
Long-term memory - stores the agent's historical actions and thoughts, implemented through external vector storage, for quick retrieval of important information.
Hybrid memory - By integrating short-term and long-term memory, the agent not only optimizes the agent's understanding of the current situation, but also strengthens the use of past experience, thereby improving its ability to reason and accumulate experience in the long term.

When designing the memory module of an agent, it is necessary to select the appropriate memory format according to the task requirements, such as natural language, embedding vectors, databases, or structured lists. These different formats have a direct impact on the agent's information processing ability and task execution efficiency.

tool

Tools enable large language models (LLMs) to obtain information or complete subtasks through external environments such as Wikipedia search APIs, code interpreters, and math engines. This includes the use of databases, knowledge bases, and other external models, greatly expanding the capabilities of LLMs. In our initial query related to car sales, the intuitive chart through code is an example of using a tool that executes the code and generates the necessary chart information requested by the user.

LLMs utilize tools in different ways:

MRKL: is an architecture for autonomous agents. The MRKL system is designed to contain a series of "expert" modules, while a generic large language model (LLM) acts as a router that directs queries to the most appropriate expert module. These modules can be either large models or symbolic (e.g. math calculators, currency converters, weather APIs). Using arithmetic as a test case, they conducted a fine-tuning experiment on the LLM by calling the calculator. Experiments have shown that it is more difficult to solve verbal mathematical problems than explicitly stated ones, because the large language model (7B Jurassic1-large model) fails to reliably extract the correct parameters required for basic arithmetic operations. The results highlight that knowing when and how to use external symbolic tools is critical when they can work reliably, as determined by the capabilities of the LLM.
Toolformer: This academic work is to train a large model that is used to decide when to call which APIs, what parameters to pass, and how best to analyze the results. This process uses a fine-tuned approach to train large models, requiring only a few examples per API. The job integrates a range of tools, including a calculator, a Q&A system, a search engine, a translation system, and a calendar. Toolformer achieves significantly improved zero-shot performance across a wide range of downstream tasks, often competing with larger models without sacrificing its core language modeling capabilities.
Function Calling: This is also a strategy for augmenting the use of large language model (LLM) tools by defining a series of tool APIs and making those APIs available to the model as part of the request, enabling the model to call external functions or services while processing text tasks. This approach not only extends the capabilities of the LLM to handle tasks that are beyond the scope of its training data, but also improves the accuracy and efficiency of task execution.
HuggingGPT: It is powered by large language models (LLMs) and is designed to autonomously handle a range of complex AI tasks. HuggingGPT blends the capabilities of LLMs with the resources of the machine learning community, such as ChatGPT combined with Hugging Face, allowing it to process inputs from different modalities. Specifically, LLMs play the role of brains here, disassembling tasks based on user requests, and selecting appropriate models to perform tasks based on model descriptions. By executing these models and integrating the results into the planned tasks, HuggingGPT is able to autonomously complete complex user requests. This process demonstrates the complete process from task planning, to model selection, to task execution, and finally to response generation. First, HuggingGPT leverages ChatGPT to analyze users' requests to understand their intentions and break them down into possible solutions. Next, it selects the expert model hosted on Hugging Face that is best suited to perform these tasks. Each selected model is called and executed, and the results are fed back to ChatGPT. Ultimately, ChatGPT integrates the predictions of all models to generate a response for the user. This way of working of HuggingGPT not only extends the capabilities of traditional single-mode processing, but also provides an efficient and accurate solution in cross-domain tasks through its intelligent model selection and task execution mechanism.

The combination of these strategies and tools not only enhances the ability of the LLM to interact with the external environment, but also provides strong support for handling more complex and cross-domain tasks, opening a new chapter in the capabilities of agents.

Challenges of Agents

Building agents based on large language models (LLMs) is an emerging field that faces numerous challenges and limitations. Here are a few of the main challenges and possible solutions:

Role suitability issues

Agents need to work effectively within a specific domain, and for roles that are difficult to characterize or migrate, performance can be improved by fine-tuning LLMs in a targeted manner. This includes an ability boost that represents an unusual character or psychological trait.

Context length limitations

The limited context length limits the capabilities of the LLM, although vector storage and retrieval provide the possibility of accessing a larger knowledge base. System design requires innovation to operate effectively within limited communication bandwidth.

Robustness of prompts

The agent's cue design needs to be robust enough to prevent small changes from causing reliability issues. Possible solutions include automatic optimization adjustment prompts or automatic generation of prompts using LLMs.

Control of knowledge boundaries

It is a challenge to control the internal knowledge of the LLM and avoid introducing bias or using knowledge that the user does not know. This requires agents to be more transparent and controllable in their processing of information.

Efficiency and cost issues

The efficiency and cost of LLMs when handling a large number of requests are important considerations. Optimizing inference speed and cost efficiency is the key to improving the performance of multi-agent systems.

Overall, LLM-based agent building is a complex and multifaceted challenge that requires innovation and optimization in multiple aspects. Continuous research and technological development are essential to overcome these challenges.

The article "AI" Science Popularization takes you to understand the agent based on large models

Datawhale dry goods

Foreword

Brief introduction

LLM Agent 架构

Agents

planning

No feedback planning

There is a feedback plan

memory

tool

Challenges of Agents

Role suitability issues

Context length limitations

Robustness of prompts

Control of knowledge boundaries

Efficiency and cost issues

Read on

Ant Bailing large model No. 1: The release of GPT-4o is not unexpected, and the direction of native multimodality is clear

The ByteDance large model made its debut with full staff: the price was 99% lower, and there was no parameter scale and running score

Tasly and Huawei released a large model of digital intelligence materia medica

99.3% cheaper than the industry! ByteDance's bean bag model is going to overturn the industry

What do you have to do to "tame" a large model that is not controlled?

Original | How multimodal large models can help enterprises in digital transformation

Patriot missiles are pulled by pallet trucks and run all over the streets, real or model?

Huawei's whole-home intelligence "anti-follower": do not shout the slogan of large models, and intensively cultivate AI health care

Multi-functional RNA analysis, the RNA language model of the Baidu team was published in the journal Nature

OpenAI叇巼બ模५GPT-4o, 微๓ "I'm going to say "I'm going to go"

Huawei's press conference was accused of fraud: the pictures generated by the large model are manually manipulated?

The 2023 annual reports of 58 listed banks took stock: net interest income grew negatively for the first time since 2017, accelerating the layout of large models

Byte took the lead in launching a large model price war

The construction standard construction period model of the school project of China Construction Eighth Bureau 2022 is available for download

Baidu released a new model of autonomous driving, saying that it is more than 10 times safer than real driving, and the comment area is lively

OpenAI's new big model "GPT-4o", killing the education industry, may be able to bring mankind to eternal life