Original topic: Large model autonomous agents exploded, OpenAI is also secretly observing and exerting force, this is an insider's analysis blog

Large model autonomous agents are explosive, OpenAI is also secretly observing, exerting force, and insider analysis

“

OpenAI, which aspires to implement AGI, has it already secretly made a large-model agent?

”

In recent months, as the popularity of large language models has continued, research on using them to build AI agents has entered people's sights. The concept of AI agents has also become popular, constantly breaking through people's imagination.

First, researchers at Stanford University and Google successfully built a "virtual town" where the inhabitants of the town are no longer people, but 25 AI agents. They behave more realistically than human role-playing, and even host a Valentine's Day party.

Subsequently, SenseTime, Tsinghua and other institutions proposed Ghost in the Minecraft (GITM), a generalist AI agent that can learn to solve tasks independently, which performs better than all previous agents in Minecraft.

At the same time, NVIDIA's open source VOYAGER also brought a "little" shock to the AI circle. AS A LARGE MODEL-DRIVEN, LIFELONG LEARNING GAMING AGENT, VOYAGER PLAYS AT A HIGH LEVEL IN MINECRAFT. The emergence of these AI agents is even considered to be the prototype of future general artificial intelligence (AGI).

Many bigwigs and technology giants in the field of AI have great interest in the development of AI agents and have high expectations. Andrej Karpathy, Tesla's former AI director and who returned to OpenAI earlier this year, revealed at a developer event that whenever a new AI agent paper appears, OpenAI is very interested and seriously discussed.

Source:

https://twitter.com/GPTDAOCN/status/1673781206121578498

So you can't help but ask, what are the components of AI agents? What is the magic of it?

Recently, Lilian Weng, head of OpenAI Safety Systems, wrote a blog about AI agents. She believes that the core driving force of AI agents is the big language model, and planning, memory and tool use are the three key components to achieve it.

The previous Heart of the Machine article "The Developers Behind GPT-4: Seven Teams, More Than Thirty Chinese" also introduced Lilian Weng, who joined OpenAI in 2018 and is mainly involved in pre-training, reinforcement learning & alignment, model security and other aspects of the GPT-4 project.

Lilian Weng takes a detailed look at each component and provides case studies such as scientifically discovered agents, generative agent simulations, and proof-of-concept examples. She also gave her views on what challenges AI agents will face in the future.

Machine Heart compiled and organized the core content of the blog.

Blog Link:

https://lilianweng.github.io/posts/2023-06-23-agent/

The concept of an agent system

In the autonomous agent system enabled by the Large Language Model (LLM), LLM acts as the brain of the agent, and its three key components are as follows:

The first is planning, which is divided into the following:

Sub-goals and decomposition. Agents decompose large tasks into smaller, manageable sub-targets to efficiently handle complex tasks;
Reflection and refinement: Agents can engage in self-criticism and self-reflection on past behavior, learn from mistakes, and refine for future steps to improve the quality of the final result.

The second is memory, which is divided into short-term memory and long-term memory:

Short-term memory: The authors argue that all contextual learning (see cue engineering) uses the model's short-term memory to learn.
Long-term memory: Provides agents with the ability to retain and recall (infinite) information for a long time, usually using external vector storage and fast retrieval.

Finally, the tool use:

The agent learns to call external APIs to get extra information that is missing from the model weights (which is often difficult to change after pretraining), including current information, code execution capabilities, access to proprietary information sources, and so on.

Figure 1 below provides an overview of the LLM-enabled autonomous agent system.

Component 1: Planning

We know that a complex task usually involves many steps. The agent must understand what the task is and plan ahead.

Task decomposition

The first is the Chain of Thought (CoT). It has become the standard hinting technique for enhancing model performance on complex tasks. During implementation, the model is instructed to "think step by step", thus using more test time calculations to break down difficult tasks into smaller, simpler steps. CoT transforms large tasks into small, manageable tasks and explains the model's thought process.

The second is the Tree of Thoughts. It extends CoT by exploring multiple inference possibilities at each step. Start by breaking down the problem into multiple thinking steps and generating multiple thoughts in each step, creating a kind of tree structure. The search process can be breadth-first search (BFS) or depth-first search (DFS), where each state is evaluated by a classifier (by prompt) or by a majority vote.

Specifically, the task decomposition process can be done in three ways:

Simple LLM-based prompts such as "What are the steps for XYZ?" "What are the sub-goals for achieving XYZ?";
Use task-specific instructions, such as "write a story outline";
Manual input.

A last, radically different approach is LLM+P, which relies on an external classical planner for long-term planning. This method uses Planning Domain Definition Language (PDDL) as an intermediate interface for describing planning problems. In this process, LLM (1) turns the problem into a "Problem PDDL", then (2) requests the classical planner to generate a PDDL plan based on an existing "Domain PDDL", and finally (3) converts the PDDL plan back to natural language.

Essentially, the planning steps are outsourced to external tools and assume that domain-specific PDDLs and suitable planners are available. This is common in some bot setups and not in many other areas.

Self-reflection

Self-reflection plays a crucial role in real-world tasks where trial and error occur, allowing autonomous agents to iterate on improvements by refining past action decisions and correcting past mistakes.

ReAct integrates reasoning and action in LLM by extending the action space into a task-specific "combination of discrete action and language space." Discrete actions enable LLM to interact with the environment (such as using the Wikipedia search API), while language space enables LLM to generate inference trajectories in natural language.

The ReAct prompt template contains clear steps for LLM thinking, roughly formatted as follows:

Thought: ...
Action: ...
Observation: ...
... (Repeated many times)

Figure 2 below shows an example of an inference trajectory for knowledge-intensive tasks (e.g., HotpotQA, FEVER) and decision-making tasks (e.g., AlfWorld Env, WebShop).

Source: https://arxiv.org/abs/2210.03629

The results showed that for knowledge-intensive and decision-making tasks, ReAct outperformed the Acts-only baseline method, which removed the "Thought: ..." step.

The Reflexion framework equips the agent with dynamic memory and self-reflection to improve reasoning skills. It has a standard RL setup where the reward model provides simple binary rewards, while the action space follows the settings in ReAct. And task-specific action spaces are enhanced with language to enable complex reasoning steps. After each action a_t, the agent computes heuristic h_t and selectively decides to reset the environment based on the results of self-reflection, thus starting a new experiment.

Figure 3 below shows an overview of the Reflexion framework.

Source: https://arxiv.org/abs/2303.11366

Heuristics determine when the trajectory begins to be inefficient or contains hallucinations, and when it should stop. Inefficient planning is when it takes too long without a successful trajectory. Hallucination is defined as encountering a continuous series of identical actions that cause the same observations to occur in the environment.

Self-reflection is created by showing LLM two-shot examples, each of which is a pair of failed trajectories, and they are ideal reflections to guide changes in future planning. Reflections are then added to the agent's working memory, up to three, as a context for querying the LLM.

Figure 4 below shows the experiment on AlfWorld Env and HotpotQA. In AlfWorld, illusion is a more common failure than inefficient planning.

Source: https://arxiv.org/abs/2303.11366

The Chain of Hindsight (CoH) encourages models to improve their own output by explicitly presenting a series of past outputs, each with feedback annotations. Human feedback data is

A collection, where x is a prompt, each y_i is model completion, r_i is the human score of the y_i, and z_i is the post-hoc feedback provided by the corresponding human. Suppose the feedback tuple is sorted by reward

, the process is supervised fine-tuning. The sequence form of the data is

, where ≤i≤j≤n. The model is fine-tuned to predict only y_n conditional on the sequence prefix, allowing the model to self-reflect based on the feedback sequence, resulting in a better output. The model can optionally receive multiple rounds of instruction from human annotators at test time.

To avoid overfitting, CoH adds regularization terms to maximize the log-likelihood of the pretrained dataset. At the same time, in order to avoid shortcuts and duplication (due to the many common words in the feedback sequence), the researchers randomly masked 0%-5% of past tokens during training.

The training dataset used in the experiment is a combination of WebGPT comparison, human feedback summary, and human preference dataset. Figure 5 below shows that after fine-tuning with CoH, the model can follow instructions to produce output with incremental improvements in sequence.

Source: https://arxiv.org/abs/2302.02676

The idea of CoH is to present a history of continuously improved outputs in context and train the model to produce better outputs. Algorithmic distillation (AD) applies the same idea to cross-plot trajectories in reinforcement learning tasks, where algorithms are encapsulated in long-term historical conditional strategies.

Figure 6 below shows how algorithmic distillation works.

Source: https://arxiv.org/abs/2210.14215

In the algorithm distillation paper, the researchers hypothesize that any algorithm that generates a set of learning history can be distilled into a neural network by performing behavioral cloning on actions. Historical data is generated by a set of source policies, each of which is trained for a specific task.

During the training phase, during each RL run, the researchers sample random tasks and train them using subsequences of multi-episode history to make the learned strategy task-independent.

In practice, the model has a limited context window length, so episodes should be short enough to build a multi-episode history. To learn the near-optimal context RL algorithm, 2 to 4 episodes of multi-episodic context are required. The occurrence of context RL requires a sufficiently long context.

Compared to three baselines, including ED (expert distillation, behavioral cloning with expert trajectories instead of learning history), source strategy (used to generate trajectories for UCB distillation), RL^2 (an online reinforcement learning algorithm proposed in 2017 as an upper limit for comparison). Although the AD algorithm uses only offline reinforcement learning, its performance is close to RL^2 and it learns much faster than other baselines. AD also improves much faster than the ED baseline when conditional on a portion of the training history of the source policy.

Figure 7 below shows a comparison of AD, ED, source policy, and RL^2.

Component 2: Memory

According to the authors, this chapter was drafted with the help of ChatGPT. Let's take a look at this part of the specifics.

Memory type

Memory types fall into three categories: perceptual memory, short-term memory (STM) or working memory, and long-term memory (LTM).

Perceptual memory: This is the early stage of memory, which is able to maintain the impression of sensory information (visual, auditory, etc.) after the original stimulus ends. Perceptual memories usually last only a few seconds. Its subcategories include image memory (vision), echo memory (hearing), and touch memory (touch).

Short-term memory (STM) or working memory: Short-term memory stores information that we currently know and the information we need to perform complex cognitive tasks such as learning and reasoning. In general, short-term memory lasts 20-30 seconds.

Long-term memory: Long-term memory can store information for a long time, ranging from a few days to decades, and its storage capacity is basically unlimited. There are two subtypes of LTM:

Explicit, declarative memory: this is the memory of facts and events, referring to those memories that can be consciously recalled, including episodic memory (events and passages) and semantic memory (facts and concepts);
Implicit, procedural memory: This type of memory is unconscious and involves autonomously performed skills and routines, such as riding a bicycle or typing on a keyboard.

Human memory classification

Referring to the classification of human memory, we can get the following mapping:

Perceptual memory is represented as a learning embedding representation of the original input, including text, images, or other modalities.
Short-term memory learns as context, which is short and limited due to the limitation of the Transformer's limited context window length.
Long-term memory is stored as an external vector, and the agent can be queried, quickly retrieved, and thus accessed.

Maximum Internal Product Search (MIPS)

External memory can alleviate some limitations in attention. To better handle external memory, it is common practice to save embedded representations of information to a vector storage database that can support fast maximum inner product search (MIPS). In order to optimize retrieval speed, researchers often use the ANN (approximate nearest neighbors) algorithm.

In accelerated MIPS, frequently used ANN algorithms include:

Local Sensitive Hash (LSH): It introduces a hash function that maps similar inputs to the same buckets with a high probability, where the number of buckets is much smaller than the number of inputs.

Approximate Nearest Neighbor (ANNOY): The core data structure of the method is the Random Projection Trees, which is a set of binary trees where each non-leaf node represents a hyperplane that divides the input space into two parts, while each leaf node stores a data point. Trees are independently and randomly constructed, so they resemble hash functions in some way. This idea is closely related to KD trees, a tree-like data structure that stores points in space separately, but is more extensible.

Hierarchical Navigable Small World (HNSW): This approach is inspired by small world networks (a graph structure) in which most nodes can be connected to other nodes in very few steps. HNSW builds a hierarchy of these small world graphs, where the bottom layer contains the actual data points and the middle layer creates shortcuts to speed up searches. When performing a search, HNSW starts at a random node at the top level and navigates towards the target node, and when it can no longer get closer to the target, it moves down to the next level until it reaches the bottom level. Every movement made in the upper layer has the potential to cover a large distance in the data space, while every movement made in the lower layer improves the accuracy of the search.

The Facebook AI (now Meta AI) team's open-sourced library FAISS: FAISS operates on the assumption that in high-dimensional space, the distances between nodes follow a Gaussian distribution, so there should be clustering of data points. FAISS applies vector quantization by segmenting vector space into clusters and quantizing within clusters.

Extensible nearest neighbor (ScaNN):

The main innovation of ScaNN is Anisotropic Vector Quantization (AVQ), which quantifies data points x_i to

, so that the inner product is as close as possible to the original distance, thereby reducing the distance error between the data points.

MIPS algorithm comparison.

Component 3: Using tools

The use of tools is a distinctive feature of human beings. We create, modify, and use external objects to explore and perceive the real world. Similarly, equipping LLM with external tools can greatly expand the capabilities of the model.

A photo of a sea otter knocking open a shell with a stone as it floats in the water. While tools are available to some other animals, they are not as complex as humans. Source: Animals using tools

MRKL (Karpas et al. 2022) is a neuro-symbolic architecture for autonomous agents, named after short for Modular Reasoning, Knowledge, and Language. Each MRKL system contains a number of "expert" modules, and the generic LLM acts as a router that is responsible for routing queries to the most appropriate expert module. These modules can be neural (such as deep learning models) or symbolic (such as math calculators, currency converters, weather APIs).

MRKL's research team conducted an experiment to fine-tune the LLM call calculator using mathematical calculations as test cases. Since LLM (7B Jurassic1-large model) fails to reliably extract the correct argument for basic calculations, this experiment shows that solving mathematical problems that are simply articulated in spoken language is more difficult than explicitly stated mathematical problems. The results highlight the critical importance of knowing when and how to use external symbolic tools when and how to use them when they work reliably, as determined by the capabilities of LLM.

Two other studies, TALM (Parisi et al. 2022) and Toolformer (Schick et al. 2023), both fine-tuned language models (LMs) to learn to use external tool APIs. The dataset is extended based on whether the newly added API call annotations can improve the output quality of the model.

The ChatGPT plugin and OpenAI's API function calls are the best examples of LLM's enhanced capabilities with tools. A collection of tool APIs can be provided by other developers (plug-ins) or custom (function calls).

HuggingGPT (Shen et al. 2023) is a framework that uses ChatGPT as a task planner, selects the models available in the HuggingFace platform based on the model description, and summarizes the responses based on the execution results.

Schematic diagram of how HuggingGPT works. Source: Shen et al. 2023

The HuggingGPT system consists of 4 phases:

(1) Task planning: LLM, as the brain, resolves user requests into multiple tasks. Each task has four associated properties: task type, task ID, dependencies, and parameters. The research team used a small number of examples to guide LLM in task resolution and planning.

The AI assistant can parse user input to several tasks: [{"task": task, "id", task_id, "dep": dependency_task_ids, "args": {"text": text, "image": URL, "audio": URL, "video": URL}}]. The "dep" field denotes the id of the previous task which generates a new resource that the current task relies on. A special tag "-task_id" refers to the generated text image, audio and video in the dependency task with id as task_id. The task MUST be selected from the following options: {{ Available Task List }}. There is a logical relationship between tasks, please note their order. If the user input can't be parsed, you need to reply empty JSON. Here are several cases for your reference: {{ Demonstrations }}. The chat history is recorded as {{ Chat History }}. From this chat history, you can find the path of the user-mentioned resources for your task planning.

(2) Model selection: LLM selects a model from a list of models and assigns tasks to expert models. Due to the limited length of the context, filtering based on task type is required.

Given the user request and the call command, the AI assistant helps the user to select a suitable model from a list of models to process the user request. The AI assistant merely outputs the model id of the most appropriate model. The output must be in a strict JSON format: "id": "id", "reason": "your detail reason for the choice". We have a list of models for you to choose from {{ Candidate Models }}. Please select one model from the list.

(3) Task execution: The expert model performs specific tasks and records the implementation results.

With the input and the inference results, the AI assistant needs to describe the process and results. The previous stages can be formed as - User Input: {{ User Input }}, Task Planning: {{ Tasks }}, Model Selection: {{ Model Assignment }}, Task Execution: {{ Predictions }}. You must first answer the user's request in a straightforward manner. Then describe the task process and show your analysis and model inference results to the user in the first person. If inference results contain a file path, must tell the user the complete file path.

(4) Response generation: LLM receives the execution result and provides the overall result to the user.

In order to put HuggingGPT into practical use, several challenges need to be addressed: (1) efficiency needs to be improved, as LLM inference and interaction with other models slow down the process; (2) it relies on a long contextual window to communicate complex task content; (3) Improve the stability of LLM output and external model services.

API-Bank (Li et al. 2023) is a benchmark for evaluating the performance of the tool's enhanced LLM. It contains 53 commonly used API tools, a complete tool-enhanced LLM workflow, and 264 annotated dialogs involving 568 API calls. The APIs available in the API-Bank benchmark are quite diverse, including search engines, calculators, calendar queries, smart home controls, schedule management, and more. LLM can first find the appropriate API to call through the API search engine, and then call the API using the relevant documentation.

The pseudocode of LLM making API calls in API-BANK. (Image source: Li et al. 2023)

In the workflow of API-Bank, LLM needs to make a number of decisions, including:

whether you need to call an API;
Determine the right API to call: if it's not good enough, LLM needs to repeatedly modify the API input (e.g. change the search term of the search engine API);
Response based on API results: If the results are not satisfactory, the model can choose to optimize and call again.

This benchmark evaluates the agent's ability to use tools at three levels:

Ability to call APIs: According to the description of the API, the model needs to determine whether to call a given API, call it correctly, and react correctly to the return result of the API;
The ability to retrieve the API. Models need to search for APIs that might solve user needs and learn how to use them by reading the documentation.
The ability to retrieve and call APIs outside of planning. Given ambiguous user requirements (e.g. scheduling group meetings, booking flights/hotels/restaurants for travel), the model may require multiple API calls to solve real-world problems.

Case studies

Agents for scientific discovery

ChemCrow is a chemical agent designed by Large Language Modeling (LLM) to accomplish tasks such as organic synthesis, drug discovery, and materials design. By integrating 17 expert-designed tools, ChemCrow improves the chemistry of LLM and derives new capabilities.

An interesting observation about ChemCrow is that although LLM-based evaluations concluded that GPT-4 and ChemCrow were almost comparable in performance, expert manual evaluations showed that ChemCrow was largely superior to GPT-4. This means that using LLM to evaluate its own performance in areas that require deep expertise can be potentially problematic. Lack of expertise can lead to LLM not understanding its flaws and therefore not being able to judge the correctness of task results well.

The Boiko et al. paper examines AI agents for scientific discovery, which are used to handle the autonomous design, planning, and execution of complex scientific experiments. This agent can use tools to browse the internet, read documents, execute code, call robotic experimentation APIs, and leverage other LLMs.

For example, when the agent receives the prompt "develop a novel anticancer drug", its reasoning steps are as follows:

Ask about current trends in cancer drug discovery;
Selected targets;
Start looking for compounds for this target;
Once the compound is identified, the model attempts to synthesize it.

Production agents

Generative agents combine LLM with memory, planning, and reflection mechanisms that enable agents to react based on past experience and interact with other agents.

Generative agent architecture diagram.

Proof-of-concept example

Here the author mentions AutoGPT (Autonomous Artificial Intelligence), with which humans can complete tasks autonomously without human intervention. Andrej Karpathy also praised: "AutoGPT is the next frontier in prompt engineering."

Specifically, AutoGPT is equivalent to giving GPT-based models a memory and a body. With it, you can give a task to an AI agent, let it autonomously come up with a plan, and then execute the plan. It also features Internet access, long- and short-term memory management, GPT-4 instances for text generation, and file storage and summary generation using GPT-3.5. AutoGPT can be used to analyze the market and propose trading strategies, provide customer service, conduct marketing, and other tasks that require constant updating.

In addition, the author lists the GPT-Engineer project, which is similar to a code generation class tool that can generate a codebase at a prompt. As mentioned earlier, GPT-Engineer can do it as long as you make reasonable requirements.

challenge

After understanding the key ideas and demonstrations of building LLM-centric agents, we should also see some limitations:

Limited context length: LLM has limited ability to process contextual information, and while mechanisms such as self-reflection can learn from past mistakes, longer or infinite context windows will be of great benefit. While vector storage and retrieval can provide access to a larger knowledge base, their representation capabilities are not as powerful as full attention.

LLM challenges in long-term planning and task decomposition: LLM is difficult to adjust planning and correct in the face of unexpected errors, and LLM robustness is still relatively poor compared to humans who can constantly try and error.

Reliability of natural language interfaces: Current agent systems rely on natural language as an interface between LLM and external components such as memory and tools. However, the reliability of the model output is questionable because LLM can exhibit malformed and occasionally rebellious behavior (for example, refusing to follow instructions).

Source: Heart of the Machine

Large model autonomous agents are explosive, OpenAI is also secretly observing, exerting force, and insider analysis

Original topic: Large model autonomous agents exploded, OpenAI is also secretly observing and exerting force, this is an insider's analysis blog

Read on

changes in the senior management of pharmaceutical companies Novartis and GSK in China; OpenAI's Chief Scientist Leaves | Executive Updates: May 5-17, 2024

The Conservative Rout? The driving force behind OpenAI's infighting left Altman: It makes me sad

OpenAI is shockingly exposed! Executives angrily denounced the suppression, and the 710 billion AI giant was embarrassed at home and abroad|Titanium Media AGI

GPT-4o sparks heated discussions about OpenAI's organizational innovation! Heavy responsibilities for fresh graduates and undergraduates, the ranks are all floating clouds

Ilya left OpenAI insider exposure: Ultraman cut his team's computing power and prioritized products to make money

In the second act of OpenAI's palace fight, the core security team was disbanded, and the person in charge blew up the inside story of his resignation

OpenAI forces departing employees to sign shut-up agreements: GPT can talk, but former employees can't

OpenAI responds to "gag" resignation clauses; Didi Chengwei: Liu Qing was promoted to permanent partner, and the company no longer has the position of president; NetBSD prohibits AI-generated code | Geek headlines

OpenAI employees were "sealed" when they left their jobs, the core security team was disbanded, and Altman responded urgently: there was an agreement, but it was never implemented!

聊聊OpenAI最新发布的GPT 4o

OpenAI Shock! The chief scientist suddenly left! Wang Yuquan's exclusive analysis!

OpenAI officially announced the launch of "next-generation cutting-edge model" training! It is expected that the training parameters will be further improved, or the "Wensheng video" model Sora will be integrated

Former OpenAI director reveals the inside story of Ultraman's recall: The board of directors knew that ChatGPT had been released from X

It's all "my own people"! OpenAI urgently set up a "safety committee", less than half a month after the disbandment of the "super alignment" team, and will face the first security "big test" in 90 days

OpenAI is caught in the biggest public relations crisis in history, and the head of Altman, who is in charge, donated half of his net worth to help the company tide over the difficulties

Current and former employees of OpenAI, Google DeepMind warn of the risks of artificial intelligence: it could lead to the extinction of humanity! Call for the protection of whistleblowers