laitimes

OpenAI does not roll up large models, but starts to roll AI Agents? This is a long article from OpenAI

author:ChatGPT sweeping monk
"If a paper proposes some kind of different training method, Slack inside OpenAI scoffs because that's what we play with leftover." But when the new AI Agents paper comes out, we'll be seriously excited to discuss it. ”

Recently, Andrej Karpathy, the co-founder of OpenAI, gave a brief speech at a developer event about himself and OpenAI's internal views on AI agents.

OpenAI does not roll up large models, but starts to roll AI Agents? This is a long article from OpenAI

Andrej Karpathy compared the difficulties of developing AI agents in the past with the new opportunities developed under new technology tools today, and he did not forget to joke that his work at Tesla was "distracted by self-driving", and he said that autonomous driving and VR are examples of bad AI agents.

On the other hand, Andrej Karpathy believes that ordinary people, entrepreneurs, and geeks have an advantage over companies like OpenAI in building AI agents, and everyone is currently on a level playing field, so he is looking forward to seeing the results. Karpathy's full sharing video ends the article.

Follow and click on "AI Subconscious" above to set it as a star

More technical dry goods, the first time to deliver

On June 27, Lilian Weng, head of applied research at OpenAI, wrote a 10,000-word essay with chapters that ChatGPT helped her draft. She proposed Agent = LLM (Large Language Model) + Memory + Planning Skills + Tool Use, and gave a detailed explanation of the functions of each module of the Agent, and finally she was very optimistic about the future application prospects of the Agent, but also showed that challenges are everywhere.

OpenAI does not roll up large models, but starts to roll AI Agents? This is a long article from OpenAI

I translated this long article, and added some of my own understanding and experience, let's take a look at what the big guys have said! The article is long, I hope the guys can read it patiently. The original link is at the end of the article.

Building agents with LLM (Large Language Model) as its core controller is a cool concept. Several proof-of-concept demos, such as AutoGPT, GPT-Engineer, and BabyAGI, are inspiring examples. The potential of LLM is not limited to generating excellent copies of writing, stories, essays, and programs, it can be built into a powerful general purpose problem solver.

Agent system overview

In an LLM-driven autonomous agent system, LLM acts as the agent's brain, supplemented by several key components:

  • planning
    • Subgoals and decomposition: Agents effectively handle complex tasks by breaking them down into smaller, manageable sub-goals.
    • Reflection and improvement: Agents can be self-critical and self-reflective about past actions, learning from mistakes and improving future steps, thereby improving the quality of the final result
  • memory
    • I think of all contextual learning (see Prompt Engineering) as learning using the short-term memory of a model.
    • Long-term memory: Long-term memory provides agents with the ability to store and recall (infinite) information for a long time, typically by utilizing external vector storage and rapid retrieval.
  • Use tools
    • Agents learn to call external APIs to get extra information missing from model weights (often difficult to modify after pre-training), including current information, code execution capabilities, access to proprietary information sources, etc.
OpenAI does not roll up large models, but starts to roll AI Agents? This is a long article from OpenAI

Component 1: Planning

Complex tasks often involve many steps. Agents need to know what specific tasks are and start planning ahead.

Task decomposition

"Chain of thought" (CoT) has become a standard prompting technique used to enhance model performance on complex tasks. Instruct the model to "think step by step" to take advantage of more test time calculations to break down difficult tasks into smaller, simpler steps. COT transforms major tasks into multiple manageable tasks and focuses attention on interpretability of the model's thought process.

The Tree of Thoughts extends COT by exploring multiple reasoning possibilities at each step. It starts by breaking down the problem into multiple thinking steps, and each step generates multiple ideas, making it possible to create a tree structure. The search process of a tree of thought can be BFS (breadth-first search) or DFS (depth-first search), with each state determined by a classifier (via prompt) or majority voting.

The teardown task can be done in three ways:

(1) Use simple prompts to disassemble the LLM, such as: "Steps to XYZ", "What are the sub-goals to achieve XYZ?" ”。

(2) Use task-specific instructions, such as "Write a story outline." "For writing novels.

(3) Humans disassemble themselves.

A very different approach is LLM+P, which involves relying on an external classical planner for long-term planning. This method uses Planning Domain Definition Language (PDDL) as an intermediate interface for describing planning problems. During this process, LLM (1) turns the problem into a "problem PDDL", then (2) requests the classic planner to generate a PDDL plan based on an existing "domain PDDL", and finally (3) translates the PDDL plan back to natural language. Essentially, the planning steps are outsourced to external tools, assuming that both domain-specific PDDLs and suitable planners are available, an assumption that is common in some robot setups but not in many others.

rethink

Self-reflection is an important aspect that allows autonomous agents to make iterative improvements by improving past action decisions and correcting previous mistakes. It plays a vital role in real-world tasks where trial and error are inevitable.

"ReAct" integrates reasoning and action into LLM by expanding the action space into a combination of discrete action and language spaces for specific tasks. The former enables LLM to interact with the environment (e.g. using the Wikipedia search API), and the latter enables LLM to generate natural language inference trajectories.

The ReAct prompt template includes clear steps for LLM thinking, roughly in the following format:

Thought: ...

Action: ...

Observation: ...

... (Repeated many times)

OpenAI does not roll up large models, but starts to roll AI Agents? This is a long article from OpenAI

In two experiments, the knowledge-intensive task and the decision-making task, ReAct outperformed the action-only benchmark model, where the "think: ..." step was removed.

"Reflection" is a framework that provides the agent with dynamic memory and self-reflection to improve its reasoning skills. Reflection employs a standard reinforcement learning setup where the reward model provides simple binary rewards, the action space follows the setting in ReAct, and the task-specific action space enhances complex reasoning steps through language. After each action, the agent calculates a heuristic value ht and decides whether to reset the environment to start a new experiment based on the results of self-reflection.

OpenAI does not roll up large models, but starts to roll AI Agents? This is a long article from OpenAI

Heuristic functions are used to determine when the action trajectory of LLM begins to be inefficient or contains hallucinations, and stops the task at that moment. Inefficient planning is one that takes a lot of time but doesn't have a path that doesn't succeed. Hallucinations are defined as LLM encountering a series of successive identical actions that cause LM to observe the same results in the environment.

Create self-reflection by presenting LLM with two examples, each of which is a pair (a trajectory of failure, ideal reflection to guide future planned changes). The reflection is then added to the agent's working memory, with a maximum of three reflections, primarily used as context for querying LLM.

OpenAI does not roll up large models, but starts to roll AI Agents? This is a long article from OpenAI

"Chain of Hindsight" (CoH) encourages models to improve their own outputs by explicitly presenting a series of past output sequences and annotating feedback for each output. The human feedback data is a collection of prompts, each of which is the result of the completion of the model, the scoring of human pairs, and the hindsight feedback provided by the corresponding humans. Assuming that the feedback tuple is ranked by reward, the process is supervised fine-tuning, where the data is in the form of a sequence, wherein. The model is fine-tuned to predict only given the sequence prefix so that the model can self-reflect based on the feedback sequence to produce a better output. When testing, the model can selectively receive multiple rounds of instructions with human annotators.

To avoid overfitting, CoH adds a regularization term to maximize the log-likelihood of the pre-trained dataset. To avoid shortcuts and copying (because there are many common words in the feedback sequence), they randomly mask 0%-5% of the historical token during training.

The training dataset in the experiment is a combination of WebGPT comparison, human feedback summary, and human preference dataset.

OpenAI does not roll up large models, but starts to roll AI Agents? This is a long article from OpenAI

The idea of CoH is to present a series of progressively improved histories in context, and train models to follow trends to produce better results. Algorithm Distillation applies the same idea to cross-story trajectories in reinforcement learning tasks, where algorithms are encapsulated in a long-term historical conditional strategy. Given the multiple interactions between the agent and the environment, each set of agents will be better tables, and AD connects this learning history and feeds it into the model. Therefore, we should expect the next predicted action to perform better than the previous trial. Our goal is to learn the process of reinforcement learning, not to train a strategy itself for a specific task.

OpenAI does not roll up large models, but starts to roll AI Agents? This is a long article from OpenAI

The paper hypothesizes that any algorithm that generates a series of learning historical data can be distilled into a neural network by performing cloning acts on actions. Historical data is generated by a set of source policies, each trained for a specific task. In the training phase, each time reinforcement learning runs, a task is randomly sampled and trained using a subsequence of multiple sets of history, so that the learned strategy is independent of the task.

In fact, the model has a limited context window length, so the episodes used should be short enough to facilitate the construction of multi-episode historical data. To learn an almost optimal context reinforcement learning algorithm, a multi-episode context of 2-4 episodes is required. Contextual reinforcement learning often requires a sufficiently long context.

Compared to three baselines including ED (expert distillation, a behavioral cloning that uses expert trajectories instead of learning historical data), source strategy (used to generate the trajectory of UCB distillation), and RL^2 (used as an upper bound because it requires in-line reinforcement learning). Although AD only uses offline reinforcement learning, it demonstrates the ability to perform and RL^2 close to contextual reinforcement learning, and learns much faster than other baselines. Given part of the historical training data of the source strategy, AD also improves faster than the ED baseline.

OpenAI does not roll up large models, but starts to roll AI Agents? This is a long article from OpenAI

Component two: Memory

Thank you so much ChatGPT for helping me draft this section. In conversations with ChatGPT, I learned a lot about the human brain and the data structure of Fast MIPS.

Types of memory

Memory can be defined as the process used to retrieve, store, retain, and subsequently retrieve information. There are several types of memory in the human brain.

  1. "Sensory memory": This is the earliest stage of memory and provides the ability to retain impressions of sensory information (visual, auditory, etc.) after the original stimulus ends. Sensory memory usually lasts only a few seconds. Subcategories of sensory memory include image memory (vision), echo memory (hearing), and tactile memory (touch).
  2. "Short-term memory" or working memory: It stores information that we are currently aware of and needed to perform complex cognitive tasks such as learning and reasoning. Short-term memory is thought to have a capacity of about 7 items (Miller 1956) and lasts 20-30 seconds.
  3. "Long-term memory": Long-term memory can store information for a considerable period of time, ranging from a few days to decades, and the storage capacity is basically unlimited. There are two subtypes of LTM:
  4. Display/declarative memory: This is the memory of facts and events and refers to those memories that can be consciously recalled, including episodic memory (events and experiences) and semantic memory (facts and concepts).
  5. Implicit/Procedural Memory: This type of memory is unconscious and involves skills and procedures that are performed automatically, such as riding a bike or typing on a keyboard.
OpenAI does not roll up large models, but starts to roll AI Agents? This is a long article from OpenAI

We can roughly consider the following mapping:

  • Sensory memory as the original input of a learning embedding representation, including text, images, or other modalities;
  • Short-term memory learns as context. It is ephemeral and finite, as it is limited by the Transformer's limited context window length.
  • Long-term memory, as an external vector store that agents can focus on at query time, can be accessed through fast retrieval.

Maximum inner product search (MIPS)

External memory can alleviate the limitation of limited attention span. A standard practice is to save the embedding representation of the information to a vector storage database that can support fast maximum inner product search (MIPS). To optimize retrieval speed, a common choice is the approximate nearest neighbor (ANN) algorithm, which returns approximately the first k nearest neighbors at the expense of a little precision in exchange for a huge speedup.

The following common ANN algorithms can be used for MIPS:

"LSH" (Locality-Sensitive Hashing) introduces a hash function that allows similar inputs to be mapped to the same bucket with a higher probability , where the number of buckets is much smaller than the number of inputs.

The core data structure of "ANNOY (Approximate Nearest Neighbors)" is a random projection tree, which is actually a set of binary trees, where each non-leaf node represents a hyperplane that divides the input space in half, and each leaf node stores a piece of data. The binary tree is built independently and randomly, so in a way, it mimics a hash function. ANNOY iteratively searches all the trees for the half closest to the query, and then continuously aggregates the results. The idea is very relevant to KD trees, but more extensible.

"HNSW (Hierarchical Navigable Small World)" is inspired by the idea of small-world networks, where most nodes can be reached by any other node in very few steps; For example, the "six degrees of separation" theory of social networks. HNSW constructs the hierarchy of these small world diagrams, where the underlying structure contains the actual data. The middle layer creates shortcuts to speed up searches. When performing a search, HNSW navigates to the target starting at a random node at the top level. When it can't get close, it moves down to the next level until it reaches the lowest level. Each movement in the upper layer can cover a long distance in the data space, while each movement in the lower layer can refine the search quality.

"FAISS (Facebook AI Similarity Search)" runs on the assumption that the distances between nodes in high-dimensional space follow a Gaussian distribution, so there are clustering points between these data points. FAISS works by dividing vector spaces into clusters and then using vectors quantization within the clusters. FAISS first uses coarse-grained quantification methods to find candidate clusters, and then further uses finer quantification methods to find each cluster.

The main innovation of "ScaNN (Scalable Nearest Neighbors)" is anisotropic vector quantification. It quantizes the data points into a vector so that their inner product is as similar as possible to the original distance, rather than choosing the closest quantized centroid point.

OpenAI does not roll up large models, but starts to roll AI Agents? This is a long article from OpenAI

Component three: Using tools

Knowing how to use tools is the most remarkable and unique thing about human beings. We create, modify, and utilize external things to accomplish and exceed our physical and cognitive limits. Likewise. We can also equip LLMs with external tools to significantly extend the capabilities of the model.

OpenAI does not roll up large models, but starts to roll AI Agents? This is a long article from OpenAI

MRKL, short for Modular Reasoning, Knowledge, and Language, is a neural symbolic architecture for autonomous agents. The MRKL system consists of a set of "expert" modules, while the generic LLM acts as a router to route queries to the most appropriate expert module. These modules can be neural modules (e.g. deep learning models) or symbolic modules (e.g. math calculators, currency converters, weather APIs).

MRKL's research team conducted experiments on fine-tuning LLM, using the call calculator as an example and arithmetic as a test case. Experimental results show that solving oral math problems is more difficult than explicitly stated math problems because LLMs (7B Jurassic1-large model) cannot reliably extract the correct parameters of basic arithmetic. The experimental results also highlight that knowing when and how to use external symbolic tools is critical when and how they work reliably, depending on the capabilities of LLM.

Both "TALM" (Tool Enhanced Language Model) and "Toolformer" learn to use external tool APIs by fine-tuning an LM. The dataset is extended based on whether the newly added API call annotations can improve the quality of the model output.

ChatGPT plugins and OpenAI API function calls are the best examples of LLM with tool-using capabilities in practice. The collection of tool APIs can be provided by other developers, as in the case of plug-ins, or customized, as in function calls.

"HuggingGPT" is a framework that uses ChatGPT as a task planner to select the models available on the HuggingFace platform according to the description of each model, and generate the final response result based on the summary of the execution results of the model.

OpenAI does not roll up large models, but starts to roll AI Agents? This is a long article from OpenAI

The system consists of the following four stages:

(1) "Task Planning": LLM acts as the brain, parsing user requests into multiple tasks. Each task has four properties: task type, ID, dependencies, and parameters. They use a small number of examples to guide LLM through task resolution and planning.

The specific instructions are as follows:

The AI assistant can parse user input into multiple tasks: [{"task": task, "id", task_id, "dep": dependency_task_ids, "args": {"text": text, "image": URL, "audio": URL, "video": URL}}]. The "dep" field represents the ID of the previous task that generated a new resource on which the current task depends. The special tag "-task_id" refers to text, images, audio, and video generated in dependent tasks with a task ID of task_id. The task must be selected from the following options: {{Available Task List}}. There are logical relationships between tasks, note their order. If the user input cannot be parsed, you will need to reply with empty JSON. Here are a few examples for your reference: {{Demo}}. The chat history is {{chat history}}. From this chat history, you can find the path to the resource mentioned by the user for task planning.

(2) "Model Selection": LLM assigns tasks to expert models, where requests are constructed as multiple-choice questions. LLM provides a list of models to choose from. Due to the limited length of the context, filtering is required based on task type.

The specific instructions are as follows:

Based on user requests and invoke commands, AI assistants help users select a suitable model from the model list to handle user requests. The AI assistant only outputs the model ID of the most suitable model. The output must be in strict JSON format: "id": "id", "reason": "Detailed reason why you chose this model". We provide you with a list of models {{candidate models}} to choose from. Select a model from the list.

(3) "Task execution": The expert model executes on a specific task and records the results.

The specific instructions are as follows:

Based on the input and inference results, the AI assistant needs to describe the process and the outcome. The previous stages can be formed as follows - user input: {{user input}}, task planning: {{task}}, model selection: {{model assignment}}, task execution: {{predict results}}. You must first answer the user's request in a simple and clear manner. Then describe the task in the first person and show the user the results of your analysis and model inference. If the inference results include a file path, you must tell the user the full file path.

(4) "Response Generation": LLM receives the execution result and provides the summary result to the user.

To apply HuggingGPT to real-world scenarios, some challenges need to be solved: (1) efficiency needs to be improved, as LLM inference rounds and interaction with other models slow down processing; (2) It relies on long context windows to convey complex task content; (3) It is necessary to improve the stability of LLM output and external model services.

"API-Bank" is a benchmark for evaluating the performance of the tool to enhance LLM. It contains 53 commonly used API tools, complete tool enhancement LLM workflows, and 264 annotated dialogs with 568 API calls. The choice of APIs is very diverse, including search engines, calculators, calendar queries, smart home control, schedule management, health data management, account authentication workflows, and more. Due to the large number of APIs, LLM first visits the API search engine to find the right API to call, and then makes the call using the corresponding documentation.

OpenAI does not roll up large models, but starts to roll AI Agents? This is a long article from OpenAI

In the API-Bank workflow, LLM needs to make a number of decisions, and at each step, we can evaluate the accuracy of that decision. These decisions include:

  1. Determine whether API calls need to be made.
  2. Determine the right API to call: If it's not good enough, LLM needs to iteratively modify the API input (e.g., decide on search keywords for the search engine API).
  3. Respond based on API results: If the results are not satisfactory, the model can choose to make improvements and call again.

The benchmark assesses an agent's ability to use tools at three levels:

  • Level-1 assesses the ability to call APIs. Given the description of an API, the model needs to determine whether to call the given API, call it correctly, and respond appropriately to the API return.
  • Level-2 examines the ability to retrieve APIs. Models need to search for APIs that might solve user needs and learn how to use them by reading the documentation.
  • The Level-3 Evaluation Plan API goes beyond the ability to retrieve and call. In cases where user requests are unclear (e.g., scheduling a team meeting, booking flights/hotels/restaurants for travel), the model may require multiple API calls to resolve the issue.

Case study

Scientific Discovery Agent

"ChemCrow" is a domain-specific example where LLM is enhanced with 13 expert-designed tools to accomplish tasks such as organic synthesis, drug discovery, and materials design. The workflow implemented in LangChain mirrors what was previously described in ReAct and MRKLs and combines CoT inference with task-related tools:

  • Provide LLM with a list of tool names, including a description of their utility and details about expected inputs/outputs.
  • Instructs LLM to use the provided tools to answer prompts given by the user when necessary. The instruction suggestion model follows the ReAct format: think, act, act input, observe.

An interesting observation is that while LLM-based evaluations concluded that GPT-4 and ChemCrow performed almost equally, human evaluations conducted by experts in solution-oriented chemical correctness showed that ChemCrow performed far better than GPT-4. This suggests potential problems when using LLM to assess one's performance in areas that require in-depth expertise. Lack of expertise can lead to LLM not being aware of its flaws and therefore not being able to judge the correctness of task results well.

Boiko et al. (2023) also investigated LLM-enhanced scientific discovery agents to handle the autonomous design, planning, and execution of complex scientific experiments. The agent can use tools to browse the internet, read documents, execute code, call robotic experimentation APIs, and take advantage of other LLMs.

For example, when asked to "develop a novel anti-cancer drug," the model proposes the following steps of reasoning:

  1. Ask about current trends in cancer drug discovery;
  2. Select a target;
  3. Request a scaffold for these compounds;
  4. Once the compound is identified, the model attempts to synthesize it.

They also discussed risks, particularly illicit drugs and biological weapons. They developed a test set containing a list of known chemical weapons agents and asked the agents to synthesize them. 4 out of 11 requests (36%) were accepted for a synthetic solution, and the agent attempted to consult the document to perform the process. 7 out of 11 requests were denied, and of those 7 rejected cases, 5 were rejected after a web search, while 2 were rejected based solely on prompts.

Build an agent simulation

Generative Agents is a very interesting experiment with 25 virtual characters, each controlled by an LLM-powered agent who live and interact in a sandbox environment inspired by The Sims. Build agents create trusted simulations of human behavior for interactive applications.

Generative agents are designed to combine LLM with memory, planning, and reflection mechanisms that enable agents to act based on past experience, as well as interact with other agents.

  • "Memory" stream: is a long-term memory module (external database) that mainly records the agent's list of experiences in natural language.
    • Each element is an observation, an event provided directly by the agent. Communication between agents can trigger new natural language utterances.
  • "Retrieve" model: Presents context to inform the agent's behavior based on relevance, recency, and importance.
    • Recency: Recent events score higher.
    • Importance: Distinguish between mundane memories and core memories. Ask LM directly.
    • Relevance: Based on how relevant it is to the current situation/query.
  • "Reflection" mechanism: over time, synthesizes memories into higher-level inferences and guides the agent's future behavior. They are higher-level summaries of past events (note that this is somewhat different from the self-reflection above).
    • Prompt LM uses the last 100 observations and generates the 3 most significant high-level questions given a set of observations/statements. LM is then asked to answer these questions.
  • "Planning and Response": Translating reflection and environmental messages into action.
    • The essence of planning is to optimize credibility in the current and future times.
    • Prompt template: {Introducing Agent X}. This is X's rough plan for today: 1)
    • The relationship between agents and the situation in which one agent is observed by another is taken into account for planning and reaction.
    • Environmental information is presented in a tree structure.
OpenAI does not roll up large models, but starts to roll AI Agents? This is a long article from OpenAI

This interesting simulation yielded "emerging social behaviors" such as information diffusion, relational memory (e.g., two agents continuing to talk about a topic), and coordination of social events (e.g., hosting a party and inviting many others).

Example of a proof of concept

AutoGPT has drawn a lot of attention to the possibility of establishing autonomous agents with LLM as the main controller. While it has quite a few reliability issues at the natural language level, it's still a cool proof-of-concept demo. A lot of the code in AutoGPT is about format parsing.

This is the system message used by AutoGPT, where {{...}} is user input:

You are {{ai-name}}, {{user-provided AI bot description}}.

Your decisions must always be made independently without seeking user assistance. Play to your strengths as an LLM and pursue simple strategies with no legal complications.

GOALS:

1. {{user-provided goal 1}}

2. {{user-provided goal 2}}

3. ...

4. ...

5. ...

Constraints:

1. ~4000 word limit for short term memory. Your short term memory is short, so immediately save important information to files.

2. If you are unsure how you previously did something or want to recall past events, thinking about similar events will help you remember.

3. No user assistance

4. Exclusively use the commands listed in double quotes e.g. "command name"

5. Use subprocesses for commands that will not terminate within a few minutes

Commands:

1. Google Search: "google", args: "input": "<search>"

2. Browse Website: "browse_website", args: "url": "<url>", "question": "<what_you_want_to_find_on_website>"

3. Start GPT Agent: "start_agent", args: "name": "<name>", "task": "<short_task_desc>", "prompt": "<prompt>"

4. Message GPT Agent: "message_agent", args: "key": "<key>", "message": "<message>"

5. List GPT Agents: "list_agents", args:

6. Delete GPT Agent: "delete_agent", args: "key": "<key>"

7. Clone Repository: "clone_repository", args: "repository_url": "<url>", "clone_path": "<directory>"

8. Write to file: "write_to_file", args: "file": "<file>", "text": "<text>"

9. Read file: "read_file", args: "file": "<file>"

10. Append to file: "append_to_file", args: "file": "<file>", "text": "<text>"

11. Delete file: "delete_file", args: "file": "<file>"

12. Search Files: "search_files", args: "directory": "<directory>"

13. Analyze Code: "analyze_code", args: "code": "<full_code_string>"

14. Get Improved Code: "improve_code", args: "suggestions": "<list_of_suggestions>", "code": "<full_code_string>"

15. Write Tests: "write_tests", args: "code": "<full_code_string>", "focus": "<list_of_focus_areas>"

16. Execute Python File: "execute_python_file", args: "file": "<file>"

17. Generate Image: "generate_image", args: "prompt": "<prompt>"

18. Send Tweet: "send_tweet", args: "text": "<text>"

19. Do Nothing: "do_nothing", args:

20. Task Complete (Shutdown): "task_complete", args: "reason": "<reason>"

Resources:

1. Internet access for searches and information gathering.

2. Long Term memory management.

3. GPT-3.5 powered Agents for delegation of simple tasks.

4. File output.

Performance Evaluation:

1. Continuously review and analyze your actions to ensure you are performing to the best of your abilities.

2. Constructively self-criticize your big-picture behavior constantly.

3. Reflect on past decisions and strategies to refine your approach.

4. Every command has a cost, so be smart and efficient. Aim to complete tasks in the least number of steps.

You should only respond in JSON format as described below

Response Format:

{

"thoughts": {

"text": "thought",

"reasoning": "reasoning",

"plan": "- short bulleted\n- list that conveys\n- long-term plan",

"criticism": "constructive self-criticism",

"speak": "thoughts summary to say to user"

},

"command": {

"name": "command name",

"args": {

"arg name": "value"

}

}

}

Ensure the response can be parsed by Python json.loads

GPT-Engineer is another project that aims to create entire codebases based on tasks specified in natural language. GPT-Engineer was instructed to consider building a series of smaller components and asked for user input to clarify issues as needed.

Below is a simple example dialog for task clarification, which is sent to GPT-Engineer using the OpenAI ChatCompletion endpoint. The user's input is wrapped in {{user input text}}

[

{

"role": "system",

"content": "You will read instructions and not carry them out, only seek to clarify them.\nSpecifically you will first summarise a list of super short bullets of areas that need clarification.\nThen you will pick one clarifying question, and wait for an answer from the user.\n"

},

{

"role": "user",

"content": "We are writing {{a Super Mario game in python. MVC components split in separate files. Keyboard control.}} \n"

},

{

"role": "assistant",

"content": "Summary of areas that need clarification:\n1. Specifics of the Super Mario game (e.g. level design, characters, gameplay mechanics)\n2. Details about the MVC components (e.g. which components are in each file)\n3. Keyboard control implementation (e.g. which keys to use, how to handle input)\n\nClarifying question:\nCan you provide more details about the Super Mario game, such as level design, characters, and gameplay mechanics?"

},

{

"role": "user",

"content": "{{There are 10 levels in total. The main character is a plumber named Mario, who can walk and jump. It is a classical platform game just like Super Mario. The main character moves from left to right, trying to get to the destination, where there are many obstacles and attacks from enemies in the process.}} \n\nIs anything else unclear? If yes, only answer in the form:\n{remaining unclear areas} remaining questions.\n{Next question}\nIf everything is sufficiently clear, only answer \"Nothing more to clarify.\"."

},

{

"role": "assistant",

"content": "Remaining unclear areas: 2 remaining questions.\nCan you provide more information about how the MVC components are split into separate files?"

},

{

"role": "user",

"content": "{{Make your own assumptions and state them explicitly before starting}}"

}

]

After clarifying these, the agent goes into coding mode and displays different system messages. The specific system messages are as follows:

You will get instructions for code to write. You will write a very long answer. Make sure that every detail of the architecture is, in the end, implemented as code. Make sure that every detail of the architecture is, in the end, implemented as code.

Think step by step and reason yourself to the right decisions to make sure we get it right. You will first lay out the names of the core classes, functions, methods that will be necessary, as well as a quick comment on their purpose.

Then you will output the content of each file including ALL code. Each file must strictly follow a markdown code block format, where the following tokens must be replaced such that FILENAME is the lowercase file name including the file extension, LANG is the markup code block language for the code’s language, and CODE is the code:

FILENAME

CODE

Useful to know: You almost always put different classes in different files. For Python, you always create an appropriate requirements.txt file. For NodeJS, you always create an appropriate package.json file. You always add a comment briefly describing the purpose of the function definition. You try to add comments explaining very complex bits of logic. You always follow the best practices for the requested languages in terms of describing the code written as a defined package/project.

Python toolbelt preferences:

pytest

dataclasses

Sample conversation:

[

{

"role": "system",

"content": "You will get instructions for code to write.\nYou will write a very long answer. Make sure that every detail of the architecture is, in the end, implemented as code.\nMake sure that every detail of the architecture is, in the end, implemented as code.\n\nThink step by step and reason yourself to the right decisions to make sure we get it right.\nYou will first lay out the names of the core classes, functions, methods that will be necessary, as well as a quick comment on their purpose.\n\nThen you will output the content of each file including ALL code.\nEach file must strictly follow a markdown code block format, where the following tokens must be replaced such that\nFILENAME is the lowercase file name including the file extension,\ nLANG is the markup code block language for the code's language, and CODE is the code:\n\nFILENAME\n```LANG\nCODE\n```\n\nYou will start with the \"entrypoint\" file, then go to the ones that are imported by that file, and so on.\nPlease note that the code should be fully functional. No placeholders.\n\nFollow a language and framework appropriate best practice file naming convention.\nMake sure that files contain all imports, types etc. Make sure that code in different files are compatible with each other.\nEnsure to implement all code, if you are unsure, write a plausible implementation.\nInclude module dependency or package manager dependency definition file.\nBefore you finish, double check that all parts of the architecture is present in the files.\n\nUseful to know:\nYou almost always put different classes in different files.\nFor Python, you always create an appropriate requirements.txt file.\nFor NodeJS, you always create an appropriate package.json file.\nYou always add a comment briefly describing the purpose of the function definition.\nYou try to add comments explaining very complex bits of logic.\nYou always follow the best practices for the requested languages in terms of describing the code written as a defined\npackage/project.\n\n\nPython toolbelt preferences:\n- pytest\n- dataclasses\n"

},

# … same conversation as earlier, ended with "Make your own assumptions and state them explicitly before starting".

{

"role": "assistant",

"content": "Assumptions:\n1. Model: The model will contain the game's data, such as level information, character states, and enemy positions.\n2. View: The view will handle the game's visuals, including rendering the game objects, backgrounds, and updating the display.\n3. Controller: The controller will manage user input, such as keyboard controls, and update the model accordingly.\n\nWith these assumptions in mind, is there any specific aspect of the keyboard control implementation you would like me to clarify?"

},

{

"role": "user",

"content": "Please now remember the steps:\n\nThink step by step and reason yourself to the right decisions to make sure we get it right.\nFirst lay out the names of the core classes, functions, methods that will be necessary, As well as a quick comment on their purpose.\n\nThen you will output the content of each file including ALL code.\nEach file must strictly follow a markdown code block format, where the following tokens must be replaced such that\nFILENAME is the lowercase file name including the file extension,\nLANG is the markup code block language for the code's language, and CODE is the code:\n\nFILENAME\n```LANG\nCODE\n```\n\nPlease note that the code should be fully functional. No placeholders.\n\nYou will start with the \"entrypoint\" file, then go to the ones that are imported by that file, and so on.\nFollow a language and framework appropriate best practice file naming convention.\nMake sure that files contain all imports, types etc. The code should be fully functional. Make sure that code in different files are compatible with each other.\nBefore you finish, double check that all parts of the architecture is present in the files.\n"

}

]

challenge

After learning about the key ideas and demos of building LLM-centric agents, I started to see some common limitations:

  • Limited context length: Limited context capacity limits historical information, detailed descriptions, API call context, and responses. Systems must be designed to work with this limited communication bandwidth, and mechanisms like self-reflection can benefit greatly from long or even infinite context windows. While vector storage and retrieval can provide access to a larger knowledge base, their presentation capabilities are not as powerful as full attention.
  • Challenges of long-term planning and task decomposition: Planning over a long history and effectively exploring solution spaces remains challenging. When faced with unexpected errors, LLMs have a hard time adjusting their plans to make them less robust compared to humans who are constantly learning through trial and error.
  • Reliability of natural language interfaces: Current proxy systems rely on natural language as an interface between LLM and external components such as memory and tools. However, the reliability of the model output is problematic because LLM can be malformed and occasionally exhibit rebellious behavior (such as refusing to follow instructions). As a result, much of the agent demo code is focused on parsing model output.

My opinion

After OpenAI released ChatGPT, an epoch-making product comparable to the "iPhone", OpenAI's "ambitions" do not stop there. They also want to go one step further and become the Apple of the AI era. OpenAI previously launched a ChatGPT-based plug-in, which attracted a lot of attention, known as ChatGPT's App Store moment, but from the current usage data, the plug-in has a limited impact and cannot be compared with ChatGPT.

Compared with plug-ins, Agent can bring greater influence, can truly reconstruct many current application scenarios, and its imagination space is larger and richer. The agent is driven by LLM, which is the culmination of LLM (Large Language Model) + Memory + Planning Skills + Tool Use. It is not difficult to understand why OpenAI is now focusing on Agent, because with a large model as hardware support, Agent as an important landing product of software applications, it can fully establish a software ecological barrier, App Store moment may really come.

However, there are still many problems with the current agent, and I personally feel that the main problem is that the interaction method is too single, the dependence on the accuracy of natural language is high, and it is also limited by the limited attention window of LLM. However, as long as there are big guys who continue to follow up, I believe that these problems will be solved in the future!

LLM Powered Autonomous Agents https://lilianweng.github.io/posts/2023-06-23-agent/

Karpathy's shared video about AI Agent:

bibliography

1.AutoGPT:https://github.com/Significant-Gravitas/Auto-GPT

2.GPT-Engineer:https://github.com/AntonOsika/gpt-engineer

3.CoT.https://arxiv.org/abs/2201.11903

4.Tree of Thoughts:https://arxiv.org/abs/2305.10601

5.LLM+P:https://arxiv.org/abs/2304.11477

6.ReAc:https://arxiv.org/abs/2210.03629

7.Reflexion:https://arxiv.org/abs/2303.11366

9.CoH:https://arxiv.org/abs/2302.02676

10.MRKL https://arxiv.org/abs/2205.00445

11.TALM https://arxiv.org/abs/2205.12255

12.Toolformer:https://arxiv.org/abs/2302.04761

13.HuggingGPT:https://arxiv.org/abs/2303.17580

14.API-Bank:https://arxiv.org/abs/2304.08244

15.ChemCrow:https://arxiv.org/abs/2304.05376

16.Generative Agents:https://arxiv.org/abs/2304.03442

Read on