
SenseTime and Tsinghua Generalist Agents unlock my world and survive, explore and create like humans

author:Heart of the Machine Pro

Heart of the Machine released

Heart of the Machine Editorial Office

From Go's AlphaGo to StarCraft II's AlphaStar to Dota2's OpenAI Five, these super-agent research has made huge breakthroughs in increasingly realistic and open virtual environments, and now the generalist AI agent "Ghost in the Minecraft" (GITM) is able to play the game Minecraft!

In Minecraft, the world's best-selling game, we can see activities such as survival, exploration, and creation, all closely simulating the real world, and Minecraft is like a scaled-down version of the real world. Many famous research teams around the world, including DeepMind and OpenAI, have invested in the research of related AI agents, hoping to seek answers to the real world.

Researchers from SenseTime, Tsinghua University, Shanghai Artificial Intelligence Laboratory and other institutions proposed that the generalist AI agent Ghost in the Minecraft (GITM), which can learn to solve tasks independently, can not only perform better than all previous agents in Minecraft, but also greatly reduce training investment. The study thus represents an important step towards general artificial intelligence (AGI). The goal of General Artificial Intelligence (AGI) research is to develop agents that can perceive, understand, and interact like humans in an open-world environment. AGI's research can bring great breakthroughs and advances to industries such as robotics and autonomous driving, and promote the greater development of artificial intelligence technology in industrial landing.

The agent is able to fully unlock 262 items in the overall tech tree of the Minecraft Overworld (only 78 in total for all previous agent methods, including OpenAI and DeepMind), significantly increase the success rate by 47.5% on the standard "Get Diamond" task (from 20% to 67.5% of OpenAI's VPT method), and only need one CPU node to complete training in two days. The number of training steps required to interact with the environment has been reduced to one-ten-thousandth of the previous method, far less than the 6480 GPU days required by the VPT method proposed by OpenAI or the 17 GPU days required by DeepMind's proposed DreamerV3.

SenseTime and Tsinghua Generalist Agents unlock my world and survive, explore and create like humans

Project Home Page:

AI can also cope with the open world, survive, explore and create like humans!

Ghost in the Minecraft (GITM), a generalist AI agent, starts from scratch in survival mode, gets all the items in the Overworld, digs diamonds, and makes enchanted books!

SenseTime and Tsinghua Generalist Agents unlock my world and survive, explore and create like humans

"Ghost in the Minecraft"(GITM)

Video loading...

Successfully crafted Enchanted Books, the highest level product of the Overworld Tech Tree

SenseTime and Tsinghua Generalist Agents unlock my world and survive, explore and create like humans


Why Minecraft

In the current artificial intelligence research, we are increasingly pursuing to create AI agents with generalist power. There are high hopes that these agents will master a wide range of skills, adapt to various environmental changes, and more deeply simulate and respond to human capabilities on complex problems.

In Minecraft, the world's best-selling game, we can see activities such as survival, exploration, and creation, all closely simulating the real world. Minecraft is like a scaled-down version of the real world. The researchers' goal is to develop an AI agent that can overcome all technical challenges in Minecraft, leading to the direction of building a general-purpose artificial intelligence with self-learning and mastery of the entire real-world skill.

However, the AI agents in Minecraft face an interesting Moravik paradox:

Some tasks that are relatively difficult for humans, such as playing chess, are relatively easy for AI; In an open world like Minecraft, where interacting with the environment, planning, and making decisions are relatively simple for humans, AI faces huge challenges.

GITM successfully broke the limits of this paradox and made a major breakthrough in a complex and real-world-like environment. This opens up new possibilities for advancing AI technology and building more general AI agents.

How strong is GITM


SenseTime and Tsinghua Generalist Agents unlock my world and survive, explore and create like humans

High task success rate: GITM achieved a 67.5% success rate on the most talked about "Get Diamonds" task, an improvement of +47.5% compared to the current best score (OpenAI VPT).

SenseTime and Tsinghua Generalist Agents unlock my world and survive, explore and create like humans

Extremely high training efficiency: Surprisingly, GITM's training efficiency has also reached a new height: the number of environmental interaction steps only needs one-ten-thousandth of the existing method, and a single CPU node can be completed in 2 days of training, which is undoubtedly a huge improvement compared to the 6480 GPU days required by OpenAI VPT or 17 GPU days required by DeepMind Dreamer V3.

SenseTime and Tsinghua Generalist Agents unlock my world and survive, explore and create like humans


The difficulty with traditional RL agents lies in how to map extremely complex tasks to the lowest-level keyboard and mouse operations.

GITM breaks the traditional RL-based architecture and adopts a new paradigm of Large Language Model (LLM) as the core of the agent.

SenseTime and Tsinghua Generalist Agents unlock my world and survive, explore and create like humans

GITM is mainly composed of LLM Decomposer, LLM Planner, and LLM Interface, gradually decomposing complex tasks into subtasks, structured actions, and keyboard and mouse operations to the lowest level:

  • LLM Decomposer leverages external knowledge, such as a knowledge base of games on the Internet, to break down complex tasks into simple subtasks
  • LLM Planner plans a series of structured actions for each subtask, adjusts the plan based on feedback, and improves itself by continuously learning from successes
  • The LLM Interface uses low-level keyboard and mouse actions to perform structured actions and obtain observations as it interacts with the environment
SenseTime and Tsinghua Generalist Agents unlock my world and survive, explore and create like humans

Advanced applications of GITM

SenseTime and Tsinghua Generalist Agents unlock my world and survive, explore and create like humans

GITM CAN BE FURTHER APPLIED TO MINECRAFT'S MORE COMPLEX QUESTS, SUCH AS SHELTERS, FARMLANDS, AND IRON GOLEMS NEEDED TO SURVIVE, REDSTONE CIRCUITS NEEDED TO CREATE AUTOMATION EQUIPMENT, AND NETHER PORTALS TO ENTER THE NETHER. These missions demonstrate GITM's power and scalability, allowing agents to survive, evolve, and explore more advanced worlds in Minecraft for long periods of time.

Read on