laitimes

AGI – Lessons learned from GPT and large language models

author:Institute of Computer Vision

Follow and star

Never get lost

Computer Vision Research Institute

AGI – Lessons learned from GPT and large language models
AGI – Lessons learned from GPT and large language models

Official ID | Computer Vision Research Institute

Study group|Scan the code to get the joining method on the homepage

AGI – Lessons learned from GPT and large language models

Address: https://arxiv.org/pdf/2306.08641.pdf

Column of the Institute of Computer Vision

Column of Computer Vision Institute

The AI community has been pursuing algorithms known as artificial general intelligence (AGI) that are applicable to any type of real-world problem.

01

Summary

Recently, chat systems powered by large language models (LLMs) have emerged and are quickly becoming a promising direction for implementing AGI in natural language processing (NLP), but the path to AGI in computer vision (CV) remains unclear. One might attribute this dilemma to visual signals being more complex than verbal signals, but we are interested in finding specific causes and learning from GPT and LLM to address this problem.

AGI – Lessons learned from GPT and large language models

In today's sharing, starting with the concept definition of AGI, a brief review of how NLP solves a wide range of tasks through chat systems. This analysis inspired us that unification is the next important goal of CV. But despite all the efforts made in this direction, CV is still far from being a system that naturally integrates all tasks like GPT. We point out that the essential weakness of CV lies in the lack of a paradigm for learning from the environment, and that NLP has done the task in the text world. We then imagine a pipeline that places a CV algorithm in a world-wide interactive environment, pre-trains it to predict future frames of its actions, and then fine-tunes it with instructions to accomplish various tasks. We hope to advance this idea and scale it up through a lot of research and engineering efforts, and to that end we share our views on the future direction of research.

02

background

The world is witnessing an epic journey towards artificial general intelligence (AGI), which we conventionally define as a computer algorithm that can replicate any intellectual task that humans or other animals can accomplish. Specifically, in natural language processing (NLP), computer algorithms have advanced to the point where they can solve a wide range of tasks by chatting with humans. Some researchers believe that these systems can be seen as an early spark for AGI. Most of these systems are built on Large Language Models (LLMs) and enhanced with instruction tuning. Equipped with an external knowledge base and specially designed modules for complex tasks such as solving mathematical problems and generating visualizations, they demonstrate their ability to understand user intent and execute initial thought chains. Despite known weaknesses in some areas (e.g., the relationship between telling scientific facts and the people being named), these groundbreaking studies have shown a clear trend towards unifying most tasks in NLP into a single system, which reflects the quest for AGI.

AGI – Lessons learned from GPT and large language models

Compared to the rapid progress of unification in NLP, the computer vision community is far from the goal of unifying all tasks. Conventional CV tasks, such as visual recognition, tracking, generation, etc., are mostly handled using different network architectures and/or specially designed channels. The researchers look forward to systems like GPT, which can handle a wide range of CV tasks with a unified cue mechanism, but there is a trade-off between good practice in achieving a single task and generalizing across a wide range of tasks. For example, in order to report high recognition accuracy in object detection and semantic segmentation, the best strategy is to design specific head modules on a powerful backbone for image classification, and this design usually does not transfer to other problems.

Therefore, two questions arise: (1) Why is the unification of CVs so difficult? (2) What can be learned from GPT and LLM in order to achieve this?

To answer these questions, GPT is revisited and understood as establishing an environment in the text world and allowing algorithms to learn from interactions. CV studies lack such an environment. Therefore, algorithms cannot simulate the world, so they sample the world and learn to get good performance in so-called proxy tasks. After an epic decade of deep learning, the surrogate task no longer meaningfully demonstrates the capabilities of CV algorithms; It is becoming increasingly clear that continuing to pursue high precision on them can keep us away from AGI.

03

Summary

Simply put, AGI is learning a generalized function a=π(s). Despite its simple form, old-fashioned AI algorithms struggle to use the same methods, algorithms, and even models to deal with all of these problems. Over the past decade, deep learning has provided an efficient and unified approach: one can train deep neural networks to approximate the function a=π(s) without knowing the actual relationship between them. The emergence of powerful neural network architectures such as Transformers has even enabled researchers to train a model for different data patterns.

There are great difficulties in achieving AGI, including but not limited to the following problems.

  • The complexity of the data. Real-world data is multifaceted and rich. Some data modalities (e.g., images) may have fairly high dimensions, and the relationships between different modalities may be complex and latent.
  • The complexity of human intelligence. The goal of AGI is not only to solve problems, but also to plan, reason, react to different events, etc. Sometimes, the relationship between human behavior and goals is vague and difficult to represent mathematically.
  • Lack of neurological or cognitive theory. Humans do not yet understand how human intelligence is achieved. Currently, computer algorithms offer an avenue, but with future research in neurology and/or cognition, more possibilities may emerge.

04

GPT

The spark of AGI in NLP

Over the past year, ChatGPT3, GPT-4, and other AI chatbots, such as Vicuna4, have made significant strides in AGI. They are computer algorithms developed for natural language processing (NLP). Through chat programs with humans, they can understand human intent and complete a wide range of tasks, as long as those tasks can be rendered in plain text. In particular, GPT-4 has a strong ability in general problem solving and is considered to be an early spark of AGI in the NLP field.

AGI – Lessons learned from GPT and large language models

Although GPT-4 has not yet opened its visual interface to the public, the official technical report presents several peculiar examples of multimodal conversations, i.e. chats based on input images as references. This means that GPT-4 already has the ability to combine linguistic features with visual features, so it can perform basic visual comprehension tasks. As we'll see later, the visual community has developed several alternatives for the same purpose, the key being to use ChatGPT or GPT-4 to generate (guide) training data. In addition, with simple prompts, GPT-4 is also able to call external software for image generation (e.g., Midtravel, as shown in the figure below) and external libraries for solving complex problems in computer vision (e.g., HuggingFace library).

AGI – Lessons learned from GPT and large language models

These AI chatbots are trained in two stages. In the first phase, large language models (LLMs) are pre-trained on large text databases using self-supervised learning, most of which are based on transformer architectures. In the second stage, the pre-trained LLM is supervised by human instructions to complete a specific task. If necessary, collect human feedback and perform reinforcement learning to fine-tune LLM for better performance and greater data efficiency. CV: AGI's next battleground

05

AGI's next battleground

CV: AGI's next battleground

Humans perceive the world based on multiple data modalities. It is well known that about 85% of what we learn is done through our visual system. So, given that the NLP community has shown promise for AGI, it's only natural to see computer vision (CV) or multimodality (including at least the visual and language domains) as the next battleground for AGI.

Two additional comments are provided to supplement the above statement. First, it's clear that CV is a superset of NLP, because humans reading articles first recognize the characters in the captured image and then understand the content. In other words, AGI (or multimodality) in CV should cover all the capabilities of AGI in NLP. Secondly, I believe that in many cases, language alone is not enough. For example, when people are trying to find detailed information about unknown objects (e.g., animals, fashion, etc.), the best way to do this is to capture an image and use it for online searches; Relying solely on textual descriptions can introduce uncertainty and inaccuracies. Another case is that, as I mentioned earlier, it's not always easy to reference fine-grained semantics (for recognition or image editing) in a scene, and it's more efficient to think in a visually friendly way, for example, using dots or boxes to locate targets instead of saying "someone in a black jacket, standing in front of a yellow car, talking to another person." ”

Ideals and realities

Hopefully, there is a CV algorithm that can solve general tasks by interacting with the environment. Note that the requirement is not limited to identifying all content or performing conversations based on images or video clips. It should be a holistic system that receives generic commands from humans and produces the desired results. However, the current status of CV is still very preliminary. As shown in the figure below, CVs have been using different modules and even systems for different vision tasks.

AGI – Lessons learned from GPT and large language models

Unification is the trend

Below, I summarize the topics of recent research on CV harmonization into five categories.

  • Open-world Visual Recognition
AGI – Lessons learned from GPT and large language models

For a long time, most CV algorithms could only identify concepts that appeared in the training data, resulting in a "closed world" of visual concepts. Instead, the concept of "open world" refers to the ability of a CV algorithm to recognize or understand any concept, whether or not it has appeared before. Open-world capabilities are often introduced by natural language because it is a natural way for humans to understand new concepts. This explains why language-related tasks such as image captioning and visual question answering contributed to the earliest visual identity open world settings.

  • The Segment Anything Task
AGI – Lessons learned from GPT and large language models

The Segment Anything task is a recently introduced generic module for clustering raw image pixels into groups, many of which correspond to the basic visual units in the image. The proposed task supports several types of prompts, including points, outlines, text, and so on, and generates some mask and score for each prompt or combination of prompts. After training on a large-scale dataset with about 10 million images, the derived model SAM can be transferred to a wide range of segmentation tasks, including medical image analysis, camouflage object segmentation, 3D object segmentation, object tracking, and image restoration applications. SAM can also be used with state-of-the-art visual recognition algorithms, such as refining the bounding box produced by the vision base algorithm into a mask and feeding the segmentation unit into an open-set classification algorithm for image labeling.

  • Generalized Visual Encoding
AGI – Lessons learned from GPT and large language models

Another way to unify CV tasks is to provide them with a common visual code. There are several ways to achieve this. A key difficulty lies in the huge differences between visual tasks, for example, object detection requires a set of bounding boxes, while semantic segmentation requires intensive prediction of the entire image, both of which are very different from the individual labels required for image classification. As everyone can understand, natural language provides a unified form to represent everything. An earlier study called PIX2SEQ showed that object detection results, or bounding boxes, could be formulated into natural language and coordinates and then converted into markers as output from a visual model. In later versions of PIX2SEQ-V2, they generalized representations with object detection, instance segmentation, key detection, and output of image captions. Similar ideas are used for other image recognition, video recognition, and multimodal understanding tasks.

  • LLM-guided Visual Understanding
AGI – Lessons learned from GPT and large language models

Visual recognition can be complex, especially when it involves the relationships between constituent concepts and/or visual instances. End-to-end models (visual language pre-trained models for visual question answering) have a hard time producing answers in a program that is easy for humans to understand. To alleviate this problem, a practical approach is to generate interpretable logic to aid visual recognition. The idea is not new. A few years ago, before the advent of transformer architectures, researchers proposed using long short-term memory (LSTM) models to generate programs that call vision modules as modules for answering complex questions. At the time, LSTM's capabilities largely limited the idea to relatively simple and templated problems.

Recently, the advent of large language models, especially the GPT series, has made it possible to transform arbitrary problems. Specifically, GPT can interact with humans in different ways. For example, it can summarize basic recognition results into final answers, or generate code or natural language scripts to invoke basic vision modules. Thus, visual problems can be broken down into basic modules. This is especially useful for logical problems, such as asking about spatial relationships between objects or questions that depend on the number of objects.

  • Multimodal Dialog

Multimodal dialog boxes extend text-based dialog boxes into the visual realm. Early work involved visual question answering, in which various datasets with simple questions were constructed. With the rapid development of LLM, multi-round question answering can be achieved by fine-tuning pre-trained vision and language models together. The study also shows that a wide range of questions can be answered through multimodal context learning or using GPT as a logic controller.

AGI – Lessons learned from GPT and large language models

Recently, a new paradigm developed in the GPT series, named Guided Learning, was inherited to improve the quality of multimodal dialogues. The idea is to provide some reference data (e.g., objectives, descriptions) from GT live notes or recognition results, and require GPT models to generate instruction data (i.e., rich question-answer pairs). By fine-tuning this data (without reference), the underlying models of vision and language can interact with each other via lightweight network modules such as Q-former. Multimodal dialogue provides an initial interaction benchmark for computer vision, but as a language-guided task, it also has weaknesses in analysis in open-world visual recognition. We hope that enriching the form of queries, for example, using a common visual coding approach, can push multimodal dialogs to a higher level.

06

future

Learn from the environment

An Imaginary Pipeline

AGI – Lessons learned from GPT and large language models

The image above shows our thoughts. The pipeline consists of three stages: stage 0 for setting up the environment, stage 1 for pre-training, and stage 2 for fine-tuning. If necessary, the fine-tuned model can be prompted to perform traditional visual recognition tasks.

Comments on Research Directions

Finally, the future research direction is prospected. As the primary goal shifts from performance on agent tasks to learning from the environment, many popular research directions may have to adjust their goals. Here's a disclaimer: All statements below are our personal opinions and may be false.

On creating an environment

A clear goal is to continue to increase the size, diversity, and fidelity of virtual environments. There are a variety of techniques that can help. For example, new 3D representations (e.g., neural rendering field, NeRF) may be more effective at achieving a compromise between reconstruction quality and overhead. Another important direction is the rich environment. Defining new, complex tasks and unifying them into a prompt system is an extraordinary task. In addition, AI algorithms can benefit greatly from better simulating the behavior of other agents, as it can greatly improve the richness of the environment and thus the robustness of the training algorithm.

On generative pretraining

There are two main factors that influence the pre-training phase, namely neural architecture design and agent task design. The latter is clearly more important, and the former should build on the latter. Existing pre-training tasks, including contrast learning and masking image modeling, should be modified for effective exploration in a virtual environment. We want the newly designed agent to focus on data compression, because redundancy in visual data is much heavier than in linguistic data. The new pre-trained agent defines the requirements of the neural architecture, for example, in order to achieve a compromise between data compression and visual recognition, the designed architecture should have the ability to extract different levels (granularity) of visual features upon request. In addition, cross-modal (e.g., text-to-image) generation will become a direct metric for measuring pre-training performance. When a unified tokenization method is available, it can be formulated as a multimodal version of the reconstruction loss.

On guidance fine-tuning

We have not yet entered the scope of defining tasks in the new paradigm. Since tasks in the real world can be very complex, we speculate that some basic tasks can be defined and trained first in order to break down complex tasks into them. To this end, a unified prompt system should be designed and a rich collection of human instructions should be collected. As a reasonable guess, the amount of instruction data may be orders of magnitude larger than the data collected for training GPTs and other chatbots. This is a whole new story for CV. The road ahead is full of unknown difficulties and uncertainties. At the moment we don't see much, but there will be a clear path in the future.

© THE END

Please contact this official account for authorization

AGI – Lessons learned from GPT and large language models

The Computer Vision Research Institute Learning Group is waiting for you to join!

ABOUT

Computer Vision Research Institute

The Institute of Computer Vision is mainly involved in the field of deep learning, mainly focusing on target detection, object tracking, image segmentation, OCR, model quantization, model deployment and other research directions. The Institute shares the latest new framework of paper algorithms every day, provides one-click download of papers, and shares practical projects. The Institute mainly focuses on "technical research" and "practical landing". The Institute will share the practice process in different fields, so that everyone can truly experience the real scene of getting rid of the theory and cultivate the habit of hands-on programming and brain-thinking!

01
02
03
04

Read on