The latest advances in multimodal large language models, take a look at the latest research results

1. Review: Recent advances in multimodal large language models

Multimodal Large Language Models: A Comprehensive Survey

* The design scheme, training method and performance evaluation indicators of MM-LLMs are summarized.

* Describes how 26 MM-LLMs are built, their advantages and disadvantages, and their application areas.

* This paper analyzes the performance of MM-LLMs on mainstream benchmarks, and proposes key training methods to enhance the performance of MM-LLMs.

Advantages of MM-LLMs:

* Large-scale pre-training: MM-LLMs use massive multi-modal data for pre-training, covering multiple modalities such as text, image, and audio.

* Multimodal Representation Learning: MM-LLMs can learn and understand the relationships and interactions between different modalities, and provide a unified representation of multimodal information.

* Wide range of applications: MM-LLMs perform well in tasks such as multimodal machine translation, multimodal information retrieval, and multimodal question answering, promoting the development of multimodal artificial intelligence.

Challenges faced by MM-LLMs:

* Data bias: The pre-trained data of MM-LLMs may be biased, resulting in biased output results of the model.

* Training cost: Large-scale pre-training of MM-LLMs requires huge computing and storage resources, and the training cost is high.

* Inference speed: The inference process of MM-LLMs is usually slow, which is difficult to meet the needs of real-time applications. A team of researchers from Tencent, Kyoto University, and the University of Chinese Academy of Sciences conducted a comprehensive investigation. An overview of the model architecture and general design scenarios for training pipelines. A brief introduction to the 26 existing MM-LLMs, each with its own specific way of building. The study reviews the performance of MM-LLMs on mainstream benchmarks and summarizes key training methods to enhance the performance of MM-LLMs.

The latest advances in multimodal large language models, take a look at the latest research results

2. SUPIR: Intelligent, Realistic Image Restoration Technology

SUPIR: A breakthrough approach to image restoration

A research team from the University of Chinese Academy of Sciences and the Shanghai Artificial Intelligence Laboratory has proposed a breakthrough image restoration method called SUPIR (Scaling-UP Image Restoration). This method makes significant progress in intelligent and photorealistic image restoration by utilizing generative prioras and model extensions.

Advantages of SUPIR:

* In the classical image restoration task, SUPIR shows better restoration results than existing methods.

* SUPIR has a new ability to fix images with text prompts, which can generate realistic images according to the user's needs.

The emergence of SUPIR marks a new stage in image restoration technology, which will be widely used in image processing, computer vision, and multimedia.

3. CreativeSynth: Creative mixing and synthesis of visual arts based on multimodal diffusion

CreativeSynth: A unified framework for the field of artistic image generation

Developed by a team of researchers from the University of Chinese Academy of Sciences, the Chinese Academy of Sciences, ByteDance, and Tsinghua University, CreativeSynth is an innovative framework for bringing real-world semantic content into the art world through inversion and real-time style transformation. The framework has the following features:

* Coordinate multimodal input: Process text, images, and other forms of input simultaneously to generate artistic images.

* Multitasking: Supports the generation of multiple art styles and content, including oil paintings, watercolors, sketches, and more.

* Precise control of style and content: Precise manipulation of image style and content while maintaining the integrity of the original model parameters.

CreativeSynth has made breakthroughs in the field of image generation, such as:

* Meet or exceed state-of-the-art levels on multiple art image generation datasets.

* The quality of the generated images has been significantly improved.

* Precise manipulation of image style and content while maintaining the integrity of the original model parameters.

CreativeSynth provides new ideas and tools for the research and application of artistic image generation, and has broad application prospects.

4. Tsinghua New Research: Make GPT-3.5 Comparable to GPT-4

ICE: A New Strategy for Adaptive and Flexible Artificial Intelligence Agents

A team of researchers from Tsinghua University and Chinese University of China and their collaborators have proposed a new strategy called ICE that can significantly improve the adaptability and flexibility of artificial intelligence (AI) agents. ICE is comparable to the original GPT-4 in a variety of agent tasks, but with 80% fewer API calls and a significantly lower need for model power.

ICE employs a novel "Explore-Consolidate-Exploit" strategy that enables agents to continuously improve their adaptability and flexibility by progressively exploring new problems and tasks, and continuously consolidating and leveraging previously learned. This strategy allows ICE to quickly adapt and make decisions in a variety of different environments and tasks.

The key advantage of ICE is its ability to use model parameters efficiently, significantly reducing the need for model power. This enables ICE to achieve high performance in resource-constrained environments and can be deployed in a wider range of use cases. The performance in the task is on par with the original GPT-4.

5. WebVoyager: Build end-to-end network agents using large multimodal models

WebVoyager: Ushering in a new era of network agents

A team of researchers from Zhejiang University, Tencent, and Westlake University launched WebVoyager, an innovative large-scale multimodal model-driven network agent that completes end-to-end user instructions with 85.3% consistency.

WebVoyager performs a variety of tasks on real-world websites, such as searching for information, booking flights, and purchasing goods. Its performance exceeds that of traditional rule-based network agents and reinforcement learning-based network agents.

The success of WebVoyager marks a new era in the field of cyber agents and is expected to lead to a wide range of applications in areas such as e-commerce, online education, and healthcare. A research team from Tencent and Westlake University has launched WebVoyager, an innovative large-scale multimodal model (LMM)-driven network agent. It can accomplish end-to-end user instructions by interacting with real-world websites. WebVoyager's automated evaluations are 85.3% consistent with human judgment.

6. Google launches Lumiere, an AI video generator

* Google launches Lumiere, a diffusion model specifically generated for video.

* Lumiere is able to directly generate full-frame-rate, low-resolution video by processing video on multiple spatial and temporal scales.

* Lumiere makes it easy to facilitate a variety of content creation tasks and video editing applications, including text-to-video, image-to-video, video restoration, and stylized generation.

* Lumiere is a powerful and versatile video generation tool that powers a wide range of video creation and editing applications. A diffusion model specifically for video generation is proposed, Lumiere. It can directly generate full-frame-rate, low-resolution video by processing video on multiple spatiotemporal scales, and can easily facilitate a variety of content creation tasks and video editing applications, including text-to-video, image-to-video, video restoration, and stylized generation.

7.ConTextual: Evaluate context-sensitive rich text visual inference in large multimodal models

1. Benchmark for evaluating LMMs for performing complex tasks: ConTextual

- A UCLA research team has proposed ConTextual, a benchmark to evaluate the ability of large multimodal models (LMMs) to perform context-sensitive text-rich visual reasoning.

2. -4Vision overall performance lags behind humans

- The overall performance of the best-performing LMM —— -4V(ision) still lags behind humans.

3. Conclusion: LMMs still have room for improvement

- LMMs have not yet fully mastered the visual reasoning capabilities of context-sensitive texts, and there is still room for improvement.

8. AgentBoard: a multi-round LLM agent analysis and evaluation framework

AgentBoard, a groundbreaking evaluation framework, enables the development of large language model agents

A research team and collaborators from the University of Hong Kong, Zhejiang University, Shanghai Jiao Tong University, and Tsinghua University jointly proposed a groundbreaking comprehensive benchmark for the analysis and evaluation of large language model (LLM) agents, and a supporting open-source evaluation framework, AgentBoard.

AgentBoard has made significant progress in demystifying agent behavior and accelerating the development of more robust LLM agents. The framework is implemented by:

1. Provide 19 assessment tasks covering language, logic, mathematics and general studies;

2. Seven evaluation indicators are proposed to comprehensively evaluate the agent from the perspectives of efficiency, effectiveness, and robustness.

3. Open-source evaluation code and data to make it easy for researchers and practitioners to use AgentBoard.

AgentBoard not only provides a comprehensive methodology and standard for the evaluation of LLM agents, but also promotes the development and application of LLM agents. A major step forward in the mystique of body behavior and the acceleration of the development of more powerful LLM agents.

Meta-Prompting, a single model can become an expert in multiple fields such as law, medicine and finance according to needs. Proposed by OpenAI and Stanford University, this technology enables large language models to adapt to different tasks without additional training and simply adjust prompts. This technology can be widely used in natural language processing, code generation, question answering, and other fields to provide users with more accurate and relevant information.

* Optimized article content:

* Meta-Prompting: An effective scaffolding technique to improve the functionality of language models. It transforms a single LM into a multi-functional commander who excels at managing and consolidating multiple independent LM queries.

* Technical advantages: Seamless integration of external tools (such as Python interpreters) into the framework expands its applicability and utility.

* Applications: Wide, such as text summarization, question answering, code generation and translation, etc.

* Technical Highlights:

*1) A unified meta-prompt framework is proposed to perform various language understanding and generation tasks under a unified framework.

*2) External tools such as Python interpreters have been introduced to augment the capabilities of the model, enabling more complex inference tasks.

*3) The effectiveness of the technique on multiple benchmark datasets has been demonstrated for a variety of tasks, including text summarization, question answering, code generation, and translation. A research team from OpenAI and Stanford has proposed Meta-Prompting, an effective scaffolding technique designed to improve the capabilities of language models (LMs). It transforms a single LM into a multi-functional commander who specializes in managing and consolidating multiple independent LM queries. The research team seamlessly integrated external tools, such as the Python interpreter, into the meta-prompting framework, extending its applicability and utility.

10. Beyond Stable Diffusion: Large-Scale Reinforcement Learning for Diffusion Models

* Reinforcement learning is used to improve diffusion models, significantly exceeding existing methods.

* Diverse reward functions, such as human preference, combinability, and fairness.

* More in line with human preferences, generating more realistic and beautiful images.

* Scalable algorithm that can be used for a variety of diffusion models.

* Open-source code for easy use by researchers and developers.

11.搞定logo设计，港科大提出AI辅助工具TypeDance

- Breakthrough Creation: TypeDance introduces a unique and comprehensive design workflow that seamlessly integrates creative ideation, selection, generation, evaluation, and iteration to ensure a more efficient and intelligent logo design process.

- Personalized Semantic Typography: With personalized semantic typography at its core, TypeDance uses semantic analysis and machine learning algorithms to automatically create logo typography that closely matches the corporate image and message.

- Dual-task user evaluation: Using two user evaluation tasks, imitation and creation, TypeDance has demonstrated strong design practicality and usability in different application scenarios, proving its value in the field of logo design.

- Practicality & Usability: In practice, TypeDance makes it easy for a diverse audience to create logo designs in a variety of styles, from simple and modern to creatively avant-garde. Dual-task user evaluations, including imitation and creation, confirm the design practicality and usability of TypeDance in different application scenarios.

12. OK-Robot: A new robot framework based on open knowledge

OK-Robot: A groundbreaking open-knowledge robotics framework

A team of researchers from New York University and Meta has developed OK-Robot, a new open-knowledge robotics framework. It combines visual-linguistic models (VLMs), navigation primitives, and gripping primitives into an integrated, training-free solution for pick-and-place operations.

With a 58.5% success rate in open pick and place tasks, OK-Robot represents the state of the art in the field of Open Vocabulary Mobile Operations (OVMM) and nearly 1.8 times better than previous jobs. In a cleaner and cleaner environment, OK-Robot's performance improved by 82%, proving its real-world usefulness.

OK-Robot has the following features:

* No training required: OK-Robot can perform pick and place operations without any training, which makes it a very flexible and adaptable tool.

* Open-ended vocabulary: OK-Robot can understand and execute a wide variety of instructions, including those expressed in natural language.

* Visual-language fusion: OK-Robot can combine visual information with verbal instructions for better understanding and execution of tasks.

OK-Robot has a wide range of applications, and it can be used in a variety of scenarios, including homes, offices, hospitals, and warehouses. It can help people with a wide variety of tasks, including organizing items, cleaning rooms, preparing food, and home delivery, among others. That's nearly 1.8 times the previous job. In a cleaner and cleaner environment, OK-Robot's performance improved by 82%.

, duration 01:16

13. SpatialVLM: Teach visual language models to learn spatial reasoning

Automatically generate 3D spatial VQA datasets to facilitate the development of spatial visual language models (VLMs).

- The research team developed an automated 3D spatial VQA data generation framework that generated 2 billion VQA examples on 10 million real-world images.

- SpatialVLM features: Internet-scale metric-spatial 3D spatial inference datasets.

- Train the VLM on this data, which greatly improves the VLM's ability to perform qualitative and quantitative spatial VQAs.

14.WARM: Improve the overall quality and alignment of LLM predictions

The weighted average reward model (WARM) is used to solve the reward cracking problem in reinforcement learning

Background:

In reinforcement learning with human feedback (RLHF), large language models (LLMs) exploit mistakes in reward models (RMs) to obtain seemingly high rewards but fail to achieve the basic goal, which is known as "reward cracking".

Way:

The Google research team proposed the Weighted Average Reward Model (WARM), which improves the accuracy and consistency of LLM predictions by weighting the prediction results across multiple RMs.

Outcome:

Experiments show that WARM improves the overall quality and consistency of LLM predictions. For example, a strategy RL fine-tuned with WARM has a win rate of up to 79.4% compared to a strategy RL that is fine-tuned with a single RM.

Conclusion:

WARM effectively solves the reward cracking problem in RLHF, improves the quality and consistency of LLM prediction, and provides new ideas and methods for the further development of RLHF. A research team from Google has proposed the Weighted Average Reward Model (WARM). Experiments have shown that WARM improves the overall quality and consistency of LLM predictions, for example, a strategy RL fine-tuned with WARM has a 79.4% win rate compared to a strategy RL fine-tuned with a single RM.

15. PhotoMaker: A Wensheng diagram model for efficient and personalized customization of portrait photos

PhotoMaker: An efficient way to generate images from personalized text

Research team: Nankai University, Tencent, and the University of Tokyo

Core Innovations:

* Propose an efficient method for generating images from personalized text - PhotoMaker.

* PhotoMaker uses stacked ID embeddings to preserve ID information as a unified ID representation.

* This embedding is able to fully encapsulate features of the same input ID and accommodate features of different IDs for subsequent integration.

Application prospects:

* Possibility for more interesting and valuable applications. A research team from Tencent and the University of Tokyo has proposed an efficient method for generating images from personalized text – PhotoMaker. PhotoMaker has the ability to encode any number of input ID images into a stacked ID embedding to preserve the ID information. As a unified ID representation, this embedding not only comprehensively encapsulates features of the same input ID, but also accommodates features with different IDs for subsequent integration. This opens up the possibility of more interesting and valuable applications.

The latest advances in multimodal large language models, take a look at the latest research results

Read on