laitimes

For the first time, the multi-modal model of the National People's Congress has been updated independently, and the generation of photo videos is stronger than Sora

author:Love Fan'er
For the first time, the multi-modal model of the National People's Congress has been updated independently, and the generation of photo videos is stronger than Sora

AGI (Artificial General Intelligence) is the holy grail of the entire AI industry.

Ilya Sutskeve, former chief scientist at OpenAI, expressed an opinion last year: "As long as you can predict the next token very well, you can help humanity reach AGI." 」

Geoffrey Hinton, Turing Award winner and known as the father of deep learning, and Sam Altman, CEO of OpenAI, both believe that AGI will come within a decade or even sooner.

AGI is not the end, but a new beginning in the history of human development. There's a lot more to consider on the road to AGI, and China's AI industry is a force to be reckoned with.

At the general artificial intelligence parallel forum of the Zhongguancun Forum held on April 27, Zhizi Engine, a start-up company of the National People's Congress, grandly released a new multi-modal large model Awaker 1.0, taking a crucial step towards AGI.

Compared with the previous generation of ChatImg sequence model of Zhizi Engine, Awaker 1.0 adopts a new MOE architecture and has the ability to update independently, and is the industry's first multi-modal large model to achieve "true" independent update. In terms of visual generation, Awaker 1.0 uses a completely self-developed video generation base VDT, which achieves better results than Sora in photo video generation, breaking the dilemma of the "last mile" landing of large models.

For the first time, the multi-modal model of the National People's Congress has been updated independently, and the generation of photo videos is stronger than Sora

Awaker 1.0 is a multimodal large model that super-converges visual understanding with visual generation. On the comprehension side, Awaker 1.0 interacts with the digital world and the real world, and feeds back the scene behavior data to the model in the process of performing tasks to achieve continuous update and training. Most importantly, because of its "true" autonomous update capabilities, Awaker 1.0 is suitable for a wider range of industry scenarios and can solve more complex practical tasks, such as AI Agent, embodied intelligence, comprehensive management, security inspection, etc.

For the first time, the multi-modal model of the National People's Congress has been updated independently, and the generation of photo videos is stronger than Sora

Awaker 的 MOE 基座模型

On the comprehension side, the pedestal model of Awaker 1.0 mainly solves the problem of serious conflicts in multimodal multi-task pre-training. Benefiting from the well-designed multi-task MOE architecture, the pedestal model of Awaker 1.0 can not only inherit the basic capabilities of ChatImg, the predecessor of the Zhizi engine, but also learn the unique capabilities required for each multimodal task. Compared with the previous generation of multimodal large model ChatImg, the pedestal model capabilities of Awaker 1.0 have been greatly improved in multiple tasks.

In view of the problem of evaluation data leakage in the mainstream multimodal evaluation list, Zhizi Engine disclosed that it has built its own evaluation set with strict standards, and most of the test pictures are from personal mobile phone albums. On this multimodal evaluation set, it conducts fair manual evaluation of Awaker 1.0 and the three most advanced multimodal large models at home and abroad, and the detailed evaluation results are shown in the following table. Note that GPT-4V and Intern-VL do not directly support detection tasks, and their detection results are obtained by asking the model to describe the orientation of objects in language.

For the first time, the multi-modal model of the National People's Congress has been updated independently, and the generation of photo videos is stronger than Sora

We found that Awaker 1.0's pedestal model surpassed GPT-4V, Qwen-VL-Max, and Intern-VL for visual Q&A and business application tasks, while it also achieved the next best results for description, inference, and detection tasks. Overall, the average score of Awaker 1.0 exceeds the three most advanced models at home and abroad, verifying the effectiveness of the multi-task MOE architecture. Here are a few specific examples of comparative analysis.

For the first time, the multi-modal model of the National People's Congress has been updated independently, and the generation of photo videos is stronger than Sora

As you can see from these comparison examples, Awaker 1.0 gives the correct answer to the counting and OCR questions, while the other three models all answer incorrectly (or partially). When it comes to detailed descriptions, Qwen-VL-Max is prone to hallucinations, and Intern-VL is able to accurately describe the content of the picture but is not accurate and specific in some details. GPT-4V and Awaker 1.0 are not only able to describe the content of the image in detail, but also accurately identify the details in the image, such as Coca-Cola.

Awaker+ Embodied Intelligence: Towards AGI

The combination of multimodal large models and embodied intelligence is very natural, because the visual understanding ability of multimodal large models can be naturally combined with embodied intelligent cameras. In the field of artificial intelligence, "multimodal large model + embodied intelligence" is even considered to be a feasible path to achieve artificial general intelligence (AGI).

On the one hand, it is expected that embodied intelligence will be adaptable, that is, the agent can adapt to the changing application environment through continuous learning, and can not only do better on known multimodal tasks, but also quickly adapt to unknown multimodal tasks. On the other hand, embodied intelligence is also expected to be truly creative, with the hope that it will be able to discover new strategies and solutions and explore the boundaries of AI's capabilities through autonomous exploration of the environment. By using multimodal large models as the "brains" of embodied intelligence, it is possible to greatly improve the adaptability and creativity of embodied intelligence, and eventually approach the threshold of AGI (or even achieve AGI).

However, there are two obvious problems in the existing multimodal large models: one is that the iterative update cycle of the model is long, which requires a lot of manpower and financial investment, and the other is that the training data of the model is derived from the existing data, and the model cannot continuously obtain a large amount of new knowledge. While it is possible to inject new knowledge as it continues to emerge through RAG and long contexts, the multimodal large model itself does not learn this new knowledge, and these two remedies introduce additional problems. In short, the current multimodal large models do not have strong adaptability in practical application scenarios, let alone creativity, resulting in various difficulties in the implementation of the industry.

For the first time, the multi-modal model of the National People's Congress has been updated independently, and the generation of photo videos is stronger than Sora

The Awaker 1.0 released by Zhizi Engine is the world's first multi-modal large model with an autonomous update mechanism, which can be used as the "brain" of embodied intelligence. The autonomous update mechanism of Awaker 1.0 includes three key technologies: active data generation, model reflection and evaluation, and continuous model update.

Unlike all other multimodal large models, Awaker 1.0 is "live" in that its parameters can be continuously updated in real time. As you can see from the frame diagram above, Awaker 1.0 can be combined with various smart devices, observe the world through smart devices, generate action intentions, and automatically build instructions to control smart devices to complete various actions. Awaker 1.0 can obtain effective training data from these actions and feedback to continuously update itself, continuously enhancing the various capabilities of the model.

Taking new knowledge injection as an example, Awaker 1.0 is able to continuously learn the latest news information on the Internet and answer various complex questions in combination with the newly learned news information. Unlike the traditional approach of RAG and long contexts, Awaker 1.0 really learns something new and "remembers" the parameters of the model.

For the first time, the multi-modal model of the National People's Congress has been updated independently, and the generation of photo videos is stronger than Sora

As you can see from the above example, over three consecutive days of self-updates, Awaker 1.0 learns the day's news every day and accurately says the corresponding information when answering questions. At the same time, Awaker 1.0 does not forget what has been learned in the process of continuous learning, for example, the knowledge of Zhijie S7 is still remembered or understood by Awaker 1.0 after 2 days.

Awaker 1.0 can also be combined with a variety of smart devices to achieve cloud-edge collaboration. Awaker 1.0 is deployed in the cloud as the "brain" to control various edge smart devices to perform various tasks. The feedback that the edge smart device receives when performing various tasks is continuously transmitted back to Awaker 1.0, so that it can continuously obtain training data and update itself.

For the first time, the multi-modal model of the National People's Congress has been updated independently, and the generation of photo videos is stronger than Sora

The above-mentioned technical route of cloud-edge collaboration has been applied in application scenarios such as power grid intelligent inspection and smart city, and has achieved far better recognition results than traditional small models, and has been highly recognized by industry customers.

For the first time, the multi-modal model of the National People's Congress has been updated independently, and the generation of photo videos is stronger than Sora
For the first time, the multi-modal model of the National People's Congress has been updated independently, and the generation of photo videos is stronger than Sora

Real-world simulator: VDT

The generation side of Awaker 1.0 is the Sora-like video generation base VDT independently developed by Zhizi Engine, which can be used as a real-world simulator. VDT's research results were published on the arXiv website in May 2023, 10 months before OpenAI released Sora. VDT's academic papers have been accepted by ICLR 2024, the top international artificial intelligence conference.

For the first time, the multi-modal model of the National People's Congress has been updated independently, and the generation of photo videos is stronger than Sora

The innovations of the video generation base VDT mainly include the following aspects:

  • The application of Transformer technology to diffusion-based video generation demonstrates the great potential of Transformer in the field of video generation. The advantage of VDT is its excellent time-dependent capture capability, which is able to generate time-coherent video frames, including simulating the physical dynamics of 3D objects over time.
  • A unified spatiotemporal mask modeling mechanism is proposed, which enables VDT to handle a variety of video generation tasks, and realizes the wide application of this technology. VDT's flexible conditional information processing methods, such as simple token space splicing, effectively unify information of different lengths and modalities. At the same time, by combining with the spatiotemporal mask modeling mechanism, VDT has become a general video diffusion tool, which can be applied to a variety of video generation tasks such as unconditional generation, video subsequent frame prediction, frame interpolation, image generation video, and video image completion without modifying the model structure.

The Zhizi Engine team focused on exploring VDT's simulation of simple physical laws and trained VDT on the Physion dataset. In the example below, we see that VDT successfully simulates physical processes such as the motion of a ball along a parabolic trajectory and the ball rolling on a plane and colliding with other objects. It can also be seen from the second example in row 2 that VDT captures the velocity and momentum of the ball, because the ball did not end up hitting the pillar due to insufficient impact. This proves that the Transformer architecture can learn certain laws of physics.

They also explored in-depth on the photo video generation task. The quality of the video generated is very high for this task, as we are naturally more sensitive to the dynamics of faces and people. Given the specificity of this task, researchers need to combine VDT (or Sora) and controllable generation to address the challenges of photo video generation. At present, the Zhizi engine has broken through most of the key technologies of photo video generation and achieved better photo video generation quality than Sora. Zhizi Engine will continue to optimize the controllable generation algorithm of portraits, and is also actively exploring commercialization. At present, a definite commercial landing scenario has been found, and it is expected to break the dilemma of the difficulty of landing the "last mile" of large models in the near future.

Loading...

In the future, VDT, which is more general, will be a powerful tool to solve the problem of data provenance of multimodal large models. Using video generation, VDT will be able to simulate the real world, further improving the efficiency of visual data production and facilitating the autonomous update of the multimodal large model Awaker.

epilogue

Awaker 1.0 is a key step for the Tomoko Engine team to achieve the ultimate goal of "achieving AGI". Zhizi Engine told APPSO that the team believes that AI's self-exploration and self-reflection capabilities are important evaluation criteria for the level of intelligence, and are equally important compared to continuously increasing the scale of parameters (Scaling Law).

Awaker 1.0 has implemented key technical frameworks such as "active data generation, model reflection and evaluation, and continuous model update", and has achieved breakthroughs on both the understanding and generation sides, which is expected to accelerate the development of the multimodal large model industry and ultimately enable humans to realize AGI.

Read on