Google's "Gemini" model 6-minute video was exposed and edited

After bard's debut "rollover" at the beginning of the year, on December 7, Beijing time, Google launched the large model Gemini (Chinese name "Gemini") and released a series of dazzling demonstration videos. Can "Gemini" face off against GPT-4 this time?

Among these demo videos, the most amazing thing is that in a 4-minute demo video, when the tester is drawing, magic, etc., Gemini can express his opinion in real time, interact with the tester in real time, and just look at the performance in the video, Gemini's comprehension even reaches the human level.

"Judging by the content of the presentation alone, Gemini's video comprehension ability is undoubtedly at the most advanced level at the moment. In an interview with a reporter from the Beijing News Shell Finance, an algorithm engineer of a large model in Beijing said, "This ability comes from the fact that Gemini naturally adds a large amount of video data during training, and supports video understanding in architecture." ”

However, just one day after the launch, many users found in their tests that Gemini's video comprehension was not as "silky" as it was in the demo. Google soon published a blog post explaining the multimodal interaction process in the demo video, almost acknowledging that it was possible to use static images and multiple prompts to achieve this effect. In addition, some netizens noticed that Google has an important disclaimer in the demo video: in order to reduce the delay of the demonstration effect, the output of Gemini has also been simplified.

Even so, in the eyes of many professionals, Google has finally launched a large model that can "pass two tricks" with OpenAI, as a veteran manufacturer of artificial intelligence, Google has a rich "family background", and Gemini will also become a strong competitor to GPT.

Where is the editing? How much is the difference between the demo video and the actual one?

"Have you seen the video demonstration of Google's latest large model? Multimodal switching is a qualitative change, especially when playing game maps, people may not be able to react. On December 7, Mr. Liu, who is engaged in website development, sent a demonstration video to a reporter from Shell Finance.

In this Google large model Gemini demonstration video that excited many practitioners, the tester took out a piece of paper, Gemini immediately replied "You took out a piece of paper", as the tester drew curves and colors on the paper, Gemini immediately "understood in seconds", and continued to explain with the tester's movements: "You are drawing curves, it looks like a bird, it is a duck, but blue ducks are not common, most ducks are brown, the Chinese duck pronunciation is 'yazi', Chinese has four tones." When the subject put a blue rubber duck on the world map, Gemini saw it and immediately said, "This duck has been released in the middle of the sea, there are not many ducks here." ”

After that, the tester began to use gestures to "interact" with Gemini, and when the tester made a gesture of paper-scissors, Gemini "rushed" to say "you are playing with rock-paper-scissors", after which Gemini also guessed the image of an eagle and a dog imitated by his hands.

However, the Shell Finance reporter found a lot of traces of editing in this video, such as rock-paper-scissors, where the tester's movements when punching were obviously cut out a lot. In response, Google posted a blog post to "answer questions": when given a picture of Gemini "out of the cloth", Gemini replied "I see a right hand with the palm open and five fingers apart", when given a picture of "punching", Gemini replied "a person knocking on the door", and when given a picture of "out of scissors", Gemini replied "I see a hand with an index and middle finger outstretched." Only when these three pictures are put together and asked, "What do you think I'm doing?", will Gemini reply "You're playing rock-paper-scissors."

So in reality, while Gemini's answer is still true, the practical application may not be as "silky" as it appears in the demo video.

Google's "Gemini" model 6-minute video was exposed and edited

Source: "Gemini" demo video published by Google.

How are multimodal capabilities "refined"?

Through this demonstration, many people in the industry also admit that Google has indeed taken a step towards catching up with OpenAI. In fact, before the emergence of ChatGPT, Google has been leading in the field of artificial intelligence, however, ChatGPT's unbeatable ride has put Google under great pressure, and after launching the bard that benchmarks ChatGPT in February this year, but after the debut "overturned", Google has been lacking a large enough model to boost morale.

After the emergence of "Gemini", Google has at least embodied some characteristics in the field of multimodal understanding. "Gemini is a native multimodal large model, i.e., it is multimodal at the time of training. Google already has a strong ecosystem in search, long video, online documents, etc., and Google has many graphics cards, and the computing power is several times that of OpenAI. A large model practitioner who graduated from Tsinghua University with a major in automation told the Shell financial reporter.

Specifically, the Gemini model comes in three versions: the Gemini Ultra, the largest and most capable version, the Gemini Pro, which can be used for a wide range of tasks, and the Gemini Nano, which will be used for specific tasks and mobile devices.

In addition to multimodal capabilities, Gemini also performs well in many aspects such as text comprehension and code operations, and in a test of an MMLU multi-task language understanding dataset, Gemini Ultra not only surpasses GPT-4, but even surpasses human experts. Shell financial reporters logged on to Google's deepmind official website and found that the phrase "Witness Gemini - our most capable large model" was placed on the homepage.

At present, users can enter the ability to experience Gemini Pro from the port of Google bard, but the Shell financial reporter found that this ability is only available in some regions. Through the test of some foreign netizens, users can input pictures to Gemini and text to Gemini, and according to the test results, Gemini Pro and GPT-4V, which also has multi-modal capabilities, have "their own merits" in answering many questions, and they have not been crushed by GTP-4V.

"According to my observations, Gemini's ability in text is still slightly inferior to GPT4 at the moment, but Google's technical strength is still in the first echelon. The above-mentioned large model algorithm engineer said.

He told the Shell financial reporter that in order to make the large model have the "multi-modal ability" to understand the sound of images, videos, and sounds, it can be technically regarded as expanding the image understanding module of LLaVA (a multi-modal pre-trained model) to video and voice, and adding additional video and audio data during training, "In fact, it proves that Gemini has made video and speech understanding into the large model for the first time, verifying the feasibility of these two on the large model." ”

"Overall, the release of Google's large model is in line with expectations, and each technical point of Gemini has been verified in the academic community before, and corresponding papers can be found. In the future, personal assistants are a very attractive scenario, and compared with large language models, multimodal large models can play an assistant that can hear, see, speak, and draw, more like a human being. The large model algorithm engineer told the Shell financial reporter.

[email protected]

Beijing News Shell Financial Reporter Luo Yidan

Edited by Yue Caizhou

Proofread by Liu Baoqing

Google's "Gemini" model 6-minute video was exposed and edited

Google's "Gemini" model 6-minute video was exposed and edited

Read on