Produced | Microjam (WJAM123456)
Author | Chen Chumu
Recently, Google's AI big model progress has attracted a lot of attention.
But when everyone rubbed their hands and tried to see how Google turned against the wind, OpenAI, the overlord of the AI big model field, once again came with heavy news.
According to The Information, OpenAI is about to launch a multimodal model GPT-Vision, and the headline of the article unceremoniously indicates that this is used to fight back at Google.
Although the new version has not really arrived, it is enough to give us a glimpse of the next stage of the competition focus of this track - multi-modality.
01#Where has GPT-5 progressed?
According to The Information, OpenAI is preparing to launch GPT-Vision on the basis of GPT-4. This is equivalent to stacking buffs for GPT-4, squeezing toothpaste to perfect it, anyway, the currently recognized number one AI must be GPT-4.
In addition, the report also mentioned that OpenAI may launch a large model codenamed "Gobi" after GPT-Vision. Unlike GPT-4, the so-called "more powerful" Gobi was built from the ground up on a multimodal model.
The outside world has locked this new large model as a strong candidate for GPT-5, because most people do not believe the previous OpenAI CEO and co-founder Sam Altman's debunking at the MIT event:
We are not training right now and will not be training GPT-5 anytime soon.
Sam Altman responds to GPT-5 rumors at MIT
After all, at that time, this statement was mainly used in response to the open letter "Suspend AI Giant Experiments". On March 29, thousands of people in the tech community, including Tesla CEO Elon Musk, Apple co-founder Steve Wozniak, and Turing Award winner Yoshua Bengio, signed a call for a six-month moratorium on the development of AI systems more powerful than GPT-4 to allow time to address AI safety and ethics.
Just earlier this month, Mustafa Suleyman, co-founder of DeepMind and now CEO of Inflection AI, said in an interview that she believes OpenAI is secretly training GPT-5. Suleyman put most people's speculation on the table, and the pressure was put back on OpenAI.
Screenshot from the interview show CEO of Inflection AI Mustafa Suleyman on risks of artificial intelligence
But it's probably too early to talk about GPT-5, as OpenAI hasn't responded to the news. Except for the new large model codenamed Gobi, which may be the rumored GPT-5, we don't know anything else. Even according to foreign media sources, OpenAI does not seem to have started training Gobi.
Relatively speaking, the situation of GPT-Vision is more traceable.
There is currently a lot of speculation that GPT-Vision is likely to be a multimodal feature that was previously demonstrated at GPT-4's March conference. At that time, GPT-4 shocked the world by generating web code based on a simple handwritten sketch.
Presentation at the GPT-4 launch in March
But after a moment of amazement, except for being provided to Be My Eyes, a company that creates technology for the blind, there is no information about this in the update of the feature and the actual use, including features such as Wen Sheng Tu.
The reason for this may be inferred from a July New York Times report that OpenAI is concerned that the feature could be abused for things like facial recognition. Combined with Sam Altman's previous rebuttal, "OpenAI is addressing GPT-4-based security issues that were overlooked in the open letter." "There may already be a solution to the security concerns.
It also means that this shielding is likely to be released.
According to The Information, OpenAI hopes to make image understanding more widely available under the name "GPT-Vision," which will open up many new image-based applications for GPT-4, such as generating text that matches pictures.
There are also rumors that DALL-E 3 is also in development, possibly integrated into ChatGPT or GPT-4. Both it and GPT-Vision could be announced at the OpenAI developer conference on November 6, as OpenAI CEO Sam Altman said:
There will be "great things", albeit not as big as GPT-4.5 or GPT-5.
In general, although GPT-5 has not yet come, GPT-4 has to exert multi-modality, and a new round of AI to refresh the concept of science and technology may not be far away.
02#OpenAI and Google are up to the top
In this report on OpenAI's new move, the Chinese and foreign media have surprisingly consistent views, basically believing that they are going for Google's Gemini.
According to the media on September 14, citing three directly informed sources, Google has provided an early version of Gemini to a small number of companies, and sold it to enterprises through the company's cloud computing services, which means that Google is considering incorporating it into consumer services, and the release of Gemini may be imminent.
Gemini is known as Google's masterpiece, and there has been faint news since April this year, and the participants of the project include former DeepMind founder Demis Hassabis and other big bulls, and Google founder Segey Brin has also personally joined the training of Gemini.
Late last month, SemiAnalysis analysts Dylan Patel and Daniel Nishball revealed more about it.
Based on the available information, we can understand the following about Gemini:
1. The original Gemini should be trained on TPUv4, and choose to use a smaller number of chips to ensure the reliability and hot swapping of the chip. At present, it has begun training on TPUv5 Pod, and the computing power is 5 times greater than that of training GPT-4.
Gemini's training database is 9.36 billion minutes of video captions on Youtube, and the total dataset size is about twice that of GPT-4.
3. Gemini is composed of a set of large language models, which may use MOE architecture and speculative sampling technology to generate tokens in advance through small models and transmit them to large models for evaluation to improve the total inference speed.
4. Gemini supports functions such as chatbots, summarizing texts or generating original text (such as email drafts, song lyrics or news articles), generating original pictures, etc.
5) Gemini support helps engineers write code, and Google hopes it will improve developers' code-generation capabilities to catch up with Microsoft's GitHub Copilot code assistant, which relies on OpenAI.
6) Google employees also discussed using Gemini to implement functions such as chart analysis, such as requiring models to explain the meaning of charts and using text or voice commands to browse the web or other software.
7. Gemini is available in different sizes, allowing developers to purchase a simplified version to handle simple tasks, and a small enough version to run on a personal terminal.
It's worth noting that Gemini has an advantage over GPT-4 — in addition to online public information, Google can also use Google to get a lot of proprietary data from its consumer products. Therefore, some relevant people believe that:
The model should be particularly accurate at understanding the user's intent on a particular query, and it seems to produce fewer wrong answers (i.e., hallucinations).
Although Gemini has not really appeared, many people have already expressed optimism. A similar point is made in the aforementioned article by Dylan Patel and Daniel Nishball:
The statement that may not be obvious is that the sleeping giant, Google has woken up, and they are iterating on a pace that will smash GPT-4 total pre-training FLOPS by 5x before the end of the year. (The less obvious claim is that the sleeping giant Google has woken up, and they're iterating to increase GPT-4 pre-training total FLOPS by 5x by the end of the year.) ）
We can see that every Gemini item is compared to GPT-4, which is of course inevitable. After all, before the advent of ChatGPT, it was Google that held the AI sword.
So the consensus of the public is -
The point here is Google had all the keys to the kingdom, but they fumbled the bag. (The point here is that Google has all the keys to the kingdom, but they lost the bag.) ）
Based on this, Google also has to work harder to accelerate the proof that it can still score points on AI. Google chose to steal the house directly, trying to preemptively plant its own flag on the highlands before OpenAI came up with a true multimodal model. Of course, OpenAI doesn't plan to let Google go after it, which is why GPT-Vision and Gobi are the same.
This also points to the focus of the next stage of AI competition, which is the multi-modality that each company is involving. After all, generative AI in text form is no longer new, and no matter how intelligent it is, it can only succumb to the glory of ChatGPT.
However, today, the battlefield of AI is no longer a situation of two armies pitted against each other, Google and OpenAI are just the more conspicuous giants in the melee.
Both of these also need to be profitable, adding a commercial component to the large-model project, such as policies for the corporate side. But the latecomer Meta, who has blazed a different path, has taken the path of open source and has been constantly releasing new features, focusing on a large amount and free.
It's hard to evaluate, will everyone choose Meta for the sake of cost.
It can be said that the current AI melee has reached the white-hot stage of glue, who will be the next to rush out, let the bullets fly for a while.