In the face of OpenAI's face-to-face opening, Google chose to fight back on the spot.

At 1 o'clock in the morning today (Beijing time), Google released a big move at the 2024 I/O conference——

Astra, a more powerful multi-modal agent, can understand the world inside and outside the camera in real time.

Multimodality and long text are the keywords of this release, and Google CEO Sundar Pichai said that the combination of multimodality and long text expands the types of questions we can ask and the types of answers we can get.

A multimodal assistant comparable to GPT-4o, AI overview search results, and a new video generation model, Veo

The Gemini series of models rolls up long text, and the context window of the 1.5 Pro will be expanded to 2 million tokens; The newly released Flash is a lightweight model priced at 35 cents per 1 million tokens, well below the price of GPT-4o $5.

The Google family of products equipped with Gemini has also been gorgeously upgraded: Google Search supports video input to ask questions, and the results page of "AI Overview" will also be launched; Android phones have a built-in AI assistant, and you can search globally by drawing a circle on the map.

In terms of AIGC, Imagen 3, a more realistic image generation model, was released; Veo, a new video generation model, is designed to generate high-definition video in excess of 60 seconds......

With 2 billion users currently using Gemini, Pichai said, the Gemini era has just begun, and Google hopes to finally make AI work for everyone.

"The number one AI player" watched the live broadcast throughout the whole process, and here are the key points of this keynote speech that we have sorted out.

The multimodal agent is coming,

Gemini精准打击GPT-4o

The long-awaited Agent (AI Intelligent Agent) is finally here.

At the I/O conference, Google shared its new project, Project Astra, an AI intelligent assistant that is no less than GPT-4o, which can understand the complex world around us like a human and provide real-time assistance in daily life.

For example, if you turn around with your camera in the office, the AI can recognize the objects in the screen, interpret the code being written, and determine the geographical location.

In the demo video, the official also showed how to combine Astra with AR glasses, which also became one of the highlights. When you put on your glasses, Astra's answers will be displayed in real time, such as when helping to modify the flowchart on the whiteboard, with arrows pointing out where to make changes.

However, compared with the GPT-4o launched by OpenAI yesterday, the latter showed more surprising emotional interactions in the demo, although there were also immediate comments from netizens, "It feels like OpenAI wants to create its own exclusive licking dog for everyone".

Previously, when Google first released Gemini, its multimodal interactive demonstration video still needed to be edited, and now the Astra video released specifically emphasizes that it is "shot in real time at one time".

An agent is an intelligent system that understands multimodal information, plans multiple steps in advance, and takes action on behalf of the user. Judging from the demo, Astra has low latency, fast response, and natural interaction, as if it is an expert assistant by your side.

In addition, Google has also announced the latest progress of the Gemini series of models.

Gemini 1.5 Pro's contextual window will expand to 2 million tokens, can process hundreds of pages of documentation, and is available in private preview to developers.

Gemini Advanced, which is open to developers around the world, offers a context window of up to 1 million tokens and supports more than 35 languages.

Gemini Advanced will be rolling out new data analytics in a few weeks, with trip planning to be added at a later date to create personalized itineraries with advanced inference.

Gemini 1.5 Flash is a new lightweight model optimized for low-latency and low-cost tasks for more efficient deployment. Developers can use it in Google AI Studio and Vertex AI today, with a context window of up to 1 million tokens.

Gemini's Gems feature, which will be available this summer, is similar to GPTs and can be set up as AI assistants with different specialties via Prompt.

At the same time, as a native multimodal model, Gemini's voice and video capabilities have been upgraded, and the upcoming "Live" function can be said to be benchmarked against GPT-4o.

You'll be able to have a deeper, two-way conversation with Gemini, interrupt at any time during your answers, and turn on the camera so that Gemini can see and understand what's going on around you. It seems that the ideal AI assistant for the number one AI player has the shadow of the movie "Her".

Added image and video search,

AI one-click networking summary

As AI products such as ChatGPT and Copilot sweep the world, the way users obtain information is quietly changing.

When searching, you can ask questions through videos, such as recording a video asking, "Why can't I put this on?" ”

Gemini understands why the record in the video can't be fixed on the record player, and quickly searches for articles, forums, videos and other information on the whole network to give a solution.

Compared with the traditional search results list, Google Search, which is now powered by Gemini, will also launch a new search results integration feature called "AI Overviews".

For example, in a live demonstration, when we want to find the best yoga or pilates studio in Boston, we show their membership offer and the distance from the address in the results.

Gemini takes all the information in a single search and organizes it to present an organized search results page.

According to reports, Google Search has been upgraded to support multi-step reasoning, which can break down large problems into several parts, and find out what problems to be solved and in what order, so results that might otherwise take minutes or even hours to arrive at can now be completed in seconds.

The "AI Search Overview" feature will be launched first in the U.S. and will reach 1 billion users in the future.

此外，在手机端，谷歌相册（Google Photos）即将推出的一项新功能“Ask Photos”。

By drawing a circle on an image, you can search for a specific object, such as a photo with a license plate number, or ask "How did your daughter learn to swim lately?" and Gemini can easily find the corresponding photos and videos by understanding complex contexts.

AIGC model is new,

Veo can generate over 60-second HD video

In the field of images, music, and video, Google has all released new models or products.

Image generation

Google has launched Imagen 3, the highest-quality text-to-image generation model to date, which produces images that are more detailed, more realistic, and understand complex text prompts.

Image 3生成

Music generation

Google and YouTube have jointly created the Music AI Sandbox, a set of professional AI music creation tools that can help creators quickly create from 0.

Video generation

Google has released its latest video generation model, Veo, which can create high-quality 1080p footage of more than 60 seconds with just one text, image, or video prompt, supporting a variety of cinematic styles, including realism, surrealism, animation, and more. Maybe everyone will become a director in the future.

All of the above AIGC models can be requested for trial on labs.google at present.

Google Family Bucket AI Upgrade,

Android phones are the first to be equipped with AI assistants

Unsurprisingly, the upgraded Gemini 1.5 Pro will be integrated into more Google Family Bucket products, including software such as email, meetings, and documents, as well as hardware devices such as mobile phones.

For example, Gemini in Gmail can summarize the content of emails with one click, and you don't need to go to multiple emails and attachments to check it, Gemini will analyze it according to the context and give suggestions for replying.

Ask Gemini in Excel no more laborious formulas, AI will automatically analyze the data and give the calculation results in the form of graphs.

For AI-centric phones, Google mentions three key applications: AI-driven search (the aforementioned image circle search), the system's built-in AI assistant (currently available on Android), and AI privacy and security (prompting fraud risks).

Google said that later this year, it will expand the multi-modal function of Gemini Nano to add visual, sound, and spoken input, which means that AI mobile phones can help more visually impaired groups and other users communicate and live better.

In a recent interview with The Circuit With Emily Chang, Pichai mentioned that in the world of technology, if you don't continue to innovate to stay ahead of the curve, then any company will inevitably fail.

Artificial intelligence has been a central focus at Google since 2016, with researchers at Google inventing Transformer, the T in GPT. At that time, ChatGPT, developed by OpenAI, was still in its infancy.

In today's era of generative AI, Google has repeatedly been stolen by OpenAI, and is also facing severe challenges from competitors such as Microsoft.

Judging from the fact that so many new models and product upgrades were released in the keynote speech of this year's I/O conference, Google is still adhering to the strategic direction of AI First.

A multimodal assistant comparable to GPT-4o, AI overview search results, and a new video generation model, Veo