Google's GPT 4, the moon, 121 km(s)

After OpenAI released ChatGPT-4o last night, the pressure was on Google I/O, as if Google couldn't get rid of the title of "Wang Feng of the AI world" anyway.

Google, on the other hand, raised AI 121 times and launched more than 10 new products and upgrades through a nearly 2-hour press conference.

Let's first summarize the highlights of this conference at one time, and please read on for more feature analysis.

Highlights of the press conference:

Google Search AI：发布了 AI Overviews，加强版 AI 搜索概要功能，多步推理能力上架。
Gemini 大模型:Gemini 1.5 Flash(100 万上下文);Gemini Pro(200 万上下文)。
Gemma Large Model: Released the open-source multimodal large models Pali Gemma and Gemma2.
AI in Google Workspace：用 Gemini 的能力和 Side Panel 的形式，将 Google 系列产品串在一起。
Gemini App: The mobile version of the Gemini app, which will soon support video conversations with AI, will be released in recent weeks.
Project Astra: The latest multimodal AI project that includes generative AI for images, music, and videos such as Imagen3, Music AI Sandbox, and Veo.

Start with a search and use the search king to bomb

Google Search is one of Google's biggest areas of investment and innovation, and it's the founding product.

Twenty-five years ago, Google turned on search, and tonight Google is pushing the boundaries of search again.

In simple terms, with AIGC's Google Search, so much more can be done:

Whatever you're thinking, whatever you need to accomplish, just ask (it) and Google Search will find it.

And all the evolution of Google Search is based on the Gemini model that is customized for it.

According to Google's announcement, there are three unique advantages of Google Search:

Google's real-time information includes more than a trillion facts about people, places, and things
One of the best products, and one of the best web services
The power of Gemini

Bringing these three things together unlocks a whole new level of Google's capabilities in search.

The first new feature is AI Review, which allows users to access summaries generated by large AI models at the top of search results, simplifying the entire search process and making the process of retrieving complex questions simple.

By the end of the year, more than a billion people will be using AI Review in Google Search, and Google says it will be one of the biggest updates to its search engine in 25 years, according to Google.

Multi-step reasoning 是 Google Search 中的另一个重磅功能。

With the new multi-step reasoning, it will be very simple for us to make some plans for life, work and travel in the future.

For example, you can use the search bar to find "the best yoga studio nearby", and then all the important information about the yoga studio nearby, such as ratings, course recommendations, distances, etc., will be sorted into blocks and displayed very clearly in the search results.

Relying on Google's own huge database, AI can call up the latest and most complete high-quality information during the search process, so the accuracy and credibility of search results are more guaranteed.

Google now includes more than 250 million locations around the world, updated in real time, along with important information such as ratings, reviews, hours of operation, and more.

Planning in Search 是另一个帮你减少负担的更新。

For example, if you're readjusting your diet and planning your meal from scratch, you don't want to eat macaroni and cheese for breakfast, lunch, and dinner.

Throw the request directly into the search box, and Google Search will give you a new recipe for the week that is as requested and reasonably arranged.

What's more, you can change the criteria and details at any time, and the results of your search will be updated in real time with the latest prompts.

If we have already seen or even used the above features in products from other companies, then Ask with Video will definitely give you some surprises.

There are many objects in life, all of which have their own exclusive names, and when some devices have some minor problems, they also have corresponding repair methods. But in many cases, only professionals can call it out, and only they can "prescribe the right medicine".

Now, with Google Search's Ask with Video, everyone can be called an expert, and it's the equivalent of an encyclopedia packed in a mobile phone.

The parts of the record didn't work, and I didn't know where to start, and the shutter of the camera suddenly failed...... In the past, it may have to be sent back to the manufacturer for after-sales service, but now you can use the lens of your Google device to capture the problem, and Google Search can make a preliminary diagnosis based on the problem you are experiencing, and some minor faults can be solved on the spot.

In the real-time demo of the press conference, the AI also listed the entire repair steps one by one, and by following the on-screen instructions, the presenter was able to solve minor problems in no time.

This function uses AI to break down videos frame by frame, import the key information of each frame into Gemini's long contact window and analyze them one by one, and sort through relevant articles, forums, videos, etc. in the network to find insights, so as to realize the intelligent suggestions of Ask with Video.

Compared with traditional text input, the biggest advantage of video is that the interaction process between us and AI becomes more intuitive, and vague words such as "here" and "this" can also make the large model know what we are referring to.

Google says that these latest AI features will be available in the lab in the coming weeks, which means that a more powerful Google Search is not far off.

In subsequent versions, it will even be able to find answers based on the automatic subtitles of the videos on the page, and I wonder if it will steal the jobs of those bloggers who "watch XX movies in 1 minute".

Diagram · Song · film, aiming at OpenAI to play

If GPT-4o two days ago was another shock to the world brought by AI, then Project Astra, which was officially announced by Google tonight, is a continuation of the shock.

Project Astra is the prototype of GoogleMind – a general artificial intelligence assistant.

Similar to GPT-4o, users can use it to have real-time conversations with the AI, as well as video chats.

This new feature was well illustrated by a demo of the launch, in which the staff pointed the camera at the object around them and asked Project Astra questions, which answered them with almost zero latency.

For example, Project Astra can tell that the top half of the speaker is a tweeter, and can easily identify the code displayed on the computer screen.

According to Google:

Our new project focuses on building a future AI assistant that can really help in everyday life.

Based on more powerful AI performance, Google also announced three other useful functions in I/O, which embody the "futuristic" of advanced technology in the fields of "graphics", "music" and "video".

Imagen 3 is the latest image generation model released by Google.

It can better understand our cue words and create more realistic images from this.

The generated image of the "wolf" shown at the press conference is a narrative that Imagen 3 accurately extracts 8 details and reflects them in the picture.

It's not hard to see that the resulting images are not only accurate in detail, but also very realistic.

Imagen 3 can also handle more abstract image creation, such as creative images based on hints of "rainbow colors", "light made of feathers", and "black background".

It's like it's clear what you want.

The spokesman even jokingly boasted at the press conference that "you can use it to count the beards on other people's faces".

Google has also broken new ground when it comes to music generation.

The Music AI Sandbox is the latest music generation model to be launched, and Google has invited Marc Rebillet to share it at the I/O site.

Based on a short music demo created by an artist, the Music AI Sandbox can be extended on this basis, and can also be further re-created according to the prompt words entered by the user, such as music style and genre.

Google says they built the Music AI Sandbox with YouTube:

It's a professional set of AI music tools to create new instrument parts from scratch, switch styles between tracks, and more to help us design and test them.

Another utility model is called Veo and focuses on generating videos.

Users can simply type in relevant text, images, or video prompts, and Veo can create high-quality 1080p videos that can be up to 60 seconds long.

It captures detailed information in commands in different visual and cinematic styles.

For example, we can enter an aerial photo of something, landscape or time-lapse in the prompt and further edit the video with other prompts.

For a long time, there have been many obstacles to the "only theoretical" video generation AI, among which the biggest threshold for "usability" is that the video generation time is only a few seconds, and it can only jump sideways repeatedly in one or two actions.

That's why Sora was a big buzz at the beginning, and starting tonight, Google's Veo has become the focus of everyone's attention, from photorealism to surrealism and animation, and it can handle most film and television styles.

In addition to Project Astra, Google also provides us with a customizable Gemini - Gems.

Google says it can complete tasks while retaining certain characteristics, making it a personal assistant for a wide range of people, allowing users to adjust their positioning to become yoga buddies, virtual pop personalities, fitness buddies, creative writing coaches, and even calculus tutors.

Gemini 狂卷长文本,Gemini 家族再添新成员

The Gemini project has been in the spotlight since it came to light. At first, there was some controversy, but later it also relied on its own strength to save its reputation, and now it is becoming more and more mature.

With more than 1.5 million developers using the Gemini model and 2 billion users, Pichai is now mentioning the "Gemini Era" with the goal of integrating it into all products, bringing new experiences to users and creating new opportunities for creators, developers, and startups.

The latest Gemini 1.5 Pro currently supports 1 million tokens of text, and later this year it is said that this number will reach 2 million, capable of processing 2 hours of video, 22 hours of audio, more than 60,000 lines of code, or more than 1.4 million words simultaneously.

In addition, Gemini Advanced was announced based on Gemini 1.5 Pro, which it claims can handle "multiple large documents totaling up to 1,500 pages, or aggregating 100 emails," and is available in 35 languages and more than 150 countries.

It has to be said that in terms of the amount of text, Gemini is really volumetric and "a big step towards the goal of turning any input into any output".

Safety is always a top priority

Since the dawn of AI, there has been a debate about how to distinguish AI-generated content. Google's countermeasure is to use SynthID to add invisible watermarks to AI-generated images and audio to make them easier to distinguish.

In the future, Google will expand this to text and video, and in the coming months, it will make it easier for more developers to build AI responsibly by updating the generative AI toolkit to open-source SynthID text watermarks.

When Gemini is integrated, Android will warn you when it detects suspicious activity during a call, such as being asked to provide your social security number and banking information.

Accessibility features TalkBack will also be enhanced with Gemini Nano, with clearer and richer descriptions to help users with low vision better navigate their phones through voice feedback, reflecting Google's usual human care.

As for Google's performance tonight, Jim Fan, research manager at NVIDIA at NVIDIA, commented that it was very pertinent.

Google's newly released model appears to be multimodal input, but Imagen3 and Music AI Sandbox, which are not multimodal outputs, are still separated from Gemini as separate components. Merging all modal I/O natively is an inevitable future.

It can perform tasks such as "use a more robotic voice", "edit this image", "generate consistent comic strips".

Without losing information on modal boundaries, such as emotions and background sounds, the new model opens up new contextual capabilities, allowing users to teach the model with a small number of examples and combine different meanings in novel ways.

GPT-4o isn't perfect, but it correctly grasps the formal factor, to use Andrey's LLM-as an operating system metaphor:

We need the model to natively support as many file extensions as possible.

Google has done one thing right: they've finally made a serious effort to integrate AI into the search box.

Gemini doesn't have to be the best one, but it can be the most widely used.

#欢迎关注爱范儿官方微信公众号: Love Fan Er (WeChat ID: ifanr), more exciting content will be presented to you as soon as possible.

Love Faner|Original link· Sina Weibo