Google "Strikes Back", releasing nearly 10 models overnight! The context window rolled to 2 million tokens, the Sora competitor Veo was released, and Android was also transformed

Author | Tu Min

出品 | CSDN（ID：CSDNnews）

This year's May is like a dream back to March 2023, and a lively AI feast has been held one after another.

However, I don't know whether it was intentional or unintentional, in March last year, when Google chose to open the large language model PaLM API, almost at the same time, OpenAI released the strongest model GPT-4 to surprise the four, and only a few days later, Microsoft officially announced at a press conference that its Office family bucket was innovated by GPT-4, causing Google to seem to be ignored by everyone.

A little embarrassingly, the same situation seems to be playing out this year, on the one hand, OpenAI brought a fully upgraded flagship GPT4o in the early hours of yesterday morning as the opening of this month's AI Spring Festival Gala, on the other hand, Microsoft will hold Bulid 2024 next week, so, this time the Google that has been caught again can reverse the wind against the wind and overturn the game of its two "groups", we will get a glimpse from the I/O 2024 developer conference that opened in the early hours of this morning.

This year's I/O Conference is also the eighth year that Google has unmistakably pursued an "AI First" strategy.

Highlights are a sneak peek

As expected, at this nearly 2-hour Keynote, "AI" was the key word throughout the I/O conference, but what I didn't expect was that it was mentioned as many as 121 times, and it is not difficult to see Google's anxiety about AI.

In the face of menacing external competitors, Google CEO Sundar Pichai (Sundar Pichai) recently said in a guest program, "AI is still in the early stages of development, and I believe that Google will eventually win this battle, just as Google was not the first company to do search."

Speaking at the I/O launch, Sundar Pichai also emphasized this point, "We are still in the early stages of the transformation of our AI platform. We see a huge opportunity for creators, developers, startups, and everyone."

Sundar Pichai said that when Gemini was released last year, it was positioned as a multimodal model that could reason across text, images, videos, code, and more. In February, Google released Gemini 1.5 Pro, a breakthrough in long text, extending the context window length to 1 million tokens, more than any other large-scale base model. Today, more than 1.5 million developers use Gemini models in Google tools.

At the launch, Sundar Pichi shared an update on what's going on inside Google:

The Gemini app is now available for Android and iOS. With Gemini Advanced, users have access to Google's most powerful models.
Google is rolling out an improved version of Gemini 1.5 Pro to all developers worldwide. In addition, Gemini 1.5 Pro, which today has 1 million token contexts, is now available directly to consumers in Gemini Advanced, and it can be used across 35 languages.
Google has expanded the Gemini 1.5 Pro context window to 2 million tokens and made it available to developers in private preview.
We're still in the early stages of Agent, but Google has already started to explore Project Astra, which analyzes the world through smartphone cameras, recognizes and interprets code, helps humans find glasses, and discerns sounds...
Gemini 1.5 Flash, a lighter than Gemini 1.5 Pro, was released, optimized for important tasks such as low latency and cost.
Imagen 3, a Veo model and text-generated image model for producing "high-quality" 1080p video, released;
Gemma 2.0 with a new architecture and a 27B size is here;
Android, the first mobile operating system to include a built-in device base model, deeply integrated with the Gemini model, becoming an operating system with Google AI at its core;
The sixth-generation TPU Trillium was released, delivering a 4.7x increase in compute performance per chip compared to the previous generation TPU v5e.

Google is "crazy", and a variety of models are launched at the same time

It is said that the large model is very "volume", but I didn't expect that on the way to accelerate and catch up, Google's "volume" is far beyond imagination. At the press conference, Google not only upgraded the previous large models, but also released a number of new models.

Gemini 1.5 Pro 升级更新

When Gemini was released last year, Google positioned it as a multimodal model that could reason across text, images, video, code, and more. In February, Google released Gemini 1.5 Pro, a breakthrough in long text, extending the context window length to 1 million tokens, more than any other large-scale base model.

At the event, Google first made quality improvements to some key use cases of Gemini 1.5 Pro, such as translation, coding, inference, etc., to handle a wider range of complex tasks. 1.5 Pro can now follow a number of complex and granular instructions, including directives that specify production-level behaviors involving roles, formats, and styles. It can also give the user the ability to control the model behavior by setting system directives.

At the same time, Google has added audio understanding to the Gemini API and Google AI Studio, so 1.5 Pro can now make inferences about the image and audio of videos uploaded in Google AI Studio.

What's more noteworthy is that if the context of 1 million tokens is long enough, just today, Google has further expanded its ability to expand the context window to 2 million tokens and make it available to developers in private preview, which means that it is taking the next step towards the ultimate goal of infinite context.

To access 1.5 Pro with a 2 million token context window, you'll need to join a waitlist in Google AI Studio or Vertex AI for Google Cloud customers.

更轻量级的新模型 Gemini 1.5 Flash

Gemini 1.5 Flash, a lightweight model built for expansion and the fastest Gemini model in the API. It is optimized for low-latency and cost-critical tasks, with a more cost-effective service and a breakthrough long context window.

Although it is lighter than the 1.5 Pro model, it can perform multimodal inference in large amounts of information. By default, Flash also has a context window with 1 million tokens, which means you can process an hour of video, 11 hours of audio, a codebase of over 30,000 lines of code, or over 700,000 words.

Gemini 1.5 Flash excels at summarizing, chatting, captioning images and videos, extracting data from long documents and tables, and more. That's because 1.5 Pro trained it through a process called "distillation" to transfer the most important knowledge and skills from larger models to smaller, more efficient models.

The price of Gemini 1.5 Flash is set at 35 cents per 1 million tokens, which is a bit cheaper than GPT-4o's price of $5 per 1 million tokens.

Gemini 1.5 Pro 和 1.5 Flash 均已推出公共预览版,并在 Google AI Studio 和 Vertex AI 中提供。

Google's first open visual language model, PaliGemma, is now available

PaliGemma is a powerful, open-ended VLM (Visual Language Model) inspired by PaLI-3. Built on open components such as the SigLIP Vision Model and the Gemma Language Model, PaliGemma is designed to deliver best-in-class fine-tuning performance on a wide range of visual language tasks. This includes captioning for images and short videos, visual question answering, understanding text in images, object detection, and object segmentation.

Google says that to promote open exploration and research, PaliGemma is available through a variety of platforms and resources, and you can find PaliGemma on GitHub, Hugging Face models, Kaggle, Vertex AI Model Garden, and ai.nvidia.com (accelerated with TensoRT-LLM), as well as through JAX and Hugging Face Transformers are easy to integrate.

Gemma 2 released

All releases of Gemma 2 will be available in a new form factor and feature an all-new architecture designed for breakthrough performance and efficiency. With 27 billion parameters, the Gemma 2 is comparable in performance to the Llama 3 70B, but half the size of the Llama 3 70B.

According to Google, Gemma 2's efficient design allows it to require less than half the computation required by comparable models. The 27B model is optimized to run on NVIDIA's GPUs or on a single TPU host in Vertex AI

Google "Strikes Back", releasing nearly 10 models overnight! The context window rolled to 2 million tokens, the Sora competitor Veo was released, and Android was also transformed

Read on

Gu Weihao, CEO of Momo Zhixing: AI large model is the only way to realize autonomous driving

UrbanGPT, the first smart city model, is fully open source and open|HKU&Baidu

Six front-line AI engineers summarize the explosion! The experience of large-scale model application for one year is public

Sunday Meditation (152): Journal paper based on the fairness concern model of the Stackelberg game

The Stanford team was exposed to plagiarism of the Tsinghua system and deleted the database, and the CEO of the plagiarized company was also internationally recognized

Mistral's first "open" programming model

Stanford AI team plagiarized domestic large models? Even the identification of "Tsinghua Jane" was copied! The Tsinghua team responded

A preliminary study of the basic model in the figure below in the era of rapid development of LLMs

Chaos Cosmos adds over 650 high-quality 3D models and materials

It seems that AI is the trend of mobile phone development in the future, and it has recently been revealed that Siri will be completely remodeled with AI, allowing it to control all functions, which allows users to control the single through voice

The Stanford AI team was accused of plagiarizing a large domestic model

RAND: Make sure the AI model weights

The Stanford AI team admitted to plagiarizing the Tsinghua model, publicly apologized and pulled down the controversial project

Today's Legal Q&A: Whether the Stanford AI team's plagiarism of the facewall open source model is infringing

The model jointly developed by Tsinghua University and Facewall was shelled, and the two Stanford student authors apologized for deleting the citation

The Stanford team plagiarized the Tsinghua large model, and the author apologized late at night, and the Chinese model could no longer be ignored