谷歌DeepMind联手!Jeff Dean、Hassabis万字长文总结2023绝地反击

Editor: Aeneas is sleepy

Just now, Jeff Dean, chief scientist of Google DeepMind, and Demis Hassabis, CEO of Google, have jointly released the 2023 super-authoritative Google annual research summary in the field of artificial intelligence.

Google DeepMind,交卷!

刚刚,Jeff Dean和Hassabis联手发文,一同回顾了Google Research和Google DeepMind在2023年的全部成果。

谷歌DeepMind联手!Jeff Dean、Hassabis万字长文总结2023绝地反击

At the beginning of this year, compared to ChatGPT, which is popular all over the world, Google seems to have lost miserably. At that time, countless capital hot money flowed towards OpenA, and OpenAI's market value and popularity soared to unprecedented heights in an instant.

In April, Google, which had fallen into passivity, released its ultimate killer move: Google Brain and DeepMind officially merged!

In May, Google was disgraced at the I/O conference. The all-new PaLM 2 surpasses GPT-4, the office family bucket is exploded and upgraded, and Bard is directly epic evolution.

In December, Google released the revenge killer Gemini late at night, and the strongest native multimodal directly crushed GPT-4. Although there are processing ingredients in the product demo, it is undeniable that Google has pushed the world's multimodal research to unprecedented heights.

Let's take a look at how Google's gods have come together to fight a 23-year war of revenge.

Advances in products and technologies

This year, generative AI officially entered a big explosion.

In February, Google launched Bard on an emergency basis, two months slower than OpenAI to launch its own AI chatbot.

In May, Google announced months and years of research and results at the I/O conference, including the language model PaLM 2. It combines compute-optimized scaling, improved dataset composition, and model architecture to excel even in the most advanced inference tasks.

After fine-tuning and tweaking PaLM 2 for different purposes, Google has integrated it into a number of Google products and features, including:

1. Bard

Bard now supports more than 40 languages and more than 230 countries and territories, and can be used to find information in the Google tools you use every day, such as Gmail, Google Maps, YouTube.

2. Search Build Experience (SGE)

It uses LLMs to reimagine how information is organized and how users navigate it, creating a smoother conversational interaction model for Google's core search product.

3. MusicLM

This text-to-music model, powered by AudioLM and MuLAN, can make music from text, humming, images or videos, musical accompaniments, songs.

4. AI Duo

Duet AI in Google Workspace can help users create text, create images, analyze spreadsheets, draft and summarize email and chat messages, summarize meetings, and more. Duet AI in Google Cloud helps users write, deploy, scale, and monitor applications, as well as identify and resolve cybersecurity threats.

Article address:https://blog.google/technology/developers/google-io-2023-100-announcements/

Following last year's release of Imagen, a text-to-image generation model, Google released Imagen Editor in June, which provides more precise control over model output by using region masking and natural language prompt editing to generate images.

Subsequently, Google released Imagen 2, which improved the output with a specialized image aesthetic model that references human preferences for good lighting, framing, exposure, and sharpness.

In October, Google launched a new feature of Google Search to help users practice Xi speaking and improve their language skills.

The key technology to achieve this feature is a new deep learning Xi model developed in collaboration with the Google Translate team, called Deep Aligner.

Compared to the Hidden Markov Model (HMM)-based alignment method, this single new model dramatically improves the alignment quality across all tested language pairs, reducing the average alignment error rate from 25% to 5%.

In November, Google partnered with YouTube to release Lyria, Google's most advanced AI music generation model to date.

In December, Google launched Gemini, Google's most powerful and versatile AI model.

From the beginning, Gemini has been built as a multimodal model across text, audio, images, and video.

Gemini comes in three different sizes, Nano, Pro, and Ultra. Nano is the smallest and most efficient model used to provide an on-device experience for products like the Pixel. The Pro model is powerful and best for scaling across tasks. The Ultra model is the largest and most performant model for highly complex tasks.

According to the technical report of the Gemini model, the performance of the Gemini Ultra exceeds the latest results from 30 of the 32 widely used academic benchmarks.

With a score of 90.04%, the Gemini Ultra is the first model to outperform human experts on the MMLU and receives the highest score of 59.4% in the new MMMU benchmark.

Building on AlphaCode, Google launched AlphaCode 2, powered by a dedicated version of Gemini, the first AI system to achieve a median level of performance in a coding competition.

Compared to the original AlphaCode, AlphaCode 2 solved more than 1.7 times the problem and outperformed 85% of the participants.

At the same time, with the addition of the Gemini Pro model, Bard has also been greatly upgraded, with greatly improved comprehension, summarization, reasoning, coding, and planning capabilities.

Gemini Pro outperformed GPT-3.5 in six of the eight benchmarks, including MMLU, one of the key criteria for LLMs, and GSM8K, which measures mathematical reasoning in elementary schools.

Early next year, the Gemini Ultra will also introduce Bard, which is sure to spark a new cutting-edge AI experience.

What's more, Gemini Pro is also available for Vertex AI, Google Cloud's end-to-end AI platform that enables developers to build applications that process text, code, images, and video information.

Applications, applications that can process text, code, images, and video information. Gemini Pro was also launched in AI Studio in December.

As you can see, the things that Gemini can do include, but are not limited to-

Unlock insights from the scientific literature.

Good at competitive programming.

Process and understand the original audio.

Gemini can answer why the dish is not scrambled: because the eggs are raw

Explain reasoning in mathematics and physics.

Understand user intent and deliver tailored experiences.

Machine Learning Xi/Artificial Intelligence

In addition to its advances in products and technologies, Google has also made many important advances in the broader areas of machine Xi and AI research this year.

The core architecture of today's most advanced machine Xi model is the Transformer architecture developed by Google researchers in 2017.

Originally, Transformer was developed for language, but today, it has proven to be extremely useful in various fields such as computer vision, audio, genomics, protein folding, and more.

This year, Google's work on expanding Vision Transformer has reached SOTA for a variety of vision tasks and can be used to build more powerful robots.

Extending the versatility of the model requires the ability to perform higher-level and multi-step inference.

This year, Google approached this goal through several studies.

For example, a new approach to algorithmic prompting teaches a language model to reason by demonstrating a series of algorithmic steps, which the model can then apply to a new context.

This approach increased the accuracy of secondary school math benchmarks from 25.9 per cent to 61.1 per cent.

By providing algorithmic hints, we can teach model arithmetic rules through upper and lower literary Xi

In the field of visual Q&A, Google collaborated with researchers at UC Berkeley to better answer complex visual questions by combining visual models with language models — "Is the carriage on the horse's right?"

Diagram of the CodeVQA method. First, the large language model generates a Python program that calls a visual function that represents the problem. In this example, a simple VQA method is used to answer a portion of the question, and an object locator is used to find the location of the mentioned object. The program then generates an answer to the original question by combining the outputs of these functions

The language model is trained to answer visual questions by performing multi-step inference through a synthetic program.

To train large-scale machine Xi models for software development, Google has developed a general-purpose model called DIDACT.

It understands every aspect of the software development lifecycle and can automatically generate code review comments, respond to code review comments, suggest performance improvements for code snippets, fix code in response to compilation errors, and more.

Over the years of working with the Google Maps team, Google has expanded the inverse hardening Xi and applied it to world-class problems that improve route suggestions for more than a billion users.

When using the RHIP inverse reinforcement Xi strategy, Google Maps' route matching rate has improved relative to existing benchmarks

This work resulted in a relative increase of 16-24% in the global route matching rate, ensuring that the routes were better aligned with user preferences.

Google is also continuing to work on technologies to improve the inference performance of machine Xi models.

While investigating a computationally friendly approach to pruning connections in neural networks, the team devised an approximation algorithm to solve the computationally intractable problem of selecting the best subset, which is able to trim 70% of the edges from the image classification model and still retain almost all of the accuracy of the original model.

Original vs. pruned networks

In the process of accelerating the device-side diffusion model, Google has made various optimizations to the attention mechanism, convolution kernel, and manipulation fusion in order to run high-quality image generation models on the device.

Now, in just 12 seconds, you can generate a "realistic high-resolution image of a cute puppy surrounded by flowers" on your smartphone.

Example output of LDM on mobile GPU, prompt: "A photorealistic high-resolution image of a cute puppy with flowers around"

The progress of language and multimodal models is also conducive to robotics research.

Google combined separately trained language, vision, and robot control models into PaLM-E, an embodied multimodal model for robotics, and Robotic Transformer 2 (RT-2).

This is a novel visual-language-action (VLA) model that learns Xi from network and robot data and translates this knowledge into generic instructions for robot control.

RT-2 architecture and training: Fine-tune pre-trained visual language models together on robot and network data. The resulting model receives images from the robot's camera and directly predicts the actions to be performed by the robot

In addition, Google has also looked at the use of language to control the gait of quadruped robots.

SayTap uses the foot contact pattern (e.g., the sequence of 0 and 1 for each foot in the illustration, where 0 represents the foot in the air and 1 the foot on the ground) as an interface to bridge natural language user commands and low-level control commands. With a strong Xi chemistry-based motion controller, SayTap allows quadruped robots to take simple and direct commands (e.g., "slowly trot forward") as well as vague user commands (e.g., "Good news, we're going on a picnic this weekend!") and react accordingly

It also explores the use of language to help formulate a more explicit reward function to bridge the gap between human language and robot actions.

The language-to-reward system consists of two core components: (1) a reward translator and (2) a motion controller. Reward Translator maps natural language instructions from users to reward functions represented as python code. The motion controller uses back-down level optimization to optimize a given reward function to find the optimal low-level robot action, such as the amount of torque that should be applied to each robot motor.

Due to the lack of data in the pre-trained dataset, LLMs are unable to directly generate low-level robot actions. The team proposes to use reward functions to bridge the gap between language and low-level robot actions, and to implement novel complex robot movements from natural language instructions

In Barkour, the team benchmarked the agility limits of quadruped robots.

Several dogs were invited to participate in the obstacle course, and the results showed that small dogs can complete the obstacle course in about 10 seconds, while robot dogs generally take about 20 seconds

Algorithms & Optimizations

Designing efficient, robust, and scalable algorithms has always been the focus of Google's research.

One of the most important achievements is AlphaDev, which has broken the bottleneck of algorithms for ten years.

Its innovative significance lies in the fact that AlphaDev did not use strong chemistry to discover faster algorithms from scratch completely by improving existing algorithms, but using strong chemical Xi.

Address: https://www.nature.com/articles/s41586-023-06004-9

The results show that AlphaDev has discovered a new sorting algorithm that brings significant improvements to the LLVM libc++ sorting library. For shorter sequences, the speed is increased by 70%, while for sequences of more than 250,000 elements, the speed is increased by about 1.7%.

Now, the algorithm is part of two standard C++ coding libraries and is used trillions of times by programmers around the world every day.

In order to better evaluate the execution performance of large programs, Google has developed a new algorithm that can be used to predict the characteristics of large graphs, and released a new dataset, TPUGraphs.

The TPUGraphs dataset contains 44 million graphs for machine learning Xi program optimization

In addition, Google has proposed a new load-balancing algorithm, Prequal, which can significantly save CPU resources, reduce response time, and memory usage when allocating server queries.

Google has improved the SOTA of clustering and graph algorithms by developing new computational least cuts, approximate correlation clustering, and massively parallel graph clustering techniques.

These include TeraHAC, a new hierarchical clustering algorithm designed for graphs with trillions of edges, KwikBucks, a text clustering algorithm that enables both high quality and scalability, and an efficient algorithm for approximating the standard similarity function of multi-embedding models, Chamfer Distance, which is more than 50 times faster and scales to billions of points than highly optimized precision algorithms.

In addition, Google has optimized for large-scale embedding models (LEMs).

These include Unified Embedding, which provides battle-tested feature representations in large-scale machine learning Xi systems, and Sequential Attention, which discovers efficient sparse model structures during model training.

Science and Society

In the not-too-distant future, the application of AI in scientific research is expected to speed up discovery in some fields by a factor of 10, 100, or more.

This has led to major breakthroughs in bioengineering, materials science, weather forecasting, climate forecasting, neuroscience, genetic medicine, and healthcare.

Climate and sustainability

In its study of contrails, Google analyzed large amounts of weather data, historical satellite imagery, and past flight records to train an AI model that predicts the formation of aircraft wakes and adjusts course accordingly. The results showed that the system could reduce aircraft wake by 54%.

To help combat the challenges posed by climate change, Google is constantly developing new approaches to technology.

Google's flood forecasting service, for example, now covers 80 countries and directly affects more than 460 million people.

In addition, Google has made recent progress in the development of weather prediction models.

Building on MetNet and MetNet-2, Google has built a stronger MetNet-3 that can surpass traditional numerical weather simulations in a timeframe of up to 24 hours.

In the field of medium-term weather forecasting, the new AI model GraphCast can accurately predict 10-day global weather in less than 1 minute, and even predict extreme weather events.

Address: https://www.science.org/doi/10.1126/science.adi2336

The study found that GraphCast accurately predicted more than 90% of the 1,380 tested variables compared to the industry's gold standard weather simulation system, High Resolution Forecasting (HRES).

What's more, GraphCast can identify severe weather events earlier than traditional forecasting models – predicting the potential path of future cyclones up to 3 days in advance.

It is worth mentioning that the source code of the GraphCast model has been fully opened, allowing scientists and forecasters around the world to benefit billions of people around the world.

Health & Life Sciences

In healthcare, AI has shown great potential.

The original Med-PaLM was the first AI model to pass the U.S. medical licensing exam. The subsequent Med-PaLM 2 was further improved by 19% to achieve an expert-level accuracy of 86.5%.

The recently released multimodal Med-PaLM M can not only process natural language input, but also interpret medical images, text data, and many other data types.

Med-PaLM M is a large-scale, multimodal generative model that provides the flexibility to encode and interpret biomedical data, including clinical language, imaging, and genomics data, with the same model weights

Not only that, but AI systems can also explore new signals and biomarkers in existing medical data.

By analyzing images of the retina, Google has demonstrated that it can predict multiple new biomarkers related to different organ systems (e.g., kidney, blood, liver) from photos of the eye.

In another study, Google also found that combining retinal images with genetic information could help reveal some of the underlying factors associated with aging.

In genomics, Google collaborated with 119 scientists at 60 institutions to create a new map of the human genome.

And, on top of the groundbreaking AlphaFold, a prediction catalog is provided for 89% of all 71 million possible missense variants.

In addition, Google also released the latest development of AlphaFold, "AlphaFold-latest", which can make atomic-accurate structural predictions of almost all molecules in the protein database (PDB).

This advance has not only deepened our understanding of biomolecules, but also dramatically improved accuracy in several important areas, including ligands (small molecules), proteins, nucleic acids (DNA and RNA), and biomacromolecules containing post-translational modifications (PTMs).

Quantum computing

Quantum computers have the potential to solve major real-world problems in science and industry.

But to realize this potential, quantum computers must be much larger than they are today, and they must be able to reliably perform tasks that classical computers cannot.

In order to ensure the reliability of quantum computing, it is also necessary to reduce its error rate from the current 1 in 10^3 to 1 in 10^8.

This year, Google took a major step forward on the road to developing a large, practical quantum computer – for the first time ever, by adding qubits to reduce the rate of calculation errors.

Responsible AI

Generative AI is revolutionizing healthcare, education, security, energy, transportation, manufacturing, and entertainment.

In the face of these leaps and bounds, ensuring that the technology design aligns with Google's AI principles remains a top priority.

Make AI universal

While advancing the latest technologies in machine Xi and artificial intelligence, Google is also committed to helping people understand and apply AI to specific problems.

To this end, Google has launched Google AI Studio, a web-based platform that helps developers build and iterate on lightweight AI applications.

At the same time, in order to help AI engineers understand and debug AI more deeply, Google has also launched the most advanced open-source machine Xi model debugging tool - LIT 1.0.

One of Google's most popular tools, Colab gives developers and students access to powerful computing resources directly in the browser, and currently has more than 10 million users.

Some time ago, Google added AI code assistance to Colab, so that all users can have a more convenient and integrated experience in data analysis and machine Xi workflows.

Just recently, Google innovatively launched the FunSearch method to ensure that AI can provide accurate information in real-world applications.

Through a combination of evolutionary algorithms and large language models, FunSearch is able to generate verified, real-world knowledge in the field of mathematical sciences.

Specifically, FunSearch pairs pre-trained LLMs with an automatic "evaluator". The goal of the former is to provide creative solutions in the form of computer code, while the latter prevents illusions and false ideas. After iteration between these two components, the initial solution "evolves" into new knowledge.

Address: https://www.nature.com/articles/s41586-023-06924-6

Community involvement

By publishing research and participating in and organizing academic conferences, Google is continuing to advance AI and computer science.

This year, Google has published more than 500 papers. Many of them have been included in many summit conferences, including ICML, ICLR, NeurIPS, ICCV, CVPR, ACL, CHI, and Interspeech.

In addition, Google joined forces with 33 academic labs to create the Open X-Embodiment dataset and RT-X model by aggregating data from 22 different bot types.

Google, with the support of the MLCommons Standards Group, is leading the industry in promoting AI safety benchmarks with OpenAI, Anthropic, Microsoft, Meta, Hugging Face, and other leading organizations in generative AI.

Looking to the future

With the continuous advancement of multimodal models, they will help mankind to achieve amazing achievements in the new fields of knowledge in science and education.

As time progresses, Google's products and research will continue to improve, and people will find more creative ways to use AI.

在这篇年终总结的最后，让我们回到开头的话题，正如谷歌在「Why We Focus on AI (and to what end)」中所言：

"If we boldly and responsibly advance the development of AI, we believe that AI can become a foundational technology that will revolutionize the lives of people around the world – and that is what we are striving for, and this is our passion!"

Resources:

https://blog.research.google/2023/12/2023-year-of-groundbreaking-advances-in.html