Jeff Dean's long article outlook: Five potential trends in machine learning after 2021

Reports from the Heart of the Machine

Editors: Du Wei, Maye

After 2021, in which areas will machine learning have an unprecedented impact?

Over the past few years, many changes have been witnessed in the fields of machine learning (ML) and computer science. Following this long arc of progress, one may see many exciting advances in the coming years that will ultimately benefit the lives of billions of people and have a more profound impact than ever before.

In a concluding article, Jeff Dean, a well-known scholar and head of AI at Google, highlights five areas with the greatest potential for machine learning after 2021:

Trend 1: More capable, more versatile machine learning models
Trend 2: Machine learning continues to improve efficiency
Trend 3: Machine learning becomes more personalized and more beneficial to the community
Trend 4: Machine learning is having a growing impact on science, health, and sustainability
Trend 5: A deeper and broader understanding of machine learning

The specific content of the article is as follows:

Researchers are training machine learning models that are larger and more capable than ever before. Over the past few years, the language field has evolved from multi-billion parameter models trained on tens of billions of token data (such as the T5 model with 11 billion parameters) to hundreds of billions or trillions of parameter models trained on trillions of token data (such as open AI's 175 billion-parameter GPT-3 and DeepMind's 280-billion-parameter Gopher and sparse models such as Google's 600-billion-parameter GShard and 1.2 trillion-parameter GLaM). The growth in dataset and model size has led to a significant increase in accuracy on multiple language tasks, as evidenced by across-the-board improvements on standard NLP benchmark tasks.

Many of these advanced models focus on a single but important form of written language and demonstrate SOTA results in language understanding benchmarks and open conversational capabilities, even across multiple tasks in the same domain. At the same time, these models also have the ability to generalize to new language tasks when training data is relatively small, and in some cases, training samples are required for new tasks with little or no need.

Jeff Dean's long article outlook: Five potential trends in machine learning after 2021

Conversations with Google When the app language model LaMDA simulates a Weddell seal.

Transformer models have also had a significant impact on image, video, and speech models, all of which benefit greatly from scale. Transformer models for image recognition and video classification implement SOTA on many benchmarks, and we have demonstrated that co-training models on image and video data can achieve higher performance than training models on video data alone.

We developed sparse, axial attention mechanisms for image and video Transformers, found better ways to label images for visual Transformer models, and improved our understanding of visual Transformer methods compared to how CNNs work. The combination of convolution operations and the Transformer model is also beneficial in visual and speech recognition tasks.

The output of the generated model is also greatly improved. This is most evident in image generation models and has made significant progress over the past few years. For example, a recent model has the ability to create a realistic image without giving only one category, can populate a low-resolution image to create a high-resolution counterpart that looks natural, and can even create an aerial natural landscape of any length.

Schematic of a cascade diffusion model that generates a completely new image based on a given class.

In addition to advanced single-modal models, large-scale multimodal models are also evolving. Some of the most advanced multimodal models can accept a variety of different input modes such as language, image, language, and video, resulting in different output modes. It's an exciting direction, and just like the real world, there are some things that are easier to learn in multimodal data.

Similarly, image and text pairing facilitates multilingual retrieval tasks, and a better understanding of how pairing text and image input can enhance image description tasks. Collaborative training on visual and text data helps improve the accuracy and robustness of visual classification tasks, while joint training on image, video, and speech tasks improves generalization performance for all modes.

The Robotics at Google team's schematic of a vision-based robot operating system that can be generalized to entirely new tasks.

All of these trends point to more versatile models that can handle multiple data modes and solve thousands or even tens of thousands of tasks. In the coming years, we will pursue this vision with next-generation architecture Pathways and expect to see substantial progress in this area.

Pathways we're working to build a single model that can be generalized across millions of tasks.

The increase in efficiency comes from advances in computer hardware design and machine learning algorithms, meta-learning research, and is driving more powerful machine learning models. Many aspects of a machine learning pipeline, from the hardware on which models are trained and executed to the various components of the machine learning architecture, can be optimized for efficiency while maintaining or improving overall performance. Greater efficiency has led to a number of key advances that will continue to significantly improve the efficiency of machine learning, enabling larger, higher-quality machine learning models to be developed while remaining cost-effective and further democratizing.

The first is the continuous improvement of machine learning acceleration performance. Each generation of machine learning accelerators is stronger than its predecessors, enabling faster per-chip performance and often increasing the size of the overall system. In 2021, we launched Google's fourth-generation tensor processor, TPUv4, which shows a 2.7x improvement over TPUv3 on the MLPerf benchmark. Machine learning capabilities on mobile devices are also improving significantly. The Pixel 6 phone comes with a new Google Tensor processor that integrates a powerful machine learning accelerator to support important on-device features.

Left: TPUv4 board; Center: TPUv4 cabin; Right: Google tensor chip used in Pixel 6 mobile phones.

The second is the continuous improvement of machine learning compilation and machine learning workload optimization. Even when the hardware cannot be changed, compiler improvements and other system software optimizations for machine learning accelerators can achieve significant efficiency gains.

End-to-end model acceleration is achieved using ml-based compiler auto-tuning on 150 machine learning models.

The third is the discovery of a more efficient model architecture driven by human creativity. Continuous improvements in the model architecture have drastically reduced the amount of computation required to achieve a certain level of accuracy on many issues. For example, with 4 to 10 times less computation than CNNs, Vision Transformer can improve SOTA results on a large number of different image classification tasks.

The fourth is the discovery of machine-driven, more efficient model architectures. Neural architecture search (NAS) can automatically discover new machine learning architectures that are more efficient for a given problem domain. The main advantage of neural architecture search is that it can significantly reduce the effort required for algorithm development because it requires only one-time effort for each search space and problem domain combination.

In addition, while the initial effort to perform a neural architecture search requires high computational costs, the resulting model can greatly reduce the amount of computation in downstream research and production settings, thereby reducing overall resource requirements.

The Primer architecture found by neural architecture search is 4 times more efficient than the plain Transformer model.

Fifth, the use of sparsity. So-called sparsity, that is, the model has a very large capacity, but only a part of it is activated for a given task, example, or token. Sparsity is another major algorithmic advancement that can greatly improve efficiency.

In 2017, we proposed sparsely-gated mixture-of-experts layers that achieve better results on multiple translation benchmarks when using 10 times less computation than the SOTA intensive LSTM model at the time. There is also the recent Swin Transformer, which combines a hybrid expert-style architecture with a Transformer model architecture, and the results show that training time and efficiency are 7x faster than the denser T5-Base Transformer model. The concept of sparsity can also be used to reduce the cost of attention mechanisms in the core Transformer architecture.

The BigBird sparse attention model proposed by Google Research consists of global tokens, local tokens, and a series of random tokens that handle all parts of the input sequence.

With innovations in machine learning and silicon hardware such as the Google Tensor processor on the Google Pixel 6, many new experiences have become possible, making mobile devices more capable of consistently and effectively perceiving the surrounding background and environment. These advances have improved accessibility and ease of use, while also enhancing computing power, which is essential for mobile photography, real-time translation, and more. It's worth noting that recent technological advances have also provided users with a more personalized experience while enhancing privacy protections.

It can be seen that more people than ever before are relying on mobile phone cameras to record their daily lives and make artistic expressions. The clever use of machine learning in computational photography has continuously improved the capabilities of mobile phone cameras, making them easier to use, more powerful, and produce higher-quality images.

For example, improved HDR+, the ability to take photos in very low light, better portrait processing, and a more inclusive camera for all skin tones are all advancements that allow users to take better photos. Use powerful ML-based tools now available in Google Photos, such as Cinematic Photo, to further improve your photo capture.

HDR+ starts with a set of full-resolution raw images, each with the same exposure (left); the merged images reduce noise and increase dynamic range, resulting in a higher-quality final image (right).

In addition to using their phones for creative expression, many people rely on their phones to communicate with others in real time, using Live Translate and Live Caption in messaging apps for phone calls.

Thanks to techniques such as self-supervised learning and noisy student training, the accuracy of speech recognition continues to improve, with significant improvements in accented, noisy or overlapping speech environments, as well as in multilingual tasks. Based on advances in text-to-speech synthesis, people can use Google's Read Aloud service to listen to web pages and articles on more and more platforms, making it easier for information to cross morphological and linguistic barriers.

A recent study suggests that gaze recognition is an important biomarker of mental fatigue. （https://www.nature.com/articles/s41746-021-00415-6）

Given the potential sensitivity of the data behind these new features, they must be designed to be private by default. Many of them run within the Android Private Compute Core, an open source security environment isolated from the rest of the operating system. Android ensures that the data processed in the private computing core is not shared with any application without the user taking action.

Android also prevents any function within the private computing core from accessing the network directly. Instead, the feature communicates with Private Compute Services through a small set of open source APIs that eliminate identifying information and leverage privacy technologies such as federated learning, federated analytics, and private information retrieval to ensure privacy while learning.

Federated refactoring is a new local federated learning technique that divides the model into global and local parameters.

In recent years, machine learning has had an increasing impact in basic science, from physics to biology, and has enabled many excellent practical applications in related fields such as renewable energy and medicine. For example, computer vision models are being used to solve problems on an individual and global scale, they can assist doctors in their daily work, expand people's understanding of neurophysiology, and provide more accurate weather forecasts that can simplify disaster relief efforts. By discovering ways to reduce emissions and improve alternative energy output, other types of machine learning models have proven critical in addressing climate change. As machine learning becomes more robust, mature, and widely available, such models can even be used as creative tools for artists.

Large-scale applications of computer vision to gain new insights

Advances in computer vision over the past decade have enabled computers to be used for a variety of tasks in different fields of science. In neuroscience, autoresistion techniques can recover the neural connective structure of brain tissue from high-resolution electron microscopy images of thin sheets of brain tissue.

In previous years, Google collaborated to create such resources for the brains of fruit flies, mice, and songbirds; last year, Google partnered with Harvard's Lichtman lab to analyze the largest reconstructed samples of brain tissue, as well as imaging at this level of detail in any species, and generated the first large-scale study of synaptic connections in the human cortex across multiple cell types across all layers of the cortex. The goal of this work is to generate a new resource that will help neuroscientists study the astonishing complexity of the human brain. For example, the following figure shows 6 of the approximately 86 billion neurons in the adult brain

A single human chandelier neuron from Google's human cortex reconstruction, as well as some pyramidal neurons connected to this cell.

Computer vision technology also provides powerful tools to address larger and even global challenges. A deep learning-based weather forecasting method uses satellite and radar images as input, combined with other atmospheric data, to produce more accurate weather and precipitation predictions than traditional physics-based models in a forecast time of up to 12 hours. They also generate updated forecasts faster than traditional methods, which is important when extreme weather occurs.

A common theme of these cases is that machine learning models can perform specialized tasks efficiently and accurately, supporting downstream tasks based on the analysis of available visual data.

Automated design space exploration

Another way to produce excellent results in many areas is to allow machine learning algorithms to explore and evaluate the design space of the problem in an automated manner to find possible solutions. In one application, Transformer-based variational autoencoders learn to create beautiful and useful document layouts, and can extend the same approach to explore possible spatial layouts.

Another machine learning-driven approach is capable of automatically exploring the design space for computer game rule adjustments, improving the playability and other properties of the game, enabling human game designers to create better games faster.

Visualization of VTN models. It extracts meaningful connections between layout elements (paragraphs, tables, images, and so on) to produce realistic composite documents (for example, with better alignment and margins).

There are other machine learning algorithms that have been used to evaluate the design space for computer architecture decisions on the machine learning accelerator chip itself. Machine learning can be used to quickly create chip layouts for ASIC designs that are superior to those generated by human experts and can be generated in hours rather than weeks. This reduces the fixed engineering cost of the chip and reduces the barrier to quickly creating dedicated hardware for different applications. Google has successfully used this approach in the design of the upcoming TPU-v5 chip.

This exploratory machine learning approach has also been applied to material discovery. In a collaboration between Google Research and caltech, several machine learning models, combined with improved inkjet printers and custom microscopes, are able to quickly search for hundreds of thousands of possible materials.

These automated design space exploration methods can help accelerate many areas of science, especially when the entire experimental cycle of generating experiments and evaluating results can be done in an automated or largely automated manner. This approach may work well in more areas in the coming years.

Health apps

In addition to advancing basic science, machine learning can also drive advances in medicine and human health more broadly. Harnessing advances in health in computer science is nothing new, but machine learning opens new doors, brings new opportunities, and brings new challenges.

Take the field of genomics, for example. Computing has been important since the advent of genomics, but machine learning has added new capabilities and disrupted the old paradigm. When Researchers at Google began working in this area, many experts thought the idea of using deep learning to help infer genetic variation from sequencer output was far-fetched. Today, this machine learning method is considered state-of-the-art.

Machine learning will play a bigger role in the future, and genomics companies are developing new sequencing instruments that are more accurate and faster, but also bring new inference challenges. Google released the open source software DeepConsensus and partnered with UCSC and PEPPER-DeepVariant to support these new instruments of cutting-edge informatics, hoping that faster sequencing would lead to applicability that would have an impact on patients.

In addition to processing sequencer data, there are other opportunities to use machine learning to accelerate the process of using genomic information for personalized health. Large biobanks of extensive phenotypic and sequenced individuals could revolutionize the way humans understand and manage genetic susceptibility to disease. Google's machine learning-based phenotypic analysis method improves the scalability of converting large imaging and text datasets into phenotypes that can be used for genetic association studies, and the DeepNull method makes better use of large phenotypic data for genetic discovery. Both methods are open source.

The process of generating large-scale quantification of anatomical and disease features to combine with genomic data in biobanks.

Just as machine learning helps us see the hidden features of genomic data, it can also help us discover new information and gather new insights from other health data types. Disease diagnosis is often about identifying patterns, quantifying correlations, or identifying new instances of larger categories, tasks that machine learning excels at.

Google researchers have used machine learning to solve a wide range of such problems, but the application of machine learning in medical imaging goes a step further: Google's 2016 paper introducing the application of deep learning in diabetic retinopathy screening was selected by the editors of the Journal of the American Medical Association (JAMA) as one of the 10 most influential papers in a decade.

Another ambitious healthcare initiative, Care Studio, uses state-of-the-art ML and advanced NLP technology to analyze structured data and medical records to deliver the most relevant information to clinicians at the right time – ultimately helping them deliver more aggressive and accurate care.

While machine learning may be important for expanding accessibility and improving accuracy in clinical settings, an equally important new trend is emerging: machine learning applied to help people improve their daily health and well-being. People's daily devices are gradually becoming more powerful sensors that help democratize health indicators and information, and people can make more informed decisions about their health. We've seen smartphone cameras already assess heart rate and breathing rate to help users, even without the need for additional hardware, and Nest Hub devices that support contactless sleep induction give users a better understanding of their nighttime health.

We've seen that, on the one hand, we can significantly improve the quality of speech recognition for disordered speech in our own ASR systems, and on the other hand, using ML to help reconstruct the voices of people with language barriers so that they can communicate with their own voices. Smartphones that support machine learning will help people better study emerging skin conditions or help people with limited vision jog. These opportunities offer a bright future that cannot be ignored.

Custom ML models for contactless sleep sensing efficiently handle continuous 3D radar tensor streams that summarize a range of distance, frequency, and time activities to automatically calculate the probability of a user's presence and likelihood of being awake (awake or asleep).

Machine learning applications for the climate crisis

Another of the most important areas is climate change, which is an extremely pressing threat to humanity. We need to work together to reverse the curve of harmful emissions and ensure a secure and prosperous future. Information about the climate impact of different choices can help us address this challenge in many different ways.

With Eco Routes, Google Maps shows the fastest routes and the most fuel-efficient routes, so users can choose the one that suits them best.

The wildfire layer in Google Maps provides important, up-to-date information for people in an emergency.

As ML becomes more widely used in technology products and societies, we must continue to develop new technologies to ensure that it is applied fairly and equitably for the benefit of all, not just a subset of it.

One focus area is a recommendation system based on user activity in online products. Because these recommendation systems are often made up of several different components, understanding their fairness often requires a deep understanding of each component and how each component behaves when combined.

As with the recommendation system, context is important in machine translation. Since most machine translation systems translate individual sentences in isolation, with no additional context, they often reinforce biases related to gender, age, or other domains. To address some of these issues, Google has conducted long-term research on reducing gender bias in translation systems.

Another common problem with deploying machine learning models is distribution drift: if the statistical distribution of the data used to train the model differs from the statistical distribution of the data that is used as input to the model, the behavior of the model can sometimes be unpredictable.

Data collection and dataset management is also an important area, as the data used to train machine learning models can be a potential source of bias and fairness issues in downstream applications. Analyzing such cascades of data in machine learning helps identify many places in the machine learning project lifecycle that can have a significant impact on results. This study on data cascading provides evidence-backed guidance for data collection and evaluation in a revised PAIR Guidebook for machine learning developers and designers.

Differently colored arrows represent various types of data cascades, each of which usually originates upstream, is compounded during machine learning development, and manifests downstream.

Creating more inclusive and less biased public data sets is an important way to help improve the field of machine learning for everyone.

In 2016, Google released the Open Images dataset, which contains about 9 million images, labeled with image tags covering thousands of object categories and bounding box annotations for 600 categories. Last year, Google introduced a more inclusive Persona Annotations (MIAP) dataset in the Open Images Extended collection. The collection contains more complete bounding box annotations for the human hierarchy, and each annotation is marked with attributes related to fairness, including perceived gender representation and perceived age range.

In addition, as machine learning models become more capable and have an impact in many areas, protecting private information used in machine learning remains a focus of research. Along these lines, some of our recent work has addressed the privacy problem in large models, both to extract training data from large models and to point out how privacy can be included in large models. In addition to its work on federated learning and analytics, Google has been using other principled and practical machine learning techniques to enhance the toolbox.

Jeff Dean's long article outlook: Five potential trends in machine learning after 2021

Read on