Will Transformer dominate the AI space? It is too early to draw conclusions

2022-03-13 13:11:52

Selected from Quanta Magazine

Machine Heart Compilation

Written by Stephen Ornes

Machine Heart Editorial Department

Starting with natural language processing tasks and shining in the field of image classification and generation, will the invincible Transformer become the next myth?

Imagine walking into a local hardware store and seeing a new type of hammer on the shelves. You've heard of this hammer: it strikes faster and more accurately than other hammers, and in the last few years, in most uses, it has eliminated many other hammers.

In addition, with some adjustments, such as adding an attachment here and screwing a screw there, the hammer can also be turned into a saw, and its cutting speed can be comparable to any other alternative. Some experts at the forefront of tool development say the hammer could signal that all tools will converge into one device.

A similar story is playing out in the field of artificial intelligence. This versatile new hammer is an artificial neural network — a network of nodes that are trained on existing data to "learn" how to accomplish certain tasks — called Transformers. It was originally used to handle language tasks, but has recently begun to influence other areas of AI.

Transformer first appeared in a 2017 paper: "Attention Is All You Need." In other AI approaches, the system first focuses on local patches of input data and then builds the whole. For example, in a language model, neighboring words are first grouped together. In contrast, Transformer runs a program so that each element in the input data connects or focuses on other elements. Researchers refer to this as "self-attention." This means that once training is started, the Transformer can see traces of the entire dataset.

Before transformers, AI's progress on language tasks had lagged behind developments in other fields. "In this deep learning revolution that's happened over the last 10 years, natural language processing is somewhat of a latecomer," says Anna Rumshisky, a computer scientist at the University of Massachusetts Lowell. "In a sense, NLP has lagged behind computer vision, and Transformer has changed that."

Transformer quickly became a leader in applications such as word recognition that focused on analyzing and predicting text. It sparked a wave of tools, such as OpenAI's GPT-3, which can be trained on hundreds of billions of words and generate coherent new text.

Transformer's success has prompted researchers in the field of artificial intelligence to think: What else can this model do?

The answers are slowly unfolding – Transformer is proven to have an astonishing wealth of features. In some vision tasks, such as image classification, neural networks that use Transformer are faster and more accurate than neural networks that don't. Transformer can also handle more and better research in other areas of AI, such as processing multiple inputs at once or completing planning tasks.

"Transformer seems to be quite transformative on many issues in the field of machine learning, including computer vision," says Vladimir Haltakov, who works with computer vision for self-driving cars at BMW in Munich.

Just a decade ago, the different subfields of AI were almost inconsistency, but the arrival of Transformer shows the possibility of fusion. "I think Transformer is so popular because it shows the potential for versatility," says Atlas Wang, a computer scientist at the University of Texas at Austin, "and there's good reason to try to use Transformer across the entire ai mission."

From "Language" to "Vision"

A few months after the release of Attention Is All You Need, the most promising move to expand Transformer's reach began. Alexey Dosovitskiy was working in Google Brain's Berlin office at the time and was working on computer vision, an AI subfield focused on teaching computers how to process and classify images.

Will Transformer dominate the AI space? It is too early to draw conclusions

Alexey Dosovitskiy。

Like almost everyone else in the field, he has been using convolutional neural networks (CNNs). Over the years, it's CNN that has driven all the major leaps in deep learning, especially in computer vision. CNNs perform feature recognition by repeatedly applying filters to pixels in an image. Based on CNN, the Photos app can categorize your photos by face or distinguish avocados from clouds. Therefore, CNNs are considered essential for visual tasks.

At the time, Dosovitskiy was working on one of the biggest challenges in the field, amplifying CNNs without increasing processing time: training on larger datasets to represent higher-resolution images. But then he saw that Transformer had replaced the tool of choice for almost all language-related AI tasks before. "We're obviously inspired by what's going on," he said, "and we wonder if we can do something similar visually?" This idea makes some sense — after all, if Transformer can handle large data sets of words, why can't it work with pictures?

The end result: at a conference in May 2021, a network called Vision Transformer (ViT) emerged. The architecture of the model is almost identical to that of the first Transformer proposed in 2017, with only minor changes, which allows it to analyze images, not just words. "Language tends to be discrete," Rumshisky says, "so images have to be discretized."

The ViT team knew that the language's approach could not be fully mimicked, as the self-attention of each pixel would be very expensive in computational time. So, they divide the larger image into square units or tokens. The size is arbitrary because the token can get larger or smaller depending on the resolution of the original image (defaulting to 16 pixels on one side), but by processing pixels in groups and applying self-attention to each pixel, ViT can quickly process large training datasets, resulting in increasingly accurate classifications.

Transformer was able to classify images with more than 90% accuracy, which was much better than Dosovitskiy expected, and implemented the new SOTA Top-1 accuracy on the ImageNet image dataset. The success of ViT suggests that convolution may not be as critical to computer vision as researchers believe.

Neil Houlsby of Google Brain's Zurich office, which co-developed ViT with Dosovitskiy, said: "I think CNN is likely to be replaced by visual transformers or its derivatives in the medium term." He believes that future models may be pure Transformers, or methods that add self-attention to existing models.

Some other results validate these predictions. The researchers regularly tested their image classification model on the ImageNet database, and in early 2022, an updated version of ViT was second only to the new method of combining CNNs with Transformer. The previous long-term champion, CNN without Transformer, is currently only barely in the top 10.

How Transformer works

ImageNet results show that Transformer can compete with leading CNNs. But Maithra Raghu, a computer scientist at Google Brain's office in Mountain View, California, wondered if they "saw" images the same way CNNs did. Neural networks are a "black box" that is difficult to decipher, but there are ways to snoop inside it – such as by examining the inputs and outputs of the network layer by layer to understand how training data flows. Raghu's team basically did just that — they took the ViT apart.

Maithra Raghu

Her team identified ways in which self-attention leads to different perceptions in algorithms. Ultimately, Transformer's power comes from the way it processes image-encoded data. "In CNN, you start in a very local place and slowly gain a global view," Raghu says. CNNs recognize images pixel by pixel, identifying features such as corners or lines from local to global. But in Transformers with self-attention, even the first layer of information processing makes connections between far-flung image locations (just like language). If CNN's approach is like starting with a single pixel and using a zoom lens to reduce the magnification of the image of a distant object, then Transformer slowly focuses the entire blurred image.

This difference is easier to understand in the language field that Transformer initially focused on, think about these sentences: "The owl found a squirrel." It tried to grab it with its claws, but only the end of its tail." The structure of the second sentence is confusing: what does it mean? CNNs that focus only on the words adjacent to "it" have difficulties, but transformers that connect each word to other words can recognize that an owl is scratching a squirrel, and the squirrel has lost part of its tail.

Apparently, the way Transformer processes images is fundamentally different from convolutional networks, and researchers are getting more excited. Transformer's versatility in converting data from one-dimensional strings, such as sentences, to two-dimensional arrays, such as images, shows that such a model can process many other types of data. For example, Wang believes that Transformer may be a big step toward achieving convergence of neural network architectures, resulting in a general approach to computer vision — and perhaps applicable to other AI tasks as well. "Of course, there are limitations to making it really happen, but if there's a model that can be universalized and allows you to put all kinds of data on a single machine, that's going to be great."

About ViT's outlook

Now the researchers hope to apply Transformer to a more difficult task: creating new images. Language tools such as GPT-3 can generate new text based on their training data. In a paper published last year, "TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up," Wang combined two Transformer models to try to do the same thing with images, but it was a much more difficult problem. When the dual Transformer network was trained on the faces of more than 200,000 celebrities, it synthesized new face images at medium resolution. Based on initial scores (a standard method of evaluating images generated by neural networks), the resulting celebrity faces are impressive and at least as trusting as the celebrities created by CNNs.

Wang believes that Transformer's success in generating images is more surprising than ViT's ability to classify images. "Generating models requires comprehensive capabilities, the ability to add information to make it look reasonable," he said. As with the classification domain, the Transformer method is generating fields instead of convolutional networks.

Raghu and Wang also see new uses for Transformer in multimodal processing. "It used to be tricky to do," Raghu says, because each type of data has its own specialized model, and the methods are siloed between them. But Transformer proposes a way to combine multiple input sources.

"There are a lot of interesting apps that can combine some of these different types of data and images." For example, a multimodal network might provide support for a system that can read a person's lips in addition to listening to one's voice. "You can have a wealth of linguistic and image information representations," Raghu says, "and more in depth than before."

These faces were created by a Transformer-based network after training on a dataset of more than 200,000 celebrity faces.

A new series of studies have shown a range of new uses for Transformer in other areas of artificial intelligence, including teaching robots to recognize human movements, training machines to recognize emotions in speech, and detecting stress levels on electrocardiograms. Another program with a Transformer component is AlphaFold, which has made headlines for five decades with its ability to quickly predict protein structures, solving the folding problem of protein molecules.

Transformer isn't all you need

Even though Transformer helps integrate and improve AI tools, it is as costly as other emerging technologies. A Transformer model needs to consume a lot of computing power during the pre-training phase to beat its previous competitors.

This could be a problem. "There is a growing interest in high-resolution images," Wang says. Training costs can be a headwind hindering transformer adoption. However, Raghu believes that training hurdles can be overcome with the help of complex filters and other tools.

Wang also noted that while visual transformers are already driving advances in AI, many new models still contain the best parts of convolution. That means future models are more likely to use both models than abandon CNNs altogether, he said.

At the same time, it also shows that some hybrid architectures have an attractive prospect of exploiting the advantages of transformers in a way that current researchers cannot predict. "Maybe we shouldn't rush to the conclusion that transformer is the perfect model," Wang said. But it's becoming increasingly clear that transformers will be at least part of all the new super tools in the AI shop.

Beijing, March 23 – Chief Chi Heng Executive Conference

The Heart of Machines AI Technology Annual Conference will be held on March 23, along with the Chief Chi Heng Conference.

Time: March 23, 2022, 13:30-17:00

Address: Hyatt Regency Beijing Wangjing

The "Chief Intellectual Officer Conference" will invite leaders in the field of smart travel, who will come from the most popular smart cars, vehicle specification chips, Robotaxi and unmanned logistics and other fields, covering a number of cutting-edge directions such as automotive robots, automotive chip prospects in the era of large computing power, and driverless commercialization.

Will Transformer dominate the AI space? It is too early to draw conclusions

Read on