laitimes

Feature article | AI small sample training, accurate prediction of protein structure in 16 seconds: from the beginning of the stove has a deep meaning

In 2016, DeepMind's artificial intelligence robot AlphaGo defeated Go world champion Lee Sedol by an aggregate score of 4:1. This year, a newly established Shanghai AI company began to develop an AI Go program with AlphaGo, and then defeated the world Go champion Park Tinghuan. In 2020, DeepMind's artificial intelligence program AlphaFold2 quickly and accurately predicted protein structure, with an accuracy comparable to experimental techniques such as cryo-EM. A year later, the local AI company once again launched the domestic self-developed protein structure prediction platform TRFold.

Recently, Shanghai Tianlang Intelligent Technology Co., Ltd. XLab released the protein structure prediction platform TRFold, its latest version of the prediction accuracy is close to AlphaFold2, and break through the bottleneck of AlphaFold2 requires super computing power, take the way of weight sharing to save computing power, and predict most protein chains in less than 16 seconds.

Why do ARTIFICIAL protein structure sequencing when AlphaFold2 has been open sourced? What's the challenge of doing it again? How to train a good model when the data and computing power are insufficient? What's next with TRFold?

In an interview with the surging news (www.thepaper.cn), Xue Guirong, the founder of Tianlang, said that AlphaFold2 opened the door for structural biology research, it is like "the plane of the Wright brothers", and the core technology must be mastered by itself. If you do not participate in the technological evolution, you can only stay in the original "aircraft" structure.

The development of TRFold also made Xue Guirong realize that another contribution of AlphaFold2 is that its training method can give back to AI and do better AI.

Xue Guirong said that if each model needs to be labeled with 10,000 pieces of data to train, it will be a disaster for AI. In fact, data is never enough, computing power is never enough, in this case, algorithm innovation is more important, such as whether you can run a good model with 10 pictures. He believes that machine learning under small samples is a big challenge for AI, and the industrial production of AI does not need so much data, which is the right way.

In the future, the team will continue to deepen the simulation of protein-protein interactions, based on the interaction can build large-scale interaction network maps, target discovery, mutant protein structure simulations, antibody simulations, etc.

Feature article | AI small sample training, accurate prediction of protein structure in 16 seconds: from the beginning of the stove has a deep meaning

TRFold is based on evaluation data from the CASP14 protein test set. Green: Real Structure, Blue: Predictive Structure.

Accurate prediction of 16 seconds under a single GPU

Protein is the material basis of life, its three-dimensional structure directly determines its function, once the three-dimensional structure is destroyed, protein function is lost or changed, many diseases are caused by important protein structure abnormalities in the body.

The amino acid chains of each protein are twisted, folded, and wound into complex structures that often take a long time or even be difficult to decipher. So far, the structure of about 180,000 proteins has been resolved experimentally, but this is only a small fraction of the billions of proteins that have been sequenced.

In the decades of protein structure resolution, X-ray crystallography, nuclear magnetic resonance spectroscopy (NMR), and cryo-electron microscopy (Cryo-SEM) technologies have made great contributions. But these traditional approaches often rely on a lot of trial and error and expensive equipment, and the study of each structure takes years.

Until the addition of AI, the problem of single protein fold prediction was basically solved, and the development of structural biology was accelerated. In 2020, the DeepMind artificial intelligence program AlphaFold2 used artificial intelligence technology to quickly and accurately predict protein structure for the first time in the protein structure prediction competition CASP14, with atomic level accuracy, comparable to experimental techniques such as cryo-EM.

The domestic academic and industrial circles are also catching up with the international pace in the field of protein structure prediction. In addition to Falcon of the Chinese Academy of Sciences, TFold of Tencent, and Uni-Fold of Deep-Trend Technology, TRFold, a self-developed protein structure prediction platform, conducted corporate internal testing based on the protein test set of CASP14 and scored 82.7 points (TM-Score, an indicator of assessing the topological similarity of protein structure), surpassing the 81.3 score of David Baker's team of biologists at the University of Washington, second only to AlphaFold2's 91.1 score.

TRFold uses weight sharing to save hash rate, which consumes about 1/32 of AlphaFold2. In training, AlphaFold2 uses 128 TPUv3 cores (about 256 GPUs), and TRFold uses only 8 Nvidia RTX 3090 GPUs, achieving results close to AlphaFold2 with minimal computing power.

TRFold uses a 50 million parameter cyclic multitrack attention network, while supporting distance prediction between amino acid residues and protein full-chain structure prediction, the prediction time of a 400-amino acid protein chain using a single Nvidia RTX 3090 GPU takes only 16 seconds, while AlphaFold2 predicts about the same number of amino acid protein chains takes more than 70 seconds.

During the CASP14 competition, CASP officials launched a protein structure prediction for the new coronavirus pneumonia, and the prediction results based on the model submitted by TRFold (nsp6-D2) were selected by CASP as one of the six "most credible models".

Traditionally, the prediction score of a single protein model of more than 90 points is not much different from the prediction results of the laboratory, Xue Guirong said that TRFold will continue to iterate, the structure simulation of single protein is just the beginning, and the future Tianlao plans to simulate the interaction of proteins with their complexes, including small molecules, polypeptides, other proteins, etc.

He said that the current clear research direction is to continue to simulate the interaction between proteins and proteins. Based on the interaction, large-scale interaction network diagram can be constructed, as well as target discovery, mutant protein structure simulation, antibody simulation, etc.

"Proteins will be an interaction network in the future, and once we have this network of action, we can conduct in-depth analysis." Xue Guirong said that if everyone's protein structure and action network can be fully measured in the future, it is possible to predict future health status and treatment methods through mutation analysis of protein internal structure in advance.

"There's so much that can be done here, and we just took a scoop out of the water from the sea today and looked at it." The challenge is self-evident, the multi-protein interaction brings greater computing power consumption, assuming that one hundred million proteins and another hundred million proteins interact, that is a billion multiplied by a hundred million computational combination. "This combination is explosive, and what kind of algorithm and strategy to use to accelerate it is a very challenging thing."

Feature article | AI small sample training, accurate prediction of protein structure in 16 seconds: from the beginning of the stove has a deep meaning

Xue Guirong

"The Wright Brothers' Plane"

In 2016, DeepMind artificial intelligence robot AlphaGo won against Go world champion South Korean player Lee Sedol, winning 4:1 on aggregate. This year, the newly established Tenma began to develop an AI Go program with reference to AlphaGo.

In May 2018, Tianyang AI Go Zhibaizi played against world Go champion Park Tinghuan, and park Tinghuan conceded defeat after three hours of fierce fighting. This set of AI Go programs was finally used in the control of urban traffic lights to help alleviate traffic congestion.

In 2019, Tianlao dabbled in protein structure prediction. Xue Guirong is often asked why he entered the field of protein prediction, or why he did protein structure sequencing when AlphaFold2 has been open sourced.

"From the very beginning when we started doing AI Go programs, we slowly realized the power of AI. For years, Tenma has been hoping to use AI to solve big, challenging problems, like transportation, and then biology. Xue Guirong said that compared with the macro level of transportation with algorithmic scheduling, human understanding of the microscopic world is more limited.

"We started doing this in 2019, when the protein structure data didn't change much from today. Whether a hundred thousand proteins with structures can use algorithms to depict the three-dimensional structure of proteins unknown in the microscopic world is actually a very challenging thing. At that time, I didn't know if there was AlphaFold2, let alone whether it could be done. ”

But if the protein prediction model can achieve the accuracy of the experimental instrument, it will be a huge progress. Fortunately, at the end of 2020, AlphaFold2 proved the power of algorithms, shocking the entire structural biology community and opening a new page in structural biology, "It used to take a protein structure prediction to take a year or two, and suddenly it can be solved in an hour." ”

The solution of the protein structure prediction problem is a new starting point for life science exploration, Xue Guirong said, this change has brought great development opportunities for the entire industry, and technological breakthroughs will reconstruct many logics at the original biological application level, such as pharmaceutical processes, disease treatment, personalized medicine, etc. But in fact, the open source code of AlphaFold2 is only inference code, not public training code.

The success of AlphaFold2 is a major breakthrough in the direction of protein structure prediction, and the development of AI algorithms around protein structure function problems and can achieve the accuracy requirements of actual landing applications has just begun, and it is impossible to advance the technology to solve deeper problems without training model experience or the ability to train AlphaFold2 results.

"The whole core technology is still controlled in the hands of others, and today people give you things to use, as for how to come, you don't know." Xue Guirong said.

For example, AlphaFold-Multimer, which the DeepMind team released in October to predict the structure of the protein complex, predicted the relationship between proteins and proteins after making minor adjustments to AlphaFold2. This kind of more in-depth research must have the ability to build its own underlying algorithms in order to be truly applied in the field of biology.

"It's like making airplanes, from the Wright brothers inventing the first airplane to be able to fly, if the intermediate process is not involved, you will always stay in that structure at that time." But today' big planes fly in the sky and can carry so many people, there is a lot of research, and a lot of innovation is constantly emerging from it. ”

Xue Guirong said that today's DeepMind AlphaFold2 is the "Wright Brothers' Plane", and the core technology must be mastered in order to compete on the same track as other teams. "AlphaFold keeps moving forward and we're catching up with them."

AlphaFold2 "feeds back" THE AI

"We've been hovering around 70 for a long time, and we've been hovering around for a long time, and we've recently jumped to over 80 points." In the two and a half years of research and development time, TRFold has undergone dozens of iterations, the current training architecture is designed from the beginning of this year, and then processing data, training data, non-stop iteration, took 10 months, and currently achieved 82.7 points.

The headache for the team was the hash rate and memory, which determined the size of the model. Xue Guirong introduced, "The memory of the small model is limited, the larger the model, the stronger the memory ability, but the larger the model, the greater the corresponding computing power and memory requirements." "With limited training resources, the team made improvements in data and network design, achieving relatively good results in the case of a huge gap in computing power.

"In the subsequent process of building protein interaction networks, involving the interaction of one protein and another protein, as well as the interaction between multiple proteins, the amount of computation increases exponentially. Therefore, TRFold's rapid prediction of protein structure with small computing power requirements is of far-reaching significance for subsequent in-depth research. Miao Hongjiang, head of the XLab team in Tianlang, said.

Data is the fuel of machine learning, but compared to the previous image recognition machine learning algorithms need to rely on millions of pictures of data to train models and systems, experimental methods have resolved only a hundred thousand protein structures. Machine learning bull Ng Enda believes that one of the challenges of artificial intelligence landing is the problem of small data, how to make machine learning work even in the face of small data.

And this is also a problem that plagues Xue Guirong. He and Miao Hongjiang met on the first day to talk about whether the structured protein data is enough, can we train a model that satisfies the effect, or wait for 10 years for cryo-EM, and wait for 500,000 data to come out. "At that time, we were also very worried, and we were afraid that this field would really be done in 10 years."

However, AlphaFold2 proves that good results can be achieved through algorithm and model design, distillation data and other means. Compared with AlphaFold2, TRFold's algorithm model only uses a small amount of real data training, that is, from multiple macro proteome sequence libraries to find out the multi-sequence arrangement containing more accurate coevolutionary information, so that the model can obtain better recognition of real coevolution information during training, so as to obtain more accurate prediction results for amino acid residue distance and coordinates. The team is expected to join distillation data enhancement work in the near future to further improve the model's predictive and generalization capabilities.

Looking back, Xue Guirong believes that in addition to opening the door for structural biology research, another contribution of AlphaFold2 is that its training method can give back to AI and do better AI.

"AI three steps, data annotation ready, algorithm design, the goal to think well, with these three things plus computing power support to do AI." But in fact, data is never enough, computing power is never enough, in this case, algorithm innovation is more important.

"Can 10 pictures come up with a good model?" Don't turn into giving you 10,000 pictures can't come up with a good model. Whether machine learning in the case of small samples can be done well is actually a big challenge for the entire AI field, because only small samples can have large-scale and industrial production capabilities. Xue Guirong said that if each model needs to label 10,000 pieces of data to train, the development momentum of the data labeling industry will be very good, but it is a disaster for AI.

"How can you play with everything that you have to label a lot of data?" Artificial intelligence is artificial. What we really have to do is to make the intelligence piece stronger, and limit the artificial piece to a box, which is what we really talk about from artificial intelligence to machine intelligence. From AlphaFold to AlphaFold2, Xue Guirong believes that this is a huge change in artificial intelligence towards machine intelligence. The industrial production of AI does not require so much data, which is the right way.

Read on