Tang Jian: Generative Artificial Intelligence in Life Sciences: How to Build ChatGPT in Life Sciences

In the 03 issue of the "Understanding the Future" Science Lecture Series of AI for Science, we specially invited Tang Jian, Associate Professor of the Artificial Intelligence Research Center (Mila) of Quebec and Chair Professor of Artificial Intelligence of the Canadian Institute for Advanced Study (CIFAR), to introduce the application of generative AI in life sciences with the topic of "Generative Artificial Intelligence in Life Sciences: How to Build a "ChatGPT" of Life Sciences". Research in artificial intelligence and biotechnology is in its golden age. Language generative models such as ChatGPT have made great breakthroughs in the field of dialogue systems, so researchers are exploring whether similar artificial intelligence models can be built in the biopharmaceutical field. There have been many explorations in the intersection of artificial intelligence and biomedicine, such as: GeoDiff application in three-dimensional conformational prediction of small molecules; E3Bind is used in protein-ligand complex structure prediction; ProtSeed simultaneously generates new protein structures and sequences.

Tang Jian: Thank you very much for the invitation and introduction, and today I am very honored to have this opportunity to share some of my views on the future development of artificial intelligence in life sciences, and some work done by our team.

I am focusing on generative artificial intelligence today, because as you know, generative artificial intelligence has recently developed relatively rapidly, especially ChatGPT, which was released at the end of last year, which attracted widespread attention.

The question we're discussing today is: Is it possible to build a ChatGPT-like model in the life sciences?

Let me start with the big picture. We are now in the best era of research, because we are actually experiencing a dual technological revolution in artificial intelligence and biotechnology. On the one hand, in the field of artificial intelligence, artificial intelligence has developed rapidly in the past decade, from the technological breakthrough of ImageNet in 2012 to more than ten years now, during which there are many important technological revolutions, such as reinforcement learning technology represented by AlphaGo, we know that in 2016, the AlphaGo system developed by DeepMind has been able to beat the best human Go players. In the past two years, generative models, such as GPT3 and ChatGPT, have been able to generate more realistic text. In addition, graph machine learning and geometric deep learning have also developed rapidly in recent years, the most classic of which is AlphaFold2, a protein structure prediction structure model. AlphaFold2 is essentially a geometric deep learning model, its core idea is to model the geometric relationship between amino acids and amino acids in space, with this geometric relationship between amino acids and amino acids in space, we can more accurately predict their three-dimensional structure in space, which is a technological revolution in artificial intelligence.

On the other hand, the development of biotechnology in the past decade has also been particularly rapid, especially represented by gene sequencing, gene synthesis, gene editing technologies, and now we can carry out gene synthesis, gene sequencing and gene editing with high throughput and low cost, so as to obtain a large number of experimental data.

On the other hand, with the rapid development of structure analysis technology represented by cryo-EM, we can quickly analyze the structure and function of proteins.

In the area of artificial intelligence technology, I want to focus on generative artificial intelligence. Generative Artificial Intelligence (AIGC) has developed rapidly in AIGC in the past two or three years. What is the core idea of generative artificial intelligence? It is essentially based on a large amount of data on the Internet, such as a large amount of text data, a large amount of image data, a large amount of code data after training, we can let the model generate new and very realistic text, images and code.

For example, the text generation model GPT-3, it can generate new text after training based on a large amount of text data, for example, in this example, we can enter a sentence, GPT-3 can supplement the next paragraph, which is the text generation film type. Similarly, there is an image generation model, for example, we can generate an image that conforms to the semantic meaning of the text based on the description of a text, such as a surfing panda in this example, we enter a sentence, and the AI model can generate an image that conforms to the semantics. Another generative model is now more discussed ChatGPT, ChatGPT is a large-scale language model, after it is optimized, it can be used in the dialogue system, the user can enter a sentence, our AI model will return the corresponding answer, of course, this answer is the newly generated text of AI, not from elsewhere, the user can further talk to it, and then the robot can further provide feedback, so it is essentially a chatbot.

Of course, what does ChatGPT's process look like? First of all, it is a large-scale pre-trained language model, pre-training for a large amount of text and code data on the Internet, and pre-training the model is equivalent to letting the model understand the data (text and code) on the Internet. Of course, after just this pre-training, this model still cannot be used in the dialogue system, so we will further optimize it later so that it can be used for dialogue tasks.

How do we optimize it? We will first collect some labeled data, for example, we will provide some user scenarios, and then people will answer, give corresponding answers, we can collect a lot of such data, with such chat Q&A data, we can further optimize the language model, so that we can conduct Q&A, so this is the second step, is to further optimize the model through label data. After having this optimized model, the third step is to let the model further interact with people, such as this person can provide some answers, this person can provide some feedback, which is essentially a reinforcement learning process, through reinforcement learning we can further optimize the model, and finally get a machine learning model specifically for chat conversations, which is a very breakthrough in generative AI in recent years.

As mentioned earlier, the biological field has also experienced a technological revolution, especially the technology represented by gene sequencing and gene synthesis. The progress of these technologies allows us to have a large number of protein sequence data and antibody sequence data (such as UniProt), and we can also test CDR to obtain a lot of antibody sequence data. At the same time, we also have a high-throughput wet experiment platform to obtain a lot of labeled experimental data by high-throughput testing of AI-designed antibodies or protein sequences.

Just now, I also talked about the structure analysis technology represented by cryo-EM, which can now quickly analyze the structure of proteins. This is one of the reasons why the amount of structured data in the PDB is increasing, and it is very important for the success of AlphaFold2, because there are now hundreds of thousands of structural data in the PDP.

Based on these two points, we now see a very big opportunity for generative AI in the field of drug discovery, especially in protein design. Why? Because what is its core goal for drug discovery, or for protein design? Its goal is to design entirely new protein molecules. So we wanted to develop a machine learning model that could generate entirely new protein molecules. In fact, generative models are used to do this, we can use generative models to generate new proteins, generate new molecules, in this way, can help us find better drug molecules.

Of course, generative AI relies on large amounts of data, such as ChatGPT, which is pre-trained based on a large amount of text on the Internet, predictions on code. In the field of biomedicine, due to the development of gene sequencing, gene editing, gene synthesis and cryo-EM technology, we have a large amount of data, such as a large number of protein sequence data, a large number of antibody sequence data, and structural data, based on these data, we can also pre-train a generative model, and then based on this generative model, we can generate new proteins.

Of course, we can see that after a language model such as ChatGPT is pre-trained, it needs to further optimize this model through interaction with people. Similarly, once we have such a pre-trained model in drug discovery that can generate new protein molecules, we also hope to further optimize it. Of course, now in this case, we are not talking about interacting with people to get feedback, in the field of biomedicine, in fact, we can interact with the AI model with the wet experiment platform to get feedback, for example, through the generation of models can design new protein sequences, design new antibody sequences. These sequences can be synthesized on the wet experimental platform, expressed, purified, tested, obtained activity data, obtained functional data, which are used as feedback and further fed back to the generative model, so we can further optimize the model based on these feedback data.

So we can see that in the field of biomedicine, we actually see a similar opportunity, we can also develop models like ChatGPT. Because we now have a lot of data in this field, we can also get a lot of feedback through the platform of high-throughput wet experiments to further optimize this model.

For example, for example, we do the design of antibody sequences, in ChatGPT, there may be a problem for the user, here, our problem may be the sequence of an antigen, or the structure of the antigen, our AI model will also provide an answer, its answer is the antibody sequence generated by AI, which is no longer interactive with people to get feedback, but further test these sequences with the wet experiment platform to get feedback, and finally feedback data further back to the AI model. Finally, after interacting with the wet experiment platform through several rounds of AI models, we may be able to find the molecules we want, which is a big idea.

Next, I will introduce the work done by our team generative artificial intelligence in biomedicine, and we have done some work on both small molecules and large molecules. The first work was about the three-dimensional structure prediction of small molecules, we know that three-dimensional structure prediction is a very basic problem, AlphaFold2 is aimed at three-dimensional structure prediction of protein macromolecules. The problem we did at that time was actually very similar to protein structure prediction, we didn't do structure prediction of large molecules of proteins here, we did structure prediction of small molecules. For the structure prediction of small molecules, its input is not the amino acid sequence, but the chemical formula of the small molecule, or the graph structure of the small molecule, based on this graph structure, we want to predict its three-dimensional conformation.

Our team has been working on this problem for some time, before the advent of AlphaFold2, and our latest work is to use the popular diffusion generation model to predict the three-dimensional conformation of molecules, which is also the first work in academia to use diffusion generation models to model the three-dimensional structure of molecules. What does a diffusion generation model essentially do? It is actually to establish the correspondence between the two distributions, on the right is the data distribution, such as the three-dimensional conformation of the molecule, the left is a noise distribution, in the diffusion model first has a forward process, from the data distribution to the noise distribution, we add some noise, do some random disturbance, at each step the coordinates of the atom are randomly perturbed, after multiple disturbances, the conformation of the molecule degenerates into a completely random noise, which is the forward diffusion process of the diffusion model.

In the actual situation, we are concerned with another process, that is, the reverse generation process, that is, how do I generate a stable structure from a random structure, and what needs to be done at each step of this generation process is denoising, and we go from a completely random molecular three-dimensional conformation, through continuous denoising, we continue to get a better molecular three-dimensional conformation. So the most important component of this model is the denoising network. So if it is related to physical principles, how to understand each step of our neural network? At each step can be understood as our model learning the forces acting on each atom. We use neural networks and data-driven methods to learn force fields. After learning the forces acting on each atom, we can adjust the position of the atoms according to the force field and then slowly converge to a stable structure. In addition, because of the force on the atom, it actually conforms to the degeneration of rotation, translation, etc., that is, if I rotate the structure I input, the force I act on the atom is also to do the corresponding rotation, so our neural network actually needs to conform to such a property.

Here I show an example, this is us input different molecular diagrams, and finally the model can actually achieve a number of different stable structures, the schematic diagram on the right is how the diffusion model to find the more stable structure of molecules. At the beginning, the position of each atom is randomly initialized, in each step of the denoising process, each step we will simulate the force acting on each atom, through this force to adjust the coordinates of each atom, through multiple steps of such denoising, finally the whole molecule converges to a relatively stable structure. So it's kind of like learning force fields in a data-driven way, which is the structure prediction of individual molecules.

Later, we expanded this work further, and we wondered if we could predict the structure of the complex? For example, if we predict the complex structure of proteins and ligands, for this problem input there is still a small molecule graph structure or its chemical formula, in addition to the small molecule graph structure, in this case our input also has the three-dimensional structure of the protein. Based on these two inputs, we want to predict how the final protein and small molecule will bind, that is, their complex structure. For this problem, we propose a framework such as an encoding and decoder, so in fact, it is essentially based on the diffusion generation model. Our model first has an encoder, and its main purpose is to model molecular maps, the structure of proteins, and the interactions between them. Their geometric constraints are taken into account when modeling their interactions using encoders. After the encoder, we will further have a decoder that further infers the structure of this complex, which is essentially a generative model, or a denoising network. Therefore, at the beginning, the structure of the protein and the ligand is also randomly initialized, and each step will also do denoising, or optimize the complex structure, and after multiple rounds of optimization, it can finally converge to a relatively stable structure.

This figure shows how the model infers the structure of the complex, green is the structure of the protein, gray on the right is the real ligand structure, and pink is the ligand structure predicted by the model. We can see that at the beginning, the pink structure is randomly initialized, the model will optimize and adjust the structure of the ligand at each step, and after multiple rounds of optimization, we can see that the prediction of this model can converge to the real ligand position, so we can also be regarded as denoising at each step, adjusting the position of the atom by learning a force field. Here are some quantitative indicators, compared with traditional methods, we have a relatively large improvement, of course, there is still a relatively large room for improvement in the future.

We have recently started to do a lot of work on large molecule protein design based on diffusion generation models, and a very important issue in protein design is De novo design - is it possible to generate or design new protein structures and sequences through AI methods. In the field of de novo design, there have been relatively big breakthroughs in the past two or three years, such as David Baker of the University of Washington in the first two years to develop RoseTTAFold, and then they further expanded the model to the field of protein design, they released the RFdiffusion algorithm last year, which is essentially a diffusion model, is used for protein design, and its core idea is to train based on the structural data of the entire PDB. Completely new protein structures and sequences are then generated.

Let me give you an example, let's say there is a target and we want to design a ligand, and RFdiffusion can design it in two stages. First, based on this target, the model generates a structure that may bind to this target. Of course, the first step is only to get the structure, not the sequence, so in the second step, a sequence design model (such as ProteinMPNN) will be used to further design the protein sequence based on this structure. This is a relatively good algorithm for them now, they have carried out sequence design for many targets based on this algorithm, and the success rate in the experiment is still very good, especially with the physical algorithm they developed two or three years ago. This RFdiffusion algorithm is actually very similar to RoseTTAFold, their previously developed protein structure prediction algorithm, which essentially optimizes the structure at every step. Of course, unlike RoseTTAFold, the sequence of RoseTTAFold is given, while in the RFdiffusion diffusion generation model, this sequence is not given.

We have also done some work in this direction. This is a job we did last year, and our work is a little different from their algorithm. Their algorithm is two-stage, first to design the structure, and then to design the structure. We propose a new algorithm to simultaneously design protein structures and sequences. This algorithm is based on the diffusion generation model, just said that the core idea of the diffusion generation model is to denoise, optimize the structure, and optimize the sequence at the same time. So in this method, we actually have such a denoising network, our input is a protein sequence, protein structure, and target related information, and our model will re-optimize the protein sequence and structure for this target, this is a step, I can continue to optimize through multiple steps of this process, until finally find a more ideal protein structure and sequence, which is the core idea of our model, which is essentially based on the diffusion generation model.

Based on this model, we did some protein design, one is to do the design of the CDR loop of the antibody. We now have a complex of antigens and antibodies, we designed to cover the six CDR loops of the antibodies, and use our model to regenerate the new CDR loop, we found that the structure and sequence found by our model are relatively small in the gap between the real antibodies in nature, which means that our current model can well restore those antibody sequences that already exist in nature, which is an example.

In addition, we have done some other examples, for example, in this example, for a protein in nature, for example, this piece has a loop, we want it to generate longer loops, in this example we actually designed different sizes of β-barrel proteins, and for this example, we actually designed α-helical transmembrane proteins with a specified number.

In this last example, I would like to briefly introduce our latest work on antibody design and optimization based on models such as generative models or ChatGPT. For antibody optimization, first we have a complex structure of the antibody and antigen. In this case, many times we get antibodies in experiments, its affinity is not good enough, we may make some mutations on this antibody to optimize its affinity. In fact, our recent work is also to build a model similar to ChatGPT, this model actually contains three parts, the first part is a pre-trained generative model; The second part is to optimize the generative model based on supervised learning, and then we close the loop with the wet experiment for mutual feedback optimization. We first pre-trained based on a large number of protein sequences and protein complex structures, and after pre-training, the obtained AI model can better understand how proteins bind to proteins. In addition, we have a labeled experimental dataset, SKEMPI data - given a protein complex, and then we mutate some amino acids on its interface, and after the mutation, the energy changes, we get the experimental data. Based on this dataset, we optimized the AI model, and finally we can get a protein optimization or antibody optimization model. Then we go to design some new antibody sequences based on this AI model, do tests in wet experiments, measure affinity, and after obtaining these activity or affinity data, we feed back to the AI model to do a new round of AI sequence design, and after multiple rounds, we can find more ideal antibody sequences or molecules.

Show the results of the actual wet experiment we did, this is a cooperation with Fudan University, CR3022 such a new crown antibody has been optimized, CR3022 is an antibody obtained from the patient, we are based on the AI model to do sequence design, wet experiment testing, now after two rounds of iteration, we found a more ideal antibody molecule, it is actually the original antibody molecule through three mutations to produce a new antibody molecule, For example, for the Delta variant, the affinity is increased by 3 times, and for the latest variant of Omicron, we found that the sequence of the latest antibody we found has increased the affinity with the original sequence by nearly 9 times.

Finally, I agree with what Academician Eweinan said just now, we need to build a community, a platform, to promote AI for Science, so our team is also actively building an open source community, TorchProtein is an open source machine learning platform that we have established with several companies specifically for protein representation learning and design.

Summary: We are in the best of times now because we are experiencing a dual technological revolution in artificial intelligence and biotechnology. In the field of artificial intelligence, generative models are developing rapidly, and we can now train and generate models based on a large amount of data, text data, image data, and code data, so as to generate new text, images or code. On the other hand, in the field of life sciences, due to the rapid development of gene sequencing, gene editing, gene synthesis, cryo-EM and other technologies, data in the field of life sciences has grown rapidly, and the growth rate has been higher than Moore's Law. With this data, we can train our generative model, after which we can generate entirely new drug molecules. AI-designed drug molecules can further interact with the wet experiment platform for closed-loop optimization. The future of generative models will have an important impact not only in biomedicine, but also in other fields, including agriculture, food, materials, energy, environment and many other fields.

That's all I have for today, thank you.

Without the authorization of the organizer and/or guests, no institution or individual shall disseminate the PPT content in the video and text in any way;

Any form of media reproduction and excerpt is strictly prohibited without authorization, and reproduction to platforms other than WeChat is strictly prohibited!