laitimes

MIT "Oracle" model on Nature cover! Deciphering DNA's past, present, and future

MIT "Oracle" model on Nature cover! Deciphering DNA's past, present, and future

Reporting by XinZhiyuan

Edit: Good Sleepy La Yan

A model can decipher the evolutionary history and future of non-coding DNA?

Today, machine learning is on the cover of Nature again!

This time, researchers from institutions such as the Massachusetts Institute of Technology and the University of British Columbia built a deep learning neural network model, the "Oracle."

After training with hundreds of millions of experimental observations, the Oracle can predict how mutations in non-coding DNA sequences in yeast will affect gene expression.

MIT "Oracle" model on Nature cover! Deciphering DNA's past, present, and future

In addition, the researchers have come up with a unique way to represent adaptive terrain in two dimensions, making it easier to understand organisms other than yeast. It is even possible to design a generic gene expression model for advancing gene therapy and industrial applications.

What is non-coding DNA?

Although each of our human cells contains a large number of genes, the so-called "coding DNA" only accounts for 1% of all our genes. The remaining 99% are not coding DNA that cannot produce proteins from this DNA.

This non-coding DNA ( nicknamed junk DNA ) has an important function. That is, to control the "on or off" of genes, and the amount of protein produced.

Over time, cells copy DNA to grow and divide. In these noncoding regions, mutations often occur, including functional fine-tuning or altering the way gene expression is controlled.

Many mutations are not worth mentioning, and even some mutations are beneficial. However, these mutations can occasionally increase the chances of developing some common diseases (such as type 2 diabetes) or more serious diseases (such as cancer).

MIT "Oracle" model on Nature cover! Deciphering DNA's past, present, and future

The plasticity of gene expression in evolution

To better understand the effects of this mutation, researchers have been working on mathematical maps to look at an organism's genome, predict which genes will be expressed, and determine how that expression will affect the observable characteristics of an organism.

These atlases are called "adaptive topography," and about a century ago , "fitness topography" was proposed to understand how genetic makeup affects the adaptive type of organisms, particularly reproductive success. Early atlases were relatively simple, focusing only on a small number of mutations.

MIT "Oracle" model on Nature cover! Deciphering DNA's past, present, and future

Adaptability terrain

Today, researchers have richer databases, but they still need additional tools to describe this complex data and visualize it.

This ability can on the one hand allow researchers to better understand how a single gene evolves over time, and on the other hand, it can also help predict possible future changes in gene sequence and gene expression.

Another breakthrough in AI in the field of biology

EISHIT graduate student Eishit Dhaval Vaishnav, co-author Carl de Boer, and their colleagues built a neural network model to predict gene expression to achieve this goal.

They trained the model by feeding a dataset of millions of completely random, non-coding DNA sequences into yeast to see how each random sequence affected gene expression.

MIT "Oracle" model on Nature cover! Deciphering DNA's past, present, and future

First, the researchers measured the expression of the gene encoding yellow fluorescent protein (YFP) in a large group of yeast cells.

Among them, different cells carry different promoters. These promoters are located on a small piece of circular DNA close to the YFP gene, and as binding sites for the protein, promoters can control the expression of nearby genes.

Specifically, the researchers used more than 30 million different promoters, each of which is 80 base pairs in length, and quantified the YFP produced by each cell containing one of these promoters.

MIT "Oracle" model on Nature cover! Deciphering DNA's past, present, and future

Genes regulate the evolution, evolution, and engineering of DNA

The researchers then fed the resulting expression data into a convolutional neural network and trained the network to predict gene expression from the data.

To test their effectiveness, the researchers synthesized thousands of promoter sequences that were not used for training and measured their ability to drive gene expression.

The results showed that the neural network very accurately predicted the extent to which each promoter sequence drove gene expression.

In addition, the researchers provided the network with random starting sequences, and the results also proved that AI's ability to predict gene expression from sequences can be used to convert these starting sequences into promoter sequences for extreme YFP expression.

Finally, the researchers synthesized an additional 500 of these sequences and measured their ability to drive YFP expression. The results show that the sequences simulated by the computer can indeed drive very high and very low expressions.

MIT "Oracle" model on Nature cover! Deciphering DNA's past, present, and future

To figure out the most basic evolutionary problems, Vaishnav and his colleagues looked at papers and even put all the datasets in an existing study into a model to try.

To build a tool powerful enough to probe any gene, you need to find a way to predict the evolutionary patterns of non-coding sequences, even without a complete data set.

To achieve this, they devised a computational technique that could interpolate predictions from a frame onto a two-dimensional image.

This makes it easy to understand how any non-coding DNA charge affects gene expression and gene adaptability without having to do any time-consuming and laborious experiments in the lab.

What's the point?

For more than 50 years, biologists have been trying to accurately predict the intensity of gene expression through non-coding DNA sequences. However, the biochemical mechanism of gene expression is very complex, and even the best efforts of the academic community have not achieved this goal.

Before the study was published, researchers mostly could only train the model with known mutations (with some minor changes at best).

However, Regev's group took a bigger step. They built unbiased models that predicted the fitness and gene expression of organisms, based on any possible DNA sequence, even if some of them had never been seen before.

Experiments have shown that for most starting sequences, 3 or 4 mutations are enough for the sequence to evolve very high or very low expressions. About 70% of yeast genes are stably selected for their expression (favoring mutations that do not lead to large changes in expression).

In addition, genes affected by stable selection are more resistant to non-coding DNA mutations. That is, mutations in its promoters alter the expression of genes to a lesser extent.

MIT "Oracle" model on Nature cover! Deciphering DNA's past, present, and future

The advent of the Oracle, like other deep learning applications such as predicting protein folding, has led to a new approach for scientists to explore and explain a wider range of areas.

In addition, the "oracle" allows researchers to control cells for pharmaceutical purposes, including the latest treatments for cancer and autoimmune disorders.

Aviv Regev, a Ph.D. in biology at MIT and a core member of Harvard and MIT's Broad Institute, said: "Now, we have an 'oracle' that we can ask a lot of questions, such as what would happen if we tried all the mutations in the sequence, or what new sequences should we design to get the gene expression we want."

She said that scientists can now use models to solve their own biological evolutionary problems, and related problems related to designing gene sequences for expected gene expression.

Martin Taylor, a professor in the Department of Human Genetics at the University of Edinburgh's Medical Research Council, said the study amply demonstrated that AI can not only predict changes in non-coding DNA, but also reveal the underlying logic of millions of years of biological evolution.

Limitations of the study

Still, Andreas Wagner, who works on evolutionary biology and the environment at the University of Zurich, says the oracle also has its obvious limitations.

For one, the researchers only changed the promoter —just one of several types of sequences that might affect gene expression. It does not take into account the effects of changes in the surrounding DNA, including changes in protein-coding regions that may affect gene expression.

Second, it was developed for yeast, where the complexity of gene regulation is much lower than in humans. For example, yeast's regulatory DNA is typically located within hundreds of base pairs of the regulated gene, while animal regulatory DNA may be located outside of millions of base pairs. Therefore, it is unclear whether this method can be extended to more complex gene regulation.

Finally, like mythical oracles, this model can be predicted but not explained.

It doesn't tell us why a promoter has high or low expression, which transcription factors bind on the promoter, or how they interact.

In other words, it doesn't play much in elucidating the regulatory logic of gene expression.

Still, we can remain cautiously optimistic.

Although the 30 million sequences used for training are only a fraction of all 4^80 sequences (about 2×10^-41) that the 4 nucleotides of DNA may form, the method has been very successful.

It can also be inferred that even sparse sampling in the sequence space will most likely not be an obstacle to the model.

About the author

Eesshit Dhaval Vaishnav, a doctoral student at the Massachusetts Institute of Technology, is the first author of the study.

He has published a total of 8 top-notch papers. There are 3 articles of "Nature", 1 article of the sub-journal "Nature Medicine", "Nature Biotechnology", "Nature Communications", and 1 article of "Cell".

He previously obtained a double degree in Computer Science and Engineering and Biosciences and Bioengineering at the Indian Institute of Technology.

MIT "Oracle" model on Nature cover! Deciphering DNA's past, present, and future

Dr Carl de Boer, assistant professor at the School of Biomedical Engineering at the University of British Columbia, is a co-author.

He received his BSc in Computer Science and Bioinformatics from the University of Waterloo in 2008 and his PhD in Molecular Genetics from the University of Toronto in 2014 and has been doing postdoctoral research ever since. In 2020, he entered the University of British Columbia as an assistant professor.

MIT "Oracle" model on Nature cover! Deciphering DNA's past, present, and future

Dr. Aviv Regev, a professor of biology at the Massachusetts Institute of Technology, was a senior researcher for the study.

She received her M.A. and Ph.D. from Tel Aviv University in 1997 and 2003, respectively, and is a core member of MIT and Harvard's Broad Institute, as well as a professor in MIT's Department of Biology, and head of Genentech Research and Earth Development. He founded and led the Human Cell Atlas Project with Sarah Teichmann.

Her research interests are in biological networks, gene regulation, and evolution. The focus is on dissecting complex molecular networks to determine how they function and evolve in the face of genetic and environmental changes, as well as during differentiation, evolution, and disease.

MIT "Oracle" model on Nature cover! Deciphering DNA's past, present, and future

Resources:

https://news.mit.edu/2022/oracle-predicting-evolution-gene-regulation-0311

Read on