laitimes

The "Book of Heaven" of human life is finally complete! Many diseases will be treated

◎ Writer: Intern reporter Zhang Jiaxin Planner: Feng Weidong Wang Junming

The International Scientific Team Telomere-to-Telomere Consortium (T2T) announced the first complete, gap-free sequence of the human genome, and the "Book of Heaven" of human life is finally complete. It reveals for the first time the highly identical segmental repetitive genomic regions and their variations in the human genome, a major upgrade to the standard human reference genome, the reference genome sequence (GRCh38) published in 2013.

The human genome is often compared to the "book of heaven" of life - the four bases of A, T, G, and C constitute DNA, but they are paired with more than 6 billion possibilities, which shows its complexity.

The "Book of Heaven" of human life is finally complete! Many diseases will be treated

The researchers are examining the output of the DNA sequencer. Source: The Associated Press

More than 20 years after the official release of the human genome sequence sketch jointly studied by scientists from China, the United States, the United Kingdom, France, Germany and Japan, the international scientific team Telomere-to-Telomere Alliance (T2T) announced the first complete, gap-free human genome sequence, and this "heavenly book" of human life is finally complete. It reveals for the first time the highly identical segmental repetitive genomic regions and their variations in the human genome, a major upgrade to the standard human reference genome, the reference genome sequence (GRCh38) published in 2013.

The "Book of Heaven" of human life is finally complete! Many diseases will be treated

Source: Science magazine website

On April 1, science magazine published six papers reporting on the results.

This achievement will fundamentally change the way we treat many diseases. With the frequent emergence of new variants of the new coronavirus, scientists can use full genome sequencing to look for mutations associated with the disease, and they can also use it in more detail to study the evolution of human genetic variation, or it may completely change the way people understand human evolution.

8% of "white space" is not "garbage"

On February 12, 2001, the International Human Genome Project published the human genome map and preliminary analysis results for the first time, and on April 15, 2003, the human genome sequence sketch was officially released. However, due to technical constraints, the original human genome map left a gap of about 8%. This difficult-to-sequence part consists of highly repetitive DNA sequences that contain telomeres at the ends of chromosomes and centromeres at the central nodes of chromosomes.

The heterochromatin sequences behind the centromeres are located at key sites of the chromosomes, and in the human genome sequence sketch, they are all labeled long sequences of N, indicating "unknown bases." The short-arm sequences of chromosomes 13, 14, 15, 21, and 22 are similarly ignored.

Eric Green, MD, director of the National Institutes for human genomes (NHGRI) at the National Institutes of Health, said that genomes with missing fragments are "as incomplete as paragraphs missing sentences."

Evan Eichler, a researcher at the Howard Hughes Medical Institute at the University of Washington, says sequencing DNA is like solving a jigsaw puzzle. Scientists must first break down DNA into smaller pieces and then piece it together in the correct order using a sequencer.

Now, the new T2T genome atlas fills 8 percent of the blank space on the puzzle box picture and corrects thousands of errors in previous puzzles. Most of the newly added DNA sequences are located near duplicate telomeres and centromeres.

The "Book of Heaven" of human life is finally complete! Many diseases will be treated

Images output by human genome parsing devices. Source: The Associated Press

The new gap-free version, known as T2T—CHM13, consists of 3.055 billion base pairs and 19,969 protein-coding genomes, adding nearly 200 million new DNA sequences for base pairs, including 99 genes that may code for proteins and nearly 2,000 of those candidate genes that need further study. Most of these candidate genes are inactivated, but 115 of them may still be expressed. The team also found about 2 million additional variants in the human genome, 622 of which appear in medicine-related genes. In addition, the new sequence corrects thousands of structural errors in GRCh38, eliminating tens of thousands of false-positive variants in each sample, including variants in 269 known or suspected genes associated with the disease.

According to Eichler, it turns out that those repeating sequences that many researchers consider "junk or inconsequential" are actually very important.

Since the previous GRCh38 model (called the reference genome) was a combination of multiple individual genomes that essentially "stitched together" one person's genome with another's genome, there were some errors and overlaps. The new, full version eliminates these gaps and better represents what a person's actual genome looks like.

Help crack the last "black box"

Due to the complexity of the repeat region, the remaining 8 percent of the human genome has plagued scientists for years. On the one hand, it contains regions of DNA with multiple replicates, which makes strung together DNA in the correct order using previous sequencing methods challenging.

In the early days, DNA sequencing, known as "short read long," could only read relatively short sequences at a time, that is, provide hundreds of DNA base sequences. This was the only genome mapping technology available 20 years ago. For example, suppose a part of the genome consists of the sentence "Only work and no play, and smart kids become stupid" that is repeated 9 times in a row. The technology will only show some of them, such as "only working", "smart", "children also", etc. The researchers pieced together these short pieces to form the phrase, but they had no way of knowing that it had been repeated 9 times. Therefore, using this technique will still leave some blanks in the assembled genome sequence.

For 10,000 pieces of a puzzle, when they look similar, it's difficult to properly arrange the areas of small pieces, like sequencing small pieces of duplicate DNA. But for a 500-piece puzzle, it's much easier to correctly arrange a large area, i.e. a longer piece of DNA. Therefore, the "long reading long" technology came into being. Huge advances in technology have allowed researchers to sort through repetitive sequences that are difficult to read.

Over the past 10 years, two new DNA sequencing techniques have emerged – the "long read" technique, which generates longer DNA sequence readings without compromising accuracy, even reading entire "sentences" or "paragraphs" at once.

Oxford Nanopore's DNA sequencing method (ultra-long reading length) can read up to 1 million DNA letters at a time with moderate accuracy, while PacBio HiFi's DNA sequencing method (high-fidelity reading technology) can read about 20,000 letters with near-perfect accuracy. The combination of these two sequencing allowed T2T researchers to avoid region duplication and ensured that the assembled gene sequences were highly accurate.

Another tool is Merfin, which researchers use to clean up some of the most difficult sequences in the human genome. Murfin enables accurate test sequences, which sense code that may be incorrect and automatically correct errors. Because the technique of generating modern sequences is more accurate, Murfin is only used in the trickiest cases. For example, existing techniques have a hard time evaluating exactly the same base pairs like AAA, and Murfin corrected this sequence error.

The "Book of Heaven" of human life is finally complete! Many diseases will be treated

In other words, scientists once thought that puzzles in repeating areas had almost the same color and shape, such as looking like a blue sky. But now, more advanced sequencing techniques have led scientists to discover that these duplicate fragment patterns are actually not just blue skies, but also grass and sun.

The second challenge in cracking the final "black box" of the "Book of Heaven" of life is to find cells that contain only one genome. Standard human cells contain two sets of DNA, one maternal and the other paternal, but the T2T team used dna from a set of cells called complete moles, which contained only copies of the paternal DNA. Complete mole is a rare complication of pregnancy caused by abnormal growth of cells derived from the placenta.

This approach simplifies the genome, so scientists only need to sequence one set of DNA instead of two sets of DNA.

A key milestone in genomics

The new sequence completes the last piece of the human genome and marks a key milestone in the field of genomics.

The new sequence reveals unprecedented details about the area around the centromere. This will greatly increase people's understanding of chromosomes, especially the centromeres and their role. Because this region is essential for understanding human evolutionary and genetic diversity as well as resistance to or susceptibility to many diseases.

At the same time, the new sequence reveals previously undiscovered segmental repetitions, i.e. long dna fragments that are repeated in the genome. Of the 20,000 genes in the human genome, about 950 originate from segmental replicates. These human-specific segmental repetitions are repositories of new genes that drive the formation of more neurons in the developing brain and enhance the connectivity of frontal cortex synapses — possibly related to the higher levels of thinking, reasoning, logic, and language functions unique to humans.

A more accurate presentation of the 5-chromosome arm map may help scientists open up new research directions and help answer basic biological questions about how chromosomes properly separate and divide.

The "Book of Heaven" of human life is finally complete! Many diseases will be treated

"Generating a truly complete sequence of the human genome represents an incredible scientific achievement, providing the first comprehensive view of the human genetic blueprint." "This foundational information will advance many of the ongoing efforts to help us understand the details of the human genome, which in turn will support genetic research into human disease," Green said. ”

In addition to the medical research implications of assembling the puzzles, it also helps to answer: What does it contain in our genome that makes us human? Compared to other apes, some blank genes in the original genome are now thought to be essential to help humans make larger brains. The variability of centromeres may also provide new evidence for how human ancestors evolved.

Scientists are now able to track these new genomic regions over time, enabling more rigorous comparisons of people or species from generation to generation and of different origins.

For example, an analysis of TBC1D3, a gene family associated with human prefrontal cortex expansion by Harvey Gittart, a graduate student in Eichler's lab, showed that repeated and independent expansions occurred at different points in primate evolution. The most recent one occurred about 2 million to 2.6 million years ago, probably when the genus Humane appeared. Surprisingly, the human family of TBC1D3 genes showed significant large-scale structural variation in a subset of samples.

In their paper, the researchers explain that different people have very different ways of complementing and arranging the TBC1D3 gene family. For a gene that is thought to be so important for brain function, this is unexpected. The scientists also found diversity in the complex structure of the LPA gene, a variant of the lipoprotein gene part that is the most important genetic risk factor for cardiovascular disease caused by abnormal lipid levels in the blood.

The researchers also studied the SMN gene, a motor neuron gene, whose mutations are linked to certain neuromuscular diseases. Better sequence recognition of the spinal muscular atrophy region (one of the most difficult to sequence on chromosome 5) helps determine disease risk and further treatment, as the duplicate gene SMN2 is one of the most effective targets of gene therapy.

In addition, many diseases are associated with structural duplication in centromeres, so the new sequence helps scientists study genes-related diseases.

Centromeres are known to play a role in DNA replication as cells multiply, and if their position in chromosomes is significantly changed, entirely new species can be created. Cancer cells divide wildly when certain heterochromatin-bearing migrain genes are overexpressed; cell division and errors in the distribution of genetic material between cells can also lead to abnormalities in prenatal development, such as Down syndrome or Robertson translocation, and a comprehensive understanding of the miromere genome could open new doors to treating these diseases.

Based on these and other findings, the scientists note that the new reference genome "reveals an unprecedented level of human genetic variation in genes important for neurodevelopment and human disease."

This is not the end but the beginning of a new one

This time, the mole cells used by the T2T team retain only the XX chromosome — a set of duplicate chromosomes that are missing the Y chromosome. Completing haploid genome sequencing is not the ultimate goal and result of the Human Genome Project, but a new beginning.

Eichler said: "We have completed a genome. In the next few years, there will be hundreds, if not thousands, of genomes. I think our perception of how humans differ from each other will shift, and that more complex genetic variation will be important not only for understanding what makes us human, but also for understanding what sets us apart. ”

The "Book of Heaven" of human life is finally complete! Many diseases will be treated

In the next phase, scientists will sequence the genomes of multiple different individuals to fully grasp human diversity, disease, and human relationships with other primates.

The good news is that researchers are also about to release the full sequence of Y chromosomes from cells of different origins. Analysis of this new Y chromosome sequence will appear in future publications.

In addition, the T2T Alliance has a new goal – to extract 350 genomes from people of different races or ancestries (70 genomes have been deciphered so far). Dr Adam Philip, head of NHGRI's genetic informatics unit, said the project would cost millions of dollars or more in total. But that's a fraction of the nearly $450 million it cost the Human Genome Project in 2003 to complete the final sequencing. With the advent of new technologies, sequencing will only get cheaper.

For now, sequencing their own genomes is still too expensive and time-consuming for everyone, but research using entirely new genomic sequences to determine whether certain genetic differences are linked to a particular cancer is already on the way.

Dr. Philip said that in the next few years, sequencing a person's entire genome should become cheaper and simpler.

"In the future, when someone sequences their genome, we'll be able to identify all the variants in their DNA and use that information to better guide their health care." "Actually completing the sequence of the human genome is like putting on a new pair of glasses, and now we can see everything clearly, and we're one step closer to understanding what it all means," Philip said. ”

Related Link: How much do you know about gene sequencing programs

1. International Genome Project of 1000 People (1KGP)

Understanding the relationship between genotype and phenotype is one of the core goals of biology and medicine.

The International Thousand Genomes Project (1KGP), which began in January 2008, is an international research effort to establish the most detailed catalogue of human genetic variation to date, which collects genetically diverse genomic sequences from thousands of people from four different continents to help address disease-related genetic variants.

In 2010, the pilot phase of the programme was completed and fruitful. The first is to obtain the most detailed map of human gene polymorphisms to date, and the second is to explore new technologies to study gene polymorphisms. In 2012, the program completed the sequencing of 1,092 genomes. In 2015, two papers in the journal Nature reported on the completion of the project and future research directions.

By outlining all human genetic variation, the program will provide valuable tools for all areas of the biological sciences, particularly in disciplines such as genetics, medicine, pharmacology, biochemistry, and bioinformatics.

The scientists plan to use newly developed faster and cheaper technology over the next 3 years to sequence the genomes of at least 1,000 anonymous participants from multiple different races.

2. 10,000 Vertebrate Genome Project (VGP)

About a decade ago, scientists began developing new techniques to produce longer sequence readings that fill gaps in the genomes of humans and other species. One such initiative, led by Erich Jarvis, a researcher at the Howard Hughes Medical Institute (HHMI) in the United States, aims to generate near-error-free reference genome combinations of 71,657 extant vertebrate species and use them to address fundamental issues in biology, disease, and biodiversity conservation.

The researchers announced in the April 28, 2021 issue of the journal Nature that the program produces the first virtually error-free and nearly complete reference genome for 25 animals. These species include high-quality genomes from the first endangered vertebrates, such as the great horseshoe bat, the Canadian lynx, the platypus and the owl parrot.

Scientists are using new data from VGP to study the genes that make bats immune to COVID-19 and have questioned long-standing practices in basic science, such as whether there are significant differences between oxytocin and its receptors found in humans, birds, reptiles and fish.

The next step in the program will be to sequence all 1,000 vertebrate genera, then all 10,000 vertebrate families, and finally each vertebrate species.

3. Human Pan-Genome Reference Alliance (HPRC)

The human reference genome is the most widely used resource in human genetics. Its current structure is a linear combination of merged haplotypes from more than 20 people, with a single individual constituting most of the sequence. It does not represent global human genome variation, there are certain deviations and errors. Therefore, a globally representative, high-quality reference genome is required, including common variants such as single nucleotide variants, structural variants, and functional elements.

The goal of the Human Pan-Genome Reference Consortium (HPRC) is to create a higher-quality, near-complete, and near-error-free genome for 350 or more people representing more than 95% of human genetic diversity, representing global genomic diversity on a graph basis and in a telomere-to-telomere fashion. The Telomere-to-Telomere Alliance (T2T) is now part of the alliance.

The alliance leverages technological innovation, research design, and global partnerships to build the highest possible quality human genome reference. The goal is to improve data representation and simplify analysis to enable routine assembly of the intact diploid genome. With a focus on ethical frameworks, HPRC will include a more accurate and diverse representation of global genomic variants, improve gene-disease association studies across populations, expand the scope of genomic research to the most repetitive and polymorphic regions of the genome, and serve as the ultimate genetic resource for future biomedical research and precision medicine.

The pictures in this article are from Visual China except for the annotations

Science and Technology Daily Produced by Deep Pupil Studio

WeChat editor 丨Liu Yiyang

Audit 丨Yue Liang

Final Judgment 丨 Wang Tingting

Read on