A review of genome assembly in the T2T era

Recently, Professor Heng Li of Dana-Farber Cancer Institute, a god in the field of bioinformatics, published a review article on the T2T genome in Nature Reviews Genetics entitled "Genome assembly in the telomere-to-telomere era", which reviews the current practice in high-quality assembly of large eukaryotic genomes, and finally points to telomere-to-telomere (T2T) assembly. This paper summarizes the genome T2T assembly from six aspects, introduces the common data types of T2T assembly, analyzes the latest assembly algorithms, explains the methods of assessing the assembly, discusses the challenges of assembling the T2T genome, and provides its own insights and guidance for assembling the T2T genome. This article first introduces the first three aspects for the benefit of readers.

Genomic properties that affect assembly

The main factor that determines the ease of genome assembly is not the size of the genome, but the repetitive structure of the genome. Repeats can be resolved by reads that are longer than repeats. However, some of the repetitive regions of the genome are much longer than the reads produced by current sequencing technologies, which makes assembly difficult. However, with the accumulation of mutations in the region of long repeats, it is rare to have identical repeats of more than 10 kb, and with the help of the highly accurate PacBio HiFi reads and ONT ultra-long reads, it is possible to distinguish between the different repeats and successfully assemble them.

Repetitive sequences can be broadly divided into three categories: scattered repeats, tandem repeats, and segmental repeats.

Scattered repeats: Mainly transposons scattered in the genome, they are almost all shorter than modern long-read reads, so they are no longer a major obstacle to assembly.

Tandem repeats: Tandem repeats on most chromosomes are shorter than long-read reads, so they are also easy to assemble. However, satellite repeats are a type of ultra-long tandem repeats that are enriched on centromere points and are particularly difficult to assemble because long reads cannot span the entire satellite array.

Segment duplication: Very long fragments of DNA that are repeated in the genome, often longer than long reads and ultra-long reads, many of which are tandem into clusters. For example, ribosomal DNA (rDNA) may contain long tandem arrays of highly similar copies, with long rDNA arrays being one of the most difficult regions to assemble.

Two homologous haplotypes in a diploid sample can also be seen as duplications of each other, and for diploid or polyploid samples, the assembly of T2T also means that all chromosomes are correctly phased. Assembly software that can solve similar repetitive sequences naturally has a strong ability to separate homologous haplotypes. Conversely, assembly software that does not perform haplotype phasing cannot solve the problem of similar duplicate copies. While traditional assembly algorithms disrupt homologous haplotypes, current practices tend to preserve haplotype phasing across millions of databases and perform chromosome-scale haplotype phasing assembly with multiple data types.

Long-read and long-distance sequencing techniques

Obtaining near-T2T assemblies typically requires a combination of multiple sequencing technologies (Table 1). Sequences typically ≥ 10 kb in length are typically produced by long-length sequencing technology, which is currently represented by two companies, Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT). In 2019, PacBio launched HiFi reads with a length of 10-20 kb and an error rate of less than 0.5%, which is the core data type for high-quality assembly. The accuracy rate of ONT products on the market is generally 90-95%, and although the error rate is high, there is an ONT data type (ultra-long reads) with a length of ≥ 100kb, which helps to solve the problem of repeats that cannot be assembled by HiFi readings. ONT is actively developing double-stranded sequencing with accuracy close to PacBio HiFi and longer lengths. Once the technology matures, it will become a compelling type of data.

Even ultra-long reads rarely span more than hundreds of kb, and long-distance data is required to obtain SCAFFOLD and phase splitting of chromosome lengths. Hi-C, the most widely used long-distance data type, consists of long-distance short reads (next-generation sequencing) that may come from distant distances at both ends on the same chromosome, enabling phasic sequencing of chromosomes and Mb distance contigs. Pore-C is similar to Hi-C, but is sequenced with ONT. Strand-seq is another technique that is particularly good at chromosomal phasing and contig sequencing, but it is more expensive and difficult to obtain in the market. Parental sequencing data or triplet (TRIO) data can also be considered as a type of long-distance data due to their strong genome-wide phasing capabilities.

Table 1 Long-read and long-distance sequencing techniques

A review of genome assembly in the T2T era

Near-T2T assembly method

· Assemble homozygous genomes

For homozygous genomes, the most reliable T2T assembly protocol is to use both PacBio HiFi and ONT ultra-long sequencing (Figure 1), where an initial assembly map is first constructed using HiFi reads, which consists of linear fragments (unitigs) that do not contain exact long repeats, and there may be connections between them, depending on the repeat structure. Areas that are highly repetitive appear as complex subgraphs called "tangles". Then, anchor the ONT extra-long reads to the unitigs and pass through the "tangles", thus solving most of the tangles. Extra-long reads also patch the occasional HiFi uncovered assembly gap. When chromosomes are poorly segregated or discontinuous in the assembly subgraph, the Hi-C data will help generate a scaffold of chromosome length.

Fig. 1 Common assembly flow of homozygous diploid genomes

Currently, verkko and hifiasm can integrate PacBio HiFi and ONT ultra-long data, which roughly follow the workflow in Figure 1 but use different algorithms at each step. Sometimes, it is also possible to achieve good assembly of homozygous genomes using only HiFi data. Verkko, hifiasm, HiCanu, and LJA can all use HiFi reads alone to assemble multiple human chromosome T2T. When scaffold is required, YaHS has replaced SALSA as the recommended method for Hi-C mounting, and it is the preferred scaffold construction method for the Vertebrate Genome Project (VGP) and the Darwin Tree of Life Project (DToL).

· Assemble a heterozygous diploid genome

The strategy for assembling a heterozygous diploid genome is similar to that of a homozygous genome (Figure 2). For genomes with long homologous sequences, including the human genome, phasing the entire chromosome using only the HiFi and ONT superlong combination may not be possible. In this case, it is advisable to use triplet data that can accurately phase the entire genome. When parental samples are not available, Hi-C can be used instead. Hi-C can only provide relative phase separation information between contigs, and its function is not as powerful as triplet data, especially in tangled subgraphs. Despite this, Hi-C is still a key data type for reliably mounting chromosomes.

Fig.2 Common assembly flows for heterozygous diploid genomes

As it stands, ONT ultra-long data is relatively expensive to obtain, and the demand for DNA is large (typically tens of micrograms). Many sequencing projects do not generate very long data, and can generate primary/alternate assembly pairs or double assembly pairs using HiFi data alone (Fig. 3b, c). To obtain a reference genome, assembled primary sequences may be preferred, as primary sequences are typically longer. Alternate sequences are fragmentary, error-prone, and often overlooked in downstream analysis. Double-assembled pairs represent two genomes in a diploid sample and support assembly-based variant detection as well as the use of double-assembled pairs to construct pan-genomes. However, for shorter contigs, setting up a scaffold can be more complicated. Regardless of the approach, problems arise in distinguishing between paraphyletic tandem repeats and homohaplotype repeats, especially at the end of the contig, which can lead to erroneous duplications. For primary assembly methods, many, but not all, of these problems can be discovered and solved by heuristics, such as those implemented in the purge_dups.

Fig. 3 Different methods of diploid split-phase assembly

Combining HiFi with long-range data such as trio, Hi-C, or Strand-seq can generate a haplotype-resolution assembly pair (Figure 3d, e) with comparable continuity to a double assembly pair. In addition, it retains the phase and can be further mounted as a phase-separated chromosome via Hi-C. It has been shown that even in the absence of parental data, imprinted methylation markers can be used to determine the parental origin of these chromosomes, as long as the markers are known and frequent enough to label the contig homologous pairs.

For heterozygous genomes, both verkko and hifiasm can integrate PacBio HiFi, ONT ultra-long and long-range data, and assemble multiple haplotype-resolved human T2T chromosomes. They can also be used separately for HiFi data and to generate dual assembly pairs or primary assemblies. HiCanu is also able to generate primary assemblies from HiFi data with comparable quality.

bibliography

Li, H., Durbin, R. Genome assembly in the telomere-to-telomere era. Nat Rev Genet (2024). https://doi.org/10.1038/s41576-024-00718-w

Compson Agriculture has long been committed to the research of animal and plant genomes, pan-genomes and T2T genome assembly, and has rich experience in project design and analysis, covering poultry, livestock, food crops, horticultural crops, flowers and trees, aquatic products and other diverse species, and will provide researchers with comprehensive professional and technical services in next-generation sequencing, genome assembly and population resequencing.

Compson genome assembly protocol

Click to follow the agricultural public account

Click to follow the bio public account

Click to follow the detection official account

More details

Welcome to consult and contact us!

Tianjin: 18710280840/022-24986099

Beijing: 400 1869 509

Email: [email protected]

Address: 7th Floor, Building 4, Yard 4, Life Park Road, Zhongguancun Life Science Park, Changping District, Beijing