laitimes

What is the technology of high-throughput sequencing, the difference between the first, second and third generations, and the precautions for sequencing

author:Guhe health

DNA sequencing is fundamental to measuring the main properties of various life forms. Since the discovery of the DNA double helix structure in the 50s of the 20th century, scientists around the world have been working to determine the original sequence of the genomes of different species. This task, known as genome sequencing, aims to reveal the genome composition of different organisms and the order in which genes are arranged.

One of the hallmarks of modern genomic research is the generation of large amounts of raw sequence data. The importance of this work lies in the fact that the decipherment of genome sequences can provide important clues about the genetic information of organisms, including gene function, genetic variation, and evolutionary relationships.

Over the past few decades, with the continuous development and breakthrough of sequencing technology, the speed and accuracy of sequencing have been significantly improved. Early sequencing methods relied heavily on Sanger sequencing technology, which was based on the principle of DNA strand elongation, which determined the DNA sequence by measuring the fluorescent markers released during the DNA strand elongation reaction. However, due to its low-throughput and high-cost limitations, Sanger sequencing is gradually being replaced by next-generation sequencing technology (NGS).

With the rise of next-generation sequencing technologies, such as Illumina's high-throughput sequencing and 454 Life Sciences' Roche sequencing platform, as well as BGI's DNBSEQ-T7 sequencing platform, genome sequencing has entered a new era. These technologies utilize the principle of parallel sequencing to sequence millions of DNA fragments simultaneously, greatly improving the speed and efficiency of sequencing. At the same time, these technologies have also made significant breakthroughs in cost and accuracy, making large-scale genome sequencing possible.

With the continuous advancement of sequencing technology, more and more prokaryotic and eukaryotic genome sequences are sequenced and stored in public databases, and the four main databases are:

  • GenBank of the National Biotechnology Information Center (NCBI).
  • Mainland's own database, CNCB (China National Center for Bioinformation)
  • Japan DNA Database (DDBJ)
  • European Laboratory of Molecular Biology (EMBL)

They currently have abundant experiments as well as raw nucleotide sequence data for samples, in addition to protein sequences or macromolecular structure data. These databases provide scientists with a valuable resource for studying and comparing the genomes of different species to improve understanding of biodiversity, evolution, and gene function.

In addition to the analysis of genome sequences, various bioinformatics tools and databases need to be developed to help interpret and annotate genomic data. These tools can be used to predict gene function, identify regulatory elements, compare genomic differences between different species, and more.

With the continuous advancement of computer technology, sequencing data is gradually relying on artificial intelligence and machine Xi and other technologies. These technologies can help analyze and interpret genomic data more quickly and accurately, and uncover hidden patterns and associations in the data. Machine Xi algorithms can be used to predict the function of genes, identify important regulatory regions in the genome, or precisely distinguish similar species.

For example, the 16S database of intestinal microbiota is a 16S sequence library extracted from the detection data of hundreds of thousands of intestinal microbiota by Guhe Health, and the species annotation is re-completed through metagenomic matching data and model construction. Through further genomic data research and analysis, we can provide you with a deeply personalized health testing solution.

At present, sequencing technology has had a wide impact in many application fields. For example, genome sequencing provides important clues to study the pathogenesis of human genetic diseases, and short reads, high measurement throughput, and low cost have laid the foundation for personalized medicine and precision treatment. In addition, sequencing is widely used in agriculture, environmental science, and bioengineering, providing strong support for improving crops, protecting the environment, and producing efficient bioprocesses.

Therefore, this article shares with you the knowledge related to DNA sequencing, as well as the development of sequencing technology and the precautions for sequencing.

What is the technology of high-throughput sequencing, the difference between the first, second and third generations, and the precautions for sequencing

01

DNA Basics

DNA (deoxyribonucleic acid) is genetic material that is found in the cells of all living organisms in a double helix structure and is composed of nucleotides, including phosphate groups, deoxyribose sugar, and four nitrogen-containing groups (adenine A, thymine T, cytosine C, and guanine G). DNA is responsible for storing genetic information, directing protein synthesis, and replicating itself when cells divide, ensuring that genetic information is passed on to offspring.

The DNA of different species is very similar in structure, but there are differences in sequence and organization

Human DNA contains about 3 billion base pairs and makes up about 20,000 to 25,000 genes distributed on 23 pairs of chromosomes. The genetic information in human DNA determines our appearance, physiological functions, and health. Although genetic diversity exists, the DNA sequences of all humans are roughly similar, about 99.9% of the same. Individual genomes vary from 3 to 4 million base pairs. These variants can be captured in single nucleotide polymorphisms (SNPs), but there are also some larger variants called structural variants (SVs).

What is the technology of high-throughput sequencing, the difference between the first, second and third generations, and the precautions for sequencing

Most viral genomes are 10,000 bp, and some plants have hundreds of billions of base pairs. Bacteria, which usually have smaller genomes, can vary from a few million to tens of millions of base pairs. The DNA of bacteria is usually a single circular chromosome, not multiple linear chromosomes. In addition, many bacteria also contain plasmids, which are small DNA molecules that can be transferred between bacteria, facilitating the horizontal transmission of genes, which is an important mechanism for bacteria to adapt to the environment and develop antimicrobial resistance.

In conclusion, the DNA of different species is functionally a carrier of genetic information, but there are differences in size, morphology, and sequence, and these differences lead to diversity between species.

There are two reasons for the differences in the genomes of different species:

➼ Random mutations, which occur during evolution, as natural selection favors certain phenotypes. These are mainly due to "errors" in the DNA replication process during cell division. Most mutations are deleterious, causing harmful phenotypic changes and leading to cell death. Sometimes, natural selection favors certain mutations, which remain in the population.

➼ Recombination, which occurs during the reproduction of higher organisms such as mammals. During recombination, the genetic material passed on by the parental organism to the offspring is a mixture of genetic material from the parental organism.

DNA double-stranded base complementation

DNA is double-stranded and constructed in the form of a double helix, where nucleotide pairs act as "rungs" of the helix (hence the name "base pairs"). Adenine always chemically binds to thymine, whereas cytosine always chemically binds to guanine. In other words, A is complementary to T, and similarly C is complementary to G. The AT and CG pairs are called complementary pairs.

The structure of DNA is as follows:

DNA double helix

What is the technology of high-throughput sequencing, the difference between the first, second and third generations, and the precautions for sequencing

图源:medlineplus

DNA sequences are typically displayed or written in the direction of the 5' end (head) to the 3' end (tail). When we have a DNA strand, knowing the complementary pairs, it can be inferred that the other strand is the inverse complementary strand of the first strand.

To obtain reverse complementation, the order of the nucleotides in the original string can be reversed, and then the complementary nucleotides can be complemented (i.e., A is interforated with T and C is interforated with G).

The image below shows an example of a DNA fragment and its inverse complementary strand.

DNA complement

What is the technology of high-throughput sequencing, the difference between the first, second and third generations, and the precautions for sequencing

DNA replication

DNA is the basis for cellular replication. When a cell undergoes cell division, also known as mitosis, the DNA in the nucleus is replicated, and through a series of steps shown in the figure below, one parent cell produces two identical daughter cells.

Diagram of mitosis

What is the technology of high-throughput sequencing, the difference between the first, second and third generations, and the precautions for sequencing

图源:wikipedia

A variety of biomolecules are involved in the process of mitosis, and here we have a highly simplified explanation of the process of mitosis.

In the diagram, we start with two chromosomes: red and blue.

First, the DNA is replicated, giving rise to the more familiar X-shaped chromosome. Through a complex cascade of biomolecular signals and intracellular recombination, (now replicated) chromosomes are arranged in the middle of the cell. For each chromosome, the two halves are pulled apart, and each of the two daughter cells receives a copy of the original chromosome. This results in two daughter cells that are genetically identical to the original parental cell.

DNA replication is the most important part of this diagram; it is the basic process used for sequencing. DNA replication is shown in the figure below:

What is the technology of high-throughput sequencing, the difference between the first, second and third generations, and the precautions for sequencing

During DNA replication, the two strands of DNA are first uncompressed, resulting in two single strands, each of which acts as a template for replication. The short RNA primer is then attached to a specific site on the DNA, and the bases in the primers are complementary to the bases in the sites. Enzymes facilitate (or "catalyze") chemical reactions, while DNA polymerases are enzymes that catalyze the complementary pairing of new nucleotides with template DNA that extends binding primers.

The nucleotides used by DNA polymerase to extend the strand are called dNTPs (deoxynucleotide triphosphates). From a biochemical point of view, they are slightly different from nucleotides because they are easier to use in the process of DNA replication. The dNTPs corresponding to A, C, G, and T are dATP, dCTP, dGTP, and dTTP, respectively.

Obtaining DNA sequences relies primarily on sequencing technology. Commonly used sequencing technologies include Sanger sequencing and next-generation sequencing. More on this in the next section.

What is the technology of high-throughput sequencing, the difference between the first, second and third generations, and the precautions for sequencing

Source: Praxilabs

Sanger sequencing is highly accurate but low throughput. Next-generation sequencing represents a high-throughput sequencing technology that enables parallelization operations, greatly improves sequence throughput, reduces sequencing costs and time, and is therefore suitable for sequencing entire genomes or transcript groups. These technologies enable large-scale, high-precision DNA sequencing analysis.

02

DNA sequencing

The development of DNA sequencing methods peaked around 2000 and was largely based on the contributions of four researchers.

01

Allan Maxam and Walter Gilbert developed a chemical method for DNA sequencing in the 70s, in which DNA fragments labeled with radioactive phosphorus at the end undergo base-specific chemical cleavage and the reaction products are separated by gel electrophoresis.

02

In 1977, Frederick Sanger took another approach to refine the sequencing method by using strand-terminated dideoxynucleotide analogues that resulted in the base-specific termination of primer DNA synthesis. In this method, primers are usually labeled with radioactive phosphorus.

3

Leroy Hood, together with his colleagues Michael Hunkapiller and Lloyd Smith, modified the Sanger method to a higher throughput configuration in 1986 by using fluorescently labeled dideoxynucleotides. This approach avoids radioactive compounds with a limited lifespan and instead uses stable fluorescent probes. In addition, the analysis of all nucleic acid bases can be done by reading only one instead of four electrophoresis lanes, and the reading process can be automated.

This high-throughput configuration was used for the sequencing of the first human genome, which was completed in 2003 through the Human Genome Project, which took 13 years.

What is the technology of high-throughput sequencing, the difference between the first, second and third generations, and the precautions for sequencing

As a result of improved methods and automation, in 2008, another human genome was sequenced over a five-month period. The completion of the first sketch of the human genome was just the beginning of the era of modern DNA sequencing, which brought with it more inventions and new, advanced high-throughput DNA sequencing strategies, known as next-generation sequencing (NGS).

The development of NGS strategies is meeting our need for sequencing throughput and cost, enabling multiple current and future applications in genomics research. These advanced methods require the development of new bioinformatics tools as a necessary prerequisite for the large amounts of data generated during the analysis process.

First-generation sequencing – Sanger sequencing

Fred Sanger and colleagues have developed a related technique based on the detection of radiolabeled partially digested fragments.

The famous Sanger sequencing originated in the late 70s of the 20th century, when Sanger developed a gel-based method that combined DNA polymerase with a mixture of standard nucleotides and strand-stop nucleotides (ddNTPs). Mixing dNTPS with ddNTPs results in random premature termination of sequencing reactions during PCR. Four reactions are performed in parallel, and each reaction contains a version of the strand-stop nucleotide. Visualization of the process using gel electrophoresis enables the reading of sequences on a base-by-base basis. At the time, the technology was revolutionary. It is capable of sequencing 500-1,000bp fragments.

What is the technology of high-throughput sequencing, the difference between the first, second and third generations, and the precautions for sequencing

Source: Praxilabs

A variant of the Sanger method, addition and subtraction, developed by Sanger and Alan Coulson, obtained the first DNA genome sequence, the bacteriophage φX174, in 1977.

What is the technology of high-throughput sequencing, the difference between the first, second and third generations, and the precautions for sequencing

Source: pixels

Two years later, Alan McSam and Walter Gilbert published their chemical lysis technique, which became the first widely adopted method of DNA sequencing.

What is the technology of high-throughput sequencing, the difference between the first, second and third generations, and the precautions for sequencing

By the 80s of the 20th century, Sanger's original method had been automated (capillary electrophoresis). Large gels are replaced by thinner acrylic capillaries, and the results can be viewed on the electropherogram. This technology was critical to the completion of the Human Genome Project in 2003. Still, even after the Human Genome Project, capillary electrophoresis was still too costly to enable large-scale sequencing projects.

By the mid-2000s, some efforts had been made to reduce the cost of sequencing. Laboratories around the world are testing new methods and technologies for higher-throughput screening.

Second-generation sequencing technology

Second-generation sequencing, also known as next-generation sequencing (NGS). In simple terms, next-generation sequencing is short-read sequencing that relies on PCR library construction and laser probe fluorescence signal reading.

The most common platforms are Illumina and BGI.

Illumina sequencing platform

The second-generation NGS technologies developed by companies such as Illumina can be divided into two broad categories: hybridization sequencing or synthetic sequencing.

  • Hybridization sequencing is a method of assembling a collection of overlapping oligonucleotide sequences together to determine DNA sequences.
  • Synthetic sequencing technology uses polymerases or ligases to bind nucleotides to fluorescent tags and then identify them to determine the DNA sequence.

BGI sequencing platform

BGI's sequencing chemistry is known as combinatorial probe anchored synthesis (cPAS). It uses Phi 29 DNA polymerase for rolling ring replication to synthesize a long single-stranded DNA that self-assembles into nanospheres about 300 nanometers in size. Identification is then performed to determine the DNA sequence.

With the advancement of large-scale dideoxy sequencing technology, the emergence of a new technology has laid the foundation for next-generation sequencing (NGS) technology. The method, called pyrosequencing, uses the light signal produced by pyrophosphate during DNA synthesis to determine nucleotide sequences. In this process, the template DNA is immobilized on a solid-phase surface, and with the addition of each nucleotide, the sequence of the DNA is inferred by detecting the light signal released by pyrophosphate. This technique has since introduced beads for more efficient attachment of DNA molecules.

Pyrosequencing technology was developed by 454 Life Sciences and eventually acquired by Roche, becoming the first commercially successful NGS platform on the market.

Latex PCR

In this platform, DNA libraries are attached to tiny beads by water-in-oil emulsion PCR. During sequencing, pyrosequencing can be performed when smaller bead-linked enzymes and dNTPs are introduced into the reaction plate. This highly parallelized approach significantly increases sequencing throughput by orders of magnitude.

Bridge magnification

Following the success of 454 sequencing technology, many new parallel sequencing technologies have emerged. The most notable of these is Solexa sequencing technology, which was later acquired by illumina.

  • In illumina sequencing methods, the DNA molecule to be tested is first bound to a complementary oligonucleotide immobilized on the surface of the flow cell via a linker.
  • Next, a process called bridge PCR amplification allows for the formation of high-density clusters of DNA fragments on the surface of the flow cell.
  • In the subsequent synthesis sequencing process, fluorescently labeled dNTPs (deoxynucleotide triphosphates) are added at a time, and the order in which they are added is determined by detecting the fluorescent signal.
  • Over time, thousands of such clusters can be read in parallel.

As a result, the Illumina sequencing platform is the first commercially available high-throughput parallel sequencing technology.

What is the technology of high-throughput sequencing, the difference between the first, second and third generations, and the precautions for sequencing

other

Over time, new technologies have emerged, including:

  • Ion Torrent, a technology that performs sequencing by measuring pH changes during DNA polymerization;
  • SOLiD technology, which uses a join-sequencing approach that does not rely on a polymerase-catalyzed synthesis process.
What is the technology of high-throughput sequencing, the difference between the first, second and third generations, and the precautions for sequencing

图源:slideserve

These innovations have become part of the field of next-generation sequencing (NGS) technology. NGS platforms are currently the mainstream sequencing technology, and they can perform high-throughput sequencing work at a relatively low cost. However, these platforms often have limited read lengths, typically producing reads between 50 and 500 base pairs (bp).

In this article, we will introduce the sequencing principles of Illumina and BGI. Let's briefly explain the rest.

Introduction to the Illumina sequencing platform

Illumina's first sequencing platform was acquired through the acquisition of Solexa, Inc., named Genome Analyzer, and began commercial operations in 2007. The device is capable of sequencing 6 million amplified DNA fragments per sequencing channel, initially with a read length of approximately 30 bases per fragment. Illumina soon increased this read length to more than 100 base pairs. At the same time, the number of amplified fragments in the flow cell has been increased, resulting in an output capacity of 80 gigabytes of base information from the genome analyzer.

Note: Gigabyte, also known as gigabyte, is a unit of computer storage capacity, abbreviated as GB.

In 2010, Illumina launched its second-generation NGS device, HiSeq. The device is equipped with two flow cells:

  • One is used to perform a chemical reaction that performs base additions
  • One is used to scan to identify the bases added to each amplification

This was followed by the release of the HiSeq X10, which further increases the number of analyzable fragments by using patterned flow cell pits instead of traditional random amplification clusters.

Today, Illumina offers a variety of sequencing equipment, including the NextSeq and NovaSeq families, as well as benchtop sequencers for different scale needs, such as the iSeq100 and MiniSeq.

What is the technology of high-throughput sequencing, the difference between the first, second and third generations, and the precautions for sequencing

NextSeq

Introduced in 2014, the NextSeq 500 uses two dye sequencing technologies instead of the four used by its predecessor. Only red and green images are taken, resulting in significantly shorter cycles and data processing times. The instrument is capable of reading 400 million base pairs in approximately 30 hours of runtime.

The NextSeq 1000 and 2000 machines were released in 2020 and are designed to streamline workflows by providing on-board informatics and cloud-based technology. The P3 flow cell extends the range of the NextSeq 2000 instrument, delivering 1.1 billion reads in a single sequencing run.

NovaSeq6000

The NovaSeq 6000 was released in 2017. It is capable of running three different chips and can generate 100 GB of sequence output for just $375 – a price that applies only to sequencing and does not include DNA isolation, library preparation, sequencing analysis, or data storage.

Essentially, the machine is capable of sequencing up to 48 complete human genomes per run, which can take up to 44 hours. Other key applications include single-cell analysis, transcriptome sequencing, and metagenomic analysis.

What is the technology of high-throughput sequencing, the difference between the first, second and third generations, and the precautions for sequencing

HiSeq X Series

The HiSeq X Ten Sequencer is a high-performance sequencing system capable of producing up to 16 TB of sequence output in a single run. With this system, the human genome can be sequenced by a factor of 30 or more for less than $1,000, and more than 18,000 human genomes can be sequenced annually. Each flow pool can generate up to 52 billion reads with a maximum run time of 48 hours.

The system has whole-genome sequencing capabilities that surpass those of human species and can also be used for whole-exome sequencing, transcriptome sequencing, single-cell analysis, and multiomics studies.

BGI Sequencing Platform (BGI) Introduction

BGI was founded in 1999 as a Chinese company involved in the Human Genome Project. BGI acquired Complete Genomics in 2012, and its products are sold by a subsidiary (MGI).

DNBSEQ-T7

DNBSEQ-T7 was launched in 2019 to support a range of large-scale sequencing applications for health programs and clinical research. Together with the Million Genome Total Solution software and hardware, DNBSEQ-T7 has been reported to be able to sequence up to 800,000 samples per year.

What is the technology of high-throughput sequencing, the difference between the first, second and third generations, and the precautions for sequencing

The hardware solution includes an automated library preparation system, which means that the sequencer can run 24 hours a day without human intervention and complete 60 human whole-genome sequences per day. Its commercialization is expected to change the sequencing landscape by reducing the cost of whole-genome sequencing for individuals to less than $500.

BGI Sequencing Chemistry

The sequencing chemistry of BGI is known as combinatorial probe-anchored synthesis (cPAS). It uses Phi 29 DNA polymerase for rolling ring replication to synthesize a long single-stranded DNA that self-assembles into nanospheres about 300 nanometers in size. Fluorescent probes are bound to it, and nanospheres are attached to a silicon wafer flow cell that selectively binds to positively charged materials in a highly orderly manner. Fluorescence emission is then imaged and measured to record base positions.

As with all short-read sequencing methods, the main drawback of the BGI platform is the inability to obtain long DNA sequences. However, an important advantage of cPAS-based sequencing is the high accuracy of Phi 29 DNA polymerase, which ensures accurate amplification of circular templates. In addition, because the DNA nanospheres remain motionless on the flow cell, they do not produce optical repeats and do not interfere with adjacent DNA.

DNBSEQ-G99(G99)

The DNBSEQ-G99 (hereinafter referred to as "G99") gene sequencer uses sequencing technology based on the principle of polymerase chain reaction (PCR). During sequencing, in vitro amplification is performed using specific primers to guide the DNA sequence, and then a dNTP (deoxynucleotide) and fluorescent marker containing the four different colors required for sequencing are added. When the primer binds to the sequence to be measured, the polymerase begins to synthesize the new strand, while the fluorescent marker is activated and emits a different color of fluorescence. By recording these fluorescent signals, and using a computer for data analysis and decoding, the sequence of each base is finalized.

What is the technology of high-throughput sequencing, the difference between the first, second and third generations, and the precautions for sequencing

The G99 achieves a breakthrough PE150 sequencing in 12 hours, which is fast, simple, and flexible while providing high-quality sequencing data based on user needs, bringing a better experience to sequencing work and greatly expanding application scenarios.

Moreover, DNBSEQ-G99 was approved by the National Medical Products Administration (NMPA) Medical Device Registration Certificate (National Device Approval 20233221289). This approval means that DNBSEQ-G99, the "speed king" in the small and medium-throughput sequencer, has been approved for clinical application in the domestic market, and will be able to give full play to its advantages of speed and flexibility to serve the application needs of clinical direction.

Third-generation sequencing technology

The principle of third-generation sequencing technology is mainly based on single-molecule sequencing or synthetic sequencing methods, which perform sequencing by directly reading the sequence of DNA molecules.

Single-molecule sequencing: DNA is fixed on a surface and sequenced using fluorescent dyes or other probes.

Single-molecule real-time sequencing (SMRT): Sequencing is performed by monitoring the fluorescent signal of the DNA polymerase on the DNA template using PacBio's SMRT technology.

Nanopore Sequencing: Using Oxford Nanopore Technologies' (ONT) nanopore sequencing technology, sequencing is performed by passing DNA molecules through nanopores and measuring the change in current through nanopores.

What is the technology of high-throughput sequencing, the difference between the first, second and third generations, and the precautions for sequencing

Synthetic sequencing: DNA sequences are synthesized stepwise in a reaction system and each base is labeled with fluorescently labeled nucleotides. Third-generation sequencing technologies typically have long read lengths that can read thousands to millions of bases.

The continuous development and improvement of third-generation sequencing technology has provided more possibilities for genomics research to better resolve complex genome structures and functions. It is suitable for sequencing long fragments, such as whole genome sequencing, long-read transcriptome sequencing, methylation sequencing, etc. However, third-generation sequencing technology also faces some challenges, such as sequencing error rate, data processing and analysis, etc., which require further research and improvement.

Other third-generation sequencing platforms on the market:

MinION: The MinION device is a portable nanopore sequencing instrument that enables real-time sequencing with a small footprint and low cost.

GridION: The GridION device is a high-throughput nanopore sequencing instrument that can sequence multiple samples simultaneously.

PromethION: The PromethION device is a high-yield, nanopore sequencing instrument that enables large-scale genome sequencing.

In addition, there are also a number of domestic companies that have launched or are developing third-generation sequencers, including Zhenmai Biotech, Qi Carbon Technology, etc.

3

Pre-DNA sequencing steps and precautions

Sequencing will continue to become more efficient and affordable, revolutionizing several fields related to genomics. Currently, all high-throughput sequencing (NGS) methods require library preparation. This protocol occurs after DNA fragmentation, in which linkers are attached to the ends of each fragment. This is usually followed by a DNA amplification step to generate a library that can then be sequenced by an NGS platform.

What is the technology of high-throughput sequencing, the difference between the first, second and third generations, and the precautions for sequencing

01

A step-by-step guide to sample preparation

The essence of sample preparation is to convert a mixture of nucleic acids from a biological sample into different types of libraries in preparation for the sequencing steps required for NGS technology. If the protocol is not followed correctly, sequencing will be affected. Each preparation step is foundational and has different considerations depending on the sample and the type of NGS platform. Therefore, it is important to consider how to perform the most effective protocol to ensure the highest quality results before starting an experiment.

The general steps for sample preparation are as follows:

Step 1: Extract the genetic material

This is the first step in every sample preparation protocol. Extraction of nucleic acids (DNA or RNA) from a variety of biological samples.

Step 2: Library preparation

Library generation requires a series of steps, with the ultimate goal of converting the extracted nucleic acids into a format appropriate for the sequencing technique of choice. This is done by fragmenting the target sequences to the desired length and then attaching specific adapter sequences to the ends of those target fragments.

Adapters can also include barcodes to identify specific samples and allow for multiplexing. Fragmentation can be done by physical or enzymatic methods.

Step 3: Zoom in

This is an optional step, but it is often required. It depends on the NGS application and sample size. Amplification is essential to obtain sufficient coverage to reliably sequence samples containing small amounts of starting material. Polymerase chain reaction (PCR) is a common method to increase the amount of DNA. More information on the emergence of PCR methods that enable nucleic acid detection of small samples.

Step 4: Purification and quality control

This step is necessary to remove any unwanted material that may be hindering sequencing. Some NGS platforms may have narrow size requirements, so discarding fragments that are too large or too small can improve sequencing efficiency. The optimal library size is determined by the sequencing application. This "clean-up" is usually done by magnetic bead-based clean-up or agarose gels.

Quality control is the last process before sequencing is performed. Confirming the quality and quantity of DNA can increase confidence in the sequencing data. Subsequent experiments are time-consuming and expensive, requiring strict quality control steps to ensure that all samples are suitable for their application.

02

Common challenges in sample preparation

Challenge 1

Many samples are taken from a limited number of samples or even individual cells. They do not provide enough genetic material on their own, so PCR is required. However, this amplification step can easily introduce bias into the sample. A PCR repeat is when there are multiple copies of an identical DNA fragment. Too many PCR replicates can lead to uneven sequencing coverage for experiments.

Solution 1: Eliminating all sources of bias is somewhat impossible, but it's important to understand where deviation occurs and take all practical steps to minimize it. High PCR repetition rates indicate that library preparation requires some modification and may require increased complexity of NGS libraries.

Many programs can remove PCR duplicates, the most commonly used are Picard MarkDuplicates and SAMTools. In addition, specific PCR enzymes have been shown to minimize amplification bias. Ultimately, the goal of library preparation is to maximize sample complexity and minimize bias caused by amplification.

Challenge 2

Inefficient library preparation is a problem in the sample preparation process. This is reflected in the lower proportion of fragments with the correct adapters. The consequence is a decrease in the amount of sequencing data obtained and an increase in the number of chimeric fragments. Chimeric readouts originate from parts of the genome that are not adjacent to each other and are a source of errors during sequencing.

Solution 2: Efficient A tailing of PCR products has been reported to prevent chimerism formation, and this procedure is versatile and can be applied to a number of different library construction techniques. In addition, strand segmentation artifact readings (SSARs) have been suggested to reduce the number of chimerism artifacts in the sample, and chimerism detection procedures can be used to filter the original sequence to achieve an overall chimerism rate of only 1%.

Challenge 3

Sample contamination is an inherent problem, as individual libraries are often prepared in parallel. The most likely major source of contamination is preamplification, which is a method of increasing the amount of nucleotide sequences prior to PCR.

Solution 3: Contamination can be identified through steps such as quality control, negative controls, setting up repeats, etc., ensuring the use of aseptic techniques and sterile experimental conditions during sample preparation to prevent the introduction of exogenous contamination.

In addition, unique barcodes and labels are used to identify samples (all samples of Guhe are uniquely barcoded and managed throughout the process) to avoid confusion and cross-contamination. Finally, do a good job of regular cleaning and disinfection: Clean and disinfect lab equipment and work areas regularly to reduce the accumulation and spread of contamination.

Challenge 4

The significant cost of library preparation is mainly attributed to the cost of lab equipment, personnel requiring training, and reagents.

Solution 4: By optimizing the protocol and conditions, you can reduce the amount of reagents used and waste, thereby reducing costs. Ensure that laboratory personnel receive appropriate training and technical support to improve the efficiency and accuracy of experiments. Collaborate with other labs or research teams to share equipment and resources, and share costs and experimental burdens. As automation becomes more popular, the accuracy and efficiency of sample preparation is likely to improve.

04

Pay attention to the NGS sequencing process

Base balance

What is Base Balance?

A principle that cannot be ignored in sequencing is base balance, which means that in the sequencing process, the four bases A, C, G, and T in each cycle exist relatively uniformly. Balance and complexity that need to be taken into account. During the sequencing process, it is very important to maintain base balance to ensure the accuracy and reliability of the sequencing results.

What is a base imbalance library?

It is the library produced by the amplicon, and the amplicon is characterized by a specific start site. When reflected on the sequencing image, one photo is particularly bright with many light spots, while the other three photos are particularly dark. At this time, it is more difficult for the software to compare the space. As a result, the reliability of the judgment is relatively poor, resulting in errors in the interpretation of bases, resulting in a significant decrease in sequencing quality. It is common to add such as genomic DNA libraries, or to incorporate a large number of balanced base libraries, including phix libraries. At the same time, it is also possible to incorporate as many different types of amplicon libraries as possible.

In addition, base balancing involves detecting and correcting base bias during sequencing. During sequencing, insertions, deletions, or errors of bases may occur, which can affect the accuracy of sequencing results. In order to correct these errors, various bioinformatics tools and algorithms have been developed, such as quality control and base correction, among others.

Library length

The length of the library includes both sides of the sequencing adapter and the insertion of the target fragment, the length range of the entire library should not be too wide, it is generally recommended to be between 250bp-450bp, more than 600bp will cause some adverse effects.

Excessively long library lengths can reduce sequencing efficiency

On high-throughput sequencing platforms such as Illumina sequencing, the length of the sequencing fragment affects the quality and efficiency of the sequencing. Excessively long library lengths can increase the error rate during sequencing and result in shorter read lengths of sequenced fragments. This can reduce the reliability and accuracy of sequencing, affecting subsequent bioinformatics analysis and data interpretation. However, if the library fragment is too short, the short fragment will be sequenced to the later stage, which is to measure the adapter sequence, and sometimes the connector sequence is measured, then there will be no signal, and some false signals will be read in the future, which will reduce the sequencing quality value.

Excessively long library lengths can reduce cluster density

Cluster density is based on the principle of sequencing while synthesizing only one base at a time, and the reaction time of each molecular cluster is required to be consistent. That is, the individual molecular clusters must react at the same time. Ideally, of course, this is the case, but in the actual PCR reaction, the reaction time of each molecule is still different (the general system and enzyme should be controlled). As a result, some molecules within the cluster react quickly, while others react slowly, and the long library length will reduce the cluster density. In Illumina sequencing, DNA fragments are immobilized into clusters of polymerase chain reaction (PCR) products on the flow array. Excessively long library lengths can lead to reduced PCR amplification efficiency, which in turn reduces cluster density. Low cluster density will reduce the number of sequencing fragments in each cluster, which in turn will reduce the coverage and depth of sequencing, which will affect subsequent data analysis and interpretation.

Excessively long library lengths can cause base shifts

During the sequencing process, long fragment libraries are prone to base shifts due to the slippage of DNA polymerase, etc.

05

Next-generation sequencing data quality evaluation

Yield

Data volume refers to the total amount of PF data obtained in a single sequencing. Note that it is PF data (PF data refers to valid sequencing data after filtering, i.e., sequenced fragments screened by quality control), not raw data. Of course, the more data, the better, and the actual results are related to the sequencer model, and the output is different for different machines.

The total amount of PF data is an important indicator of sequencing depth and sequencing quality. A higher total amount of PF data indicates that more efficient sequencing fragments were obtained during the sequencing process, which can provide higher sequencing coverage and depth, thereby improving the reliability and accuracy of subsequent data analysis.

Q30

Q30 refers to the base with a quality value (QV) greater than or equal to 30 during sequencing. The mass value is calculated from the sequencing instrument's measurements and signal peaks for each base and is used to indicate the quality of that base. A higher Q30 value indicates a higher proportion of high-quality bases in the sequencing data.

What is the technology of high-throughput sequencing, the difference between the first, second and third generations, and the precautions for sequencing

by:Alexander William Eastman

It is important to note that the size of Q30 is related to the read length of the sequencing fragment. If the read length is longer, i.e., the sequenced fragment contains more base numbers, it is more difficult to require a mass value of 30 or more for each base, so the average %Q30 may be lower. Conversely, if the read length is shorter, i.e., the sequenced fragment contains fewer base numbers, it is relatively easy to require a mass value of 30 or more per base, and the average %Q30 may be higher.

比对率(mappingrate)

Aligning sequencing data with reference sequences is an important step in sequencing data analysis. Alignment rate refers to the proportion of bases that are completely consistent with the reference sequence in the sequencing data to the total base number of the sequencing data during the alignment process. The higher the alignment rate, the higher the accuracy and reliability of the sequencing data. In bacterial 16S sequencing, the appropriate alignment tool can be selected according to specific needs.

Commonly used comparison tools are BLAST (Basic Local Alignment Search Tool) based on the Smith-Waterman algorithm and Bowtie and BWA based on Burrows-Wheeler transform. High alignment rate is one of the important indicators of good sequencing data quality. It indicates that the sequencing data is highly accurate and reliable, and can provide more accurate genomic information and important information such as variant sites (in the sequencing of grain and grass 16s, more than 70% of fecal samples can be compared). In the subsequent data analysis and interpretation, sequencing data with high alignment rate is more helpful for accurate variant detection, gene expression analysis, and functional analysis.

It should be noted that the alignment rate is affected by a variety of factors, including sequencing data quality, accuracy of reference sequences, databases, and selection of alignment algorithms. When analyzing sequencing data, it is necessary to comprehensively consider the alignment rate, sequencing data quality, and other relevant indicators to obtain accurate and reliable analysis results.

覆盖度(coverage)

Due to some technical and biological randomness in the generation of sequencing data, the coverage depth of sequencing data in different regions is different.

Depth of coverage refers to the number of reads or sequencing bases of sequencing data at a particular location. The higher the coverage depth, the richer the sequencing data at that location, and the higher the accuracy and reliability of the sequencing results.

It is important to note that the uniformity and level of coverage depth are affected by a variety of factors, including sequencing depth, sequencing technology, sample quality, etc.

重复率(duplicationrate)

In the construction of next-generation sequencing libraries, PCR amplification is required for all methods except for the PCR-free approach. PCR amplification can lead to inconsistent amplification of different regions of the chromosome, and some sequences are over-amplified. This is an artificially introduced bias. The repetition rate is related to the quality of the library construction reagent and is typically < 10% for human whole-genome sequencing.

捕获率(capturerate)

Hybridization capture is to enrich the corresponding sequences from the genome library through probe hybridization capture, and probe hybridization capture has the problem of high and low capture efficiency, so the parameter for investigating and evaluating the success or failure of this step, whether it is good or bad is the capture rate, the higher the better. The capture rate is related to the capture reagent used, and the capture rate is different for different reagents.

06

Conclusion

High-throughput sequencing operations include sample preparation, library construction, PCR amplification, sequencing instrument run, and more. The accuracy and reproducibility of the experiment can only be guaranteed by following the standard SOP specifications. With the continuous emergence and improvement of new sequencer platforms and technologies, high-throughput sequencing has made rapid progress in terms of throughput, quality, speed and cost, and the application scope of high-throughput sequencing has been greatly expanded, and it is expected to carry out high-throughput sequencing applications anytime and anywhere at low cost in the near future.