Evolution of Sequencing Technologies and NGS Limitations
The evolution of sequencing techniques is broadly categorized into first-generation (FGS), next-generation (NGS), and third-generation (TGS) sequencing. While FGS, exemplified by Sanger and Maxam-Gilbert methods, laid the groundwork by focusing on smaller genomes (a few hundred base pairs), it highlighted the need for more advanced methods for complex DNA and RNA regions. NGS technologies, such as pyrosequencing and sequencing-by-synthesis, improved upon FGS with better error rates and more sophisticated results. However, NGS platforms, commonly referred to as short-read techniques due to their limited read lengths (typically up to 600 bases, but generally around 75-250 bp), present several inherent weaknesses.
These limitations include:
- Inability to study full-length transcript variants, centromere and telomere genomic regions, and gene fusions.
- Difficulty in resolving repetitive regions of the genome, making genetic variations like repeat expansion disorders and structural variants (SVs) challenging to identify. This is because a short read cannot be uniquely mapped if the overlapping repeat element is longer than its length. More than half of the human genome consists of repetitive elements like Short Tandem Repeats (STR), SINE, LINE, and segmental duplications (Low Copy Repeats – LCR).
- Challenges in characterizing regions with extreme guanine-cytosine (GC) content or multiple homologous elements.
- Difficulty in detecting epigenetically modified bases due to the essential PCR amplification step which results in higher costs, longer times, and the loss of native base modifications.
- General inability to determine nucleotide-level breakpoints of copy number variants (CNVs) and differentiate between tandem, dispersed, and inverted duplications.
These shortcomings spurred the development of TGS, also known as LRS, to overcome these barriers.
Long-Read Sequencing Technologies: PacBio and ONT
LRS technologies, primarily Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT), have significantly simplified genome reconstruction and improved assembly contiguity. They offer several key advantages over short-read sequencing:
- Generation of reads orders of magnitude longer, ranging from 10 kilobases (kb) up to several megabases, with a current record of 2.3 Mb.
- Real-time data analysis, which means sequencing data is analyzed as it is generated.
- No requirement for PCR amplification steps before sequencing, eliminating PCR-related biases and allowing for direct detection of native base modifications.
- More precise mapping of reads to reference genomes and improved variant detection.
PacBio’s Single Molecule Real-Time (SMRT) Sequencing was the first LRS technology to be widely adopted. It involves immobilizing DNA polymerase in wells (ZMWs) on a chip and detecting light emission as fluorescently labeled nucleotides are incorporated into circularized DNA templates called SMRTbells. Early CLR (Continuous Long Reads) had lower accuracy (85–92%). However, the introduction of High-Fidelity (HiFi) reads (Circular Consensus Sequencing – CCS) marked a major improvement, offering exceptional accuracy (over 99%, often up to 99.9%) for reads over 10 kb. This is achieved by having the polymerase pass through the SMRTbell template multiple times, allowing the CCS algorithm to merge subreads for high consensus accuracy. PacBio’s Revio system further enhances HiFi throughput (up to 360 Gbases per day) with improved accuracy (Q33) and methylation calling capability.
Oxford Nanopore Technologies (ONT) distinguishes itself by generating reads of hundreds to thousands of kilobases, often outperforming PacBio in maximum read length. ONT’s technology is based on detecting changes in electric current as single-stranded DNA (or RNA) molecules pass through nanopores embedded in a membrane. The deflections in electric current are distinct for each nucleotide, creating unique signatures. Recent advancements, such as the Q20+ platform update and Ligation Sequencing Kit V14, have pushed ONT’s long-read accuracy to over 99%. ONT offers various platforms, from the portable MinION to the high-throughput PromethION, with data yields ranging from 2.8 Gb (Flongle) to 50-200 Gbases (PromethION). Ultra-long reads (over 100 kb) from ONT have been crucial for completing the human genome and resolving highly repetitive regions.
Technical Advances and Challenges in LRS
Historically, LRS faced limitations like limited yield, high error rates, and high cost per base. While early base-identifying accuracy was around 85%, it has now reached almost 99% for SMRT and 95% for ONT. The quality of SMRT reads is proportional to the number of DNA fragment transitions, reaching 99% with multiple passes. ONT read quality, however, depends on the ratchet rate per base through the nanopores, with a median sign-pass accuracy of around 95%.
Error Correction and Assembly are vital for LRS data. Hybrid error correction uses high-accuracy short-read data to correct errors in long reads, while self-correction exploits redundancy within long-read data itself. Assembly polishing, a refinement step, improves accuracy and eliminates artifacts in the assembled sequence, with tools like Racon and Medaka widely used for Nanopore data.
De Novo Genome Assembly and the T2T Consortium
De novo whole-genome assembly involves piecing together nucleotide sequences to construct complete chromosomes without a reference. A major advantage is avoiding biases from evolutionary or genetic differences between a reference and the sequenced genome. However, it is computationally intensive, especially for complex eukaryotic genomes, and faces challenges with heterozygosity, repetitive elements, and polyploidy. LRS platforms demonstrate improved capabilities in assembling such complex genomes.
The Telomere-to-Telomere (T2T) consortium showcased the power of LRS by achieving the first complete, gapless human genome sequence (T2T-CHM13 assembly), addressing previously intractable regions. This monumental achievement combined ultra-long nanopore reads, high-fidelity (HiFi) sequencing, and Hi-C data. The T2T-CHM13 assembly resulted in a 3 billion-base pair long complete human haplotype, contributing to the recognition of almost 4,000 new genes and including gapless telomere-to-telomere assemblies for all 22 human autosomes and chromosome X. It also corrected 151 Mbp of previously unknown sequence data.
Key areas where LRS tackles previous barriers in assembly and variant detection include:
- Structural Variants (SVs): LRS is most suitable for detecting SVs, especially large genomic alterations like insertions, deletions, inversions, and translocations, which are typically longer than 50 bp and hard to identify with short-read approaches.
- Repetitive Regions: LRS can span long repetitive or problematic regions, boasting an accuracy of 99.9% in this regard. The T2T consortium successfully sequenced challenging satellite arrays and acrocentric short arms, which were largely missing from previous reference genomes.
- Haplotype Phasing: LRS enables accurate long-range haplotype phasing, assigning sequence data to maternally or paternally inherited chromosomes. This is crucial for resolving inconclusive diagnoses in autosomal recessive conditions and identifying complex rearrangements.
- Homologous Genes and Pseudogenes: LRS can differentiate variants in highly homologous genes or those with pseudogenes, areas where short-read methods struggle due to mappability issues.
Bioinformatics of Long Reads
The rise of LRS necessitated a new generation of bioinformatics tools to handle their unique features, particularly increased read lengths. These tools are critical for base calling, detection of base modifications, variant calling, and genome assembly.
- Base Calling: LRS platforms translate specific electric signals (ONT) or fluorescent light pulses (PacBio) directly into nucleic acid sequences. Many TGS base callers can execute alignment in parallel with base identification.
- Epigenetic Modifications: LRS allows for the direct detection of modified bases (e.g., 6mA, 5mC, 5hmC) without the need for bisulfite conversion or PCR amplification, which were limitations for NGS. ONT detects these based on signal shifts, while PacBio uses altered kinetics of base-pair incorporation.
- Variant Calling: LRS variant callers are built upon de novo assembly, short-read alignment, or long-read mapping approaches. They show overall better performance for spanning repetitive and problematic regions.
- Genome Assembly: LRS has simplified genome reconstruction and improved assembly contiguity. Techniques like scaffolding using genetic markers, optical maps, or linked reads are crucial for resolving fragmented assemblies.
Clinical Applications of LRS
LRS has profound implications across various clinical fields:
- Rare Disease Diagnosis: LRS helps identify novel pathogenic mutations in Mendelian conditions that remain undiagnosed by NGS. It is particularly effective for short tandem repeat (STR) expansion disorders (e.g., Huntington’s disease, Fragile X syndrome), providing unbiased sizing and sequence determination of neuropathogenic STR sites in a single targeted assay. LRS can resolve the underlying cause of rare genetic conditions like Gitelman syndrome by identifying previously missed intronic variants.
- Oncology: LRS is proving valuable for detecting complex structural variants (SVs) in cancer genomes, including copy-balanced SVs and novel SVs missed by short-read approaches. It can identify fusion transcripts, which are major drivers of carcinogenesis, enabling rapid diagnosis and guiding therapy selection. LRS also supports multilayer analysis of the transcriptome and epigenome, allowing direct detection of DNA methylation landscapes in tumors for classification and prognostic markers.
- Infectious Diseases: The speed and portability of ONT platforms make them ideal for rapid, in situ diagnosis and genomic surveillance of pathogens, as demonstrated during the Ebola outbreak and COVID-19 pandemic. This enables quicker identification of disease sources, monitoring of viral evolution, and tracking transmission chains. LRS also aids in antibiotic resistance research and personalized treatment of diseases like tuberculosis.
- Transplantation: LRS has revolutionized Human Leukocyte Antigen (HLA) typing, allowing highly accurate and comprehensive identification of rare and complex HLA alleles, which are critical for donor-recipient matching and predicting graft rejection. It also helps unravel the causes of graft-versus-host disease (GVHD) and the role of killer-cell immunoglobulin-like receptor (KIR) genes.
- Accessibility and Portability: Portable LRS devices like MinION address health inequalities by enabling genomic testing in logistically challenging or resource-limited settings, reducing reliance on centralized sequencing centers and sample shipment.
Despite these advancements, challenges remain, including the need for high-quality, high-molecular-weight DNA input, higher per-base costs compared to short reads, ongoing refinement of bioinformatics tools, and standardization for routine clinical adoption. Nevertheless, LRS holds the promise to become a mainstream tool in clinical settings, enabling widespread implementation of precision medicine.