Bibliographic and Educational Resources in Cytogenomics

This platform is designed to serve as a comprehensive educational and bibliographic resource for healthcare professionals involved in cytogenomics. Covering a wide range of up-to-date topics within the field, it offers structured access to recent scientific literature and a variety of pedagogical tools tailored to clinicians, educators, and trainees.

Each topic is grounded in a curated selection of recent publications, accompanied by in-depth summaries that go far beyond traditional abstracts—offering clear, clinically relevant insights without the time burden of reading full articles. These summaries act as gateways to the original literature, helping users identify which articles warrant deeper exploration.

In addition to these detailed reviews, users will find a rich library of supplementary materials: topic overviews, FAQs, glossaries, synthesis sheets, thematic podcasts, fully structured course outlines adaptable for teaching, and ready-to-use PowerPoint slide decks. All resources are open access and formatted for easy integration into academic or clinical training programs.

By providing practical, well-structured content, the platform enables members of the cytogenomics community to efficiently update their knowledge on selected topics. It also offers educational materials that are easily adaptable for instructional use.

Long-Read Sequencing

Overview

Long-read sequencing (LRS) represents a significant advancement in genomic technologies, offering the ability to sequence DNA fragments that are orders of magnitude longer than those produced by traditional short-read sequencing (SRS). This capability has revolutionized genomics by providing a more complete and accurate view of complex genomic regions, addressing limitations inherent in SRS.

There are two primary LRS technologies that currently dominate the field: Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT).

  • Pacific Biosciences (PacBio)
    • Single-Molecule Real-Time (SMRT) Sequencing: This technology was the first LRS to achieve widespread deployment. It involves immobilizing DNA polymerase in wells of a SMRTcell chip, with DNA templates circularized into “SMRTbells” using hairpin adapters. Fluorescently labeled nucleotides are incorporated by the polymerase, and their light emission changes are detected in real-time.
    • Continuous Long Reads (CLR): Earlier SMRT sequencing versions, like CLR, used very large insert sizes (up to hundreds of kilobase pairs), resulting in the circular template being sequenced only a few times. This led to a lower accuracy (typically 85–92%) compared to Illumina short reads (which can reach 99.9%). CLR reads were not ideal for detecting single nucleotide variants (SNVs) or indels and often required combination with other sequencing technologies.
    • High-Fidelity (HiFi) Reads: A major improvement in SMRT sequencing came with HiFi reads. These reads boast exceptional accuracy (over 99%, often up to 99.9%) for reads over 10 kb. HiFi reads are generated by sequencing smaller SMRTbell inserts (10–30 kb) multiple times (7–12 or more passes) using Circular Consensus Sequencing (CCS) algorithms. The PacBio Revio system, the latest platform, significantly increases HiFi read throughput, delivering up to 360 Gb per day (equivalent to ~1300 human whole genomes per year) with improved accuracy (Q33).
  • Oxford Nanopore Technologies (ONT)
    • Nanopore Sequencing: ONT’s technology is based on detecting changes in electric current as DNA or RNA molecules pass through a protein nanopore embedded in a membrane. A motor protein unwinds the DNA, guiding a single strand through the pore, producing a characteristic “squiggle” signal that is translated into DNA sequence by base-calling algorithms.
    • Read Lengths: ONT can generate extensive reads from 10 kb up to hundreds of thousands of kilobases, even reaching several megabase pairs, outperforming PacBio in this regard. Ultra-long reads (>100 kb) have been crucial for completing the human genome, resolving repetitive regions.
    • Accuracy: Historically, ONT raw read accuracy was around 87–98%. However, recent advancements, such as the Q20+ platform update and Ligation Sequencing Kit V14, allow accuracy to reach over 99%. Accuracy is highly dependent on the base-calling algorithm used.
    • Platforms: ONT offers various platforms differing in flow cell capacity, including the MinION (handheld, portable), GridION (desktop), and PromethION (high-throughput, benchtop, with up to 48 flow cells). The Flongle is available for low-throughput applications, and SmidgION is being developed for mobile devices.
    • Adaptive Sampling: ONT implements adaptive sampling, a computational enrichment technique that allows real-time analysis of sequencing data to dynamically adjust parameters, focusing efforts on regions of interest. If a read is not of interest, it can be ejected from the pore, and a new molecule is selected.

 

LRS addresses many limitations of SRS, providing several key advantages:

  • Resolution of Challenging Genomic Regions: SRS struggles with highly repetitive elements (e.g., Short Tandem Repeats (STRs), SINEs, LINEs, segmental duplications, homopolymers) and regions with extreme GC content, often leading to assembly gaps or misalignments. LRS can span these problematic regions, enabling more accurate and contiguous genome assemblies. This has been critical for sequencing previously “intractable” regions like centromeres, telomeres, and acrocentric short arms.
  • Accurate Haplotype Phasing: LRS allows for long-range haplotype phasing, assigning alleles to maternal or paternal chromosomes over intervals spanning hundreds of kilobases to megabases. This is crucial for definitively diagnosing autosomal recessive conditions where two variants need to be confirmed on different alleles (in trans) and for understanding complex structural variations.
  • Direct Detection of Epigenetic Modifications: Unlike SRS, LRS can directly detect epigenetic modifications (e.g., DNA methylation like 5-methylcytosine (5mC) and 5-hydroxymethylcytosine (5hmC), and N6-methyladenosine (6mA) in RNA) without the need for bisulfite conversion or PCR amplification, which can degrade DNA and introduce bias. This is achieved by detecting characteristic changes in polymerase kinetics (PacBio) or electric current signals (ONT).
  • Real-time Data Analysis and Portability: ONT platforms, particularly the MinION, offer real-time data analysis and portability, allowing for rapid results (sometimes within hours) and deployment in field settings. This significantly reduces turnaround time compared to traditional sequencing methods.

The unique capabilities of LRS have led to its application across various fields:

  • De Novo Genome Assembly
    • LRS has simplified genome reconstruction and improved assembly contiguity. It is likened to assembling an intricate jigsaw puzzle, piecing together nucleotide sequences (reads) to construct complete chromosomes.
    • Methods: Two main algorithmic approaches are Overlap Layout Consensus (OLC) and De Bruijn Graph (DBG). OLC is favored for long reads due to its compatibility with their characteristics.
    • Hybrid Assembly: Combining short reads (e.g., Illumina) with long reads (e.g., PacBio, ONT) is a powerful strategy to achieve highly accurate and contiguous genome assemblies without excessive cost. Short reads can correct errors in long reads, or long reads can bridge gaps in short-read assemblies.
    • Hi-C Enhanced Assembly: Hi-C data can be integrated to achieve chromosome-scale contiguity, correcting errors, scaffolding contigs, and enabling phased assemblies.
    • Telomere-to-Telomere (T2T) Assembly: LRS, particularly ultra-long nanopore reads and high-accuracy PacBio HiFi reads, has enabled the first complete, gapless human genome assemblies, including previously unresolved centromeres, telomeres, and acrocentric short arms.
  • Rare Disease Diagnosis
    • LRS is crucial for identifying novel pathogenic mutations, particularly complex structural variants (SVs) like large insertions, deletions, and inversions, and repeat expansions that SRS often misses.
    • Examples include detecting repeat expansions in Fragile X syndrome (FMR1), Huntington’s disease (HTT), and other neurological disorders, as well as SVs in genes like PRKAR1a, EYS, BBS9, CLN6, G6PC, and WRN.
    • LRS can resolve complex genomic regions, such as the highly homologous OPN1LW/OPN1MW gene cluster involved in color vision and the alpha/beta hemoglobin gene clusters, which are challenging for SRS.
    • It also aids in differentiating between protein-coding genes and their pseudogenes (e.g., PKD1, IKBKG).
    • Ultra-rapid sequencing using ONT platforms has demonstrated the ability to diagnose rare genetic diseases in critically ill patients within hours, guiding clinical management.
  • Oncology
    • LRS detects structural variants in cancer genomes that affect oncogenes and tumor suppressor genes, often missed by SRS, providing base-pair resolution.
    • It enables the identification of fusion transcripts, which are crucial for distinguishing benign from malignant conditions and sub-classifying neoplasms.
    • LRS can also detect and screen for epigenetic changes in cancer, such as DNA methylation shifts, which are vital in tumorigenesis and can serve as biomarkers for early detection, monitoring, and prognosis.
  • Infectious Diseases and Microbiota
    • The speed and portability of ONT make it ideal for real-time genomic surveillance of pathogens (e.g., Ebola, SARS-CoV-2, monkeypox virus), enabling rapid identification, management, and tracking of disease spread.
    • LRS contributes to antibiotic resistance research by comprehensively characterizing microbial genomes and identifying novel resistance genes, including plasmid-mediated resistance.
    • It is used to identify drug resistance in M. tuberculosis, supporting personalized treatment in remote settings.
    • LRS, particularly MinION, is also used for rapid analysis of microbiota, such as vaginal microbiota, with potential for broader clinical application.
  • Transplantation
    • LRS allows for highly accurate and comprehensive Human Leukocyte Antigen (HLA) typing, identifying rare and complex HLA alleles crucial for donor-recipient matching and predicting transplant outcomes.
    • It helps unravel the causes of graft-versus-host disease (GVHD) by analyzing gene expression patterns in immune cells.
    • LRS is instrumental in elucidating the role of killer-cell immunoglobulin-like receptor (KIR) genes in immune surveillance and regulation, linking specific KIR genotypes to graft rejection.

Despite its immense potential, LRS faces several challenges for widespread adoption, particularly in routine clinical settings:

  • Accuracy Considerations: Historically, LRS had higher error rates compared to SRS. While recent advancements have significantly improved accuracy (PacBio HiFi, ONT Q20+ chemistry), achieving consistent high-quality indel calls from ONT data, for instance, can still be challenging, often requiring very high coverage.
  • Cost and Throughput: LRS generally has a higher cost per base and can be more expensive to obtain the same coverage compared to SRS. While new systems like PacBio Revio aim to mitigate this, the equipment cost is substantial. LRS throughput can also be sensitive to molecular damage during library preparation.
  • DNA Quality and Library Preparation: Producing long reads largely depends on high-quality, high-molecular-weight (HMW) DNA. Obtaining such DNA from clinical samples (e.g., saliva) can be challenging, as most protocols require blood or other invasive tissue collection, and DNA fragmentation can occur during extraction and handling.
  • Bioinformatics Tools: LRS data is qualitatively different from SRS data, requiring tailored analysis tools and bioinformatics approaches. While many tools have been developed (e.g., for base calling, error correction, assembly, variant calling, phasing), a scientific community consensus on standard algorithms and tools is still needed. The large amount of data generated by LRS also means longer bioinformatics analysis time and more expensive hardware.
  • Clinical Implementation:
    • Validation and Stability: Rigorous validation processes are crucial to ensure the reliability and accuracy of LRS data for clinical decision-making. Platform stability and consistency over extended periods are also concerns.
    • Integration into Workflows: Incorporating LRS into existing clinical workflows requires establishing standardized protocols, quality control measures, and optimizing workflow efficiency.
    • Regulatory Frameworks: Regulatory, accreditation, and credentialing frameworks need review and revision to support LRS clinical applications, especially for point-of-care testing.

LRS technologies are continuously evolving, promising more accurate, comprehensive, and insightful genome assemblies.

  • The Human Pangenome Reference Consortium (HPRC) aims to create a collection of human diploid reference genomes representing the true diversity of human genetics, utilizing the LRS technology and computational framework developed for T2T assemblies. This will provide a foundation for understanding the full range of human genomic diversity.
  • Continuous Advancements: Ongoing research and development are focused on improving read accuracy, throughput, and reducing costs. Algorithm refinements guided by genomic insights are expected to yield more accurate results.
  • Increased Clinical Adoption: As costs decrease and accuracy/throughput increase, LRS is expected to become a mainstream tool in clinical settings. Its ability to provide a comprehensive single diagnostic test, reduce sequential testing, and infer phase simultaneously is highly desirable for personalized medicine. It also holds significant potential for addressing health inequities by enabling genomic testing in logistically challenging and remote locations.

In summary, LRS has emerged as a critical tool, transforming genome assembly and enabling the discovery of complex genetic variations and epigenetic modifications previously hidden from SRS, propelling genomics into a new era of understanding and therapeutic potential.

FAQ

LRS technologies, such as Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT), generate DNA reads typically ranging from 10 kilobases (kb) to 100 kb, with a current record of 2.3 megabases (Mb). In contrast, traditional Short-Read Sequencing (SRS) platforms like Illumina generally produce reads up to 600 bases. This extended read length is crucial for generating accurate and contiguous genome assemblies.

LRS offers significant improvements by allowing better detection of structural variations (SVs) larger than approximately 500 bp. It provides better resolution of highly repetitive or non-unique genomic regions and enables accurate long-range haplotype phasing. Additionally, LRS allows for the direct detection of base modifications (epigenetic marks) from sequencing data without prior chemical treatments like bisulfite conversion, which can degrade DNA.

The two dominant LRS technologies are Pacific Biosciences (PacBio) Single Molecule, Real-Time (SMRT) sequencing and Oxford Nanopore Technologies (ONT) nanopore-based sequencing. These technologies have revolutionized genomics by enabling the coverage of long repetitive regions, closing gaps in existing reference assemblies, and facilitating the characterization of structural variations.

PacBio HiFi sequencing, a mode of SMRT sequencing, involves circularizing DNA fragments (10-30 kb) into “SMRTbell” templates. A DNA polymerase sequences these circular templates multiple times in zero-mode waveguides (ZMWs), generating several “subreads”. These subreads are then computationally combined using a circular consensus sequencing (CCS) algorithm to produce highly accurate consensus reads, achieving over 99% accuracy (up to 99.99%). This high accuracy significantly improves variant discovery and access to complex repetitive DNA regions.

ONT differentiates between conventional long reads (10-100 kb) and ultra-long reads (>100 kb, potentially up to several megabases). While early ONT raw read accuracy was around 87-98%, recent advancements, such as the Q20+ platform update and Ligation Sequencing Kit V14, allow accuracy to reach over 99%.

De novo genome assembly is like assembling an intricate jigsaw puzzle, where reads (nucleotide sequences) are pieced together by identifying overlapping regions to construct complete chromosomes. LRS platforms significantly improve this process, especially for genomes containing extensive repetitive elements and high levels of heterozygosity, which are challenging for short-read methods. Common algorithms used are Overlap Layout Consensus (OLC) and De Bruijn Graph (DBG).

Hybrid assembly combines the strengths of short-read and long-read sequencing data. Short reads offer high accuracy for base-level resolution, while long reads help resolve repetitive regions and complex structural variations. This strategy provides highly accurate and contiguous genome assemblies without incurring the high cost of exclusively long-read sequencing.

LRS can span large genomic alterations like insertions, deletions, inversions, and translocations, which are typically longer than 50 bp. Unlike short reads, LRS can resolve these complex patterns that may overlap or be nested, which is crucial for identifying disease-related variants.

Yes, a key advantage of LRS is its ability to directly detect epigenetic modifications on native genomic DNA, such as DNA methylation (e.g., 5-methylcytosine and N6-methyladenine), without the need for chemical pretreatments like bisulfite deamination. This is achieved by characteristic perturbations in sequencing signals (e.g., altered polymerase kinetics in PacBio or distinct current shifts in ONT nanopores).

LRS significantly enhances long-range haplotype phasing, enabling the assignment of sequencing data to maternally or paternally inherited chromosomes over large genomic intervals (hundreds of kilobases to megabases). This ability is critical for making definitive diagnoses in autosomal recessive conditions by determining if two variants are on the same allele (cis) or different alleles (trans), often without requiring additional parental samples.

The T2T Consortium aimed to create a complete, gapless human genome sequence. LRS technologies, particularly the increased accuracy of PacBio HiFi and the ultralong reads of ONT, were instrumental in sequencing previously intractable regions like highly repetitive centromeres, pericentromeric regions, and acrocentric short arms. This effort added nearly 200 Mbp (approximately 8%) of previously hidden sequence to the human genome.

LRS is crucial for identifying novel pathogenic mutations in diseases with previously unknown genetic causes. It is particularly effective for diagnosing short tandem repeat (STR) expansion disorders (e.g., Huntington’s disease, Fragile X syndrome), resolving variants in highly homologous genes and pseudogenes (e.g., HBA1/HBA2, PKD1), and detecting complex structural rearrangements missed by SRS. LRS has also enabled ultra-rapid diagnosis for critically ill patients, guiding immediate clinical management.

In oncology, LRS detects complex structural variants (SVs) in cancer genomes that are often missed by traditional methods, including copy-balanced SVs and novel SVs. It allows for rapid targeted sequencing of frequently mutated genes in hematological malignancies and provides full coverage of transcript sequences for identifying fusion transcripts, which are key drivers of carcinogenesis. LRS also enables direct detection and screening for epigenetic changes like DNA methylation in cancer.

LRS, especially ONT platforms, enables rapid, in-situ diagnosis of infectious pathogens (e.g., Ebola, COVID-19, Monkeypox virus), facilitating real-time genomic surveillance and response. It aids in identifying drug resistance (e.g., in Mycobacterium tuberculosis) and characterizing antibiotic resistance mechanisms through hybrid sequencing. LRS can also be used for high-resolution analysis of microbiota.

LRS has revolutionized Human Leukocyte Antigen (HLA) typing, allowing for highly accurate and comprehensive analysis, including the identification of rare and complex HLA alleles crucial for donor-recipient matching. It also plays a role in understanding Graft versus Host Disease (GVHD) by analyzing gene expression patterns in immune cells and elucidating the role of Killer-cell Immunoglobulin-like Receptor (KIR) genes in immune surveillance and transplant outcomes.

Key challenges include the validation of data accuracy and reliability, ensuring the stability and consistency of sequencing platforms, and establishing standardized protocols and quality control measures. There’s a need for tailored bioinformatics tools and a consensus on standard algorithms. The cost per base can still be high compared to large-scale SRS, and obtaining high-quality, high-molecular-weight DNA for library preparation remains a practical hurdle, especially from non-invasive samples.

PacBio HiFi sequencing achieves 99.9% per-base accuracy, with 35x coverage producing high-quality SNV and indel calls (99.5% recall and precision). The PacBio Revio system can deliver up to 360 Gb of HiFi reads per day. ONT’s Q20+ chemistry provides 99% single-molecule accuracy, requiring 40-50x coverage for high-quality SNV calls. Ultra-rapid sequencing with ONT, achieving diagnoses in less than a day (some as fast as 7 hours), is possible on high-throughput platforms like PromethION.

The portability of devices like ONT’s MinION can overcome the need for centralized sequencing centers, making genomic testing accessible in logistically challenging locations. This enables faster diagnostic turnaround, reduces reliance on sample shipment, and helps address health inequalities by bringing advanced genomic medicine closer to the patient, even in resource-limited communities.

The HPRC aims to create a comprehensive collection of human diploid reference genomes that reflect the true diversity of human genetics. LRS technologies are crucial for this, building on the framework developed by the T2T consortium to provide complete, gapless assemblies across diverse haplotypes, including previously unresolved repetitive regions. This initiative is expected to revolutionize our understanding of human genetic variation on an unprecedented scale.

LRS is expected to become a mainstream tool and standard of care for genomic analysis in the coming years. Continuous advancements in reducing costs, increasing quality, and improving throughput, coupled with the development of more intuitive bioinformatics platforms, will accelerate its adoption. This will enable more accurate, comprehensive, and insightful genome assemblies, fueling discoveries across various disciplines from biology to medicine and supporting personalized medicine.

Bibliography

Espinosa, E., Bautista, R., Larrosa, R., & Plata, O. (2024). Advancements in long-read genome sequencing technologies and algorithms. Genomics, 116(2024), 110842. Available online 11 April 2024. https://doi.org/10.1016/j.ygeno.2024.110842

Szakállas, N., Barták, B. K., Valcz, G., Nagy, Z. B., Takács, I., & Molnár, B. (2024). Can long-read sequencing tackle the barriers, which the next-generation could not? A review. Pathology & Oncology Research, 30, 1611676. Published 16 May 2024. https://doi.org/10.3389/pore.2024.1611676

Warburton, P. E., & Sebra, R. P. (2023). Long-Read DNA Sequencing: Recent Advances and Remaining Challenges. Annual Review of Genomics and Human Genetics, 24, 109–132. First published as a Review in Advance on April 19, 2023. https://doi.org/10.1146/annurev-genom-101722-103045

Conlin, L. K., Aref-Eshghi, E., McEldrew, D. A., Luo, M., & Rajagopalan, R. (2022). Long-read sequencing for molecular diagnostics in constitutional genetic disorders. Human Mutation, 43(11), 1531–1544. Published in final edited form as: Hum Mutat. 2022 November. https://doi.org/10.1002/humu.24465

Oehler, J. B., Wright, H., Stark, Z., Mallett, A. J., & Schmitz, U. (2023). The application of long-read sequencing in clinical settings. Human Genomics, 17(73). Accepted 1 August 2023. https://doi.org/10.1186/s40246-023-00522-3

Introduction to Long-Read Sequencing The emergence of high-throughput sequencing technologies, collectively known as Next Generation Sequencing (NGS), has revolutionized genomics and transcriptomics by generating vast amounts of data in a single run. Historically, the initial draft of the human genome, completed using Sanger Sequencing, cost an exorbitant $3 billion and took over a decade. In stark contrast, NGS platforms can produce millions or even billions of reads in hours or days. While short-read sequencers, such as Illumina’s NovaSeq and HiSeq, produce reads generally up to 600 bases, LRS technologies routinely generate reads ranging between 10 kb and 100 kb, with a current record of 2.3 Mb. This capability allows for the generation of accurate and contiguous genome assemblies, a significant advantage over short-read sequencing.

Despite the notable achievements of short-read sequencing in de novo assembly, its limitations become apparent when attempting to detect long repetitive structures or large structural variants (SVs). Short-read methods are typically limited to identifying rearrangements or insertions/deletions no larger than approximately 500 bp, making larger insertions particularly challenging. LRS, however, excels in this area, as it can span longer repetitive or problematic regions, boasting an accuracy rate of 99.9%.

Long-Read Sequencing Technologies: PacBio and ONT The two leading LRS technologies are Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT). Both have simplified genome reconstruction and significantly improved assembly contiguity. While historically facing limitations like limited yield, high error rates, and high cost per base, substantial progress in recent years has mitigated these issues, leading to improved accuracy and overall performance.

  • PacBio’s Single Molecule Real-Time (SMRT) Sequencing was the first LRS technology to be widely adopted. It offers two main modes:
    • Continuous Long Reads (CLR): These reads are generated from DNA inserts larger than 30 kb. While CLR reads can be up to 100 kb long, their lower accuracy (typically 85–92%) made them unsuitable for detecting single nucleotide variants (SNVs) or small insertions/deletions (indels) on their own, often requiring combination with other sequencing technologies.
    • High-Fidelity (HiFi) Reads (Circular Consensus Sequencing or CCS): This represents a major advancement, offering exceptional accuracy (over 99%) for reads over 10 kb. HiFi reads are generated from smaller inserts (10–30 kb) that allow the polymerase to pass through the DNA template multiple times. The CCS algorithm merges these multiple subreads to achieve high consensus accuracy, often reported up to 99.9%. HiFi data has had a significant impact on improving variant discovery, reducing assembly costs, and providing access to complex repetitive DNA regions, including human centromeres. The new Revio system further enhances this, reducing run time to 24 hours and supporting four high-density SMRT cells in parallel, leading to a 15x increase in HiFi read throughput (up to 360 Gbases of HiFi reads per day).
  • Oxford Nanopore Technologies (ONT) is distinguished by its ability to generate reads of hundreds to thousands of kilobases in length, often outperforming PacBio in maximum read length. ONT differentiates between conventional long reads (10–100 kb) and ultra-long reads (over 100 kb), which can extend up to several megabases. Recent advancements, like the Q20+ platform update and Ligation Sequencing Kit V14, have pushed ONT’s long-read accuracy to over 99%. ONT reads are highly valuable for spanning complex genomic regions like structural variants and repeats, and for enhancing haplotype phasing. ONT offers various platforms, including the portable MinION, the medium-throughput GridION, and the high-throughput PromethION, which can yield 50–200 Gbases of data. The handheld MinION is already established for portable DNA sequencing, and an even smaller device, SmidgION, is being developed for smartphones.

De Novo Genome Assembly and Strategies De novo whole-genome assembly is a crucial strategy that involves meticulously piecing together nucleotide sequences (reads) to construct complete chromosomes, akin to solving an intricate jigsaw puzzle. A key advantage is that it avoids biases arising from evolutionary or genetic differences between a reference genome and the sequenced genome. However, this process demands significant time and computational resources, especially for complex eukaryotic genomes. Challenges arise from significant heterozygosity, highly repetitive elements (e.g., LINEs, SINEs, STRs), and polyploid genomes, which can lead to fragmented or misassembled outputs. Notably, LRS platforms demonstrate improved capabilities in assembling genomes containing extensive repetitive elements and high levels of heterozygosity.

Two widely recognized methods for genome assembly are the De Bruijn Graph (DBG) approach and the Overlap-Layout-Consensus (OLC) method.

  • DBG involves breaking down input sequences into k-mers to identify overlaps, constructing a graph where nodes are overlapping regions and k-mers form edges. While scalable, sequencing errors can introduce erroneous k-mers, increasing graph complexity and memory footprint.
  • OLC consists of three steps: overlap identification, layout construction (graph representation), and consensus sequence inference. Its main bottleneck is the all-versus-all pairwise alignment during overlap identification. A significant advantage of OLC is its ability to resolve repetitive regions shorter than the read length. OLC-based assemblers are generally favored for long reads due to their compatibility with the distinct characteristics of these reads.

Several assembly strategies are employed with long reads:

  • Long Read-Only Assembly utilizes only LRS data. Prominent OLC-based assemblers include Canu (with its HiCanu iteration for HiFi reads) and Hifiasm (specifically for HiFi reads, preserving haplotype contiguity). DBG-based assemblers like Shasta are notable for ONT data, offering faster and cheaper human-scale genome assembly. Verkko is an advanced assembler that combines ONT and HiFi reads to address complex repetitive regions, often achieving phased, diploid assemblies with telomere-to-telomere chromosomes with high accuracy (e.g., 99.9997% for the HG002 human genome).
  • Hybrid Assembly combines the strengths of short and long reads, offering a powerful strategy for comprehensive and cost-effective genome reconstructions. Approaches include direct mapping of long reads to a DBG from short reads, de novo assembly of long reads followed by short-read error correction, short-read correction of long reads before joint assembly, or independent short-read assembly followed by long-read linkage to bridge gaps.
  • Hi-C Enhanced Assembly is crucial because LRS alone cannot always achieve chromosome-scale contiguity. Hi-C data improves assemblies by ordering and orienting contigs, correcting misassemblies, linking contigs to chromosomes, and generating phased assemblies. The Telomere-to-Telomere (T2T) consortium leveraged ultra-long nanopore reads, HiFi sequencing, and Hi-C data to achieve the complete sequencing of the human X chromosome from telomere to telomere.

Error Correction and Assembly Polishing To counter the historical inaccuracies of LRS, various error correction methods have been developed. The choice of method depends on the sequencing sample, study type, and selected software. These methods broadly fall into:

  • Hybrid Error Correction: Leverages highly accurate short-read data to correct errors in long reads, using techniques like alignment of short reads to long reads, De Bruijn Graph exploration, contig generation and alignment, or Hidden Markov Models.
  • Self-Correction: Exploits the redundancy within long-read data itself, primarily through multiple sequence alignment of long reads to derive a consensus sequence or by employing De Bruijn Graphs to anchor and correct regions.

Assembly polishing is a subsequent refinement step that aims to eliminate artifacts and enhance the accuracy of the assembled sequence. Tools like Pilon are effective with paired-end Illumina data, while Racon and Medaka are widely adopted for Nanopore-based polishing, with Medaka specifically designed to correct systematic errors that Racon might miss. Homopolish is an innovative approach for ONT data that rectifies indel errors in homopolymers by leveraging homologous sequences from related genomes.

Assessing Assembly Quality A thorough evaluation of de novo assembly quality is paramount, focusing on three critical dimensions:

  • Contiguity Assembly: Evaluates the size and quantity of contigs. N50, defined as the sequence length of the shortest contig that covers 50% of the total genome length, is the most common metric.
  • Completeness Assembly: Determines the content of the contigs, especially gene content. BUSCO (Benchmarking Universal Single-Copy Orthologs), which assesses the presence of highly conserved genes, is commonly used, with a score above 95% considered good.
  • Correctness Assembly: Refers to the accuracy of each base pair and the correct order and location of contigs. This is often assessed by comparing against a gold standard reference, using tools like QUAST.

Future Perspectives LRS technologies have profoundly transformed genome assembly. The future of LRS promises further algorithm refinements, guided by genomic insights, leading to more accurate and comprehensive results. The synergy of advanced assemblers, robust benchmarking, and cutting-edge sequencing technologies is expected to drive the field toward even more insightful genome assemblies, continuously unraveling the genome’s complexities.

Evolution of Sequencing Technologies and NGS Limitations

The evolution of sequencing techniques is broadly categorized into first-generation (FGS), next-generation (NGS), and third-generation (TGS) sequencing. While FGS, exemplified by Sanger and Maxam-Gilbert methods, laid the groundwork by focusing on smaller genomes (a few hundred base pairs), it highlighted the need for more advanced methods for complex DNA and RNA regions. NGS technologies, such as pyrosequencing and sequencing-by-synthesis, improved upon FGS with better error rates and more sophisticated results. However, NGS platforms, commonly referred to as short-read techniques due to their limited read lengths (typically up to 600 bases, but generally around 75-250 bp), present several inherent weaknesses.

These limitations include:

  • Inability to study full-length transcript variants, centromere and telomere genomic regions, and gene fusions.
  • Difficulty in resolving repetitive regions of the genome, making genetic variations like repeat expansion disorders and structural variants (SVs) challenging to identify. This is because a short read cannot be uniquely mapped if the overlapping repeat element is longer than its length. More than half of the human genome consists of repetitive elements like Short Tandem Repeats (STR), SINE, LINE, and segmental duplications (Low Copy Repeats – LCR).
  • Challenges in characterizing regions with extreme guanine-cytosine (GC) content or multiple homologous elements.
  • Difficulty in detecting epigenetically modified bases due to the essential PCR amplification step which results in higher costs, longer times, and the loss of native base modifications.
  • General inability to determine nucleotide-level breakpoints of copy number variants (CNVs) and differentiate between tandem, dispersed, and inverted duplications.

These shortcomings spurred the development of TGS, also known as LRS, to overcome these barriers.

Long-Read Sequencing Technologies: PacBio and ONT

LRS technologies, primarily Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT), have significantly simplified genome reconstruction and improved assembly contiguity. They offer several key advantages over short-read sequencing:

  • Generation of reads orders of magnitude longer, ranging from 10 kilobases (kb) up to several megabases, with a current record of 2.3 Mb.
  • Real-time data analysis, which means sequencing data is analyzed as it is generated.
  • No requirement for PCR amplification steps before sequencing, eliminating PCR-related biases and allowing for direct detection of native base modifications.
  • More precise mapping of reads to reference genomes and improved variant detection.

PacBio’s Single Molecule Real-Time (SMRT) Sequencing was the first LRS technology to be widely adopted. It involves immobilizing DNA polymerase in wells (ZMWs) on a chip and detecting light emission as fluorescently labeled nucleotides are incorporated into circularized DNA templates called SMRTbells. Early CLR (Continuous Long Reads) had lower accuracy (85–92%). However, the introduction of High-Fidelity (HiFi) reads (Circular Consensus Sequencing – CCS) marked a major improvement, offering exceptional accuracy (over 99%, often up to 99.9%) for reads over 10 kb. This is achieved by having the polymerase pass through the SMRTbell template multiple times, allowing the CCS algorithm to merge subreads for high consensus accuracy. PacBio’s Revio system further enhances HiFi throughput (up to 360 Gbases per day) with improved accuracy (Q33) and methylation calling capability.

Oxford Nanopore Technologies (ONT) distinguishes itself by generating reads of hundreds to thousands of kilobases, often outperforming PacBio in maximum read length. ONT’s technology is based on detecting changes in electric current as single-stranded DNA (or RNA) molecules pass through nanopores embedded in a membrane. The deflections in electric current are distinct for each nucleotide, creating unique signatures. Recent advancements, such as the Q20+ platform update and Ligation Sequencing Kit V14, have pushed ONT’s long-read accuracy to over 99%. ONT offers various platforms, from the portable MinION to the high-throughput PromethION, with data yields ranging from 2.8 Gb (Flongle) to 50-200 Gbases (PromethION). Ultra-long reads (over 100 kb) from ONT have been crucial for completing the human genome and resolving highly repetitive regions.

Technical Advances and Challenges in LRS

Historically, LRS faced limitations like limited yield, high error rates, and high cost per base. While early base-identifying accuracy was around 85%, it has now reached almost 99% for SMRT and 95% for ONT. The quality of SMRT reads is proportional to the number of DNA fragment transitions, reaching 99% with multiple passes. ONT read quality, however, depends on the ratchet rate per base through the nanopores, with a median sign-pass accuracy of around 95%.

Error Correction and Assembly are vital for LRS data. Hybrid error correction uses high-accuracy short-read data to correct errors in long reads, while self-correction exploits redundancy within long-read data itself. Assembly polishing, a refinement step, improves accuracy and eliminates artifacts in the assembled sequence, with tools like Racon and Medaka widely used for Nanopore data.

De Novo Genome Assembly and the T2T Consortium

De novo whole-genome assembly involves piecing together nucleotide sequences to construct complete chromosomes without a reference. A major advantage is avoiding biases from evolutionary or genetic differences between a reference and the sequenced genome. However, it is computationally intensive, especially for complex eukaryotic genomes, and faces challenges with heterozygosity, repetitive elements, and polyploidy. LRS platforms demonstrate improved capabilities in assembling such complex genomes.

The Telomere-to-Telomere (T2T) consortium showcased the power of LRS by achieving the first complete, gapless human genome sequence (T2T-CHM13 assembly), addressing previously intractable regions. This monumental achievement combined ultra-long nanopore reads, high-fidelity (HiFi) sequencing, and Hi-C data. The T2T-CHM13 assembly resulted in a 3 billion-base pair long complete human haplotype, contributing to the recognition of almost 4,000 new genes and including gapless telomere-to-telomere assemblies for all 22 human autosomes and chromosome X. It also corrected 151 Mbp of previously unknown sequence data.

Key areas where LRS tackles previous barriers in assembly and variant detection include:

  • Structural Variants (SVs): LRS is most suitable for detecting SVs, especially large genomic alterations like insertions, deletions, inversions, and translocations, which are typically longer than 50 bp and hard to identify with short-read approaches.
  • Repetitive Regions: LRS can span long repetitive or problematic regions, boasting an accuracy of 99.9% in this regard. The T2T consortium successfully sequenced challenging satellite arrays and acrocentric short arms, which were largely missing from previous reference genomes.
  • Haplotype Phasing: LRS enables accurate long-range haplotype phasing, assigning sequence data to maternally or paternally inherited chromosomes. This is crucial for resolving inconclusive diagnoses in autosomal recessive conditions and identifying complex rearrangements.
  • Homologous Genes and Pseudogenes: LRS can differentiate variants in highly homologous genes or those with pseudogenes, areas where short-read methods struggle due to mappability issues.

Bioinformatics of Long Reads

The rise of LRS necessitated a new generation of bioinformatics tools to handle their unique features, particularly increased read lengths. These tools are critical for base calling, detection of base modifications, variant calling, and genome assembly.

  • Base Calling: LRS platforms translate specific electric signals (ONT) or fluorescent light pulses (PacBio) directly into nucleic acid sequences. Many TGS base callers can execute alignment in parallel with base identification.
  • Epigenetic Modifications: LRS allows for the direct detection of modified bases (e.g., 6mA, 5mC, 5hmC) without the need for bisulfite conversion or PCR amplification, which were limitations for NGS. ONT detects these based on signal shifts, while PacBio uses altered kinetics of base-pair incorporation.
  • Variant Calling: LRS variant callers are built upon de novo assembly, short-read alignment, or long-read mapping approaches. They show overall better performance for spanning repetitive and problematic regions.
  • Genome Assembly: LRS has simplified genome reconstruction and improved assembly contiguity. Techniques like scaffolding using genetic markers, optical maps, or linked reads are crucial for resolving fragmented assemblies.

Clinical Applications of LRS

LRS has profound implications across various clinical fields:

  • Rare Disease Diagnosis: LRS helps identify novel pathogenic mutations in Mendelian conditions that remain undiagnosed by NGS. It is particularly effective for short tandem repeat (STR) expansion disorders (e.g., Huntington’s disease, Fragile X syndrome), providing unbiased sizing and sequence determination of neuropathogenic STR sites in a single targeted assay. LRS can resolve the underlying cause of rare genetic conditions like Gitelman syndrome by identifying previously missed intronic variants.
  • Oncology: LRS is proving valuable for detecting complex structural variants (SVs) in cancer genomes, including copy-balanced SVs and novel SVs missed by short-read approaches. It can identify fusion transcripts, which are major drivers of carcinogenesis, enabling rapid diagnosis and guiding therapy selection. LRS also supports multilayer analysis of the transcriptome and epigenome, allowing direct detection of DNA methylation landscapes in tumors for classification and prognostic markers.
  • Infectious Diseases: The speed and portability of ONT platforms make them ideal for rapid, in situ diagnosis and genomic surveillance of pathogens, as demonstrated during the Ebola outbreak and COVID-19 pandemic. This enables quicker identification of disease sources, monitoring of viral evolution, and tracking transmission chains. LRS also aids in antibiotic resistance research and personalized treatment of diseases like tuberculosis.
  • Transplantation: LRS has revolutionized Human Leukocyte Antigen (HLA) typing, allowing highly accurate and comprehensive identification of rare and complex HLA alleles, which are critical for donor-recipient matching and predicting graft rejection. It also helps unravel the causes of graft-versus-host disease (GVHD) and the role of killer-cell immunoglobulin-like receptor (KIR) genes.
  • Accessibility and Portability: Portable LRS devices like MinION address health inequalities by enabling genomic testing in logistically challenging or resource-limited settings, reducing reliance on centralized sequencing centers and sample shipment.

Despite these advancements, challenges remain, including the need for high-quality, high-molecular-weight DNA input, higher per-base costs compared to short reads, ongoing refinement of bioinformatics tools, and standardization for routine clinical adoption. Nevertheless, LRS holds the promise to become a mainstream tool in clinical settings, enabling widespread implementation of precision medicine.

Overview and Limitations of Short-Read Sequencing DNA sequencing has revolutionized medicine, with the human reference genome (hg38) serving as a fundamental resource for biomedical interpretation and disease understanding. However, the current hg38 build remains fragmented, containing approximately 151 megabase pairs (Mbp) of unknown sequence or gaps, particularly in pericentromeric, subtelomeric regions, acrocentric short arms, and large repeat arrays. Short-read sequencing, characterized by read lengths of typically 100–300 base pairs (bp), has been the dominant technology due to its widespread use and affordability. However, its short read length fundamentally limits the resolution of genomic regions not uniquely spanned by overlapping reads, especially in low-complexity repetitive loci, duplicated regions, tandem arrays, and complex structural variants (SVs), which constitute a significant portion of these gaps. This inability to resolve large structural variations and repetitive DNA has impeded a full understanding of human genomes. Long-read sequencing has emerged as a crucial tool to circumvent these limitations, offering the ability to sequence DNA fragments orders of magnitude longer than SRS, thereby facilitating the resolution of previously underannotated genomic characteristics.

Long-Read Sequencing Technologies The two principal LRS technologies are single-molecule real-time (SMRT) sequencing, primarily developed by PacBio, and nanopore-based sequencing, from Oxford Nanopore Technologies (ONT).

  • PacBio SMRT Sequencing: This technology involves sequencing by synthesis, utilizing SMRTbell templates (circularized double-stranded DNA with hairpin adapters) distributed into zero-mode waveguides (ZMWs), each containing an immobilized DNA polymerase. Earlier versions, known as Continuous Long Reads (CLRs), had higher error rates (85–92%) due to fewer passes over very large inserts (up to hundreds of kilobase pairs). A significant advancement was the introduction of high-fidelity (HiFi) sequencing, which produces 10–20 kbp reads with exceptional accuracy (approaching 99.99% or Q40). This high accuracy is achieved by selecting smaller fragments (15–20 kbp), allowing the polymerase to make multiple passes (7–12 or more) around the circular template. These multiple subreads are then computationally combined to generate a highly accurate circular consensus sequence, enabling intramolecular error correction. PacBio HiFi reads have significantly improved structural variant discovery and the deciphering of complex repetitive regions, such as centromeres and acrocentric short arms.
  • Oxford Nanopore Technologies (ONT) Nanopore-based LRS: This method employs a linear double-stranded DNA template, which can extend to several megabase pairs. A motor protein unwinds the DNA, and a single strand passes through a nanopore, causing characteristic disruptions in an electric current that are translated into DNA sequence by computational base-calling algorithms. Standard nanopore sequencing typically yields reads tens to hundreds of kilobase pairs in length with an accuracy of approximately 87–98%, though ongoing improvements in base-calling algorithms and chemistry are pushing accuracy to 99% or higher. Notably, ONT also generates ultralong reads (generally >100 kbp, sometimes several Mbp), which are particularly valuable as genome scaffolds for assembling large, challenging-to-sequence regions like the Y chromosome centromeric region.

Hybrid approaches also combine these technologies, such as using nanopores for electrophoretic loading of ZMW arrays to enhance SMRT sequencing efficiency, or nanopore-coupled DNA polymerases for real-time electronic DNA sequencing.

Improvements in Human Genome Assemblies and Variant Discovery LRS has been instrumental in enhancing genome resolution and comprehending the full spectrum of genetic variation. It has allowed the detection of a wide array of SVs, including insertions, deletions, inversions, and duplications, often associated with repetitive elements like LINEs and SINEs, which were previously difficult or impossible to identify with SRS. LRS simplifies the analysis and annotation of these variants, making their identification feasible in individual genomes, particularly in regions with repetitive DNA or segmental duplications. This capability is critical for discovering rare protective or pathogenic SVs.

Complementary technologies further enhance LRS-based assemblies:

  • Hi-C technology: Provides long-distance and haplotypic data by cross-linking, digesting, and ligating protein/DNA complexes in close nuclear proximity, aiding in contig linking and assembly scaffolding.
  • Optical genome mapping: Captures high-molecular-weight DNA in microfluidic chambers for sequence-specific nicking and fluorescent labeling, offering long-range readouts of large genome fragments for SV detection and chromosomal aberration identification, and anchoring LRS/SRS assemblies.

Targeted LRS Approaches While whole-genome LRS is still expensive for routine clinical use, targeted enrichment strategies enhance its utility by increasing coverage and enabling localized assembly or phasing of specific genomic variants at a reduced cost. Methods include:

  • PCR amplification of targeted regions to generate SMRTbell libraries, useful for phasing double mutations or haplotyping. However, PCR can introduce biases and is limited by target duplication and size.
  • Cas9-assisted targeting of chromosome segments (CATCH): An amplification-free method that uses Cas9-targeted fragmentation and separation of high-molecular-weight DNA to capture regions spanning hundreds of kilobase pairs. It’s particularly valuable for unstable short repeat copy expansions.
  • Adaptive sampling: A computational enrichment technique where LRS selectively continues sequencing DNA molecules that align to a predefined region in real-time, ejecting unwanted molecules. This has led to the identification of previously undetected pathogenic variants.

These targeted protocols significantly reduce per-sample costs, making LRS more viable for specific genetic diagnostics.

Epigenomics Applications LRS offers a distinct advantage in detecting epigenetic modifications on native genomic DNA, such as cytosine or adenine methylation, without the need for bisulfite deamination, which can fragment DNA.

  • SMRT LRS detects methylated nucleotides by observing changes in polymerase kinetics (increased interpulse duration and pulse width).
  • Nanopore LRS decodes modified bases from characteristic deviations in the electrical “squiggle” signal using trained algorithms. Emerging LRS-based chromatin-profiling methods (e.g., Fiber-seq, SAMOSA, nanoNOMe) further allow mapping and phasing of chromatin organization across kilobase-length genomic regions, including repetitive DNA, providing insights into gene regulation and disease states.

Pathogenic Variant Discovery LRS is uniquely suited to reveal structural variants and repeat expansions that are challenging for SRS. It has enabled the discovery of rare, disease-causing SVs, such as large deletions in genes like EYS and BBS9/RP9, and the precise identification of long insertions of previously unmapped sequences. LRS also excels in HLA typing, spanning entire HLA class 1 genes for accurate haplotyping in this complex region. Crucially, LRS can span entire repeat expansions, which are a hallmark of many neurodegenerative diseases (e.g., C9orf72, spinocerebellar ataxia, neuronal intranuclear inclusion disease, myotonic dystrophy type 1). The ability to identify these causative genetic variants in rare undiagnosed diseases is rapidly growing, showcasing the diagnostic power of LRS.

The Telomere-to-Telomere Human Reference Genome A landmark achievement demonstrating the power of LRS is the first complete telomere-to-telomere (T2T) sequence of several human chromosomes, followed by a complete human genome (CHM13). This project leveraged a hybrid approach, combining highly accurate PacBio HiFi circular consensus sequencing reads with ultralong nanopore reads to span and resolve highly homologous tandem arrays that previous technologies struggled with. The CHM13 assembly incorporated 100x SRS data, 30x HiFi LRS data, 120x ultralong nanopore reads, optical genome mapping, Hi-C data, and single-strand sequencing, revealing approximately 200 Mbp (8%) of previously hidden sequence. This included the full span of megabase-sized centromeric and pericentromeric repeat arrays and the highly homologous short arms of acrocentric chromosomes, which were previously represented as gaps in the hg38 reference. A key discovery was the centromere dip region (CDR), a hypomethylated segment within active alpha satellite arrays at centromeres, identified by ultralong nanopore reads, which correlates with the location of CENP-A chromatin—an epigenetic mark critical for centromere function.

Future Perspectives and Challenges The success of the T2T project paves the way for the Human Pangenome Reference Consortium (HPRC), which aims to create a collection of human diploid reference genomes representing the true diversity of human genetics. This ambitious project will require efficient, haploid-aware assembly of diploid genomes, especially in complex repetitive regions, and relies on graph-based approaches.

Despite its immense potential, several challenges remain for the routine clinical adoption of LRS, including:

  • Cost: While decreasing, LRS is still more expensive per gigabase compared to high-throughput SRS platforms.
  • Accuracy: While HiFi and newer ONT chemistries offer high accuracy, consistent high-quality indel calls from ONT data, for instance, still present challenges.
  • Throughput: Though improving, variable throughput and sensitivity to molecular damage during library preparation (requiring high-quality, unnicked, high-molecular-weight DNA) are considerations.
  • Bioinformatics: LRS data is qualitatively different, requiring tailored analysis tools and a scientific community consensus on standard algorithms and tools, which is still developing.
  • Integration into clinical workflows: Practical implementation and integration into existing clinical workflows, including sample collection methods, remain areas for optimization.

However, recent innovations, such as PacBio’s Revio system (increasing HiFi throughput by 12.5 times) and ONT’s K14 chemistry (improving accuracy to 99% or higher), are continuously addressing these limitations, driven by increasing demand and inter-platform competition. As costs continue to fall and capabilities improve, LRS is poised to become a mainstream tool in clinical settings, enabling more accurate, comprehensive, and insightful genome assemblies and accelerating personalized medicine.

Despite its existence for over a decade, LRS has faced slow clinical adoption due to historical perceptions of high error rates and prohibitive costs. However, recent technological advancements have dramatically improved accuracy and reduced expenses, indicating a growing role for LRS in routine diagnostics. The authors highlight LRS’s key advantages: superior detection of structural variations (SVs), improved resolution of highly repetitive or non-unique genomic regions, accurate long-range haplotype phasing, and direct detection of base modifications.

Limitations of Current Diagnostic Technologies While current diagnostic approaches like CMA and SRS (often referred to as Next Generation Sequencing or NGS) have significantly advanced genetic diagnoses, they possess fundamental limitations. CMA, though a first-tier test for many congenital disorders and excellent for detecting submicroscopic deletions and duplications over 50 kilobase pairs (kb), suffers from uneven probe distribution that limits resolution and prevents nucleotide-level breakpoint determination. It also cannot detect balanced rearrangements or precisely locate and orient duplicated material.

SRS, despite its low cost and high per-base accuracy, struggles with complex genomic regions due to its short read lengths (typically 75–250 base pairs (bp)). This inherent limitation makes it difficult to unambiguously map reads in highly repetitive areas, such as short tandem repeats (STRs), interspersed repeat elements (SINE, LINE, Alu), and segmental duplications (LCRs), which collectively comprise over half of the human genome. Consequently, thousands of exons in clinically relevant genes are inaccessible or highly challenging for SRS, leading to inflated coverage metrics from ambiguously aligned reads and complications in SV detection. Although specialized short-read tools exist for specific variant types (e.g., ExpansionHunter for STRs, Cyrius for pseudogene discrimination), they do not fully overcome these fundamental challenges. Furthermore, SRS often necessitates parental samples for accurate variant phasing, which can pose ethical and practical obstacles to equitable patient care. The fragmented nature of current diagnostic workflows, often requiring multiple orthogonal assays, increases costs, complexity, and the time required to achieve a definitive diagnosis.

Long-Read Sequencing Technologies LRS technologies, often termed third-generation single-molecule sequencing, produce reads orders of magnitude longer than SRS, ranging from 10 kb to several megabases. The two primary platforms are Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT).

  • PacBio HiFi Sequencing (High-Fidelity): This technology utilizes double-stranded, high-molecular-weight DNA templates that are circularized. Multiple passes of sequencing around these circular templates enable the generation of highly accurate consensus long reads (HiFi reads) by error-correcting stochastic errors from individual subreads. The PacBio Sequel II platform produces HiFi reads of 10–25 kb with >99.9% per-base accuracy, comparable to Illumina’s SRS. The newer Revio system has significantly boosted throughput, capable of delivering up to 360 Gb of HiFi reads per day, and enhances methylation calling capabilities.
  • Oxford Nanopore Technologies (ONT): This method involves passing single-molecule DNA or RNA through a protein nanopore, detecting characteristic changes in electric current that correspond to the nucleotide sequence. ONT sequencing has no theoretical upper limit to read length, with researchers achieving megabase-long reads. Ultralong sequencing kits allow consistent production of reads with N50 > 50 kb. While earlier ONT versions had higher error rates, recent advancements like the Q20+ chemistry (Kit 14) and R10.4.1 flowcells have improved accuracy to over 99%. ONT also offers advantages in real-time data analysis and portability, with devices like MinION, GridION, and PromethION. Nanopore technology can directly detect base modifications from the electrical “squiggle” signal, eliminating the need for bisulfite treatment and providing crucial epigenetic insights.

Applications of Long-Read Sequencing in Molecular Diagnostics LRS has proven invaluable in achieving molecular diagnoses for genetic disorders where traditional assays have failed. Its applications span two broad categories: the identification of challenging pathogenic variants and the resolution of haplotype phasing.

  1. Identifying Challenging Variants:
    • Repeat Expansions: LRS can span the entire length of expanded repeats, which are causative in numerous neurodegenerative diseases, providing comprehensive information about their full length and internal interruptions. Early PacBio SMRT sequencing successfully assayed trinucleotide repeats in FMR1 (Fragile X syndrome), and this capability has extended to other genes like ATXN10, C9orf72, DMPK, and HTT. ONT’s adaptive sequencing protocols also accurately detect repeat expansions, simultaneously offering haplotype-resolved sizing and methylation profiling.
    • Homologous Genes and Pseudogenes: LRS effectively addresses the challenge of differentiating variants in highly homologous genomic regions. For example, it can resolve complex deletions and rearrangements in the alpha and beta hemoglobin gene clusters (HBA and HBB)—regions notoriously difficult for SRS due to mappability issues and common structural polymorphisms. Similarly, LRS can distinguish variants within the PKD1 gene from its six pseudogenes, and IKBKG from IKBKGP1, where SRS methods typically yield low sensitivity and high false-positive rates. While older PacBio platforms had some limitations, current HiFi reads and advanced algorithms are expected to resolve these regions with significantly higher sensitivity and specificity.
    • Complex Structural Rearrangements: LRS excels in detecting SVs that are “cryptic” or undetectable by SRS. Examples include L1-mediated insertions in CDKL5 and intricate chromothriptic events involving multiple chromosomes, which were uniquely identified using PacBio sequencing. Nanopore sequencing has also been shown to detect numerous SVs that short-read methods miss in chromothriptic cases.
  1. Long-Range Phasing: A significant advantage of LRS is its ability to infer haplotype phase with high accuracy over extended genomic intervals (hundreds of kilobases to megabases). This is critical for diagnosing autosomal recessive conditions where two heterozygous disease-causing variants must be confirmed to reside on different alleles (in trans). LRS can resolve variant phase within a single read, or through haplotype phasing for larger genes, thereby eliminating the often-required, and sometimes unobtainable, parental samples. Clinical case vignettes demonstrate this utility:
    • In a patient with sensorineural hearing loss, PacBio LRS detected both a missense variant and a 13kb deletion in TRIOBP and definitively showed them to be in cis, rendering them non-diagnostic. This crucial phasing information was previously unresolvable by SRS without parental testing.
    • For another hearing loss patient, LRS elucidated complex findings in STRC, including variants associated with the homologous pseudogene STRCP1. LRS unambiguously mapped reads to the protein-coding gene, identified all variants, and, importantly, confirmed the pathogenic variants were in trans, providing a definitive diagnosis of a gene conversion event without the need for familial testing.
    • In a patient with Glycogen Storage Disease, LRS identified a 7.1kb deletion in G6PC alongside a single nucleotide variant, enabling precise phase resolution—a diagnosis previously inconclusive with exome sequencing.
  1. Emerging Applications:
    • Epigenetic Modifications: LRS enables the direct detection of epigenetic modifications, such as cytosine or adenine methylation, on native genomic DNA, bypassing the need for bisulfite treatment which can fragment DNA. PacBio detects methylation via changes in polymerase kinetics, while Nanopore decodes modified bases from characteristic electrical signal deviations. This direct detection offers higher specificity and mapping uniformity.
    • RNA Sequencing: LRS facilitates accurate detection of full-length transcripts and isoforms, revealing novel reading frames, allele-specific expression, and variations in post-transcriptional RNA modifications that are challenging to reconstruct from short reads. This is particularly valuable for identifying chimeric transcripts and gene fusions in cancer, which are recognized as major drivers of carcinogenesis.

Challenges and Future Outlook Despite the clear advantages, several challenges hinder the widespread routine clinical adoption of LRS. The production of high-quality, high-molecular-weight DNA, essential for long reads, can be difficult in clinical settings and from non-invasive samples like saliva. Cost, accuracy, and throughput remain significant considerations. While PacBio HiFi boasts high accuracy (99.9%), and ONT’s Q20+ chemistry approaches 99%, achieving high-quality indel calls from ONT data still demands high coverage. Per-gigabase cost for LRS can still be significantly higher compared to SRS for whole-genome sequencing. Furthermore, LRS data is qualitatively distinct from SRS, requiring specialized bioinformatics tools and a lack of scientific community consensus on standardized algorithms. Current LRS workflows can also be lengthy, for instance, an 11-day run for a full PacBio Sequel II sequencing.

However, continuous innovation is addressing these limitations. ONT’s portable MinION, for example, allows for ultra-rapid sequencing in critical care settings, achieving molecular diagnoses in less than a day, sometimes as quickly as seven hours. While currently more costly, the clinical urgency in such situations often justifies the expense. Hybrid approaches, combining the strengths of LRS and SRS, offer improved accuracy, read length, and cost-effectiveness, proving especially beneficial for de novo genome assembly and analysis of complex genomes.

The authors conclude that LRS holds immense potential to become the most comprehensive diagnostic platform available, capable of overcoming the current paradigm of sequential testing. It promises to reduce diagnostic uncertainty by enabling comprehensive variant detection and simultaneous phase inference, thereby accelerating the time to a definitive diagnosis. As costs continue to fall and both data quality and throughput improve, LRS is widely anticipated to transition into a mainstream tool in clinical settings, thereby greatly advancing personalized medicine.

While SRS, particularly Illumina’s platforms, has been the gold standard for genetic profiling due to its high accuracy and cost-effectiveness, its short read lengths (typically 150 bp) inherently limit its capacity to reconstruct complete genome assemblies, identify complex structural variants (SVs), sequence repetitive regions, or accurately phase alleles. Furthermore, SRS often requires PCR amplification, which can introduce artifacts and prevents the direct detection of native base modifications. These shortcomings motivated the development of LRS, also known as third-generation sequencing (TGS), which generates reads tens of thousands of bases long, operates in real-time, and typically avoids PCR amplification, thereby enabling direct detection of base modifications.

Long-Read Sequencing Technologies The LRS market is largely dominated by two major technologies: Pacific Biosciences (PacBio) single-molecule real-time (SMRT) sequencing and Oxford Nanopore Technologies (ONT) nanopore sequencing.

  • PacBio SMRT Sequencing: PacBio’s platforms (Sequel and Revio) employ a system where DNA, circularized with hairpin adapters, binds to polymerases. Labeled bases fluoresce upon incorporation, detected by a zero-mode waveguide and a camera. While earlier SMRT versions had an error rate of 1–5%, PacBio’s HiFi (High-Fidelity) reads have significantly improved accuracy (0.1–0.5% error rate, or >Q30), comparable to Illumina sequencing, by enabling multiple passes around the circularized DNA to generate a consensus sequence. The latest Revio system boosts throughput to 360 Gb of HiFi reads per day, equivalent to ~1300 human whole genomes per year, and enhances methylation calling capabilities.
  • Oxford Nanopore Technologies (ONT): Nanopore sequencing, conceptualized in the 1990s, involves passing single-molecule polynucleotides through a protein nanopore embedded in a membrane. A motor protein guides the DNA, which unwinds and passes through the pore, disrupting an electric current. Each k-mer of consecutive nucleotides creates a unique disruption in the current, allowing sequence determination. ONT offers devices like MinION, GridION, and PromethION. Recent innovations include adaptive sampling, a computational technique that uses real-time data analysis to focus sequencing on regions of interest, improving efficiency and coverage. ONT also enables direct identification of base modifications (e.g., 5-methylcytosine, N6-methyladenine) by detecting differences in the electrical “squiggle” signal. The V14 chemistry with R10.14.1 pores significantly reduces error rates and improves sequencing accuracy to Q20+.

LRS was recognized as “Method of the Year 2022” by Nature Methods due to its ability to read genomes, transcriptomes, and epigenomes, opening major opportunities in medical and biological research.

Clinical Applications of LRS

  1. Rare Disease Diagnosis:
    • LRS is crucial for identifying underlying genetic causes and inheritance patterns in rare diseases, where about 50% of suspected Mendelian conditions remain undiagnosed by traditional methods.
    • It excels in detecting short tandem repeat (STR) expansion disorders, which are prevalent in neurological diseases like Huntington’s and Fragile X syndromes. A pilot study using nanopore sequencing demonstrated unbiased sizing and sequence determination of all known neuropathogenic STR sites in a single assay.
    • LRS has identified novel pathogenic mutations, such as a 2.2-kb deletion in the PRKAR1 gene causing Carney complex.
    • It has successfully diagnosed conditions like colour vision disorders, resolving variants in the highly homologous OPN1LW/OPN1MW gene cluster that SRS struggles with.
    • In Gitelman syndrome, LRS identified a second pathogenic intronic variant in 67% of previously partially diagnosed patients, significantly increasing diagnostic yield.
    • Ultra-rapid nanopore whole-genome sequencing workflows have achieved diagnoses in critically ill patients in an average of eight hours, sometimes as quickly as 5 hours and 2 minutes, guiding critical clinical decision-making.
  1. Oncology:
    • LRS addresses limitations of traditional cancer screening methods (e.g., copy number profiling, karyotyping) by offering better detection of copy-balanced SVs and base-pair resolution for complex SVs that SRS might miss.
    • Studies have used LRS to analyze SVs in tumor suppressor genes like CDKN2A and SMAD4 in pancreatic cancer cell lines, detecting translocations, inversions, and deletions.
    • LRS precisely identified complex rearrangements and oncogene amplifications (e.g., ERBB2) in breast cancer cell lines.
    • A 1-day workflow for CNS tumor diagnosis using MinION technology successfully identified relevant alterations like 1p/19q codeletion and amplifications of cancer-related genes (EGFR, PDGFRA, CDK4).
    • Portable MinION sequencers provide rapid diagnosis (24h) for hematological malignancies (e.g., chronic lymphocytic leukemia, acute myeloid leukemia), detecting SVs in genes like TP53 and ABL1 at reduced costs (approx. USD 200).
    • In fusion transcriptomics, LRS allows full coverage of transcript isoforms. MinION identified BCR-ABL1 fusion transcripts within 15 minutes of sequencing, crucial for rapid acute leukemia treatment decisions.
    • LRS enables detection and screening for epigenetic changes (e.g., DNA methylation shifts) in cancer. Nanopore sequencing can directly detect cytosine methylation without bisulfite treatment, which is prone to DNA degradation and PCR bias. LRS has been used for methylation-based classification of brain tumors, showing potential for clinical implementation.
  1. Infectious Diseases and Microbiota:
    • The speed of ONT platforms makes LRS attractive for in-situ diagnosis of infectious pathogens, enabling rapid response for identification and management of disease sources and spread.
    • Real-time genomic surveillance of pathogens was critical during the Ebola virus outbreak, clarifying evolution patterns and transmission chains.
    • LRS was employed during the COVID-19 pandemic for SARS-CoV-2 genomic surveillance, strengthening responses by enabling point-of-care WGS and fast turnaround, especially in remote settings.
    • It is valuable for monitoring the monkeypox virus (MPXV) outbreak, facilitating sequencing of samples with lower viral loads that are challenging for Illumina platforms.
    • LRS expedites malaria elimination efforts by enabling early examination of parasite genotypes for drug resistance and unraveling disease dynamics at the vector level.
    • It contributes to tuberculosis treatment and diagnosis by cost-effectively identifying drug resistance in M. tuberculosis and enabling phylogenetic reconstruction.
    • Hybrid long-read and short-read sequencing approaches leverage the benefits of both technologies for comprehensive and accurate characterization of microbial genomes, including antibiotic resistance mechanisms and plasmid-mediated resistance genes.
    • LRS helps address antimicrobial resistant (AMR) sexually transmitted infections (STIs) by enabling rapid diagnosis and identification of AMR strains.
    • LRS is promising for microbiota analysis, with examples like high-resolution and rapid differentiation of vaginal microbiota using MinION.
  1. Transplantation:
    • LRS revolutionized human leukocyte antigen (HLA) typing, allowing identification of rare and complex HLA alleles previously difficult to detect with SRS. This aids in donor-recipient matching and personalizing transplantation strategies.
    • It is instrumental in unraveling causes of graft-versus-host disease (GVHD) by performing transcriptome analysis of immune cells, identifying gene expression patterns associated with GVHD development and severity.
    • LRS has elucidated the role of killer-cell immunoglobulin-like receptor (KIR) genes in transplantation, revealing associations between specific KIR genotypes and graft rejection.

Addressing Health Needs Portable LRS technologies, like MinION, overcome the need for high capital investment and centralized sequencing infrastructure, reducing sample shipment times. This allows for genomic surveillance in resource-limited and remote communities, addressing health inequalities by providing rapid results (under 24 hours, sometimes as quickly as 13–60 minutes). In critical care settings, LRS identifies pathogens in patients with serious infections with high accuracy and timeliness, even when traditional methods fail.

Challenges and Future Outlook Despite its advantages, several challenges impede widespread clinical adoption of LRS:

  • Data Validation: Rigorous validation processes are needed to ensure the accuracy and reliability of LRS data for clinical decision-making, as some limitations in detecting certain variant types persist.
  • Platform Stability and Reliability: While LRS platforms show high initial accuracy, their long-term stability and consistency in performance need to be ensured.
  • Standardization: There is a crucial need for standardized protocols, quality control measures, and scientific consensus on algorithms and tools for LRS data analysis.
  • Cost: Although the cost per base for ONT MinION has dropped, it can still be higher than large Illumina platforms. The initial capital investment for some LRS platforms (e.g., PacBio) remains significant.
  • DNA Quality and Extraction: Producing high-quality, high-molecular-weight DNA, essential for long reads, can be challenging in clinical settings and from non-invasive samples like saliva. DNA fragmentation during extraction and handling is a particular concern in remote centers.
  • Bioinformatics Complexity: LRS data is qualitatively different from SRS, requiring specialized bioinformatics tools, which are still under continuous development.

The authors conclude that LRS holds immense potential to become the most comprehensive diagnostic platform, capable of overcoming the current sequential testing paradigm. It promises to reduce diagnostic uncertainty by enabling comprehensive variant detection and simultaneous phase inference, thereby accelerating definitive diagnoses. Continuous advancements in cost reduction, quality, and throughput are expected to accelerate LRS into mainstream clinical use, greatly advancing personalized medicine.

Summary sheet

Long-Read Sequencing (LRS), also known as third-generation sequencing (TGS), represents a transformative leap in genetic analysis, addressing significant limitations of conventional short-read sequencing (SRS). While SRS typically produces reads of 100-300 base pairs, excelling in accuracy and cost-effectiveness for single nucleotide variants (SNVs) and small insertions/deletions (indels), it struggles with complex genomic regions. These challenges include difficulty in characterizing large structural variants (SVs), resolving highly repetitive or homologous regions (like centromeres and telomeres), accurately phasing alleles, and directly detecting native base modifications. LRS overcomes these by generating significantly longer reads, ranging from tens to hundreds of kilobases, and even megabases, enabling more comprehensive and accurate genome reconstructions.

The LRS market is dominated by two primary platforms:

  • Pacific Biosciences (PacBio) Single Molecule Real-Time (SMRT) Sequencing: This technology initially offered Continuous Long Reads (CLR) with lower accuracy (85-92%). A major advancement came with HiFi (high-fidelity) reads, which achieve exceptional accuracy (>99%, or up to 99.9%) by repeatedly sequencing smaller (10-30kb) circular DNA molecules (SMRTbells) to generate a consensus sequence. The latest Revio system significantly boosts throughput, capable of producing up to 360 Gigabases per day, making whole human genome sequencing more accessible.
  • Oxford Nanopore Technologies (ONT) Nanopore Sequencing: This method measures characteristic disruptions in an electric current as DNA or RNA molecules pass through a protein nanopore. ONT offers reads from 10-100 kilobases, with “ultra-long reads” exceeding 100 kilobases and occasionally reaching several megabases. While initial accuracy was lower (around 87-98%), recent improvements in chemistry (V14) and pore design (R10.14.1) have increased it to over 99% (Q20+). ONT platforms range from portable MinION to high-throughput PromethION systems. A key innovation is adaptive sampling, which allows real-time targeting of specific genomic regions. Nanopore sequencing also uniquely enables direct detection of base modifications (e.g., DNA methylation) without chemical pretreatments.

Clinical Applications

LRS has been recognized as “Method of the Year 2022” by Nature Methods due to its profound utility. Its expanding clinical applications include:

  • Rare Disease Diagnosis: LRS is crucial for identifying genetic causes of rare diseases, especially those involving short tandem repeat (STR) expansions (e.g., Huntington’s disease, Fragile X syndrome), complex SVs, and conditions like Gitelman syndrome or color vision deficiencies that are difficult for SRS to resolve. Its rapid capabilities, especially with nanopore, have enabled ultra-rapid diagnoses for critically ill patients in hours.
  • Oncology: LRS improves the detection of SVs (including copy-balanced and complex rearrangements) and fusion transcripts in cancer genomes, which can aid in diagnosis, prognosis, and treatment selection. It also allows for the analysis of epigenetic changes, such as DNA methylation patterns, for tumor classification.
  • Infectious Diseases & Microbiome: The speed of ONT platforms makes LRS ideal for rapid, in situ diagnosis and genomic surveillance of pathogens (e.g., Ebola, COVID-19, Monkeypox), facilitating real-time response and tracking of drug resistance. It is also valuable for microbiota analysis.
  • Transplantation: LRS has revolutionized Human Leukocyte Antigen (HLA) typing, enabling high-resolution identification of rare and complex alleles, improving donor-recipient matching, and helping understand graft-versus-host disease (GVHD) and KIR gene roles.

Challenges and Future Perspectives

Despite its advantages, the widespread clinical adoption of LRS faces several hurdles:

  • Cost: While decreasing, the cost per base can still be higher than SRS, especially for whole-genome sequencing.
  • DNA Quality and Sample Preparation: LRS requires high-quality, high-molecular-weight DNA, which can be challenging to extract, particularly from non-invasive sources or in remote settings.
  • Accuracy and Validation: Although accuracy is improving, rigorous validation is essential for clinical decision-making, and some limitations in detecting specific variant types persist.
  • Bioinformatics Complexity and Standardization: LRS data requires specialized analysis tools that are continuously evolving, and a consensus on standard algorithms and protocols is still needed for consistent clinical use.

Nevertheless, LRS is poised to become a cornerstone of personalized medicine. Its capacity for comprehensive variant detection, simultaneous phase inference, and direct epigenetic profiling will likely reduce diagnostic uncertainty and accelerate definitive diagnoses. Ongoing advancements in cost reduction, quality, and throughput are expected to accelerate its integration into routine clinical practice, significantly enhancing our understanding of human genetic diversity and disease. The ambition of initiatives like the Human Pangenome Reference Consortium (HPRC) to create comprehensive, telomere-to-telomere assemblies of diverse human genomes further underscores the critical role LRS will play in the future of genomics and medicine.

Podcast

Powerpoint Slides

Slide 1: The Genomic Revolution: From Sanger to NGS

  • Early sequencing methods, like Sanger Sequencing, were costly and time-consuming, with the initial draft of the human genome taking over a decade and costing an exorbitant $3 billion.
  • Next-Generation Sequencing (NGS) platforms revolutionized the field by generating vast amounts of data more cost-effectively and efficiently, producing millions or billions of reads in hours or days.
  • Short-Read Sequencing (SRS) platforms, such as Illumina, became widely adopted due to their high accuracy (typically >99%) and ability to sequence tens of thousands of genomes annually.
  • Despite these advancements, SRS has inherent limitations that prompted the development of newer technologies capable of reading longer genomic sequences.

Slide 2: Short-Read Sequencing (SRS): Strengths & Limitations

  • SRS technologies like Illumina produce reads typically 150 bp or up to 600 bases in length.
  • Strengths include high accuracy and cost-effectiveness for detecting single nucleotide substitutions and small insertions/deletions (≤50 bp).
  • Major limitations arise in detecting long repetitive structures or large structural variants (SVs), which are often larger than ~500 bp.
  • SRS struggles with characterizing full-length transcript variants, centromere and telomere regions, gene fusions, and epigenetically modified bases.
  • PCR amplification, often essential for SRS, can introduce biases and artifacts, and makes direct detection of native base modifications impossible.

Slide 3: The Emergence of Long-Read Sequencing (LRS)

  • Long-Read Sequencing (LRS), also known as Third-Generation Sequencing (TGS), emerged to overcome the limitations of SRS technologies.
  • LRS technologies, led by Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT), routinely generate DNA reads typically ranging between 10 kilobases (kb) and 100 kb, with a current record of 2.3 Mb.
  • A key advantage of LRS is its ability to generate sequence reads tens of thousands of bases in length, facilitating accurate and contiguous genome assemblies.
  • The LRS process often occurs in real-time and without the need for PCR amplification, which reduces PCR-related biases and allows for direct detection of native base modifications.

Slide 4: Why LRS? Overcoming SRS Limitations

  • LRS significantly enhances de novo genome assembly, especially for genomes with extensive repetitive elements and high heterozygosity, which are challenging for SRS.
  • It provides improved capabilities for covering long repetitive regions and closing gaps in existing reference assemblies, which are often intractable with short reads.
  • LRS facilitates the characterization and detection of large structural variations (SVs), typically >50 bp, that are difficult to identify with short-read methods, especially when they overlap or are nested.
  • LRS enables accurate long-range haplotype phasing, which is crucial for assigning sequencing data to maternally or paternally inherited chromosomes over large genomic intervals.
  • A unique advantage is the direct detection of chemical modifications on native genomic DNA, such as methylation, without the need for chemical pretreatments.

Slide 5: Major LRS Platforms: PacBio & ONT

  • The LRS landscape is primarily defined by two major technologies: Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT).
  • Both platforms have undergone continuous improvements to enhance accuracy, throughput, and portability, while striving to reduce costs.
  • These technologies have revolutionized genomics by enabling the coverage of long repetitive regions, closing gaps in reference assemblies, and characterizing structural variations.
  • The choice between PacBio and ONT depends on specific application requirements, such as the need for high accuracy versus rapid turnaround time and portability.

Slide 6: PacBio SMRT Sequencing: An Overview

  • Pacific Biosciences’ SMRT (Single Molecule, Real-Time) sequencing technology was the first LRS to achieve widespread deployment.
  • It uses high-molecular-weight double-stranded DNA that is size-selected and constructed into circular “SMRTbell” template libraries by ligating single-stranded closed hairpin adapters to their ends.
  • Sequencing occurs in Zero-Mode Waveguides (ZMWs), which are wells containing an immobilized DNA polymerase.
  • Early versions (Continuous Long Reads or CLR) provided typical lengths of 5–60 kb (up to 100 kb), but had a lower accuracy (85–92%), often requiring combination with other technologies for variant detection.

Slide 7: PacBio HiFi Sequencing: Achieving High Accuracy

  • PacBio HiFi (High-Fidelity) sequencing represents a major improvement to SMRT technology, providing exceptional accuracy.
  • It utilizes Circular Consensus Sequencing (CCS) mode, where smaller DNA inserts (typically 10–30 kb) are sequenced multiple times.
  • This multi-pass sequencing allows for intramolecular error correction, resulting in per-base accuracy of over 99% and up to 99.9%.
  • HiFi reads significantly improve variant discovery and reduce assembly costs, offering access to complex repetitive DNA regions, including human centromeres.
  • The new PacBio Revio system further mitigates previous limitations, reducing run time to 24 hours and supporting four high-density SMRT cells in parallel, for a 15x increase in HiFi read throughput (up to 90 Gbase/SMRT Cell or 360 Gbases/day).

Slide 8: Oxford Nanopore Technologies (ONT): Principles

  • ONT’s method is based on detecting changes in electric current as single-stranded DNA (ssDNA) or RNA molecules pass through protein nanopores.
  • A motor protein guides ssDNA molecules through nanochannels (pores), and the unique disturbances in electric current for each base or k-mer allow for their identification.
  • The entire process occurs in real-time within device-specific flow cells containing thousands of nanopore channels.
  • ONT offers direct sequencing of DNA and RNA, meaning PCR amplification steps are generally not required before sequencing, preserving native nucleic acid strands.

Slide 9: ONT Read Types: Long & Ultra-Long

  • ONT differentiates two main types of data: conventional long reads and ultra-long reads.
  • Conventional long reads typically range from 10 kb to 100 kb. With recent updates (e.g., Q20+ platform, Ligation Sequencing Kit V14, V14 chemistry, R10.14.1 pore), accuracy reaches over 99%.
  • Ultra-long reads can extend up to several megabases in length (>100 kb), with accuracy similar to conventional long reads.
  • These ultra-long reads were crucial for completing the human genome by enabling the resolution of repetitive regions previously intractable with other technologies.

Slide 10: ONT Platform Versatility: MinION to PromethION

  • ONT offers a range of devices catering to different throughput needs, all utilizing the same core nanopore sensing technology.
  • MinION is a handheld, portable DNA sequencing device, making genomic testing accessible in logistically challenging and resource-limited locations, enabling rapid diagnostic turnaround.
  • GridION offers medium throughput, sharing the same flow cell type as MinION (512 nanopore channels).
  • PromethION is a high-throughput benchtop device, utilizing different flow cells with more channels (2675 nanopores) to achieve significantly higher yields (50-200 Gbases of reads).
  • Flongle is an adapter for low-throughput applications, compatible with MinION and GridION, providing cost-effective and rapid tests with 126 channels.

Slide 11: LRS Technology Evolution & Improvements

  • Significant progress in recent years has mitigated historical limitations of LRS, such as limited yield, high error rates, and high cost per base.
  • Both PacBio and ONT continuously develop new chemistries, software, and hardware updates to improve sequencing precision and reliability.
  • Advances in base-calling algorithms, often leveraging deep learning and neural networks, have substantially improved per-read accuracy for both platforms.
  • Innovations like PacBio’s Revio system and ONT’s Q20+ chemistry (V14) demonstrate a continuous drive toward higher accuracy (99% or higher) and increased throughput.
  • These improvements open new opportunities for LRS to address complex genomic regions and gain insights into diseases and biological processes.

Slide 12: De Novo Genome Assembly: A Jigsaw Puzzle Solved

  • De novo genome assembly is the process of reconstructing a complete genome sequence by piecing together sequencing reads based on identifying overlapping regions, akin to solving a jigsaw puzzle.
  • A key advantage of de novo assembly is that it avoids biases arising from evolutionary differences or genetic diversity inherent when mapping against an existing reference genome.
  • LRS platforms, like PacBio and ONT, demonstrate improved capabilities in assembling genomes containing extensive repetitive elements and high levels of heterozygosity.
  • Long reads can span problematic regions and resolve ambiguities, leading to more contiguous assemblies with reduced gaps and misassemblies compared to SRS.
  • Prominent methods for long-read assembly include Overlap Layout Consensus (OLC) and De Bruijn Graph (DBG) approaches, with OLC often favored for LRS characteristics.

Slide 13: Hybrid Assembly: Combining Strengths for Better Genomes

  • Hybrid assembly is a powerful approach that combines the strengths of short-read (SRS) and long-read (LRS) sequencing data.
  • SRS offers high accuracy for base-level resolution, while LRS helps resolve repetitive regions and complex structural variations.
  • This integration leads to highly accurate and contiguous genome assemblies with reduced gaps and misassemblies, often at a lower cost than LRS-only strategies for the same coverage.
  • Approaches include mapping long reads onto a DBG constructed from short reads, error correcting long reads with short reads before assembly, or using long reads to bridge contigs generated from short reads.
  • Hybrid solutions, such as the combination of ONT and PacBio reads, were successfully applied in projects like the Telomere-to-Telomere (T2T) Consortium, demonstrating their beneficial applications.

Slide 14: Resolving Structural Variants (SVs) with LRS

  • Structural variants (SVs) are large genomic alterations, typically longer than 50 bp, encompassing insertions, deletions, inversions, and translocations.
  • LRS is particularly effective at detecting SVs because its long reads can span these large alterations, which are challenging to identify with short-read methods.
  • Short-read sequencing struggles with SVs due to their length and complexity, especially when they overlap or are nested.
  • LRS has revealed a wide variety of previously underannotated genomic characteristics, including rare pathogenic SVs and repeat expansions.
  • Examples include detecting large deletions in genes like EYS and BBS9, insertions of unmapped sequences, and precise breakpoints of chromosomal translocations.

Slide 15: Haplotype Phasing: Unraveling Parental Chromosomes

  • Haplotype phasing is the process of assigning sequencing data to maternally or paternally inherited chromosomes over large genomic intervals.
  • This is a crucial step for definitive diagnoses in autosomal recessive conditions, determining if two variants are on the same allele (cis) or different alleles (trans).
  • LRS significantly enhances long-range haplotype phasing, enabling the assignment of sequence variants with expanded repeats and resolving complex findings.
  • Short-read data typically requires parental testing for phasing, which presents ethical challenges for patients without access to biological parental samples.
  • LRS provides an opportunity for hypothesis-free genome-wide phasing in a single experiment, offering a more effective tool to determine phase across entire genes.

Slide 16: Direct Epigenetic Modification Detection

  • A key advantage of LRS is its ability to directly detect chemical modifications on native genomic DNA, such as DNA methylation (e.g., 5-methylcytosine, N6-methyladenine).
  • This is achieved by characteristic perturbations in sequencing signals (e.g., altered polymerase kinetics in PacBio SMRT, or unique current flow changes in ONT nanopores).
  • LRS avoids the need for chemical pretreatments like bisulfite deamination, which can fragment DNA and interfere with interpretation.
  • This direct detection allows for mapping and phasing of epigenetic signatures across kilobase-sized genomic alleles, providing important functional insights.
  • LRS is particularly valuable for haplotype-resolved mutation and methylation calling, leading to a more comprehensive understanding of genetic variations and their association with phenotypic traits.

Slide 17: LRS in Rare Disease Diagnostics

  • LRS technology plays an important role in discovering novel pathogenic mutations in human diseases with previously unknown genetic causes, where SRS often fails.
  • It excels at identifying pathogenic variants in regions difficult for short reads, such as repeat expansions (e.g., Huntington’s disease, Fragile X syndrome, C9orf72), homologous genes, and genes with pseudogenes.
  • LRS can accurately detect haplotype-resolved sizing of repeat alleles and methylation profiling in a single experiment.
  • It enables the detection of large insertions (>50 bp) and complex structural rearrangements that are cryptic to short-read sequencing.
  • LRS has demonstrated increased diagnostic yield by identifying a second pathogenic intronic variant in patients with Gitelman syndrome previously thought to harbor only one variant.

Slide 18: LRS in Oncology: Tumor Genomics & Fusion Transcripts

  • LRS has achieved significant results in cancer genomics, proving its clinical potential for diagnosis and therapy selection.
  • It is superior to traditional methods like copy number profiling or karyotyping for detecting copy-balanced SVs and resolving complex SVs with base-pair resolution.
  • LRS allows for the identification of many novel SVs in cancer genomes that are missed by short-read approaches, potentially underestimating mutational burden.
  • It enables full coverage of transcript sequences for transcript isoform determination, crucial for identifying fusion transcripts that drive carcinogenesis.
  • Real-time LRS platforms like MinION have enabled rapid identification of fusion transcripts (e.g., BCR-ABL1 within 15 minutes) for acute leukemia patients requiring immediate, type-specific treatment.

Slide 19: LRS in Infectious Diseases & Microbiota

  • The speed and portability of ONT platforms make LRS an attractive option for in situ diagnosis of infectious pathogens, facilitating rapid response for identification and management of disease sources and spread.
  • LRS was crucial during the Ebola virus outbreak (2015), providing real-time genomic surveillance in the field and clarifying virus evolution patterns.
  • It has been widely employed during the COVID-19 pandemic for studying SARS-CoV-2 transmission and evolution, offering rapid generation of results with potential real-time data analysis.
  • MinION’s portability and cost-effectiveness make it feasible for malaria elimination efforts and tuberculosis treatment, especially in low-income countries.
  • Hybrid long-read and short-read sequencing leverages benefits to overcome limitations in antibiotic resistance research, enabling comprehensive characterization of microbial genomes and identification of novel resistance genes.

Slide 20: LRS in Transplantation: HLA & GVHD

  • LRS has emerged as a valuable tool in transplantation, particularly for understanding the complex interplay between donor and recipient immune systems.
  • It enables highly accurate and comprehensive Human Leukocyte Antigen (HLA) typing, revolutionizing the identification of rare and complex HLA alleles previously hard to detect with SRS.
  • LRS has facilitated the identification of novel HLA alleles, enhancing understanding of HLA diversity and its implications in transplantation immunology, including increased risk of acute rejection.
  • It plays a crucial role in unraveling the causes of graft-versus-host disease (GVHD) by performing transcriptome analysis of immune cells, identifying gene expression patterns associated with development and severity.
  • LRS provides comprehensive understanding of Killer-cell Immunoglobulin-like Receptor (KIR) gene diversity and its impact on transplantation outcomes, linking specific KIR genotypes to graft rejection.

Slide 21: Key Challenges in LRS Adoption

  • High cost per sample/base compared to SRS, though rapidly decreasing, remains a barrier for routine clinical use.
  • Requirement for high-quality, high-molecular-weight DNA input for library preparation, which can be challenging to achieve in clinical settings or remote locations.
  • Historical lower per-read accuracy of LRS platforms (though significantly improved) demanded robust error correction methods and affected confidence in single-nucleotide variant calls.
  • Need for specialized bioinformatics tools and algorithms tailored to LRS data, as their qualitative differences from SRS data require different processing approaches.
  • Lack of standardized protocols and quality control measures across different laboratories and platforms can hinder consistent and reliable data generation and analysis.
  • Extensive validation is required before clinical implementation, as diagnostic tests demand high sensitivity and specificity, and limitations must be thoroughly understood and documented.

Slide 22: Addressing Challenges: New Technologies & Algorithms

  • New hardware like PacBio’s Revio system increases throughput and reduces cost by increasing ZMWs and enabling parallel processing, making HiFi WGS more accessible.
  • ONT’s Q20+ chemistry (Kit 14) and V14 chemistry, combined with R10.4.1 flowcells, significantly improve accuracy to 99% or higher, partially overcoming the low-accuracy limitation.
  • Advances in machine learning algorithms for base and variant calling (e.g., DeepVariant, PEPPER-Margin-DeepVariant for ONT) continue to close the accuracy gap with PacBio HiFi data.
  • Computational enrichment techniques like Adaptive Sampling (ONT) leverage real-time analysis to focus sequencing efforts on regions of interest, improving efficiency and deeper coverage.
  • Improved DNA extraction and library preparation techniques are being developed to ensure high-quality, high-molecular-weight DNA and reduce fragmentation during sample handling.

Slide 23: The Human Pangenome Project: The Future of Genomics

  • The Human Pangenome Reference Consortium (HPRC) aims to create a comprehensive collection of human diploid reference genomes that reflect the true diversity of human genetics.
  • LRS technologies are crucial for this initiative, building on the framework of the Telomere-to-Telomere (T2T) consortium to provide complete, gapless assemblies across diverse haplotypes.
  • This project will significantly expand our understanding of human genetic variation, by sequencing high-quality genomes from hundreds of individuals to maximize global diversity.
  • It seeks to resolve previously unknown gaps and difficult-to-sequence regions such as centromeres, acrocentric short arms, segmental duplications, and satellite arrays that were intractable with older technologies.
  • The complete pangenome dataset holds the potential to revolutionize human genomics and understanding of variation and diversity on an unprecedented scale, impacting biomedical research and personalized medicine.

Slide 24: Conclusion: LRS – A Paradigm Shift in Genomics

  • Long-read sequencing has transformed genome assembly and variant discovery, providing unprecedented insight into complex genomic features previously inaccessible.
  • LRS is driving the next generation of complete human reference genomes, including previously intractable repetitive regions and enabling the Telomere-to-Telomere assembly of the human genome.
  • It offers a straightforward method for discovering rare pathogenic structural variants in both basic and clinical research, rapidly emerging as a critical diagnostic tool.
  • As costs continue to decrease and throughput and accuracy increase, LRS is expected to become a mainstream tool and the standard of care for genomic analysis in clinical settings.
  • The synergy of advanced assemblers, robust benchmarking, and cutting-edge LRS technologies will continue to unravel the genome’s secrets, fueling discoveries in diverse fields from biology to medicine and realizing the promise of personalized medicine.