Limitations in Genome Sequencing Technology and Data Analysis

When you understand that the process used to obtain a “viral” genome involves numerous complex steps each with their own ability to introduce biases, errors, artefacts, etc which can easily propagate into the final result, it becomes clear to see how the accuracy and reliability of a “viral” genome can be questioned. There are too many variables to account for and the errors that are introduced can only be mitigated at best, never eliminated. While the landmines created during the library preparation are critical, another concerning area is the reliance on and the limation of the technology used. The fact is that the creation of a “viral” genome is heavily dependent upon continually outdated technology in constant need of refinement to ensure an “accurate” end product. However, if the technology used is in a continual state of evolution in order to obtain the most accurate results, how can any genome ever be considered a truly accurate representation of the entity it is supposed to represent?

This is a broad area with too many different sequencing platforms with too many alternative methods each with their own set of advantages/disadvantages. It would take a book rather than a post to break down the various technologies and processes used. The intention here is to provide an overview on the limitations some of these sequencers have, focusing mostly on Illumina as it is the most used platform and was instrumental in the creation of the “SARS-COV-2” genome as seen here:

“Total RNA was extracted from 200 μl of BALF and a meta-transcriptomic library was constructed for pair-end (150-bp reads) sequencing using an Illumina MiniSeq as previously described. In total, we generated 56,565,928 sequence reads that were de novo-assembled and screened for potential aetiological agents. Of the 384,096 contigs assembled by Megahit, the longest (30,474 nucleotides (nt)) had a high abundance and was closely related to a bat SARS-like coronavirus (CoV) isolate—bat SL-CoVZC45 (GenBank accession number MG772933)—that had previously been sampled in China, with a nucleotide identity of 89.1% (Supplementary Tables 1, 2). The genome sequence of this virus, as well as its termini, were determined and confirmed by reverse-transcription PCR (RT–PCR)10 and 5′/3′ rapid amplification of cDNA ends (RACE), respectively. This virus strain was designated as WH-Human 1 coronavirus (WHCV) (and has also been referred to as ‘2019-nCoV’) and its whole genome sequence (29,903 nt) has been assigned GenBank accession number MN908947.”

https://www.nature.com/articles/s41586-020-2008-3

In order to focus on the limitations of the sequencing technology as well as the problems with the ensuing data analysis and the lack of standardization in the methods, I have included information from a few sources with a summary at the end. This first study provided a look at the flow of procedures involved in the Illumina platform and broke down some of the issues commonly encountered during sequencing such as substitution errors due to background noise growing each sequencing cycle, scarring on nucleotide structures which interacted with proteins and decreased the efficiency of the reactions, GC bias introduced during the bridge amplification step (a type of PCR where DNA is repeatedly replicated on a glass flow cell containing complementary oligonucleotides), and read length limitation which does not allow for de novo (no reference genome) sequencing:

Next-Generation Sequencing: Advantages, Disadvantages, and Future

Second-Generation Sequencing

“In 2000s, the concept of DNA sequencing underwent drastic changes. Particularly, it needs to be mentioned that shotgun sequencing approach introduced during HGP which includes random fragmentation and sequencing of DNA then utilizing computer programs for assembly of different overlapping reads caused an expansion in the perceptions by forming the idea of massively parallel sequencing.”

“General mechanism of reversible terminator chemistry-based sequencing technology consists of three main steps: library preparation, clonal amplification, and sequencing by synthesis. The procedures start with construction of library including DNA fragments with suitable sizes and tagging each fragment by adapter and index sequences. Clonal amplification is performed on a solid surface where primer sequences that are complementary to adapter sequences are immobilized to create clusters representing each unique DNA fragment and provide sufficient signal for during imaging process. Sequencing step involves nucleotide addition by DNA polymerase, washing away unincorporated nucleotides, signal detection, removal of fluorescent and terminator groups, and washing away all remnants. DNA polymerase adds one of four nucleotides labeled with different fluorescent dyes and containing a 3′ blocking group to growing DNA chain. Next, the unincorporated nucleotides are washed away. After signal detection by fluorescent imaging, 3′ blocking group and fluorescent dye is cleaved from nucleotide structure thus DNA polymerase can add new nucleotides in the next cycle. Another washing step is conducted to remove all chemical remnants that may interfere with sequencing reaction later cycles (Bentley et al. 2008).

Sequencing by reversible terminator chemistry is currently the most commonly used NGS technology worldwide. NGS platforms of Illumina Inc. rely on sequencing technology consisting of bridge amplification on solid surfaces (Adessi et al. 2000) developed by Manteia Predictive Medicine and reverse termination chemistry and engineered polymerases (Bennett 2004) developed by Solexa.”

“The flow of procedures for Illumina platforms start with conversion of DNA sample to fragments with acceptable sizes. Library preparation step continues with addition of specific adapter and index sequences of Illumina systems to each DNA fragment. Then, DNA fragments are loaded to a flow cell containing (immobilized to the surface) two types of primer sequences that are complementary to adapters attached to the fragments in library preparation in order to amplify each fragment with a reaction named “bridge amplification.” After binding to the primers on the surface, the complementary sequence is produced and template strand is removed. After that, surface attached DNA strand bends over and anneals to the closest complementary primer, a new strand is synthesized and the replication is repeated. Consequently, millions of clusters consisting of clonally amplified fragments are formed on the flow cell. After removing one type of fragments, DNA polymerase, nucleotides containing 3′ blocking group and a flourophore, and first sequencing primers are added to perform sequencing reaction. DNA polymerase adds suitable nucleotide to the growing chain, unincorporated nucleotides are washed away and using a laser fluorescent attached to the incorporated nucleotide is activated, signal is detected by a CCD camera. After cleavage of blocking group and removal of fluorescent washing step is repeated and continue next cycle. Index sequences are read between two sequencing period. A barcode specific primer is released to the reaction and index sequence of each fragment is determined. To start second read, the synthesized complementary strands are removed with denaturation and bridge amplification is conducted. After amplification, opposite strands of fragments are removed with chemical cleavage and sequencing reaction starts again binding reverse primer (second sequencing primer) and explained steps are followed.”

“Instead, substitution errors are more commonly observed in Illumina systems due to noise background growing each sequencing cycle (Hutchison 2007). Also, after cleavage of blocking group, scars remained on nucleotide structure which eventually caused interaction with proteins and decreased efficiency of sequencing reactions (Chen et al. 2013). Another problem about Illumina systems was GC bias introduced in bridge amplification step (Mardis 2013). These limitations originated from the nature of the method have been reduced with enhancements in its chemistry. Although engineering of DNA polymerase and rearrangement of flow cell channels has provided better accuracy and cluster densities, read length limitation still stays as the main issue for reversible terminator chemistry-based sequencing which presents noticeable obstacles especially in de novo sequencing (Chen et al. 2013).”

https://doi.org/10.1007/978-3-319-31703-8_5

Artist rendition of what occurs during the completely unobservable bridge amplification step.

The choice of the sequencing platform is crucial to the creation of any genome and there are many different sequencers to choose from along with different processes for each. Next Generation Sequencing (NGS) are the most widely used platforms and studies show that these technologies all have systemic defects and introduce their own biases. In the following source, it is reiterated that Illumina, the most widely used platform, is prone to GC bias and low diversity and it is stated that MiSeq has issues regarding reproducibility. Any study relying on genomics is limited by the accuracy of the sequencing experiments as the RNA-seq technology can introduce various errors, biases, and artefacts throughout the different steps in the process:

Bias in RNA-seq Library Preparation: Current Challenges and Solutions

4. Sequencing and Imaging

“It is very important for the selection of sequencing platform in RNA-seq experiment. Currently, commercially available NGS platforms include Illumina/Solexa Genome Analyser, Life Technologies/ABI SOLiD System, and Roche/454 Genome Sequencer FLX [61]. These platforms use a sequencing-by-synthesis approach to sort tens of millions of sequence clusters in parallel. Generally, the NGS platform can be classified as either ensemble-based (sequencing multiple identical copies of a DNA molecule) or monomolecular (sequencing a single DNA molecule). Nevertheless, studies have found that sequencing technologies often have systematic defects. For example, when the wrong bases are introduced in the process of template cloning and amplification, substitution bias may appear in platforms such as Illumina and SOLiD®, which limits the utility of data. In addition, studies pointed out that sequence-specific bias may be caused by single-strand DNA folding or sequence-specific changes in enzyme preference [62]. Pacific Biosciences SMRT platform produces long single molecular sequences that are vulnerable to misinsertion from nonfluorescent nucleotides [6364]. Besides, the sequencing platform can produce representative biases, that is, some base composition regions (especially those with very high or very low GC composition) are not fully represented, thus leading to bias in the results [65]. Consequently, we will briefly discuss the bias of sequencing platforms, mainly including the Illumina and single-molecule-based platforms. A sum up of suggestions for improvement is presented in Table 4.

Currently, the Illumina HiSeq platform is the most widely used next-generation RNA sequencing technology and has become the standard of NGS sequencing. The platform has two flowcells, each of which provides eight separate channels for sequencing reaction. The sequencing reaction takes 1.5 to 12 days to complete, depending on the total read length of the library. Minoche et al.’s [66] study discovered that the HiSeq platform exists error types of GC content bias. In addition, Illumina released the MiSeq, which integrates NGS instruments and provides end-to-end sequencing solutions using reversible terminator sequencing-by-synthesis technology. The MiSeq instrument is a desktop classifier with low throughput but faster turnaround (generating about 30 million paired-end reads in 24 h). Simultaneously, it can perform on-board cluster generation, amplification, and data analysis in a single run, including base calls, alignment, and variant calling. At the present, MiSeq has become a dominant platform for gene amplification and sequencing in microbial ecology. Nevertheless, various technical problems still remain, such as reproducibility, hence hampered harnessing its true potential to sequence. Furthermore, Fadrosh et al.’s [67] study found that MiSeq 16S rRNA gene amplicon sequencing may arise “low sequence diversity” problems in the first several cycles.

Furthermore, the emergence of single-molecule sequencing platforms such as PacBio makes single-molecule real-time (SMRT) sequencing possible [68]. In this method, DNA polymerase and fluorescent-labeled nucleoside were used for uninterrupted template-directed synthesis. One advantage of SMRT is that it does not include the PCR amplification step, as a consequence avoiding amplification bias. At the same time, this sequencing approach can produce extraordinarily long reads with average lengths of 4200 to 8500 bp, which greatly improves the detection of new transcriptional structures [6970], in addition, due to the relatively low cost per run of PacBio’s, which can reduce the cost of RNA-seq. However, PacBio can usually introduce high error rates (∼5%) compared to Illumina and 454 sequencing platform [71]. Due to the fact that it is difficult to the matching erroneous reads to the reference genome, thus the high error rate may be lead to misalignment and loss of sequencing reads. Furthermore, Fichot and Norman’s [72] study showed that PacBio’s sequencing platform can shun enrichment bias of extremely GC/AT.

5. Discussion and Conclusion

At the present, RNA-seq has been widely used in biological, medical, clinical, and pharmaceutical research. However, all these sequencing studies are limited by the accuracy of underlying sequencing experiments, because RNA-seq technology may introduce various errors and biases in sample preparation, library construction, sequencing and imaging, etc.”

https://www.hindawi.com/journals/bmri/2021/6647597/

Conned.

This next source discusses how the current high-speed introduction of sequencing instruments all have significant modifications that can introduce their own biases into the genome creation process. As they are new, these technologies have not been examined properly in order to determine the impact they have on sequencing errors. A critical aspect of any technology is its error rate which is normally assessed by comparing the different technologies against each other. However, even though this is considered the gold standard evaluation, different groups see different outcomes with the same technology and even within the same group, there is variation from experiment to experiment. There may also be a difference between error rates that are seen in the ideal conditions in the laboratory versus those that occur “in the wild.” There are many reasons why this comparison is difficult. For example, if any discrepancy with reference genomes is labelled as errors, this runs the risk of misclassifying real variants as errors. Another issue is that of assuming that the majority allele at any position is correct and any minor alleles are errors. This can lead to calling true minor alleles as errors. Approaches that attempt to identify these errors also combine library preparation errors with sequencing errors thus making the process a difficult and complicated one:

Sequencing error profiles of Illumina sequencing instruments 

“In the past few years, many new sequencing instruments have been introduced. For instance, Illumina has introduced the HiSeq X Ten, with patterned flowcells, NextSeq 500, with 2-dye chemistry and NovaSeq 6000, combining both in an industrial-scale platform (3). While the basic reversible chain-terminator principle remains unchanged, these are significant modifications which could be expected to introduce their own biases. For instance, labeling nucleotides with only two fluorophores means that guanine is detected by the absence of signal (3). Some have reported that this results in overcalling of G’s when artifacts cause signal dropout (4). On the other hand, a controlled study compared HiSeq 2500 and NovaSeq 6000 and indicated a lower error rate in the NovaSeq (5). Evidently, these new technologies beg examination to determine their effects on sequencing errors.

Comparing the error rates of sequencing platforms has been a focus of research since sequencing began. Every new platform has its advantages and disadvantages, with its error rate being one of the most important factors. Typically the error rate is assessed by comparison of results across different platforms with multiple replicates (6). This is the gold standard for showing how the different technologies operate in the same hands. These studies are useful when one is deciding on an instrument to use. But different groups see different outcomes with the same technology. Even within the same group, there is often variation from experiment to experiment (7). And there may be a difference between error rates observed in an ideal scenario versus typical use ‘in the wild’. So, knowing the extent of this variation is important for consumers of sequencing data produced by others. Even researchers choosing technologies for their own data may find it useful to know how much their mileage may vary.

But measuring error is a theoretically difficult task. Some have taken a simple approach, aligning reads to a reference and calling variants as errors (6). But real variants will then be misclassified as errors as well. Instead, one could first perform variant calling, assuming the majority allele at any position is correct and any minor alleles are errors. This will work well for samples that are known to be highly homogeneous, but otherwise there may be true minor alleles which would be mistaken for errors (8). This can be the case for samples of microorganisms, viruses, cancers or organelles. It is difficult to automatically ascertain how homogeneous a sample is, making it a hurdle for an automated survey. Also, at sites with low numbers of reads, it is possible that the error base randomly occurs more often than the true sample base, causing artifacts in error detection. Another issue with both of these approaches is that they detect errors from more than just sequencing. Library preparation steps like polymerase chain reaction (PCR) can also introduce errors. And different preparation techniques can introduce different numbers and types of errors. Both of the error detection methods above will identify both library preparation errors and sequencing errors combined.”

https://academic.oup.com/nargab/article/3/1/lqab019/6193612

The many sources of errors which combine throughout the process.

While the various errors listed previously for 2G technology should be plenty to question the validity of the results, this next source discusses how this technology also offers poor interpretation of homopolymers (a sequence of consecutive identical bases). These sequencers can also incorporate incorrect dNTP (deoxyribose nucleotide triphosphate employed in PCR to expand the growing DNA strand) by polymerases (an enzyme that synthesizes long chains of polymers or nucleic acids) which ends up in sequencing errors incorporated into the final product. This article expands on the major issue with all NGS which is the amplification step in PCR which leads to the introduction of biases and other errors. It also details the centralized workflow in the creation of a genome and how even in something as simple as mapping to a reference genome requires choosing from among at least 60 mapping tools to find the right one. A major problem is that the information is scattered through various publications and other places. Even with the plethora of sequencing tools available, it is stated that there is a constant need for new and improved versions to ensure that the accuracy and reliability of the results generated can match the accelerated pace of the evolving NGS techniques. The massive amount of data generated has created a problem in being able to properly store and analyze the results. Due to this, it is said that there is a continuous output of proposed strategies aimed at increasing efficiency, reducing sequencing errors, maximizing reproducibility and ensuring correct data management:

An Overview of Next-Generation Sequencing

“2G NGS technologies in general offer several advantages over alternative sequencing techniques, including the ability to generate sequencing reads in a fast, sensitive and cost-effective manner. However, there are also disadvantages, including poor interpretation of homopolymers and incorporation of incorrect dNTPs by polymerases, resulting in sequencing errors. The short read lengths also create the need for deeper sequencing coverage to enable accurate contig and final genome assembly. The major disadvantage of all 2G NGS techniques is the need for PCR amplification prior to sequencing. This is associated with PCR bias during library preparation (sequence GC-content, fragment length and false diversity) and analysis (base errors/favoring certain sequences over others).”

Next-generation sequencing data analysis

“Any kind of NGS technology generates a significant amount of output data. The basics of sequence analysis follow a centralized workflow which includes a raw read QC step, pre-processing and mapping, followed by post-alignment processing, variant annotation, variant calling and visualization.

Assessment of the raw sequencing data is imperative to determine their quality and pave the way for all downstream analyses. It can provide a general view on the number and length of reads, any contaminating sequences, or any reads with low coverage. One of the most well-established applications for computing quality control statistics of sequencing reads is FastQC. However, for further pre-processing, such as read filtering and trimming, additional tools are needed. Trimming bases towards the ends of reads and removing leftover adapter sequences generally improves data quality. More recently, ultra-fast tools have been introduced, such as fastp, that can perform quality control, read filtering and base correction on sequencing data, combining most features from the traditional applications while also running two to five times faster than any of them alone.39

After the quality of the reads has been checked and pre-processing performed, the next step will depend on the existence of a reference genome. In the case of a de novo genome assembly, the generated sequences are aligned into contigs using their overlapping regions. This is often done with the assistance of processing pipelines that can include scaffolding steps to help with contig ordering, orientation and the removal of repetitive regions, thus increasing the assembly continuity.40,41 If the generated sequences are mapped (aligned) to a reference genome or transcriptome, variations compared to the reference sequence can be identified. Today, there is a plethora of mapping tools (more than 60), that have been adapted to handle the growing quantities of data generated by NGS, exploit technological advancements and tackle protocol developments.42 One difficulty, due to the increasing number of mappers, is being able to find the most suitable one. Information is usually scattered through publications, source codes (when available), manuals and other documentation. Some of the tools will also offer a mapping quality check that is necessary as some biases will only show after the mapping step. Similar to quality control prior to mapping, the correct processing of mapped reads is a crucial step, during which duplicated mapped reads (including but not limited to PCR artifacts) are removed. This is a standardized method, and most tools share common features. Once the reads have been mapped and processed, they need to be analyzed in an experiment-specific fashion, what is known as variant analysis. This step can identify single nucleotide polymorphisms (SNPs), indels (an insertion or deletion of bases), inversions, haplotypes, differential gene transcription in the case of RNA-seq and much more. Despite the multitude of tools for genome assembly, alignment and analysis, there is a constant need for new and improved versions to ensure that the sensitivity, accuracy and resolution can match the rapidly advancing NGS techniques.

The final step is visualization, for which data complexity can pose a significant challenge. Depending on the experiment and the research questions posed, there are a number of tools that can be used. If a reference genomes is available , the Integrated Genome Viewer (IGV)is a popular choice43, as is the Genome Browser. If experiments include WGS or WES, the Variant Explorer is a particularly good tool as it can be used to sieve through thousands of variants and allow users to focus on their most important findings. Visualization tools like VISTA allow for comparison between different genomic sequences. Programs suitable for de novo genome assemblies44 are more limited. However, tools like Bandage and Icarus have been used to explore and analyze the assembled genomes.

Next-generation sequencing bottlenecks

NGS has enabled us to discover and study genomes in ways that were never possible before. However, the complexity of the sample processing for NGS has exposed bottlenecks in managing, analyzing and storing the datasets. One of the main challenges is the computational resources required for the assembly, annotation, and analysis of sequencing data.45 The vast amount of data generated by NGS analysis is another critical challenge. Data centers are reaching high storage capacity levels and are constantly trying to cope with increasing demands, running the risk of permanent data loss.46 More strategies are continuously being suggested with the aim to increase efficiency, reduce sequencing error, maximize reproducibility and ensure correct data management.”

https://www.technologynetworks.com/genomics/articles/an-overview-of-next-generation-sequencing-346532

Lack of Standardization

With the numerous technologies involved, the various steps required of each, the differing results generated by the same teams using the same equipment and procedures, it really shouldn’t come as a surprise that genomics is facing a reproducibility crisis. A big part of the reproducibilty problem is the lack of standardization for the methods and technologies involved in producing a genome. There is no agreed upon framework to which the genomics industry as a whole adheres to in order to ensure that all processes associated with the creation of a genome reflects a set of guidelines generating consistency. This is a known issue that affects the end result of every sequencing experiment and it is one which has yet to be resolved.

Interestingly, a paper came out in 2014 specifically referring to this lack of standardization and how it affects the genomes produced. The authors attempted to create their own set of standards to apply to “viral” genomics. In order to bypass any differences which may be attained through the use of contrasting technologies, the authors made two assumptions:

Standards for Sequencing Viral Genomes in the Era of High-Throughput Sequencing

“Thanks to high-throughput sequencing technologies, genome sequencing has become a common component in nearly all aspects of viral research; thus, we are experiencing an explosion in both the number of available genome sequences and the number of institutions producing such data. However, there are currently no common standards used to convey the quality, and therefore utility, of these various genome sequences.”

“To alleviate any reliance on particular aspects of the different sequencing technologies, we have made two assumptions that should be valid in most viral sequencing projects. The first assumption is a basic understanding of the genomic structure of the virus being sequenced, including the expected size of the genome, the number of segments, and the number and distribution of major open reading frames (ORFs). Fortunately, genome structure is highly conserved within viral groups (6), and although new viruses are constantly being uncovered, the discovery of a novel family or even genus remains relatively uncommon (7). In the absence of such information, the defined standards can still be applied following further analysis to determine genome structure. The second assumption is that the genetic material of the virus being described can be accurately separated from the genomes of the host and/or other microbes, either physically or bioinformatically. Depending on the technology used, it is critical that the potential for cross-contamination of samples during the sample indexing/bar coding process and sequencing procedure be addressed with appropriate internal controls and procedural methods (8).”

https://journals.asm.org/doi/10.1128/mBio.01360-14

According to this study, in order to try to standardize the sequencing of “viral” genomes, we must first assume that all “viruses” conform to the same size and that the “viral” particles are able to be completely separated from host and other microbes. If you’ve been following along, you will immediately understand the problem with these two assumptions. No “virus” has ever been purified directly from human fluids and isolated from everything else. Virologists admit that this is impossible. Thus, the second assumption is automatically false and without the second assumption being true, the first assumption regarding the size of any “virus” particle is also false as there would be no purified/isolated particles to determine the exact size of any sequenced genome.

Apparently the standards set forth by the authors in 2014 did not take hold as another paper in 2017 admitted to the continued lack of standardization in genomics, specifically regarding the laboratory workflows and how this impacts the final results:

Standardization in next-generation sequencing – Issues and approaches of establishing standards in a highly dynamic environment

“Due to the fast development, NGS is currently characterized by a lack of standard operating procedures, quality management/quality assurance specifications, proficiency testing systems and even less approved standards along with high cost and uncertainty of data quality. On the one hand, appropriate standardization approaches were already performed by different initiatives and projects in the format of accreditation checklists, technical notes and guidelines for validation of NGS workflows. On the other hand, these approaches are exclusively located in the US due to origins of NGS overseas, therefore there exists an obvious lack of European-based standardization initiatives. An additional problem represents the validity of promising standards across different NGS applications. Due to highest demands and regulations in specific areas like clinical diagnostics, the same standards, which will be established there, will not be applicable or reasonable in other applications. These points emphasize the importance of standardization in NGS mainly addressing the laboratory workflows, which are the prerequisite and foundation for sufficient quality of downstream results.”

Conclusions

“There is already a distinct number of NGS standardization efforts present; however, the majority of approaches target the standardization of the bioinformatics processing pipeline in the context of “Big Data”. Therefore, an essential prerequisite is the simplification and standardization of wet laboratory workflows, because respective steps are directly affecting the final data quality and thus there exists the demand to formulate experimental procedures to ensure a sufficient final data output quality.”

https://peerj.com/preprints/2771/

Hypothetical “virus” genome….

In September 2020, after the creation of the “SARS-COV-2” genome and the explosion of genomes submitted to the GISAID database, China sought to finally address the lack of standardization in genomics by releasing the High-throughput Sequencer Standards. Even with this release, it is noted that the industry is only now moving toward a semblance of standardization. The first sequenced genome was said to have been done in 1977. This lack of standardization has permeated for nearly 50 years. How can any genome created through varying methods over decades with constantly evolving technologies and techniques ever be considered accurate, especially given that many genomes today relie on references to older genomes made with outdated and less “accurate” technology?

The first high-throughput gene sequencer standard is released! Initiated by the China Procuratorate in conjunction with BGI Manufacturing

“Although the research on high-throughput gene sequencers at home and abroad is becoming more and more extensive and the clinical needs are becoming stronger, there are no mature standards for the research, development and use of sequencing platforms at present, which makes it difficult to control the risks in clinical use. Therefore, it is urgent to introduce a relatively complete set of national standards and even industry standards.

The gene sequencing industry is moving towards standardization, fully helping to build an ecological civilization for sequencing applications.

“In recent years, gene sequencing technologies and platforms have been continuously upgraded and iterated, and high-throughput gene sequencers have also developed in the direction of portability, speed, and intelligence. The release of”High-throughput Sequencer Standards” will comprehensively promote the standardization and standardization of genetic testing products with high-throughput sequencing technology as the core. On this basis, sequencing costs will be greatly reduced, the ease of use of the system will continue to increase, and the depth and breadth of sequencing applications will gradually be opened, thereby deepening the application of sequencing services in genetic technology and comprehensively building an ecological civilization for sequencing applications.”

The first high-throughput gene sequencer standard is released! Initiated by the China Procuratorate in conjunction with BGI Manufacturing

Non-existent.

In Summary:

  • The shotgun sequencing approach used today was introduced during the Human Genome Project and it included random fragmentation and sequencing of DNA then utilizing computer programs for assembly of different overlapping reads
  • General mechanism of reversible terminator chemistry-based sequencing technology consists of three main steps:
    1. Library preparation
    2. Clonal amplification
    3. Sequencing by synthesis
  • Sequencing by reversible terminator chemistry is currently the most commonly used NGS technology worldwide
  • NGS platforms of Illumina Inc. rely on sequencing technology consisting of bridge amplification on solid surfaces developed by Manteia Predictive Medicine and reverse termination chemistry and engineered polymerases developed by Solexa
  • Illumina Workflow:
    1. Start with conversion of DNA sample to fragments with acceptable sizes
    2. Library preparation step continues with addition of specific adapter and index sequences of Illumina systems to each DNA fragment
    3. DNA fragments are loaded to a flow cell containing (immobilized to the surface) two types of primer sequences that are complementary to adapters attached to the fragments in library preparation in order to amplify each fragment with a reaction named “bridge amplification”
    4. After binding to the primers on the surface, the complementary sequence is produced and template strand is removed
    5. After that, surface attached DNA strand bends over and anneals to the closest complementary primer, a new strand is synthesized and the replication is repeated
    6. Consequently, millions of clusters consisting of clonally amplified fragments are formed on the flow cell
    7. After removing one type of fragments, DNA polymerase, nucleotides containing 3′ blocking group and a flourophore, and first sequencing primers are added to perform sequencing reaction
    8. DNA polymerase adds suitable nucleotide to the growing chain, unincorporated nucleotides are washed away and using a laser fluorescent attached to the incorporated nucleotide is activated, signal is detected by a CCD camera
    9. After cleavage of blocking group and removal of fluorescent washing step is repeated and continue next cycle
    10. Index sequences are read between two sequencing period
    11. A barcode specific primer is released to the reaction and index sequence of each fragment is determined
    12. To start second read, the synthesized complementary strands are removed with denaturation and bridge amplification is conducted
    13. After amplification, opposite strands of fragments are removed with chemical cleavage and sequencing reaction starts again binding reverse primer (second sequencing primer) and explained steps are followed
  • Substitution errors are more commonly observed in Illumina systems due to noise background growing each sequencing cycle
  • After cleavage of blocking group, scars remained on nucleotide structure which eventually caused interaction with proteins and decreased efficiency of sequencing reactions
  • Another problem about Illumina systems was GC bias introduced in bridge amplification step
  • Read length limitation still stays as the main issue for reversible terminator chemistry-based sequencing which presents noticeable obstacles especially in de novo sequencing
  • It is very important for the selection of sequencing platform in RNA-seq experiment
  • Commercially available NGS platforms include:
    1. Illumina/Solexa Genome Analyser
    2. Life Technologies/ABI SOLiD System
    3. Roche/454 Genome Sequencer FLX
  • These platforms use a sequencing-by-synthesis approach to sort tens of millions of sequence clusters in parallel
  • Studies have found that sequencing technologies often have systematic defects
  • When the wrong bases are introduced in the process of template cloning and amplification, substitution bias may appear in platforms such as Illumina and SOLiD®, which limits the utility of data
  • Studies pointed out that sequence-specific bias may be caused by single-strand DNA folding or sequence-specific changes in enzyme preference
  • The sequencing platform can produce representative biases, that is, some base composition regions (especially those with very high or very low GC composition) are not fully represented, thus leading to bias in the results
  • The Illumina HiSeq platform is the most widely used next-generation RNA sequencing technology and has become the standard of NGS sequencing
  • Minoche et al.’s study discovered that the HiSeq platform exists error types of GC content bias
  • MiSeq has become a dominant platform for gene amplification and sequencing in microbial ecology
  • Nevertheless, various technical problems still remain, such as reproducibility, hence hampered harnessing its true potential to sequence
  • Fadrosh et al.’s study found that MiSeq 16S rRNA gene amplicon sequencing may arise “low sequence diversity” problems in the first several cycles
  • The emergence of single-molecule sequencing platforms such as PacBio makes single-molecule real-time (SMRT) sequencing possible
  • One advantage of SMRT is that it does not include the PCR amplification step, as a consequence avoiding amplification bias
  • However, PacBio can usually introduce high error rates (∼5%) compared to Illumina and 454 sequencing platform
  • Due to the fact that it is difficult to the matching erroneous reads to the reference genome, thus the high error rate may be lead to misalignment and loss of sequencing reads
  • Fichot and Norman’s study showed that PacBio’s sequencing platform can shun enrichment bias of extremely GC/AT
  • All RNA-seq studies are limited by the accuracy of underlying sequencing experiments, because RNA-seq technology may introduce various errors and biases in sample preparation, library construction, sequencing and imaging, etc.
  • Illumina has introduced the HiSeq X Ten, with patterned flowcells, NextSeq 500, with 2-dye chemistry and NovaSeq 6000, combining both in an industrial-scale platform
  • While the basic reversible chain-terminator principle remains unchanged, these are significant modifications which could be expected to introduce their own biases
  • These new technologies beg examination to determine their effects on sequencing errors
  • Every new platform has its advantages and disadvantages, with its error rate being one of the most important factors
  • Typically the error rate is assessed by comparison of results across different platforms with multiple replicates
  • But different groups see different outcomes with the same technology
  • Even within the same group, there is often variation from experiment to experiment
  • There may be a difference between error rates observed in an ideal scenario versus typical use ‘in the wild’
  • Measuring error is a theoretically difficult task
  • Some have taken a simple approach, aligning reads to a reference and calling variants as errors but real variants will then be misclassified as errors as well
  • One could first perform variant calling, assuming the majority allele at any position is correct and any minor alleles are errors
  • This will work well for samples that are known to be highly homogeneous, but otherwise there may be true minor alleles which would be mistaken for errors
  • It is difficult to automatically ascertain how homogeneous a sample is, making it a hurdle for an automated survey
  • At sites with low numbers of reads, it is possible that the error base randomly occurs more often than the true sample base, causing artifacts in error detection
  • Another issue is that both of the error detection methods above detect more than sequencing errors and will identify both library preparation errors and sequencing errors combined
  • There are disadvantages to 2G sequencing including poor interpretation of homopolymers and incorporation of incorrect dNTPs by polymerases, resulting in sequencing errors
  • The short read lengths also create the need for deeper sequencing coverage to enable accurate contig and final genome assembly
  • The major disadvantage of all 2G NGS techniques is the need for PCR amplification prior to sequencing
  • This is associated with PCR bias during library preparation (sequence GC-content, fragment length and false diversity) and analysis (base errors/favoring certain sequences over others)
  • The basics of sequence analysis follow a centralized workflow which includes:
    1. A raw read QC step
    2. Pre-processing and mapping
    3. Post-alignment processing
    4. Variant annotation
    5. Variant calling
    6. Visualization
  • Assessment of the raw sequencing data is imperative to determine their quality and pave the way for all downstream analyses
  • It can provide a general view on the number and length of reads, any contaminating sequences, or any reads with low coverage
  • After the quality of the reads has been checked and pre-processing performed, the next step will depend on the existence of a reference genome
  • If the generated sequences are mapped (aligned) to a reference genome or transcriptome, variations compared to the reference sequence can be identified
  • There is a plethora of mapping tools (more than 60), that have been adapted to handle the growing quantities of data generated by NGS, exploit technological advancements and tackle protocol developments
  • One difficulty, due to the increasing number of mappers, is being able to find the most suitable one
  • Information is usually scattered through publications, source codes (when available), manuals and other documentation
  • Similar to quality control prior to mapping, the correct processing of mapped reads is a crucial step, during which duplicated mapped reads (including but not limited to PCR artifacts) are removed
  • Despite the multitude of tools for genome assembly, alignment and analysis, there is a constant need for new and improved versions to ensure that the sensitivity, accuracy and resolution can match the rapidly advancing NGS techniques
  • The final step is visualization, for which data complexity can pose a significant challenge
  • The complexity of the sample processing for NGS has exposed bottlenecks in managing, analyzing and storing the datasets
  • One of the main challenges is the computational resources required for the assembly, annotation, and analysis of sequencing data
  • The vast amount of data generated by NGS analysis is another critical challenge
  • More strategies are continuously being suggested with the aim to increase efficiency, reduce sequencing error, maximize reproducibility and ensure correct data management
  • There are currently no common standards used to convey the quality, and therefore utility, of these various genome sequences
  • To alleviate any reliance on particular aspects of the different sequencing technologies, the authors made two assumptions that they felt should be valid in most “viral” sequencing projects:
    1. The first assumption is a basic understanding of the genomic structure of the “virus” being sequenced, including the expected size of the genome, the number of segments, and the number and distribution of major open reading frames (ORFs)
    2. The second assumption is that the genetic material of the “virus” being described can be accurately separated from the genomes of the host and/or other microbes, either physically or bioinformatically
  • Obviously, both of these assumptions would be invalid as there are no purified/isolated “virus” particles
  • Due to the fast development, NGS is currently characterized by a lack of standard operating procedures, quality management/quality assurance specifications, proficiency testing systems and even less approved standards along with high cost and uncertainty of data quality
  • An additional problem represents the validity of promising standards across different NGS applications
  • These points emphasize the importance of standardization in NGS mainly addressing the laboratory workflows, which are the prerequisite and foundation for sufficient quality of downstream results
  • An essential prerequisite is the simplification and standardization of wet laboratory workflows, because respective steps are directly affecting the final data quality and thus there exists the demand to formulate experimental procedures to ensure a sufficient final data output quality
  • There are no mature standards for the research, development and use of sequencing platforms at present, which makes it difficult to control the risks in clinical use
  • It is urgent to introduce a relatively complete set of national standards and even industry standards
  • The gene sequencing industry is moving towards standardization, (i.e. not there yet) fully helping to build an ecological civilization for sequencing applications
  • The release of”High-throughput Sequencer Standards” will comprehensively promote the standardization and standardization of genetic testing products with high-throughput sequencing technology as the core

The sequenced genome is only as good as the technology used for it. There are many to choose from each with advantages and disadvantages and they all have some systematic defects. For Illumina, this refers to GC content bias, substitution errors, low sequence diversity, read length limitations, and technical problems related to reproducibility. Even if Illumina and other sequencing technologies were 100% accurate, the contamination, bias, artefacts, batch errors, etc. inherent in the processes leading up to the sequencing analysis would be enough to question anything assembled from the data. Adding in the lack of standardization of the methods used along with the technological challenges, it becomes even more apparent that there are too many different technologies, too many different processes, and far too many different variables to be able to say that the end product is a reliable and accurate representation of the nonexistent entity it is supposed to represent.

5 comments

  1. Hi Mike
    I appreciate what you are trying to do!
    I listened to your interview and you mentioned a Dr Grant that you learned a lot from. Can you refer me to as to how I can learn from him too?
    amicable,
    Martin

    Liked by 1 person

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: