DNA and RNA Shearing/Fragmentation

There are numerous steps that go into the creation of a genome, whether human, animal, bacterial, or “viral.” They are often complex and require various methods and technologies such as PCR amplification and sequencers in order to obtain the final product. As the process is a very intricate and complicated one, there are multiple ways in which the accuracy and reliability of the genome can become compromised along the way. This can be seen at the very beginning of the process during the crucial step of shearing the DNA/RNA obtained from tissue or cell cultures into the desired fragment range. This is usually done either through physical or chemical means and is an absolute requirement after extracting the RNA in order to create the optimal library for the sample to be sequenced. The size of the fragments are dependent on the technology that is selected and used. If any problems arise during fragmentation, this can introduce biases, artefacts, and errors that will be propagated and compounded down the line into the final genome.

The information presented below is compiled from various sources and it may seem confusing at first. There are many technical details presented. If it is difficult to understand, that is ok as the main takeaways are the admittances to the limitations of the technology, the complexity of the sequencing process, and the numerous ways errors, biases, and artifacts are introduced throughout each and every step. Before diving into the different challenges and potential landmines waiting to blow up the sequencing effort, let’s get a quick breakdown of what fragmentation entails.

DNA Fragmentation

This first source explains what fragmentation is and provides a bit of context as to why this incoherent process of cutting the DNA/RNA into small fragments in order to be reassembled is used. It also explains one of the major hurdles that is often introduced during this process known as bias:

Non-random DNA fragmentation in next-generation sequencing

“Next Generation Sequencing (NGS) technology is based on cutting DNA into small fragments and their massive parallel sequencing. The multiple overlapping segments termed “reads” are assembled into a contiguous sequence. To reduce sequencing errors, every genome region should be sequenced several dozen times. This sequencing approach is based on the assumption that genomic DNA breaks are random and sequence-independent.”

“The genome of a living organism resembles a bookshelf filled with books – chromosomes containing texts made up of letters – nucleotides. The early methods of deciphering biological sequences were based on the precise excision of a particular DNA fragment and its accurate reading. An alternative and, as it initially seemed, incoherent, sequencing method was proposed back in 19791, whereby the multiple copies of a whole genome DNA were to be broken up into small fragments, which were sequenced and then these sequences (termed “reads”) were assembled into a continuous text based on the overlapping ends. Nevertheless, the explosive development of automated sequencers and advances in computational power determined the present-time dominance of this method, called random shotgun sequencing. Modern sequencing machines are capable of reading hundreds of millions of reads per day, where each read consists of tens or hundreds of nucleotides.

The first step of DNA sequencing in the NGS technology is DNA fragmentation. Samples of purified DNA are sheared into short fragments, using either mechanical methods (e.g., ultrasonication shearing and nebulization) or enzymatic digestion2. The fragmented DNA is ligated at both blunt ends of each fragment with specific adaptors, which serve as primer-binding sites for amplification. Then the adaptor-ligated DNA fragments are size-selected through agarose gel electrophoresis or with paramagnetic beads; at this step the ligation duplicates are removed. Subsequently DNA fragments are melted and the single-stranded DNAs are immobilized either on planar solid surfaces of a flow cell (Illumina sequencers), or on the surface of micron-scale beads (454-Roche and SOLiD sequencers), or on ionized spheres (Ion Torrent sequencers)3. Template amplification is performed by PCR on solid surface, or by emulsion PCR into separate microreactors, beads or spheres within sequencers. Finally, sequencing is achieved by detecting the emission of light or hydrogen ions from every dot on the solid surface or spheres, during enzymatic attachment of complimentary nucleotides to the clusters of identical single-stranded DNA fragments4.”

“The bias in NGS data has been extensively observed17,18,19,20,21,22,23, but there is no agreement with respect to the sources of the observed bias. Thus, Benjamini and Speed17 reported the regularities in the GC bias patterns and found that GC content influences fragment count the most. Since both GC-rich and AT-rich fragments were underrepresented in the sequencing results, the authors hypothesized that the most important cause of the GC bias was PCR.”

“We propose to investigate the bias that originated at the fragmentation stage of NGS sequencing procedures. For the first time we venture to show that the bias in NGS reads strongly correlates with the bias produced by sonication of pure restriction DNA fragments, which was shown to be sequence-specific27. Hence we report and analyze bias produced by three DNA shearing methods, sonication, nebulization and Covaris and demonstrate that the instances of bias in different fragmentation methods highly correlate with each other and are common for all of the hydrodynamic DNA shearing methods.”

https://www.nature.com/articles/srep04532

As can be gleaned from this first source, the necessity of fragmenting DNA into small pieces in order to  sequence them creates the problem of introducing bias. These biases are of several types and are anything that causes distortion which can throw off genetic predictions and genome accuracy. It can present as deviations from the true estimates or as heterogeneity (diversity in content). In order for the genome to be considered reliable, these biases must be accounted for and corrected. Without doing so, the work will lead to inaccurate interpretations and will be irreplicable. However, it is well known that the technology, as it stands today, is unable to meet this requirement:

Characterizing and measuring bias in sequence data

“Ideal whole-genome shotgun DNA sequencing would distribute reads uniformly across the genome and without sequence-dependent variations in quality. All existing sequencing technologies fall short of this ideal and exhibit various types and degrees of bias. Sequencing bias degrades genomic data applications, including genome assembly and variation discovery, which rely on genome-wide coverage.”

“Bias manifests in multiple ways. Coverage bias is a deviation from the uniform distribution of reads across the genome. Similarly, error bias is a deviation from the expectation of uniform mismatch, insertion, and deletion rates in reads across the genome. This paper focuses primarily on coverage bias because it is the most damaging sequencing failure.

Sequencing technologies are vulnerable to multiple sources of bias. Methods based on bacterial cloning and Sanger-chemistry sequencing [8] were subject to many coverage-reducing biases, notably at GC extremes, palindromes, inverted repeats, and sequences toxic to the bacterial host [9–17]. Illumina sequencing [18] has been shown to lose coverage in regions of high or low GC [19–22], a phenomenon also seen in other ‘next-generation’ technologies [3, 6]. PCR amplification during library construction is a known source of undercoverage of GC-extreme regions [20, 21] and similar biases may also be introduced during bridge PCR for cluster amplification on the Illumina flowcell [23]. Illumina strand-specific errors can lead to coverage biases by impairing aligner performance [24]. Ion Torrent [25], like 454 [26], utilizes a terminator-free chemistry that may limit its ability to accurately sequence long homopolymers [4, 27, 28], and may also be sensitive to coverage biases introduced by emulsion PCR in library construction. Complete Genomics [29] also uses amplification along with a complex library construction process.

In addition to sources in the wet lab, bias can be introduced by any of the computational steps in the sequencing pipeline. Signal-processing and base calling limitations could result in under-representation or increased error rates in some locations, as can inaccurate alignment. An inaccurate reference or sample-reference differences can cause coverage or accuracy variations that may be misdiagnosed as sequencing bias. Therefore, detecting bias is only the first step and must be followed by more detailed experiments to assign responsibility to the library preparation, sequencing, or computational stages.”

https://genomebiology.biomedcentral.com/articles/10.1186/gb-2013-14-5-r51

Obviously bias in the creation of a genome would lead to an inaccurate and unreliable representation of whatever is being sequenced. This problem is one that stems from the very first step in the genome creation process. Fragmentation is critical due to the technological limitations and can not be avoided so bias must be expected to enter the picture from the very start.

How does this fragmentation occur in order for the bias to be created? This next source provides a good overview of the various methods which are used to fragment DNA:

DNA Fragmentation Methods

“While the preparation of intact high molecular weight DNA and prevention of shearing throughout most workflow steps is important, DNA fragmentation is a necessary step in sample prep for most sequencing platforms. For example, short read sequencing platforms generally rely on fragments of ~300-600 bp), while long read sequencing platforms are compatible with fragments many kb in length.

It is essential to choose a fragmentation method that reliably generates appropriately sized fragments without introducing bias as to the cut sites or base composition of the fragments. There are several options for the DNA fragmentation step, each with its own unique profile of benefits and drawbacks, and the selection of a fragmentation method should be carefully weighed. Methods used include:

Enzyme-based treatments: These fragment DNA by simultaneously cleaving both strands, or by generating nicks on each strand of dsDNA to produce dsDNA breaks. This method is highly flexible and can be used for generation of fragments from low bp to many kb in length.

Acoustic shearing: Short-wavelength, high-frequency acoustic energy is focused on the DNA sample, physically disrupting the DNA molecule, without requiring temperature changes. This method is used for generation of fragments from low hundreds of bp to many kb in length.

Sonication: Specialized sonicators subject DNA to unfocused, longer-wavelength acoustic energy; sonicators require a cooling period between sonication bursts. This method is used for generation of fragments many kb in length.

Centrifugal shearing: DNA can be sheared by the use of centrifugal force to move DNA through a hole of a specific size; the rate of centrifugation determines the degree of DNA fragmentation. This method is used for generation of fragments many kb in length.

Point-sink shearing: A syringe pump generates hydrodynamic shear forces within a tube; the size of the constriction and the flow rate of the liquid determine the DNA fragment size. This method is used for generation of fragments many kb in length.

Needle shearing: The lowest-tech option relies on shearing forces created by passing DNA through a small gauge needle. This method is used for generation of fragments tens of kb in length.”

https://www.neb.com/applications/ngs-sample-prep-and-target-enrichment/dna-fragmentation

The above six methods are the main ones used for DNA fragmentation but as discussed, there are many biases and errors that can be incorporated into this crucial step that will carry over into the subsequent steps required for genome sequencing. This creates sequencing results that are unreproducible and irreplicable. The limitations and difficulties encountered during shearing/fragmentation which can lead to these problems are breifly outlined here:

How it Works: Accurate Shearing of Chromosomal DNA

“Problem: All contemporary, as well as NextGen and Third Generation sequencing methodologies are dependent on the generation of DNA fragments from initial MegaBase long chromosomal DNA molecules. General requirements are that such fragments have to be random and of similar size. Many methods of DNA fragmentation have been developed. Most of these methods include limitations and difficulties including high costs, fragmentation, broad fragment size distribution, or irreversible damage in DNA fragments. In addition, the results can become concentration dependent and/or not highly reproducible.”

https://www.google.com/amp/s/www.labmanager.com/how-it-works/accurate-shearing-of-chromosomal-dna-19660/amp

This next source differentiates between a physical means of fragmentation with ultra-sonication, which is known to cause a loss of DNA, and enzymatic fragmentation, which has not been carefully assessed and leads to artefacts and biases. It states that the sequencing process involves multiple steps for which sequencing biases and errors can be introduced. Even though there have been attempts to minimize the errors, they are still persistent. It is claimed that due to the persistence of these errors and artefacts, they may be mistaken as mutations and added to the genome.

Sequencing artifacts derived from a library preparation method using enzymatic fragmentation

“DNA fragmentation is a fundamental step during library preparation in hybridization capture-based, short-read sequencing. Ultra-sonication has been used thus far to prepare DNA of an appropriate size, but this method is associated with a considerable loss of DNA sample. More recently, studies have employed library preparation methods that rely on enzymatic fragmentation with DNA endonucleases to minimize DNA loss, particularly in nano-quantity samples. Yet, despite their wide use, the effect of enzymatic fragmentation on the resultant sequences has not been carefully assessed. Here, we used pairwise comparisons of somatic variants of the same tumor DNA samples prepared using ultrasonic and enzymatic fragmentation methods. Our analysis revealed a substantially larger number of recurrent artifactual SNVs/indels in endonuclease-treated libraries as compared with those created through ultrasonication. These artifacts were marked by palindromic structure in the genomic context, positional bias in sequenced reads, and multi-nucleotide substitutions.”

“Genome sequencing using hybridization capture comprises multiple steps, including tissue processing, tissue storage, DNA isolation, DNA fragmentation, probe hybridization, library amplification, sequencing, and informatics analysis [25]. Sequencing errors can be introduced at any of these steps, and nucleotides can be further modified through oxidation during tissue processing, tissue storage, DNA isolation, and DNA fragmentation [246]. Nucleotide incorporation errors can in turn create polymerase reaction biases, affect precise library amplification, and generate sequencing noise [24]. Although substantial efforts have been made to minimize such sequencing noise experimentally, stochastic errors remain persistent [24].

Ultra-sonication has long been a standard method for DNA fragmentation during hybridization capture-based, short-read sequencing. Ultrasonication creates even cuts in the DNA across the entire genome, thereby providing a simple means of controlling fragment size in a non-biased manner [23]. However, the physical scattering of DNA solution during the process often leads to a loss of DNA sample, which can be critical when the sample amount is limited to nano- or picogram quantities, such as found with biopsied tissue fragments. Several commercial library preparation kits are available, including the HyperPlus (KAPA Biosystems), SureSelect QXT (Agilent Technologies), Fragmentase (New England Biolabs), and Nextera Tagmentation (Illumina) kits, each of which uses endonucleases or transposases for DNA fragmentation. Although these kits minimize DNA loss, it remains largely unknown what degree of sequencing errors are caused by the enzymatic fragmentation process.”

“DNA fragmentation is a necessary step in the preparation of nucleic acids, as the quality of the sequencing is contingent on both the randomness of the DNA fragmentation as well as the overlap of the resultant fragments. Furthermore, because fragment size tends to differ across NGS platforms and sequencing runs, efficient control of DNA fragment size is imperative.”

“Given that endonucleases themselves are incapable of incorporating nucleotides into the DNA or causing mutations [19], we speculate that mutations arise after enzymatic fragmentation during the “fill-in process” orchestrated by the DNA polymerase for end repair (“End repair & A-tailing enzyme” prior to adaptor ligation in the HyperPlus kit). Ultra-sonication randomly cleaves DNA molecules at different genomic positions and, therefore, in the subsequent fill-in process, nucleotides are incorporated at different genomic positions in different DNA molecules. Even if an erroneous nucleotide is incorporated into the cleaved sites, the resultant artifact would not be recognized as a mutation, because it would not consistently appear at the same position on different molecules. However, because the HyperPlus endonuclease preferentially cleaves specific sites on the DNA, when an erroneous nucleotide is incorporated, the resultant artifact could be mistakenly recognized as a mutation because it appears repeatedly at the same position on different molecules.”

“We found a substantial number of somatic SNVs/indels in the paired analysis of the six tumor samples using the SureSelect treatment for normal samples and the HyperPlus treatment for tumor samples (SH). We considered that such noise could be avoided by using the same DNA fragmentation method for paired samples (i.e., HH combination), and tested this using samples from two rectal cancer cases. Even though we confirmed a substantial reduction in the number of SNVs/indels using just one fragmentation method, upon careful examination, we detected the persistence of HyperPlus noise among the resultant SNVs/indels from the HH combination; this noise was frequently classified by the algorithm in other pairwise comparisons and characterized by palindromic structure. This finding reinforces our proposal of the risk that persistent errors may be confused with genuine mutations due to their recurrent appearance in a cohort.”

https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0227427

RNA Fragmentation

The above methods detailed the fragmentation process as it relates to DNA. When dealing with RNA, there are additional steps that must be done in order to be able to sequence a genome. This is done in two different ways. The first method is accomplished by taking RNA and coverting it into cDNA, which is DNA synthesized from an RNA template in a reaction catalyzed by the enzyme reverse transcriptase carried out through RT-PCR. The same DNA shearing methods cited above are then used in order to shear the cDNA to create the necessary fragments for sequencing.

The second method is a chemical fragmentation process that uses metal ions such as Mg++ and Zn++ in high temperatures and alkaline conditions or enzymatic fragmentation utilizing RNase. This RNA sequencing process is detailed in this next source and shows how the extremely complicated workflow for RNA sequencing easily produces bias. This obviously leads to inaccurate and erroneous interpretations:

Bias in RNA-seq Library Preparation: Current Challenges and Solutions

“Although RNA sequencing (RNA-seq) has become the most advanced technology for transcriptome analysis, it also confronts various challenges. As we all know, the workflow of RNA-seq is extremely complicated and it is easy to produce bias. This may damage the quality of RNA-seq dataset and lead to an incorrect interpretation for sequencing result. Thus, our detailed understanding of the source and nature of these biases is essential for the interpretation of RNA-seq data, finding methods to improve the quality of RNA-seq experimental, or development bioinformatics tools to compensate for these biases. Here, we discuss the sources of experimental bias in RNA-seq.”

“However, RNA-seq is a process of extremely intricate, including RNA extraction and purification, library construction, sequencing, and bioinformatics analysis. These processes can inevitably introduce some deviations (Table 1), which influence the quality of RNA-seq datasets and result in their erroneous interpretation. Therefore, understanding these biases is critical to avoiding erroneous interpretation of the data and to realize the full potential of this powerful technology.”

“Generally, the representative workflow of RNA-seq analysis includes the extraction and purification of RNA from cell or tissue, the preparation of sequencing library, including fragmentation, linear or PCR amplification, RNA sequencing, and the processing and analysis of sequencing data (Figure 1). Commonly used NGS platforms, including Illumina and Pacific Biosciences, need PCR amplification during library construction to increase the number of cDNA molecules to meet the needs of sequencing. Nevertheless, the most problematic step in sample preparation procedures is amplification. It is due to the fact that PCR amplification stochastically introduces biases, which can propagate to later cycles [2]. In addition, PCR also amplifies different molecules with unequal probabilities, leading to the uneven amplification of cDNA molecules [34]. Recently, researchers have proposed several different methods in order to reduce PCR amplification, such as PCR-free protocols and isothermal amplification. Nevertheless, these methods are not perfect and still present some artifacts and biases of sequencing. Consequently, understanding these biases is critical to get reliable data and will provide some useful advice to the researcher.”

3.3. RNA Fragmentation

“Currently, RNA is usually fragmented due to read length restriction (<600 bp) of sequencing technologies and the sensitivity of amplification to long cDNA molecules. There are two major approaches of RNA fragmentation: chemical (using metal ions) and enzymatic (using RNase III) [30]. Commonly, RNA is fragmented using metal ions such as Mg++ and Zn++ in high temperatures and alkaline conditions. This method yields more accurate transcript identification than RNase III digestion [31]. This result was also confirmed in Wery et al. [31]. Furthermore, intact RNAs can be reverse transcribed (RT) to cDNA by reverse transcriptase, subsequently was fragmented. Then, the cDNA was fragmented using the enzymatic or physical method. Examples of the enzymatic method include DNase I digestion, nonspecific endonuclease (like NEBNext dsDNA Fragmentase from New England Biolabs), and transposase-mediated DNA fragmentation (Illumina Nextera XT). However, the Tn5 transposase method showed sequence-specific bias [32], which is the preferred method when only small quantities of cDNA are available, since the cDNA fragmentation and adapter ligation are connected in one step [33]. Studies have shown that nonspecific restriction endonucleases indicate less sequence bias and have been shown to perform similarly to the physical methods with respect to cleavage-site sequence bias and coverage uniformity of target DNA [34, 35]. Another advantage of the enzymatic method is that they are easy to automate [36]. The physical method includes acoustic shearing, sonication, and hydrodynamic [17, 37, 38], which also can present nonrandom DNA fragmentation bias [35]. However, the physical cDNA fragmentation method is less amenable to automation than RNA fragmentation. Therefore, the physical method will be replaced by commercially available kits and the enzymatic method.”

5. Discussion and Conclusion

“At the present, RNA-seq has been widely used in biological, medical, clinical, and pharmaceutical research. However, all these sequencing studies are limited by the accuracy of underlying sequencing experiments, because RNA-seq technology may introduce various errors and biases in sample preparation, library construction, sequencing and imaging, etc.”

https://www.hindawi.com/journals/bmri/2021/6647597/

Small RNA’s are lost?

The same problems regarding errors and biases relating to the fragmentation of DNA still occur with the fragmentation of cDNA converted from RNA. This is also further complicated by errors/contamination involved in the RT-PCR process used to synthesize cDNA which I will post about at a different time. Yet even without this step and using metal ions/RNases directly on RNA, problems still occur:

Principles of Nucleic Acid Cleavage by Metal Ions

“Often molecular biologists dealing with DNA and RNA believe that the properties of metal ions are simpler than the properties of nucleic acids due to their smaller size. In fact, both nucleic acids and metal ions exhibit considerable complexity in their interactions, and such interactions could affect the chemical and biochemical properties of both parties.

Metal ions are usually required to promote and stabilize functionally active or native conformations of nucleic acids, but they can also trap polynucleotides in inactive conformations (Heilman-Miller et al. 2001). In addition to structural roles, most polyvalent metal ions (M22+) can induce cleavage (i.e., breakage, scission, fragmentation, depolymerization, or rupture) of nucleic acids. These reactions can be either non-specific or dependent on the chemical nature of nucleotide residues, nucleic acid sequence, or secondary and/or tertiary structure. The specificity of these reactions is dependent on both the nucleic acid conformation and metal binding modes as well as the properties of the metal ions.”

“Here, we critically review the fundamental factors that influence the efficacy and specificity of nucleic acid cleavage promoted by metal ions. These (frequently disregarded) factors can have dramatic effects in experiments or might result in artifacts or misinterpretation of experimental data.”

“DNA and RNA differ sharply in the stability of their phosphodiester bonds in aqueous solutions. In general, DNA is thought to be more stable than RNA (Li and Breaker 1999; Thorp 2000; Williams et al. 1999). Indeed, RNA is easily cleaved even in mild alkali solutions while DNA is stable under these condi-tions. Contamination of solutions by RNases and M2+ ions may result in fast degradation of RNA even at neutral pH and low temperatures.”

https://www.google.com/url?sa=t&source=web&rct=j&url=https://static1.squarespace.com/static/5ae2118f697a984bb60bd6c3/t/5b2ac74c352f53d7dbf5609e/1529530193348/Dallas_2004.pdf&ved=2ahUKEwj-s9bvz9PxAhVba80KHT_CDCwQFjAAegQIBRAC&usg=AOvVaw1EEoxCIULXXuu8KMeKMldg

These last few sources provide further details on the different types of biases and errors which can be introduced during the RNA fragmentation process. These include inaccurate statistical models, positional biases due to RNA degradation and/or the PCR amplification step, sequence-specific biases based on faulty assumptions, and the uncertainty of the biochemistry of many of the involved steps. Just as with DNA fragmentation, contamination, errors, artefacts, and biases are all common to RNA fragmentation as well:

Mixture models reveal multiple positional bias types in RNA-Seq data and lead to accurate transcript concentration estimates

“Despite these advantages, obtaining accurate transcript quantification measurements from RNA-Seq has proven difficult. One of the main reasons for the inaccuracy is the failure of the statistical models used in the derivation of the measurements to properly represent biases inherent in RNA-Seq data. The statistical model of the original version of Cufflinks [2], for instance, assumes that the cDNA fragments generated by RNA-Seq are uniformly distributed along the transcripts. In reality, however, this assumption is rarely fulfilled and quantification measurements by this version of Cufflinks are therefore often inaccurate.

One type of bias affecting transcript quantification from RNA-Seq data is the result of a preference of the fragmentation, i.e. the process that generates cDNA fragments from RNA transcripts, to produce fragments at certain positions within the transcript, e.g. at the start and/or at the end of the transcript [3]. Hence, this type of bias is referred to as positional bias [4]. Positional bias can also be caused by a bias in the RNA itself, for instance, due to RNA degradation which results in a shortening of the RNA. Another kind of bias in RNA-Seq is introduced during ligation, amplification and NGS sequencing [5]. This bias is correlated to the RNA sequence of a transcript and is therefore called sequence specific bias [4].”

https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005515#sec001

Improving RNA-Seq expression estimates by correcting for fragment bias

“Unfortunately, current technological limitations of sequencers require that the cDNA molecules represent only partial fragments of the RNA being probed. The cDNA fragments are obtained by a series of steps, often including reverse transcription primed by random hexamers (RH), or by oligo (dT). Most protocols also include a fragmentation step, typically RNA hydrolysis or nebulization, or alternatively cDNA fragmentation by DNase I treatment or sonication. Many sequencing technologies also require constrained cDNA lengths, so a final gel cutting step for size selection may be included. “

“The randomness inherent in many of the preparation steps for RNA-Seq leads to fragments whose starting points (relative to the transcripts from which they were sequenced) appear to be chosen approximately uniformly at random. This observation has been the basis of assumptions underlying a number of RNA-Seq analysis approaches that, in computer science terms, invert the ‘reduction’ of transcriptome estimation to DNA sequencing [26]. However, recent careful analysis has revealed both positional [7] and sequence-specific [89] biases in sequenced fragments. Positional bias refers to a local effect in which fragments are preferentially located towards either the beginning or end of transcripts. Sequence-specific bias is a global effect where the sequence surrounding the beginning or end of potential fragments affects their likelihood of being selected for sequencing. These biases can affect expression estimates [10], and it is therefore important to correct for them during RNA-Seq analysis.

Although many biases can be traced back to specifics of the preparation protocols (see Figure 2 and [8]), it is currently not possible to predict fragment distributions directly from a protocol. This is due to many factors, including uncertainty in the biochemistry of many steps and the unknown shape and effect of RNA secondary structure on certain procedures [10].”

https://genomebiology.biomedcentral.com/articles/10.1186/gb-2011-12-3-r22

Mutation or biases, errors, and/or artefacts?

In Summary:

  • Next Generation Sequencing (NGS) technology is based on cutting DNA into small fragments and their massive parallel sequencing
  • This sequencing approach is based on the assumption that genomic DNA breaks are random and sequence-independent
  • An alternative and, as it initially seemed, incoherent, sequencing method was proposed back in 1979, whereby the multiple copies of a whole genome DNA were to be broken up into small fragments, which were sequenced and then these sequences (termed “reads”) were assembled into a continuous text based on the overlapping ends
  • Samples of purified DNA are sheared into short fragments, using either mechanical methods (e.g., ultrasonication shearing and nebulization) or enzymatic digestion
  • Template amplification is performed by PCR on solid surface, or by emulsion PCR into separate microreactors, beads or spheres within sequencers
  • The bias in NGS data has been extensively observed but there is no agreement with respect to the sources of the observed bias
  • Benjamini and Speed hypothesized that the most important cause of the GC bias was PCR
  • The bias in NGS reads strongly correlates with the bias produced by sonication of pure restriction DNA fragments
  • They reported and analyzed bias produced by three DNA shearing methods, sonication, nebulization and Covaris and demonstrated that the instances of bias in different fragmentation methods highly correlate with each other and are common for all of the hydrodynamic DNA shearing methods
  • All existing sequencing technologies fall short of an ideal of distributing reads uniformly across the genome without sequence-dependent variations in quality and instead exhibit various types and degrees of bias
  • Sequencing bias degrades genomic data applications, including genome assembly and variation discovery, which rely on genome-wide coverage
  • Coverage bias is a deviation from the uniform distribution of reads across the genome
  • Error bias is a deviation from the expectation of uniform mismatch, insertion, and deletion rates in reads across the genome
  • Sequencing technologies are vulnerable to multiple sources of bias:
    1. Methods based on bacterial cloning and Sanger-chemistry sequencing were subject to many coverage-reducing biases, notably at GC extremes, palindromes, inverted repeats, and sequences toxic to the bacterial host
    2. Illumina sequencing has been shown to lose coverage in regions of high or low GC, a phenomenon also seen in other ‘next-generation’ technologies
    3. PCR amplification during library construction is a known source of undercoverage of GC-extreme regions and similar biases may also be introduced during bridge PCR for cluster amplification on the Illumina flowcell
    4. Illumina strand-specific errors can lead to coverage biases by impairing aligner performance
    5. Ion Torrent, like 454, utilizes a terminator-free chemistry that may limit its ability to accurately sequence long homopolymers and may also be sensitive to coverage biases introduced by emulsion PCR in library construction
    6. Complete Genomics also uses amplification along with a complex library construction process
  • Signal-processing and base calling limitations could result in under-representation or increased error rates in some locations, as can inaccurate alignment
  • An inaccurate reference or sample-reference differences can cause coverage or accuracy variations that may be misdiagnosed as sequencing bias
  • DNA fragmentation is a necessary step in sample prep for most sequencing platforms
  • It is essential to choose a fragmentation method that reliably generates appropriately sized fragments without introducing bias as to the cut sites or base composition of the fragments
  • The main methods used for fragmentation include:
    1. Enzyme-based treatments
    2. Acoustic shearing
    3. Sonication
    4. Centrifugal shearing
    5. Point-sink shearing
    6. Needle shearing
  • Most of these methods include limitations and difficulties such as:
    1. High costs
    2. Fragmentation
    3. Broad fragment size distribution
    4. Irreversible damage in DNA fragments
  • The results can become concentration dependent and/or not highly reproducible
  • Ultra-sonication has been used thus far to prepare DNA of an appropriate size, but this method is associated with a considerable loss of DNA sample
  • Despite wide use, the effect of enzymatic fragmentation on the resultant sequences has not been carefully assessed
  • An analysis revealed a substantially larger number of recurrent artifactual SNVs/indels in endonuclease-treated libraries as compared with those created through ultrasonication
  • These artifacts were marked by palindromic structure in the genomic context, positional bias in sequenced reads, and multi-nucleotide substitutions
  • Genome sequencing using hybridization capture comprises multiple steps, including:
    1. Tissue processing
    2. Tissue storage
    3. DNA isolation,
    4. DNA fragmentation
    5. Probe hybridization
    6. Library amplification
    7. Sequencing
    8. Informatics analysis
  • Sequencing errors can be introduced at any of these steps, and nucleotides can be further modified through oxidation during tissue processing, tissue storage, DNA isolation, and DNA fragmentation
  • Nucleotide incorporation errors can in turn create polymerase reaction biases, affect precise library amplification, and generate sequencing noise
  • Although substantial efforts have been made to minimize such sequencing noise experimentally, stochastic errors remain persistent
  • The physical scattering of DNA solution during the process often leads to a loss of DNA sample
  • Several commercial library preparation kits are available yet although these kits minimize DNA loss, it remains largely unknown what degree of sequencing errors are caused by the enzymatic fragmentation process
  • The quality of the sequencing is contingent on both the randomness of the DNA fragmentation as well as the overlap of the resultant fragments
  • Because fragment size tends to differ across NGS platforms and sequencing runs, efficient control of DNA fragment size is imperative
  • It is speculated that mutations arise after enzymatic fragmentation during the “fill-in process” orchestrated by the DNA polymerase for end repair
  • Ultra-sonication randomly cleaves DNA molecules at different genomic positions and, therefore, in the subsequent fill-in process, nucleotides are incorporated at different genomic positions in different DNA molecules.
  • Because the HyperPlus endonuclease preferentially cleaves specific sites on the DNA, when an erroneous nucleotide is incorporated, the resultant artifact could be mistakenly recognized as a mutation because it appears repeatedly at the same position on different molecules
  • It was considered noise could be avoided by using the same DNA fragmentation method for paired samples yet even though there was a substantial reduction in the number of SNVs/indels using just one fragmentation method, upon careful examination, persistence of HyperPlus noise among the resultant SNVs/indels from the HH combination was detected
  • This noise was frequently classified by the algorithm in other pairwise comparisons and characterized by palindromic structure
  • This finding reinforced the proposal of the risk that persistent errors may be confused with genuine mutations due to their recurrent appearance in a cohort
  • For RNA genomes, the workflow of RNA-seq is extremely complicated and it is easy to produce bias
  • This may damage the quality of RNA-seq dataset and lead to an incorrect interpretation for sequencing result
  • RNA-seq is a process of extremely intricacy including:
    1. RNA extraction and purification
    2. Library construction (including fragmentation)
    3. Linear or PCR amplification
    4. Sequencing
    5. Bioinformatics analysis
  • These processes can inevitably introduce some deviations which influence the quality of RNA-seq datasets and result in their erroneous interpretation
  • Therefore, understanding these biases is critical to avoiding erroneous interpretation of the data and to realize the full potential of this technology
  • The most problematic step in sample preparation procedures is amplification due to the fact that PCR amplification stochastically introduces biases, which can propagate to later cycles
  • In addition, PCR also amplifies different molecules with unequal probabilities, leading to the uneven amplification of cDNA molecules
  • RNA is usually fragmented due to read length restriction (<600 bp) of sequencing technologies and the sensitivity of amplification to long cDNA molecules
  • There are two major approaches of RNA fragmentation: chemical (using metal ions) and enzymatic (using RNase III)
  • Commonly, RNA is fragmented using metal ions such as Mg++ and Zn++ in high temperatures and alkaline conditions
  • However, the Tn5 transposase method showed sequence-specific bias, which is the preferred method when only small quantities of cDNA are available
  • The physical method includes acoustic shearing, sonication, and hydrodynamic, which also can present nonrandom DNA fragmentation bias
  • All sequencing studies are limited by the accuracy of underlying sequencing experiments, because RNA-seq technology may introduce various errors and biases in sample preparation, library construction, sequencing and imaging, etc.
  • Both nucleic acids and metal ions exhibit considerable complexity in their interactions, and such interactions could affect the chemical and biochemical properties of both parties.
  • Most polyvalent metal ions (M22+) can induce cleavage (i.e., breakage, scission, fragmentation, depolymerization, or rupture) of nucleic acids.
  • The specificity of these reactions is dependent on both the nucleic acid conformation and metal binding modes as well as the properties of the metal ions
  • There are several frequently disregarded factors that can have dramatic effects in experiments or might result in artifacts or misinterpretation of experimental data
  • Contamination of solutions by RNases and M2+ ions may result in fast degradation of RNA even at neutral pH and low temperatures
  • Obtaining accurate transcript quantification measurements from RNA-Seq has proven difficult
  • One of the main reasons for the inaccuracy is the failure of the statistical models used in the derivation of the measurements to properly represent biases inherent in RNA-Seq data
  • The statistical model of the original version of Cufflinks, for instance, assumes that the cDNA fragments generated by RNA-Seq are uniformly distributed along the transcript
  • In reality, however, this assumption is rarely fulfilled and quantification measurements by this version of Cufflinks are therefore often inaccurate
  • One type of bias affecting transcript quantification from RNA-Seq data is the result of a preference of the fragmentation, i.e. the process that generates cDNA fragments from RNA transcripts, to produce fragments at certain positions within the transcript, e.g. at the start and/or at the end of the transcript
  • Positional bias can also be caused by a bias in the RNA itself, for instance, due to RNA degradation which results in a shortening of the RNA
  • Another kind of bias in RNA-Seq is sequence specific bias which is introduced during ligation, amplification and NGS sequencing
  • Current technological limitations of sequencers require that the cDNA molecules represent only partial fragments of the RNA being probed
  • The randomness inherent in many of the preparation steps for RNA-Seq leads to fragments whose starting points (relative to the transcripts from which they were sequenced) appear to be chosen approximately uniformly at random
  • This observation has been the basis of assumptions underlying a number of RNA-Seq analysis approaches that, in computer science terms, invert the ‘reduction’ of transcriptome estimation to DNA sequencing
  • Recent careful analysis has revealed both positional and sequence-specific biases in sequenced fragments
  • These biases can affect expression estimates, and it is therefore important to correct for them during RNA-Seq analysis
  • Although many biases can be traced back to specifics of the preparation protocols, it is currently not possible to predict fragment distributions directly from a protocol
  • This is due to many factors, including uncertainty in the biochemistry of many steps and the unknown shape and effect of RNA secondary structure on certain procedures
How many errors, aretefacts, and biases will be introduced throughout each and every step? Can they all be caught and accounted for?

It should be obvious that no matter what method is used for the fragmentation of DNA/RNA, biases/errors are an unavoidable occurrence. This leads to many inaccuracies and the inability to reproduce results. The more steps that are required to manipulate and craft the sample into a something the limited technology can read, the further away from reality that sample becomes. The occurrence of biases, errors, and artefacts during fragmentation can start a chain reaction that will entirely demolish any sequencing effort. This isn’t even taking into account the contamination and potential RNA degradation sure to occur during the extraction procedures done to the sample prior to fragmentation. If the earliest steps involved in creating a genome are filled with this much potential for interference with the sequencing process ultimately leading to unreliable results, what faith can there be in the final product in being an accurate representation?

14 comments

  1. I couldn’t begin to understand this, but appreciate the explanation. I recall back about 20-25 years ago, there were a few shows like “Forensic Files” where DNA was used to solve murder mysteries. Seems like it was pretty much cut and dried as far as making up a DNA profile on someone or some form of DNA like blood or skin always using a PCR evaluation. I now wonder how accurate these real life instances were when DNA profiles were created. Certainly there is no basis for using PCR to determine covid cases or to identify any other proposed virus.

    Liked by 1 person

    1. I’m not going to claim I even understand it all. There are so many confusing layers to peel on the genomics onion. I think the important thing is to understand that the more processes the material is put through, the less it resembles reality. There are too many variables and factors that can throw off the results and the work is largely not reproducible. It has definitely made me question how accurate and reliable DNA tests truly are. My feelings at the moment are that they are not.

      Liked by 1 person

    2. I recall the show “Adam ruins everything” mentioned how DNA testing is wrong half the time, but because the jury and the legal idiots follow “experts”, this is not a testable science…. So why the fk do they act like it is? Ugh, no wonder why I couldn’t stand engineering school, it’s such poppycock.

      I’m of the opinion that any science that needs ridiculous amplification to be “seen”, like quantum theory, dark matter/energy, etc, is a science of imagination trying to explain the visible.
      Might as well believe in aliens or gods that can appear and disappear, no difference there .. just circular reasoning. But I will say that flat earth is nonsense 🤪😂

      Anything goes in this 🦠🤡🌎 (covid clown world)

      Liked by 1 person

  2. Hi Mike,
    shotgun sequencing is dubious enough for any organism but the virologists have another fatal problem. We note that in the first paper you cover they state the sample needs to be PURIFIED first.

    “The first step of DNA sequencing in the NGS technology is DNA fragmentation. Samples of purified DNA are sheared into short fragments, using either mechanical methods (e.g., ultrasonication shearing and nebulization) or enzymatic digestion.”

    We know the virologists’ specimens are certainly not purified so they have no idea of the provenance of the sequences before they even start the sequencing process. They refer to their “reference library” of course but we know that none of these reference sequences were demonstrated to be viral in origin either. However, the databases are now flooded with meaningless sequence attributions.

    It is strange to see when metagenomics deteriorates into statistical analysis. Either the genome was in the sample, or it wasn’t. Like a virus: either it exists or it doesn’t – it is not an exercise in probability.

    Keep up the great work, Sam gave you a shout out in her latest video: https://odysee.com/@drsambailey:c/electron-microscopy-and-unidentified-viral-objects:f

    Cheers,
    Mark

    Liked by 2 people

    1. Oh wow! Tell her thanks for me! I really appreciate it. 🙂

      You are absolutely spot on. The lack of purification is a huge issue. As they break their genetic soup down to smaller and smaller fragments, there is no way they would be able to determine which piece of the puzzle goes to which of the various puzzles mixed together. It amazes me how they can not see this.

      BTW, I love your articles and wanted to comment but I could not find a way. Is there a comment section?

      The 200 million “viruses” per sneeze was new information to me. How would they not be able to find enough “virus” for purification directly from a sample if someone is projecting out 200 million copies per sneeze? Their lies do not hold up under scrutiny.

      Liked by 1 person

      1. Hi Mike,
        we’ve kept comments off the website as people can already comment on YouTube, Odysee, and SubscribeStar. With all the correspondence that comes to Sam it gets too much to follow. We’re happy that you’re running this blog which is building great momentum – hope you’re not overwhelmed with questions too soon!

        Yes, no matter how sick a person is with a “viral” infection they can never find any viruses in them. Even if they collect sneezes from a person all day and then filter/DG centrifuge the sample, apparently we’d see none either…

        Cheers,
        Mark

        Liked by 2 people

      2. Your video finally got me to sign up on Odyssey so I could comment. 😉

        I’m not overwhelmed with questions but I have been getting a lot of interview requests which has been a tad overwhelming. I was kind of comfortable sitting back in the shadows so to speak but I understand it comes with the territory. If we want this information to get out there, we need to promote it. I’m very focused on getting all of my old posts updated on to the site so it’s just a balancing act at the moment.

        Liked by 1 person

Leave a comment