The Challenges in Genome Library Construction

What is truly revolutionary about molecular biology in the post-Watson-Crick era is that it has become digital…the machine code of the genes is uncannily computer-like.”

-Richard Dawkins

According to the Britannica, a genomic library is essentially “a collection of DNA fragments that make up the full-length genome of an organism. A genomic library is created by isolating DNA from cells and then amplifying it using DNA cloning technology.” Creating the library for the sequencing of a “viral” genome involves multiple steps each with their own limitations and drawbacks. From extracting, fragmenting, and then converting the RNA to cDNA, critical issues such as contamination as well as the introduction of biases and artefacts can creep in at any moment and threaten to derail the entire sequencing process. Any of the introduced errors can be compounded and make their way into the final product. This leads to issues with false, erroneous, and unreproducible results. While I have gone over each of these steps individually (the links above), the sources presented below look at library prep as a whole and further flesh out the unavoidable challenges introduced during the creation of a genomic library.

First up is a review of the automation strategies for library preparation. The researchers admit that the process itself is complex and cumbersome and has become something of a bottleneck, meaning that it is halting forward progress. In order to be able to obtain reproducible results, there needs to be an emphasis on standardisation of the process as challenges occur at each stage of the workflow. Contamination is an inherent problem and can carry-over during successive sequencing runs leading to significant errors. Solutions for the problems encountered during library preparation are still under development, meaning this is a problem that has not been corrected and doesn’t look to be going away any time soon:

Library preparation for next generation sequencing: A review of automation strategies

“Recently introduced high throughput and benchtop instruments offer fully automated sequencing runs at a lower cost per base and faster assay times. In turn, the complex and cumbersome library preparation, starting with isolated nucleic acids and resulting in amplified and barcoded DNA with sequencing adapters, has been identified as a significant bottleneck. Library preparation protocols usually consist of a multistep process and require costly reagents and substantial hands-on-time. Considerable emphasis will need to be placed on standardisation to ensure robustness and reproducibility.”

“NGS can be roughly divided into the process elements of sample pre-processing, library preparation, sequencing itself and bioinformatics (Fig. 1). Regardless of the underlying principles of the respective sequencing method, all modern sequencing technologies require dedicated sample preparation to yield the sequencing library loaded onto the instrument (Goodwin et al., 2016Metzker, 2010). Sequencing libraries consist of DNA fragments of a defined length distribution with oligomer adapters at the 5′ and 3′ end for barcoding, as well as the actual sequencing process. After sequencing, the generated data is analysed using bioinformatics.

Reliable and standardised implementation and quality control measures for all stages of the process are crucial in routine laboratory practice (Endrullat et al., 2016Gargis et al., 2012). Challenges are encountered at each of the aforementioned workflow steps and must be tackled to ensure high quality sequencing results. For instance, the extraction of a sufficient amount of DNA from the sample input without extracting disturbing inhibitors can vary in complexity depending on the sample material (Chiu, 2013Trombetta et al., 2014). During sequencing runs, carry-over contamination can lead to significant errors (Kircher et al., 2012Kotrova et al., 2017Nelson et al., 2014). For bioinformatics, the handling of the extremely large amounts of data generated by high-throughput NGS requires considerable IT resources and comprehensive analysis solutions are still under development (Korneliussen et al., 2014Naccache et al., 2014Scholz et al., 2012).

During library preparation, three major challenges can be observed: complexity of protocols, contamination and cost.

The Illumina TruSeq Nano workflow (Illumina, 2019), for example, requires ten steps to attach adapters and barcodes to the nucleotides. The bead-based purification steps in particular, which entail the handling of magnets and magnetic particles by the user, are error-prone and can result in failure of the library preparation (Meyer and Kircher, 2010).

Sample contamination is an inherent problem, as libraries are usually prepared in parallel (Kotrova et al., 2017Salter et al., 2014). Major sources of contamination are pre-amplifications required for low starting concentration of nucleic acids (Kotrova et al., 2017).”

https://www.sciencedirect.com/science/article/pii/S0734975020300343

This next source provides more detail regarding the issue of bias being introduced during library preparation. Bias refers to any of the factors related to experimental design that can cause a distortion in genetic predictions. The goal during library prep is to create as little bias as possible as it is entirely unavoidable and can only be mitigated at best. Many of the bias issues can be directly tied to the amplification process which requires the use of PCR to create more material for sequencing. Thus, it is essential to construct a library in such a way as to increase complexity (the amount of unique DNA fragments) and limit PCR amplification which reduces complexity by increasing the amount of duplicate material:

Library construction for next-generation sequencing: Overviews and challenges

Considerations in NGS library preparation: Complexity, bias, and batch effects

“The main objective when preparing a sequencing library is to create as little bias as possible. Bias can be defined as the systematic distortion of data due to the experimental design. Since it is impossible to eliminate all sources of experimental bias, the best strategies are: (i) know where bias occurs and take all practical steps to minimize it and (ii) pay attention to experimental design so that the sources of bias that cannot be eliminated have a minimal impact on the final analysis.

The complexity of an NGS library can reflect the amount of bias created by a given experimental design. In terms of library complexity, the ideal is a highly complex library that reflects with high fidelity the original complexity of the source material. The technological challenge is that any amount of amplification can reduce this fidelity. Library complexity can be measured by the number or percentage of duplicate reads that are present in the sequencing data (39). Duplicate reads are generally defined as reads that are exactly identical or have the exact same start positions when aligned to a reference sequence (40). One caveat is that the frequency of duplicate reads that occur by chance (and represent truly independent sampling from the original sample source) increases with increasing depth of sequencing. Thus, it is critical to understand under what conditions duplicate read rates represent an accurate measure of library complexity.

Using duplicate read rates as a measure of library complexity works well when doing genomic DNA sequencing, because the nucleic acid sequences in the starting pool are roughly in equimolar ratios. However, RNA-seq is considerably more complex, because by definition the starting pool of sequences represents a complex mix of different numbers of mRNA transcripts reflecting the biology of differential expression. In the case of ChIP-seq the complexity is created by both the differential affinity of target proteins for specific DNA sequences (i.e., high versus low). These biologically significant differences mean that the number of sequences ending up in the final pool are not equimolar.

However, the point is the same—the goal in preparing a library is to prepare it in such a way as to maximize complexity and minimize PCR or other amplification-based clonal bias. This is a significant challenge for libraries with low input, such as with many ChIP-seq experiments or RNA/DNA samples derived from a limited number of cells. It is now technologically possible to perform genomic DNA and RNA sequencing from single cells. The key point is that the level of extensive amplification required creates bias in the form of preferential amplification of different sequences, and this bias remains a serious issue in the analysis of the resulting data. One approach to address the challenge is a method of digital sequencing that uses multiple combinations of indexed adapters to enable the differentiation of biological and PCR-derived duplicate reads in RNA-seq applications (41,42). A version of this method is now commercially available as a kit from Bioo Scientific (Austin, TX).

When preparing libraries for NGS sequencing, it is also critical to give consideration to the mitigation of batch effects (4345). It is also important to acknowledge the impact of systematic bias resulting from the molecular manipulations required to generate NGS data; for example, the bias introduced by sequence-dependent differences in adaptor ligation efficiencies in miRNA-seq library preparations. Batch effects can result from variability in day-to-day sample processing, such as reaction conditions, reagent batches, pipetting accuracy, and even different technicians. Additionally, batch effects may be observed between sequencing runs and between different lanes on an Illumina flow-cell. Mitigating batch affects can be fairly simple or quite complex. When in doubt, consulting a statistician during the experimental design process can save an enormous amount of wasted money and time.”

“One problem associated with amplicon sequencing is the presence of chimeric amplicons generated during PCR by PCR-mediated recombination (61). This problem is exacerbated in low complexity libraries and by overamplification. A recent study identified up to 8% of raw sequence reads as chimeric (62). However, the authors were able to decrease the chimera rate down to 1% by quality filtering the reads and applying the bioinformatic tool, Uchime (63). The presence of the PCR primer sequences or other highly conserved sequences presents a technical limitation on some sequencing platforms that utilize fluorescent detection (i.e., Illumina). This can occur with amplicon-based sequencing such as microbiome studies using 16S rRNA for species identification. In this situation, the PCR primer sequences at the beginning of the read will generate the exact same base with each cycle of sequencing, creating problems for the signal detection hardware and software.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4351865/

Bad batch?

This third source deals with the problem of batch effects during library preparation as it is a critical issue which is often overlooked. According to Wikipedia (I know…sadly the best definition available), “in molecular biology, a batch effect occurs when non-biological factors in an experiment cause changes in the data produced by the experiment. Such effects can lead to inaccurate conclusions when their causes are correlated with one or more outcomes of interest in an experiment.” While this is a known problem, it is not well understood and there remains no algorithms to detect these effects nor any standardized guidelines for correcting them. Batch effects can be introduced in many ways throughout the various steps involved in library preparation, and without successful and validated methods for finding and correcting these effects, this makes the data generated entirely unreliable:

Identifying and mitigating batch effects in whole genome sequencing data

Abstract

Background

“Large sample sets of whole genome sequencing with deep coverage are being generated, however assembling datasets from different sources inevitably introduces batch effects. These batch effects are not well understood and can be due to changes in the sequencing protocol or bioinformatics tools used to process the data. No systematic algorithms or heuristics exist to detect and filter batch effects or remove associations impacted by batch effects in whole genome sequencing data.”

Conclusions

“Researchers currently do not have effective tools to identify and mitigate batch effects in whole genome sequencing data. We developed and validated methods and filters to address this deficiency.”

“Recent reductions in the cost of whole genome sequencing [1] (WGS) have paved the way for large-scale sequencing projects [2]. The rapid evolution of WGS technology has been characterized by changes to library preparation methods, sequencing chemistry, flow cells, and bioinformatics tools for read alignment and variant calling. Inevitably, the changes in WGS technology have resulted in large differences across samples and the potential for batch effects [34].

“Batch effects in WGS come with the additional complexity of interrogating difficult to characterize regions of the genome, and common approaches such as the Variant Quality Score Recalibration (VQSR) step in GATK [14] and processing samples jointly using the GATK HaplotypeCaller pipeline fail to remove all batch effects. Factors leading to batch effects are ill-understood and can arise from multiple sources making it difficult to develop systematic algorithms to detect and remove batch effects.”

“In addition to removing unconfirmed and likely spurious associations induced by batch effects, researchers must also determine that a batch effect exists. Identifying a method to detect batch effects that have an impact on downstream association analyses is crucial as researchers need to know upfront whether WGS datasets can be combined or if changes in sequencing chemistry will result in sequences that can no longer be analyzed together.”

“Large-scale WGS efforts are thriving, however few guidelines exist for determining whether a dataset has batch effects and, if so, what methods will reduce their impact. We address both these deficiencies and introduce new software (R package, genotypeeval, see Methods for additional details and web link) that can help identify batch effects.”

Mitigating batch effects via filtering

“Large-scale genome-wide association studies using SNP array based data often combined cases and controls obtained from different sources [39,40,41] and this practice continues with WGS based data [1920]. Rigorous QC of SNP array based data reduced batch effects in this setting. The sensitivity of WGS technology to differences in library preparation, sequencing chemistry, etc. makes it markedly susceptible to batch effects, however no standard set of guidelines for QC of WGS has been established.”

Discussion

“While sequencing costs are decreasing, many thousands of samples are necessary to have sufficient power to identify novel variants associated with common complex diseases [45]. In order to collect enough cases for diseases, multiple groups often work collaboratively by contributing samples to a consortium. In order to analyze these cases an even greater number of controls are desired [46]. Thus the need to combine samples that have been processed independently is clear, as is the unavoidable introduction of batch effects. These batch effects are subtle and simple filtering e.g. removing variants in “difficult regions” is ineffective. We found that changes in sequencing chemistry related to PCR versus PCR-free workflows strongly contributed to the detectable batch effects in both the Batch GWAS and the RA Batch GWAS.”

“Batch effects in WGS data are not well understood and perhaps because of this, we were not able to find an existing method or develop a novel method that removed all sites impacted by batch effects without impacting the power to detect true associations. While we focused on creating targeted filters that removed a small percent of the genome, in practice these need to be used in conjunction with standard quality control measures (for example removing sites out of Hardy-Weinberg equilibrium), which can result in very stringent filtering. In the case of a severe batch effect, such as the chemistry change present in the RA Batch GWAS, more stringent filtering was necessary even after applying standard quality control and our proposed filters as almost 40,000 UGAs remained after filtering. In order to fully address batch effects, disentangling the impact of changes in sequencing chemistry and bioinformatics processing on association analysis will be necessary.

“Batch effects will arise as independent groups attempt to combine sequencing data generated and processed from different sources – this collaboration is necessary particularly to attain power to detect new disease-associated variants. Large-scale resources are spent by research, industry, and government organizations creating databases that cannot easily be merged.”

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1756-z

The fourth and final source again delves into the issues related to bias introduced during library preparation. It provides a general overview of the ways bias can make its way into each of the steps involved in the creation of the genomic library. The article focuses a great deal on PCR amplification as it is considered the main source of bias and even though there are methods aimed at eliminating PCR, they have their own sets of limitations, artefacts, and biases. It is said most protocols introduce serious deviations which propagate into further cycles resulting in erroneous interpretations. The aithors state that all sequencing studies are limited by the accuracy of the sequencing experiments which are all subject to errors and biases introduced during preparation. The researchers break down the sources of bias throughout the process and ultimately conclude that it is a problem that can only be mitigated as bias can not be eliminated:

Bias in RNA-seq Library Preparation: Current Challenges and Solutions

“Generally, the representative workflow of RNA-seq analysis includes the extraction and purification of RNA from cell or tissue, the preparation of sequencing library, including fragmentation, linear or PCR amplification, RNA sequencing, and the processing and analysis of sequencing data (Figure 1). Commonly used NGS platforms, including Illumina and Pacific Biosciences, need PCR amplification during library construction to increase the number of cDNA molecules to meet the needs of sequencing. Nevertheless, the most problematic step in sample preparation procedures is amplification. It is due to the fact that PCR amplification stochastically introduces biases, which can propagate to later cycles [2]. In addition, PCR also amplifies different molecules with unequal probabilities, leading to the uneven amplification of cDNA molecules [34]. Recently, researchers have proposed several different methods in order to reduce PCR amplification, such as PCR-free protocols and isothermal amplification. Nevertheless, these methods are not perfect and still present some artifacts and biases of sequencing. Consequently, understanding these biases is critical to get reliable data and will provide some useful advice to the researcher.”

2. Sample Preservation and Isolation

“Despite many studies have shown that RNA-seq has many advantages, it is still a rapidly developing biotechnology and faces several challenges. Among them, one often overlooked aspect is the sample preparation process, which may also bring potential variations and deviations on RNA-seq experiment, including RNA isolation, sample processing, library storage time, RNA input level (such as the difference in the number of start-up RNA), and sample cryopreservation (such as fresh or frozen preservation).”

2.2. The Isolation and Extraction of RNA

“High-quality RNA purification is the premise of RNA-seq. However, due to the widespread existence of RNA degrading enzymes (RNases) [1012], successful isolation of high-quality RNA remains challenging. At the present, the RNA extraction method can be divided into two types, including TRIzol (phenol: chloroform extraction) and Qiagen (silica-gel-based column procedures). These methods were mainly developed to extract long mRNAs and have been based on the assumption that all RNAs are equally purified, when these methods are applied to noncoding RNAs, which may be resulted in RNA degradation [13].”

Library Construction

“After RNA isolation and extraction, the next step is library construction of transcriptome sequencing. Library construction usually begins with the depletion of ribosomal RNA (rRNA) or the enrichment of mRNA enrichment, because most of the total RNA of cellular or tissue is rRNA. For eukaryotic transcriptome, polyadenylated mRNAs are usually extracted by oligo-dT beads, or rRNAs are selectively depleted. Unlikely, prokaryote mRNAs are not stably polyadenylated. Hence, oligo d(T)-mediated messenger enrichment is not suitable; there is only the second option. Then, RNA is usually fragmented to a certain size range by physical or chemical method. The subsequent steps differ among experimental design and NGS platforms. However, studies indicated that most of the protocols currently used for library construction may introduce serious deviations. For example, RNA fragmentation can introduce length biases or errors, subsequently propagating to later cycles. Furthermore, library amplification may also be affected by primer bias, such as primer bias in multiple displacement amplification (MDA) [15], primer mismatch in PCR amplification [1617]. As a consequence, it may introduce nonlinear effects and inevitably compromise the quality of RNA-seq dataset, leading to the result of erroneous interpretation. Consequently, in the next section, we will describe and summarize the bias sources of library preparation, including mRNA enrichment, fragmentation, primer bias, adapter ligation, reverse transcription, and especially PCR. A sum up suggestions for improvement is presented in Table 3.”

3.1. Input RNA

“Notwithstanding, RNA-seq can be used to measure transcripts of any sample in principle; it has been a challenge to apply standard protocols to samples with either very low quantity or low quality (partially degraded) input RNA. It is due to the fact that the bias associated with low amounts of input RNA has strong and harmful effects on downstream analysis. If not noticed, this may have significant impact on the subsequent biological interpretation.”

3.2. rRNA Depletion

“rRNAs are very abundant, often constituting 80% to 90% of total RNA. Due to the fact that rRNA sequence rarely arouse people’s interest in RNA-seq experiments, it is necessary to remove rRNA from sample before library construction. The aim is in order to prevent most of the library and the majority of sequencing reads from being rRNA.”

3.3. RNA Fragmentation

“Currently, RNA is usually fragmented due to read length restriction (<600 bp) of sequencing technologies and the sensitivity of amplification to long cDNA molecules. There are two major approaches of RNA fragmentation: chemical (using metal ions) and enzymatic (using RNase III) [30]. Commonly, RNA is fragmented using metal ions such as Mg++ and Zn++ in high temperatures and alkaline conditions. This method yields more accurate transcript identification than RNase III digestion [31]. This result was also confirmed in Wery et al. [31]. Furthermore, intact RNAs can be reverse transcribed (RT) to cDNA by reverse transcriptase, subsequently was fragmented. Then, the cDNA was fragmented using the enzymatic or physical method. Examples of the enzymatic method include DNase I digestion, nonspecific endonuclease (like NEBNext dsDNA Fragmentase from New England Biolabs), and transposase-mediated DNA fragmentation (Illumina Nextera XT). However, the Tn5 transposase method showed sequence-specific bias [32], which is the preferred method when only small quantities of cDNA are available, since the cDNA fragmentation and adapter ligation are connected in one step [33].”

“The physical method includes acoustic shearing, sonication, and hydrodynamic [173738], which also can present nonrandom DNA fragmentation bias [35]. However, the physical cDNA fragmentation method is less amenable to automation than RNA fragmentation. Therefore, the physical method will be replaced by commercially available kits and the enzymatic method.”

3.4. Primer Bias

“Commonly, after mRNA is fragmented, which can be reverse transcribed into cDNA by random hexamers. However, studies have been indicated that random hexamer primer can lead to the deviation of nucleotide content of RNA sequencing reads, which also affects the consistency of the locations of reads along expressed transcripts. This may result in low complexity of RNA sequencing data.”

3.5. Adapter Ligation

“Generally, as for the deep sequencing of RNA library preparation, a critical step is the ligation of adapter sequences. The selection of T4 RNA ligase (Rnl1 or Rnl2) or other RNA ligase is very important. Subsequently, the ligation products were amplified by PCR. Or, nucleotide homopolymer sequences were added by poly (A) polymerase [41] or terminal deoxyribonucleotidyl transferase [41] but prevent the unambiguous determination of the termini of the input RNAs. This method has also been widely used in the construction of small RNA library. Recently, studies have shown that adapter ligation introduces a significant but widely overlooked bias in the results of NGS small RNA sequencing.”

3.6. Reverse Transcription

“Currently, the strategies of transcriptome analysis are still to convert RNA to cDNA before sequencing. A known feature of reverse transcriptases is that they tend to produce false second strand cDNA through DNA-dependent DNA polymerase. This may not be able to distinguish the sense and antisense transcript and create difficulties for the data analysis.”

3.7. PCR Amplification

“PCR is a basic tool widely in molecular biology laboratories. In particular, the combination of PCR and NGS sequencing promoted the explosive development of RNA sequence acquisition. However, PCR amplification has been proved to be the main source of artifacts and base composition bias in the process of library construction, which may lead to misleading or inaccurate conclusions in data analysis. Therefore, it is essential to avoid PCR bias, and great efforts have been expended on trying to control and mitigate bias in current.”

5. Discussion and Conclusion

“At the present, RNA-seq has been widely used in biological, medical, clinical, and pharmaceutical research. However, all these sequencing studies are limited by the accuracy of underlying sequencing experiments, because RNA-seq technology may introduce various errors and biases in sample preparation, library construction, sequencing and imaging, etc.”

“Additionally, library construction methods are frequently biased, which is a main concern for RNA-seq data quality. Among them, PCR amplification is the major source of bias. A previous study showed that GC content has a virtual influence on PCR amplification efficiency.”

“In summary, the major goal of constructing the sequencing library is to minimize the bias. The bias was frequently defined as the systematic distortion of data due to the experimental protocols. Therefore, it is impossible to eliminate all sources of experimental bias. The best strategies are as follows: (i) to understand how the bias is generated and to take measures to minimize it; (ii) to pay attention to the experimental design and minimize the influence of irreducible bias on the final analysis.”

https://www.hindawi.com/journals/bmri/2021/6647597/

In Summary:

  • The complex and cumbersome library preparation, starting with isolated nucleic acids and resulting in amplified and barcoded DNA with sequencing adapters, has been identified as a significant bottleneck (i.e. something that retards or halts free movement and progress)
  • Library preparation protocols usually consist of a multistep process and require costly reagents and substantial hands-on-time
  • Considerable emphasis will need to be placed on standardisation to ensure robustness and reproducibility
  • Regardless of the underlying principles of the respective sequencing method, all modern sequencing technologies require dedicated sample preparation to yield the sequencing library loaded onto the instrument
  • Sequencing libraries consist of DNA fragments of a defined length distribution with oligomer adapters at the 5′ and 3′ end for barcoding, as well as the actual sequencing process
  • Reliable and standardised implementation and quality control measures for all stages of the process are crucial in routine laboratory practice
  • Challenges are encountered at each of the aforementioned workflow steps and must be tackled to ensure high quality sequencing results
  • During sequencing runs, carry-over contamination can lead to significant errors
  • The Illumina TruSeq Nano workflow (Illumina, 2019) requires ten steps to attach adapters and barcodes to the nucleotides
  • The bead-based purification steps in particular, which entail the handling of magnets and magnetic particles by the user, are error-prone and can result in failure of the library preparation
  • Sample contamination is an inherent problem, as libraries are usually prepared in parallel
  • Major sources of contamination are pre-amplifications required for low starting concentration of nucleic acids
  • The main objective when preparing a sequencing library is to create as little bias as possible
  • Bias can be defined as the systematic distortion of data due to the experimental design
  • Since it is impossible to eliminate all sources of experimental bias, the best strategies are:
    1. Know where bias occurs and take all practical steps to minimize it
    2. Pay attention to experimental design so that the sources of bias that cannot be eliminated have a minimal impact on the final analysis
  • The complexity of an NGS library can reflect the amount of bias created by a given experimental design
  • The technological challenge is that any amount of amplification (through RT-PCR) can reduce this fidelity
  • Duplicate reads are generally defined as reads that are exactly identical or have the exact same start positions when aligned to a reference sequence
  • It is critical to understand under what conditions duplicate read rates represent an accurate measure of library complexity
  • RNA-seq is considerably more complex than DNA, because by definition, the starting pool of sequences represents a complex mix of different numbers of mRNA transcripts reflecting the biology of differential expression
  • The goal in preparing a library is to prepare it in such a way as to maximize complexity and minimize PCR or other amplification-based clonal bias
  • The key point is that the level of extensive amplification required creates bias in the form of preferential amplification of different sequences, and this bias remains a serious issue in the analysis of the resulting data
  • It is important to acknowledge the impact of systematic bias resulting from the molecular manipulations required to generate NGS data; for example, the bias introduced by sequence-dependent differences in adaptor ligation efficiencies in miRNA-seq library preparations
  • Batch effects can result from variability in day-to-day sample processing such as:
    1. Reaction conditions
    2. Reagent batches
    3. Pipetting accuracy
    4. Different technicians
  • One problem associated with amplicon sequencing is the presence of chimeric amplicons (sequences formed from two or more biological sequences joined together) generated during PCR by PCR-mediated recombination
  • This problem is exacerbated in low complexity libraries and by overamplification
  • A recent study identified up to 8% of raw sequence reads as chimeric
  • The presence of the PCR primer sequences or other highly conserved sequences presents a technical limitation on some sequencing platforms that utilize fluorescent detection (i.e., Illumina)
  • In this situation, the PCR primer sequences at the beginning of the read will generate the exact same base with each cycle of sequencing, creating problems for the signal detection hardware and software
  • Assembling datasets from different sources inevitably introduces batch effects
  • These batch effects are not well understood and can be due to changes in the sequencing protocol or bioinformatics tools used to process the data
  • No systematic algorithms or heuristics exist to detect and filter batch effects or remove associations impacted by batch effects in whole genome sequencing data
  • The rapid evolution of WGS technology has been characterized by changes to:
    1. Library preparation methods
    2. Sequencing chemistry
    3. Flow cells
    4. Bioinformatics tools for read alignment and variant calling
  • Inevitably, the changes in WGS technology have resulted in large differences across samples and the potential for batch effects
  • Factors leading to batch effects are ill-understood and can arise from multiple sources making it difficult to develop systematic algorithms to detect and remove batch effects
  • Identifying a method to detect batch effects that have an impact on downstream association analyses is crucial as researchers need to know upfront whether WGS datasets can be combined or if changes in sequencing chemistry will result in sequences that can no longer be analyzed together
  • Few guidelines exist for determining whether a dataset has batch effects and, if so, what methods will reduce their impact
  • The sensitivity of WGS technology to differences in library preparation, sequencing chemistry, etc. makes it markedly susceptible to batch effects, however no standard set of guidelines for QC of WGS has been established
  • Many thousands of samples are necessary to have sufficient power to identify novel variants associated with common complex diseases
  • In order to collect enough cases for diseases, multiple groups often work collaboratively by contributing samples to a consortium
  • The need to combine samples that have been processed independently is clear, as is the unavoidable introduction of batch effects
  • The researchers found that changes in sequencing chemistry related to PCR versus PCR-free workflows strongly contributed to the detectable batch effects in both the Batch GWAS and the RA Batch GWAS
  • Batch effects in WGS data are not well understood and perhaps because of this, the researchers were not able to find an existing method or develop a novel method that removed all sites impacted by batch effects without impacting the power to detect true associations
  • In order to fully address batch effects, disentangling the impact of changes in sequencing chemistry and bioinformatics processing on association analysis will be necessary
  • Batch effects will arise as independent groups attempt to combine sequencing data generated and processed from different sources
  • Large-scale resources are spent by research, industry, and government organizations creating databases that cannot easily be merged
  • Commonly used NGS platforms, including Illumina and Pacific Biosciences, need PCR amplification during library construction to increase the number of cDNA molecules to meet the needs of sequencing
  • The most problematic step in sample preparation procedures is amplification due to the fact that PCR stochastically introduces biases, which can propagate to later cycles
  • PCR also amplifies different molecules with unequal probabilities, leading to the uneven amplification of cDNA molecules
  • Recently, researchers have proposed several different methods in order to reduce PCR amplification, such as PCR-free protocols and isothermal amplification yet these methods are not perfect and still present some artifacts and biases of sequencing
  • The sample preparation process can bring potential variations and deviations on RNA-seq experiment including:
    1. RNA isolation
    2. Sample processing
    3. Library storage time
    4. RNA input level (such as the difference in the number of start-up RNA)
    5. Sample cryopreservation (such as fresh or frozen preservation)
  • Due to the widespread existence of RNA degrading enzymes (RNases), successful isolation of high-quality RNA remains challenging
  • The purification methods were mainly developed to extract long mRNAs and have been based on the assumption that all RNAs are equally purified, when these methods are applied to noncoding RNAs, which may result in RNA degradation
  • Studies indicated that most of the protocols currently used for library construction may introduce serious deviations
  • RNA fragmentation can introduce length biases or errors, subsequently propagating to later cycles
  • Library amplification may also be affected by primer bias, such as primer bias in multiple displacement amplification (MDA), primer mismatch in PCR amplification
  • As a consequence, it may introduce nonlinear effects and inevitably compromise the quality of RNA-seq dataset, leading to the result of erroneous interpretation
  • Sources of bias during library preparation nclude:
    1. mRNA enrichment
      • Input RNA
        1. Bias associated with low amounts of input RNA has strong and harmful effects on downstream analysis
        2. If not noticed, this may have significant impact on the subsequent biological interpretation
      • rRNA depletion
        1. It is necessary to remove rRNA from sample before library construction
        2. The aim is in order to prevent most of the library and the majority of sequencing reads from being rRNA
    2. Fragmentation
      • The Tn5 transposase method showed sequence-specific bias, which is the preferred method when only small quantities of cDNA are available, since the cDNA fragmentation and adapter ligation are connected in one step
      • The physical method includes acoustic shearing, sonication, and hydrodynamic, which also can present nonrandom DNA fragmentation bias
    3. Primer bias
      • Studies have been indicated that random hexamer primer can lead to the deviation of nucleotide content of RNA sequencing reads, which also affects the consistency of the locations of reads along expressed transcripts
      • This may result in low complexity of RNA sequencing data
    4. Adapter ligation
      • Studies have shown that adapter ligation introduces a significant but widely overlooked bias in the results of NGS small RNA sequencing
    5. Reverse transcription
      • A known feature of reverse transcriptases is that they tend to produce false second strand cDNA through DNA-dependent DNA polymerase
      • This may not be able to distinguish the sense and antisense transcript and create difficulties for the data analysis
    6. PCR amplification
      • Proven to be the main source of artifacts and base composition bias in the process of library construction, which may lead to misleading or inaccurate conclusions in data analysis
  • All sequencing studies are limited by the accuracy of underlying sequencing experiments, because RNA-seq technology may introduce various errors and biases in sample preparation, library construction, sequencing and imaging, etc.
  • Library construction methods are frequently biased, which is a main concern for RNA-seq data quality
  • Among them, PCR amplification is the major source of bias
  • The major goal of constructing the sequencing library is to minimize the bias
  • It is impossible to eliminate all sources of experimental bias

The genome that we decipher in this generation is but a snapshot of an ever-changing document. There is no definitive edition.”

Matt Ridley, Genome: the Autobiography of a Species in 23 Chapters

It is obvious that every step in the preparation of the sequencing library is littered with bias, artefacts, errors, and batch effects. They openly admit that bias can not be eliminated and can only be mitigated and that batch effects are inevitable with no way to detect or remove them. Knowing the various issues of contamination, errors, artefacts, and biases inherent in the steps leading to the creation of the genomic library, how can the data generated by these libraries be considered accurate and/or reliable? Every step that the sample goes through further removes it from its natural state. What is ultimately copied as cDNA, sequenced, and compiled into a genome no longer reflects reality. These researchers break down the fluids and materials taken from within us into unrecognizable lab-created cloned fragments and then attempt to put them back together into a construct only observable on a computer screen. These researchers beleve that they can obtain valuable insight from “specific” genetic markers within these heavily altered computer-generated theoretical constructs. So far, this belief is nothing but blind religious-like faith as the medical marvels promised by the rise of genomics with the sequencing of the human genome nearly 20 years ago has yet to materialize into a benefit to society. When it is admitted that the various sources of errors in each and every step leading to erroneous and unreliable data can not be eliminated, it is time to realize that the promise of genomics is a pipe dream built upon fraudulent data.

6 comments

  1. Do I trust the medical community, medical scientific research labs and big pharma to get all this DNA sequencing and data capture right? Do I trust the medical community to do something like CRISPR and make it work to improve the health of mankind? Never in a million years. They can edit all the media they want, but never my DNA. Perhaps in the Star Trek world they created new medical procedures to help people, but as of now, it’s about nothing more than control and big profits.

    Liked by 1 person

    1. Yes, it is yet another in a long line of deceptive charades they have running currently. It is indeed wise not to trust those who create the diseases and then sell the “cures.”

      Like

    1. I’m not entirely familiar with the process but for biological relatedness, I have heard that they do extensive questionnaires and without those, they are not able to determine anything. In fact, even with quesionnaires, the tests are not very accurate as different places can grt different results. For criminals, there have been many convictions overturned due to the unreliability of DNA evidence. The accuracy of DNA is a myth.

      Like

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: