“Viral” Genomics: Nothing but Strings of Letters in a Data Bank

“Although all that is terrific, says Calisher, a string of DNA letters in a data bank tells little or nothing about how a virus multiplies, which animals carry it, how it makes people sick, or whether antibodies to other viruses might protect against it. Just studying sequences, Calisher says, is “like trying to say whether somebody has bad breath by looking at his fingerprints.

Virologist Charles Calisher to Science in 2001

https://www.google.com/amp/s/fdocuments.in/amp/document/virology-old-guard-urges-virologists-to-go-back-to-basics.html

There are many issues with relying solely on genomics for “viruses,” especially since no “virus” has ever been properly purified/isolated nor proven pathogenic. This step is absolutely necessary in order to know whether these strings of letters in a database actually exist in reality and have any meaning. As Charles Calisher warned, the study of DNA sequences tells nothing about a “virus” and without this proof of existence, the random A,C,T,G’s are completely meaningless. Presented below are three sources which highlight what Dr. Calisher warned about in regards to the various challenges in attempting to analyze “viral” sequences and the difficulties of using genomics for the characterization/identification of “viruses.”

This first article is from July 2016 and it lays out the many flaws when using genomics for “viral” identification/characterization. It does an excellent job of describing the numerous challenges in regards to analyzing “viral” metagenomes. Keep in mind that with metagenomics, there are no attempts at purification/isolation of any “viruses” at all. They try to sequence genomes from mixed samples containing a multitude of host, bacteria, fungi, and “virus” DNA. It is admitted in this article that the fragments from the “viral” genomes are much less abundant than the other substances within the sample. They discuss the various technological challenges along with the the varied quality of the software and state that most genomic analysis software is not suited for “viral” genomes. These are but a few of the many interesting revelations within this source:

Classification of “viral” sequences is critically dependent upon the quality of curated “viral” databases

Challenges in the analysis of viral metagenomes

“Genome sequencing technologies continue to develop with remarkable pace, yet analytical approaches for reconstructing and classifying viral genomes from mixed samples remain limited in their performance and usability. Existing solutions generally target expert users and often have unclear scope, making it challenging to critically evaluate their performance. There is a growing need for intuitive analytical tooling for researchers lacking specialist computing expertise and that is applicable in diverse experimental circumstances. Notable technical challenges have impeded progress; for example, fragments of viral genomes are typically orders of magnitude less abundant than those of host, bacteria, and/or other organisms in clinical and environmental metagenomes; observed viral genomes often deviate considerably from reference genomes demanding use of exhaustive alignment approaches; high intrapopulation viral diversity can lead to ambiguous sequence reconstruction; and finally, the relatively few documented viral reference genomes compared to the estimated number of distinct viral taxa renders classification problematic. Various software tools have been developed to accommodate the unique challenges and use cases associated with characterizing viral sequences; however, the quality of these tools varies, and their use often necessitates computing expertise or access to powerful computers, thus limiting their usefulness to many researchers. In this review, we consider the general and application-specific challenges posed by viral sequencing and analysis, outline the landscape of available tools and methodologies, and propose ways of overcoming the current barriers to effective analysis.”

At present we have a limited grasp of the extent of viral diversity present in the environment: the 2014 database release from the International Committee for the Taxonomy of Viruses classified just 7 orders, 104 families, 505 genera, and 3286 species (http://www.ictvonline.org/virustaxonomy.asp); yet, one study estimated that there are at least 320,000 virus species infecting mammals alone (Anthony et al. 2013).”

“For example, a fast sequence classifier might fail entirely to detect a novel strain of a well-characterized virus, and equally might perform well with Illumina sequences yet deliver poor results for data generated with the Ion Torrent platform. Furthermore, results arising from these analyses should be replicable, intelligible, and useful to the end user, with provision for quality control and error management.”

Methodologically, most genomic sequence analysis software is not well suited for viral genomes. Generic tools that are able to address the challenges posed by viral sequences are often applicable only in limited circumstances. Choosing between approaches is made difficult due to an abundance of disparate yet functionally equivalent methodologies and in general a lack of rigorous benchmarks for viral datasets. While there is much ongoing research in this area, both the sensitive detection of previously characterized viruses and viral discovery remain key challenges open for innovation.

2. Viral sequence enrichment: physical and insilico approaches

“Within metagenomes the proportion of viral nucleic acids is typically far lower than that of host or other microbes, limiting the amount of signal available for analysis after sequencing. To mitigate this issue, enrichment and amplification approaches are widely used prior to sequencing viral samples. Size filtration or density-based enrichment by centrifugation are two effective methods for increasing virus yield, although such methods may bias the observed composition of viral populations (Ruby, Bellare, and Derisi 2013). Alternatively, PCR amplification may be used to generate an abundance of specific viral sequences present in a sample, a widely used strategy, which was employed in the identification and analysis of MERS coronavirus (Zaki et al. 2012; Cotten et al. 20132014), although effective primer design can be challenging in the presence of high genomic diversity in the target viral species. Conversely, an excess of sequencing coverage can lead to the construction of overly complex and unwieldy de novo assembly graphs in the presence of high genomic diversity, reducing assembly quality. Using in silico normalisation (Crusoe et al. 2015), excess coverage may be reduced by discarding sequences containing redundant information. This approach increases analytical efficiency when dealing with high coverage sequence data, and we have shown that it can benefit de novo assembly of viral consensus sequences. Another in silico strategy for increasing analytical efficiency by discarding unneeded data is to filter sequences from known abundant organisms through alignment with one or more reference genomes using an aligner or specialist tool (approaches reviewed in Daly et al. 2015).”

4. Assembling genomes: denovo and reference-based assembly

“The reconstruction of sequencing reads into full length genes and genomes can be performed by means of either reference-based alignment or de novo assembly, a decision dependent on experimental objectives, read length, quality and data complexity. In reference-based approaches, reads are mapped to similar regions of a supplied template genome, a well-studied and computationally efficient process implemented with a suffix array index of the reference genome. In contrast, de novo assembly is computationally exhaustive but important in cases where either a target genome is poorly characterized or reconstruction of genomes of a priori unknown entities in metagenomes is sought, such as in surveillance studies. For short read data, the increased sequence length afforded by assembly can be necessary to distinguish members of highly conserved gene families from one another. Assembly is also widely used for generating whole genome consensus sequences to facilitate analyses of viral variation, and is a typical starting point for analyses of diverse populations of well-characterized viruses. Even where long reads are available, assembly plays an important role in mitigating the high error rates associated with single molecule sequencing technologies, yielding accurate consensus sequences from inaccurate individual reads.

4.2 Denovo assembly for metagenomes

“Typical de novo assemblers are designed to reconstruct genomes with uniform sequencing coverage across their length. This is problematic for metagenomes (including viromes) where coverage typically varies considerably both among different genomes and within individual genomes.

“Although metagenome assemblers generally outperform single genome assemblers in reconstructing different genomes simultaneously, the complexity of this task stipulates their tendency to collapse variation at or beneath strain level into consensus sequences. Even to this end, their effectiveness may be limited as a consequence of extreme variation within specific RNA virus populations due to mutation and recombination, and low and/or uneven sequencing coverage across a particular genome. Furthermore, it should be noted that de novo assembly is particularly sensitive to the quality of input sequences, meaning that problems during sample extraction, enrichment and library preparation can be highly detrimental to downstream analyses. Of key importance therefore are quality control methods for detecting, and where appropriate correcting, problems associated with contamination (Darling et al. 2014Orton et al. 2015), primer read-through and low quality reads (reviewed in Leggett et al. 2013).”

5. Haplotype reconstruction in specific viral populations

Viral genomes and metagenomes comprising high intraspecific variation can be challenging targets for assembly, giving rise to complex assembly graphs and fragmented assemblies. This is often the case for clinical samples from HIV and Hepatitis C patients, in which high rates of mutation and long durations of infection can contribute to extreme population divergence, but can also be observed in environmental samples.

A limitation of all of these approaches; however, is their reliance upon a single reference sequence with which to perform the initial alignment, a process which assumes a degree of sequence similarity which may not always be observed in diverse regions, such as regions encoding envelope proteins, of RNA virus genomes.”

6. Sequence classification

“Sequence classification is one of the most studied problems in computational biology, and taxonomic assignment is a key objective of metagenome analysis. All classification methods, to some extent, depend upon detecting similarity between a query sequence and a collection of annotated sequences. Classification may be undertaken using either unassembled reads or the reconstructed contigs arising from the assembly process. The computational requirements of available approaches vary dramatically according to their ability to detect homology in divergent sequences;

6.1. Sequence similarity searches

Viral identification approaches typically depend on similarity searches against a database using an aligner such as BLAST (Altschul et al. 1990). Comprehensive databases (e.g. GenBank) or smaller custom databases containing for example, only viral sequences of interest may be used, although the latter can generate misleading results.

6.2 Alternatives to similarity searches

“Although exhaustive BLAST-like methods can detect homology in divergent sequences, these methods are in general limited by the relatively few validated viral sequences deposited in public databases, the high diversity within viral families which can obscure relatedness, and the lack of a defined set of core genes common to all viruses that can be used to distinguish species (e.g. the 16S gene for bacteria) (Fancello, Raoult, and Desnues 2012). These features make it difficult to assign similarity thresholds for classification that are applicable to all potential viruses in a sample (Simmonds 2015).”

A fundamental challenge in the classification of viral sequences with any of these methods remains their limited representation within curated sequence databases. While the rate at which new viruses are being added to NCBI’s RefSeq collection has increased considerably, from a year average 0.34 species/day in 2010 to 2.5 species/day in 2015 (Fig. 3), our documented understanding of the extent of viral diversity remains superficial (Anthony et al. 2013). Reads of true viral origin are therefore liable to be missed in many cases. The rate of database growth also highlights the need to maintain frequently updated search indexes for sequence classification, construction of which often demands specialist servers equipped with hundreds of gigabytes of RAM. Even if up-to-date indexes are maintained inside a public repository, their file sizes are substantial, demanding users have access to a fast internet connection. Consequently, complete outsourcing of sequence classification to remote web services is a compelling prospect for those with adequate internet connections but without powerful computing hardware, increasing scope for conducting analyses with portable computers.”

“Furthermore, classification of viral sequences is critically dependent upon the quality of curated viral databases such as RefSeq, to which submitting newly discovered sequences can be prohibitively time consuming.”

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5822887/

“Just studying sequences is like trying to say whether somebody has bad breath by looking at his fingerprints.” – Virologist Charles Calisher

This second article is from January 2017. It breaks down the three main methods used to sequence “viruses” and provides the advantages and disadvantages of each. Keep in mind while reading this, the original “SARS-COV-2” genome came from the metagenomic analysis of unpurified/unisolated BALF taken from one patient. I detailed the process here:

Creating the “SARS-COV-2” Genome

The three main methods outlined in the article include metagenomics, PCR amplification, and target enrichment. Regardless of what is claimed, none of these methods are suitable to detect novel “viruses,” especially when they are sequencing from sources that have not been through proper purification steps (centrifugation, filtration, precipitation, etc) to remove host DNA, bacteria, fungi, and other potential foreign microorganisms/contaminants from the original sample. As you will see, either the sample contains more than just the targeted “viral” material or the sequence must already be known beforehand.

Clinical and biological insights from viral genome sequencing

Sequencing viral nucleic acids, whether from cultures or directly from clinical specimens, is complicated by the presence of contaminating host DNA58. By contrast, most bacterial sequencing is currently carried out on clinical isolates that are cultured; thus, sample preparation is comparatively straightforward (Table 2 and reviewed in Ref. 59). Currently, genome sequencing of viruses can be achieved by ultra-deep sequencing or through the enrichment for viral nucleic acids before sequencing, either directly or by concentrating virus particles. All of these approaches have their own costs and complexities.

Three main methods are currently used for viral genome sequencing: metagenomic sequencing, PCR amplicon sequencing and target enrichment sequencing (Fig. 1).

Metagenomics. Metagenomic approaches have been used extensively for pathogen discovery and for the characterization of microbial diversity in environmental and clinical samples60,61. Total DNA and/or RNA, including from the host, bacteria, viruses, fungi and other pathogens, are extracted from a sample, and a library is prepared and sequenced by shotgun sequencing or RNA sequencing (RNA-seq). Box 1 explores the diagnostic applications for metagenomics and RNA-seq; for example, in encephalitis of unknown aetiology62,63,64, for which conventional methods such as PCR are often not diagnostic, metagenomics and RNA-seq have detected viral infections65,66,67 and other causes68 of encephalitis. In addition, these methods have been used to sequence the whole genome of some viruses, including Epstein–Barr virus (EBV)69 and HCV27. However, in clinical specimens, the presence of contaminating nucleic acids from the host and commensal microorganisms58 (Table 2) decreases sensitivity. The proportion of reads that match the target virus genome from metagenomic WGS is often low; for example, 0.008% for EBV in the blood of a healthy adult70, 0.0003% for Lassa virus in clinical samples71 and 0.3% for Zika virus in a sample that was enriched for virus particles through filtration and centrifugation72. The read depth is often inadequate to detect resistance27 and the cost is high. Thus, metagenomic sequencing has typically only been carried out on a small number of samples for research purposes72,73. The concentration of virus particles (see the Zika virus example above72), depletion of host material and/or sequencing to high read depth can increase the amount of virus sequence, but all of these methods add to the cost. The concentration of virus particles from clinical specimens by antibody-mediated pull-down (for example, virus discovery based on complementary DNA (cDNA)–amplified fragment length polymorphism (AFLP), abbreviated to VIDISCA), filtration, ultracentrifugation and the depletion of free nucleic acids, which mostly come from the host, have all been tried74,75,76,77; however, these methods may also decrease the total amount of viral nucleic acids so that it is insufficient for preparing a sequencing library. Non-specific amplification methods, such as multiple displacement amplification (MDA), which make use of random primers and Φ29 polymerases, can increase the DNA yield. However, these approaches are time consuming, costly, and may increase the risk of biases, errors and contamination, without necessarily improving sensitivity78,79. Moreover, the proportion of host reads often remains high80.

When metagenomic methods are used for pathogen discovery or diagnosis, it is crucial to use appropriate bioinformatic tools and databases that can evaluate whether detected pathogen sequences are likely to be the cause of infection, incidental findings or contaminants. Bioinformatic analyses of large metagenomic datasets require high-performance computational resources.”

“However, incidental findings, both in host and microbial sequences, may also present ethical and even diagnostic dilemmas for clinical metagenomics82 (see below). A recent example involved a cluster of cases of acute flaccid myelitis that were associated with enterovirus D68 (Ref. 83). The metagenomic data from samples taken from patients showed the presence of alternative pathogens, some of which are treatable, and was debated in formal84 and informal scientific channels (see Omicsomics blogspot article).”

“PCR amplicon enrichment. An alternative to metagenomic approaches is to enrich the specific viral genome before sequencing. PCR amplification of viral genetic material using primers that are complementary to a known nucleotide sequence has been the most common approach for enriching small viral genomes, such as HIV and influenza virus.”

Overlapping PCRs combined with NGS have been used to sequence the whole genomes of larger viruses, such as HCMV93, but this method has limited scalability, as many primers and a relatively large amount of starting DNA are required93. This limits the number of suitable samples that are available and also the genomes that can be studied using this method. For example, 8–19 PCR products were required to amplify the genome of Ebola virus39, and two studies of norovirus needed 14 and 22 PCR products, respectively86,87. For clinical applications this is problematic because of the high laboratory workload that is associated with numerous discrete PCR reactions, the necessity for individually normalizing concentrations of different PCR amplicons before pooling, the increasing probability of reaction failure due to primer mismatch (particularly for highly variable viruses), and the high costs of labour and consumables94. Therefore, although PCR-based sequencing of viruses as large as 250 kb is technically possible, the proportional relationship between genome size and technical complexity make PCR-based sequencing of viral genomes that are more than 20–50 kb impractical with current technologies, particularly for large multi-sample studies or routine diagnostics. Another consideration is that increasing numbers of PCR reactions require a corresponding increase in sample amount, and this is not always possible as clinical specimens are limited.

Highly variable pathogens, particularly those that have widely divergent genetic lineages or genotypes, such as HCV97 and norovirus, cause problems for PCR amplification, such as primer amplification27,92 and primer mismatches86. Careful primer design may help to mitigate these problems, but novel variants remain problematic.”

“Target enrichment. Methods of target enrichment (also known as pull-down, capture or specific enrichment methods) can be used to sequence whole viral genomes directly from clinical samples without the need for prior culture or PCR98,99,100. These methods typically involve small RNA or DNA probes that are complementary to the pathogen reference sequence (or a panel of reference sequences).”

“The lack of a culture step means that the sequences that are obtained are more representative of original virus rather than cultured virus isolates, and there are fewer mutations than in PCR-amplified templates69,100. The success of this method depends on the available reference sequences for the virus of interest; specificity increases when probes are designed against a larger panel of reference sequences, as this leads to better capture of the diversity in and between samples. Target enrichment is possible despite small mismatches between template and probe; however, whereas PCR amplification requires only knowledge of flanking regions of a target region, target enrichment requires knowledge of the internal sequence to design probes. However, if one probe fails, internal and overlapping regions may still be captured by other probes69,100. Target enrichment is not suitable for the characterization of novel viruses that have low homology to known viruses for which metagenomics, and, in some cases, PCR using degenerate primers, which are a mix of similar but variable primers, may be more appropriate.”

“Results from this study, in which three different enrichment protocols, two metagenomic methods and one overlapping PCR method were evaluated, showed that metagenomic methods were the least sensitive, yielded the lowest genome coverage for comparable sequencing effort and were more prone to result in incomplete genome assemblies. The PCR method required repeated amplification and was the most likely to miss mixed infections, but when reactions were successful it resulted in the most consistent read depth, whereas read depth was proportional to virus copy number in metagenomics and target enrichment. PCR generated more incomplete sequences for some HCV genotypes (particularly genotype 2) than metagenomics and target enrichment. Target enrichment was the most consistent method to result in full genomes and identical consensus sequences.”

The issues of sensitivity and contamination are especially important in WGS, because of the risk of both false-negative and false-positive detection of pathogens. Highly sensitive sequencing (whether metagenomic, PCR-based or target enrichment-based) may detect low-level contaminating viral nucleic acids112,113. For example, murine leukaemia virus114,115 and parvovirus-like sequences116,117 are just two of many contaminants that can come from common laboratory reagents, such as nucleic acid extraction columns118. As with other highly sensitive technologies, robust laboratory practices and protocols are required to minimize contamination. It is also important to remember that the detection of viral nucleic acid does not necessarily identify the cause of illness, and it is good practice when using NGS methods for the diagnosis of viral infections to confirm the findings with alternative independent methods that do not rely on testing for nucleic acids. For example, in cases of encephalitis of unknown origin, positive NGS findings can be confirmed through immunohistochemical analysis of the affected tissue65,119, or identification of the virus by electron microscopy or tissue culture82.”

“Currently, the choice of method is specific to both the virus and the clinical question. Metagenomic sequencing is most appropriate for diagnostic sequencing of unknown or poorly characterized viruses, PCR amplicon sequencing works well for short viral genomes and low diversity in primer binding sites, and target enrichment works for all pathogen sizes but is particularly advantageous for large viruses and for viruses that have diverse but well-characterized genomes. Two obvious areas of innovation currently exist: methods that can effectively deplete host DNA without affecting viral DNA, and the further development of long-read technologies to achieve the flexibility and competitive pricing of short-read technologies.”

https://www.nature.com/articles/nrmicro.2016.182?WT.feed_name=subjects_bacteria

None are suitable for novel “viruses.”

This third source is from May 2019. It highlights further issues when relying on genomics for “viruses.” It reiterates the problem of having low “viral” material compared to host DNA as well as the coexistence of several “variants” within the same sample. It also points out some of the issues of utilizing PCR for sequencing genomes as well as the criticalness of selecting the right reference genome:

A complete protocol for whole-genome sequencing of virus from clinical samples: Application to coronavirus OC43

“Obtaining virus genome sequence directly from clinical samples is still a challenging task due to the low load of virus genetic material compared to the host DNA, and to the difficulty to get an accurate genome assembly.

However, despite the relative small size of virus genomes, their sequencing often remains difficult. The small amount of virus genetic material compare to the host nucleic acid decreases viral sequencing output. In addition, one have to deal with the difficulty that several viral variants coexist in a single sample, presenting more or less variable sequences depending on the intrinsic mutation rate of the virus. All these points burden the sequencing and the assembly of viral genome.

“Three main methods based on HTS are currently used for viral whole-genome sequencing: metagenomic sequencing, target enrichment sequencing and PCR amplicon sequencing, each showing benefits and drawbacks (Houldcroft et al., 2017)⁠. In metagenomic sequencing, total DNA (and/or RNA) from a sample including host but also bacteria, viruses and fungi is extracted and sequenced. It is a simple and cost-effective approach, and it is the only approach not requiring reference sequences. Instead, the other two HTS approaches, target enrichment and amplicon sequencing, both depend on reference information to design baits or primers. The limitation of metagenomic sequencing is that it requires a very high sequencing depth to obtain enough viral genome material. The target enrichment sequencing uses virus-specific capture oligonucleotides to enrich the viral genome preparation before sequencing. This method is more specific than metagenomics sequencing but implies higher costs and a more advanced technical expertise for sample preparation. Finally, the PCR amplicon sequencing is a well-established method consisting in specific viral genome amplification by PCR before sequencing. It is easily applicable on large number of samples in a routine use and so very adequate for clinical samples. The PCR amplification method, compared to the others, is particularly relevant for samples containing very low viral genetic material, it presents several disadvantages, though. The sequence of the virus of interest has to be known and not too variable to be correctly amplified by the set of designed primers. A second pitfall is due to the fact that the PCR cycles can introduce some amplification errors along the sequence which make the assembly step more prone to mistakes. Finally, this method can only be used for small genomes because of the number of PCR reactions which has to be limited.”

The bioinformatics analysis of virus sequencing data is often based on alignment, or mapping, of reads against a reference sequence followed by the consensus extraction by majority voting. However, the alignment step is known to introduce some biases (Archer et al., 2010Posada-Cespedes et al., 2017)⁠. For example, if the studied virus sequence is divergent from the chosen reference sequence, the reads covering the regions of divergence could not be aligned correctly which will bias the resulting consensus. Moreover, the mapping step of reads in divergent, repetitive or low complexity regions is a difficult task which have to be carefully examined (Caboche et al., 2014). Finally, the choice of the reference sequence itself is a critical step from which the resulting consensus sequence will strongly depend.

https://www.sciencedirect.com/science/article/pii/S0042682219300728

Without purification/isolation of a “virus,” all of these methods are worthless.

In Summary (2016):

  • Analytical approaches for reconstructing and classifying “viral” genomes from mixed samples remain limited in their performance and usability
  • Notable technical challenges have impeded progress such as:
    1. Fragments of “viral” genomes are typically orders of magnitude less abundant than those of host, bacteria, and/or other organisms in clinical and environmental metagenomes
    2. Observed “viral” genomes often deviate considerably from reference genomes demanding use of exhaustive alignment approaches
    3. High intrapopulation “viral” diversity can lead to ambiguous sequence reconstruction
    4. The relatively few documented “viral” reference genomes compared to the estimated number of distinct “viral” taxa renders classification problematic
  • Various software tools have been developed to accommodate the unique challenges and use cases associated with characterizing “viral” sequences; however, the quality of these tools varies, and their use often necessitates computing expertise or access to powerful computers, thus limiting their usefulness to many researchers
  • At present we have a limited grasp of the extent of “viral” diversity present in the environment
  • A fast sequence classifier might fail entirely to detect a novel strain of a well-characterized “virus,” and equally might perform well with Illumina sequences yet deliver poor results for data generated with the Ion Torrent platform
  • Most genomic sequence analysis software is not well suited for “viral” genomes
  • Generic tools that are able to address the challenges posed by “viral” sequences are often applicable only in limited circumstances
  • Choosing between approaches is made difficult due to an abundance of disparate yet functionally equivalent methodologies and in general a lack of rigorous benchmarks for “viral” datasets
  • The sensitive detection of previously characterized “viruses” and “viral” discovery remain key challenges open for innovation
  • Within metagenomes the proportion of “viral” nucleic acids is typically far lower than that of host or other microbes, limiting the amount of signal available for analysis after sequencing
  • Size filtration or density-based enrichment by centrifugation are two effective methods for increasing “virus” yield, although such methods may bias the observed composition of “viral” populations
  • PCR amplification may be used to generate an abundance of specific “viral” sequences present in a sample, although effective primer design can be challenging in the presence of high genomic diversity in the target “viral” species
  • An excess of sequencing coverage can lead to the construction of overly complex and unwieldy de novo assembly graphs in the presence of high genomic diversity, reducing assembly quality
  • Using in silico (in the computer) normalization, excess coverage may be reduced by discarding sequences containing redundant information
  • Another in silico (in the computer) strategy for increasing analytical efficiency by discarding unneeded data is to filter sequences from known abundant organisms through alignment with one or more reference genomes using an aligner or specialist tool
  • In reference-based approaches, reads are mapped to similar regions of a supplied template genome, a well-studied and computationally efficient process implemented with a suffix array index of the reference genome
  • In contrast, de novo assembly is computationally exhaustive but important in cases where either a target genome is poorly characterized or reconstruction of genomes of a priori unknown entities in metagenomes is sought, such as in surveillance studies
  • Assembly is the process of taking a large number of short DNA sequences and putting them back together to create a representation of the original chromosomes from which the DNA originated (https://www.sciencedirect.com/topics/agricultural-and-biological-sciences/genome-assembly)
  • It is also widely used for generating whole genome consensus sequences to facilitate analyses of “viral” variation, and is a typical starting point for analyses of diverse populations of well-characterized “viruses”
  • Even where long reads are available, assembly plays an important role in mitigating the high error rates associated with single molecule sequencing technologies, yielding accurate consensus sequences from inaccurate individual reads
  • Typical de novo assemblers are designed to reconstruct genomes with uniform sequencing coverage across their length but this is problematic for metagenomes (including “viromes”) where coverage typically varies considerably both among different genomes and within individual genomes
  • The complexity of reconstructing multiple genomes simultaneously stipulates metagenome assemblers tendency to collapse variation at or beneath strain level into consensus sequences
  • Their effectiveness may be limited as a consequence of extreme variation within specific RNA “virus” populations due to mutation and recombination, and low and/or uneven sequencing coverage across a particular genome
  • De novo assembly is particularly sensitive to the quality of input sequences, meaning that problems during sample extraction, enrichment and library preparation can be highly detrimental to downstream analyses
  • “Viral” genomes and metagenomes comprising high intraspecific variation can be challenging targets for assembly, giving rise to complex assembly graphs and fragmented assemblies
  • A limitation of all of the alignment based probabilistic population reconstruction approaches, however, is their reliance upon a single reference sequence with which to perform the initial alignment, a process which assumes a degree of sequence similarity which may not always be observed in diverse regions
  • All classification methods, to some extent, depend upon detecting similarity between a query sequence and a collection of annotated sequences
  • The computational requirements of available approaches vary dramatically according to their ability to detect homology in divergent sequences
  • “Viral” identification approaches typically depend on similarity searches against a database using an aligner such as BLAST
  • Comprehensive databases (e.g. GenBank) or smaller custom databases containing for example, only “viral” sequences of interest may be used, although the latter can generate misleading results
  • BLAST-like methods are limited by:
    1. The relatively few validated “viral” sequences deposited in public databases
    2. The high diversity within “viral” families which can obscure relatedness
    3. The lack of a defined set of core genes common to all “viruses” that can be used to distinguish species
  • These features make it difficult to assign similarity thresholds for classification that are applicable to all potential “viruses” in a sample
  • A fundamental challenge in the classification of “viral” sequences with any of these methods remains their limited representation within curated sequence databases
  • Our documented understanding of the extent of “viral” diversity remains superficial
  • Reads of true “viral” origin are therefore liable to be missed in many cases
  • Classification of “viral” sequences is critically dependent upon the quality of curated “viral” databases

In Summary (2017):

  • Sequencing “viral” nucleic acids, whether from cultures or directly from clinical specimens, is complicated by the presence of contaminating host DNA
  • Metagenomics:
    1. Total DNA and/or RNA, including from the host, bacteria, “viruses,” fungi and other pathogens, are extracted from a sample, and a library is prepared and sequenced by shotgun sequencing or RNA sequencing
    2. In clinical specimens, the presence of contaminating nucleic acids from the host and commensal microorganisms decreases sensitivity
    3. The proportion of reads that match the target “virus” genome from metagenomic WGS is often low; for example, 0.008% for EBV in the blood of a healthy adult, 0.0003% for Lassa “virus” in clinical samples and 0.3% for Zika “virus” in a sample that was enriched for “virus” particles through filtration and centrifugation
    4. The concentration of “virus” particles from clinical specimens by antibody-mediated pull-down (for example, “virus” discovery based on complementary DNA (cDNA)–amplified fragment length polymorphism (AFLP), abbreviated to VIDISCA), filtration, ultracentrifugation and the depletion of free nucleic acids, which mostly come from the host, have all been tried; however, these methods may also decrease the total amount of “viral” nucleic acids so that it is insufficient for preparing a sequencing library
    5. Methods to increase DNA yield exist, however these approaches are time consuming, costly, and may increase the risk of biases, errors and contamination, without necessarily improving sensitivity and the proportion of host reads often remains high
    6. It is crucial to use appropriate bioinformatic tools and databases that can evaluate whether detected pathogen sequences are likely to be the cause of infection, incidental findings or contaminants
    7. Incidental findings, both in host and microbial sequences, may also present ethical and even diagnostic dilemmas for clinical metagenomics
  • PCR amplicon enrichment:
    1. PCR amplification of “viral” genetic material uses primers that are complementary to a known nucleotide sequence
    2. Overlapping PCRs combined with NGS have been used to sequence the whole genomes of larger “viruses,” but this method has limited scalability, as many primers and a relatively large amount of starting DNA are required which limits the number of suitable samples that are available and also the genomes that can be studied using this method
    3. For clinical applications this is problematic because of the high laboratory workload that is associated with numerous discrete PCR reactions, the necessity for individually normalizing concentrations of different PCR amplicons before pooling, the increasing probability of reaction failure due to primer mismatch (particularly for highly variable “viruses”), and the high costs of labour and consumables
    4. Increasing numbers of PCR reactions require a corresponding increase in sample amount, and this is not always possible as clinical specimens are limited
    5. Highly variable pathogens, particularly those that have widely divergent genetic lineages or genotypes, cause problems for PCR amplification, such as primer amplification and primer mismatches
    6. Careful primer design may help to mitigate these problems, but novel variants remain problematic
  • Target enrichment:
    1. Can be used to sequence whole “viral” genomes directly from clinical samples without the need for prior culture or PCR but these methods typically involve small RNA or DNA probes that are complementary to the pathogen reference sequence (or a panel of reference sequences)
    2. The success of this method depends on the available reference sequences for the “virus” of interest; specificity increases when probes are designed against a larger panel of reference sequences, as this leads to better capture of the diversity in and between samples
    3. Target enrichment requires knowledge of the internal sequence to design probes
    4. Target enrichment is not suitable for the characterization of novel “viruses” that have low homology to known “viruses”
  • In one study, metagenomic methods were the least sensitive, yielded the lowest genome coverage for comparable sequencing effort and were more prone to result in incomplete genome assemblies while the PCR method required repeated amplification and was the most likely to miss mixed infections
  • The issues of sensitivity and contamination are especially important in WGS, because of the risk of both false-negative and false-positive detection of pathogens
  • It is important to remember that the detection of “viral” nucleic acid does not necessarily identify the cause of illness, and it is good practice when using NGS methods for the diagnosis of “viral” infections to confirm the findings with alternative independent methods that do not rely on testing for nucleic acids
  • Methods that can effectively deplete host DNA without affecting “viral” DNA are needed

In Summary (2019):

  • It is admitted that obtaining “virus” genome sequence directly from clinical samples is still a challenging task due to the low load of “virus” genetic material compared to the host DNA, and to the difficulty to get an accurate genome assembly
  • Despite the relative small size of “virus” genomes, their sequencing often remains difficult
  • The small amount of “virus” genetic material compared to the host nucleic acid decreases “viral” sequencing output
  • Several “viral” variants coexist in a single sample, presenting more or less variable sequences
  • In metagenomic sequencing, total DNA (and/or RNA) from a sample including host but also bacteria, “viruses” and fungi is extracted and sequenced
  • Target enrichment and amplicon sequencing both depend on reference information to design baits or primers
  • The limitation of metagenomic sequencing is that it requires a very high sequencing depth to obtain enough “viral” genome material
  • PCR for sequencing presents several disadvantages such as:
    1. The sequence of the “virus” of interest has to be known and not too variable to be correctly amplified by the set of designed primers
    2. A second pitfall is due to the fact that the PCR cycles can introduce some amplification errors along the sequence which make the assembly step more prone to mistakes
    3. Finally, this method can only be used for small genomes because of the number of PCR reactions which has to be limited
  • The bioinformatics analysis of “virus” sequencing data is often based on alignment, or mapping, of reads against a reference sequence followed by the consensus extraction by majority voting
  • The alignment step is known to introduce some biases
  • If the studied “virus” sequence is divergent from the chosen reference sequence, the reads covering the regions of divergence could not be aligned correctly which will bias the resulting consensus
  • The choice of the reference sequence itself is a critical step from which the resulting consensus sequence will strongly depend

Needless to say, in order for genomics to be a valuable tool for the study of “viruses,” these “viruses” must be shown to exist first. There are too many errors, biases, variables, and an over reliance on references and consensus that make it a poor fit to prove and/or study something that has not been purified/isolated first.

This is what Charles Calisher warned about. These new tools are pretty cool and fun to look at but the information coming from them means nothing if the tried and true methods of the past are ignored. Modern virology has only strings of A,C,T,G’s in a database with nothing physical backing them up.

Unfortunately for Dr. Calisher, even the old methods of virology he championed were prone to errors and were unable to produce the proof of purified/isolated pathogenic “viruses.”

5 comments

  1. Reading this has me shaking my head. It is unbelieveable that virologists can make ANY claims about a virus sequence with a straight face. Let’s just call the whole thing off, as the song goes. It is utterly ridiculous to claim anything from an unpurified sample contaminated with who knows what genetic material and run through de novo assemblers, aligned against a reference. All made up then wrapped in a highly technical bow of nonsense. “..yielding accurate consensus sequences from inaccurate individual reads.“ Yeah, right. How this has gone on so long is incredible. Great work, as always!!

    Liked by 1 person

    1. Yes! I don’t know how they can claim these sequences mean anything with a straight face. They admit over and over again to off-target DNA from multiple sources being within the samples. They also admit that purification procedures not only may not get rid of all of these off-target DNA sources, they can also decrease “viral” yields. On top of all of that, they admit detecting nucleic acid does not mean that they identified the cause of illness. These sequences are utterly useless.

      Liked by 1 person

  2. Thank you for these gems. So as an engineer having 20 plus years doing Test and Measurement – my first question regarding the insilico modeling. Who are the software developers writing the software code for the graphics that the scientists use to model and what is that code based upon? What margin for error is in the code itself? What effects might different operating systems have? Is their a certified standard in these software models? Who validates a new computer generated virus model found is authentic ? So in the make believe world lets pretend everything aligned and a new model virus was deemed authentic. But there is still one more glaring problem which you already commented on. They still need to show it actually exists by isolating and purification a reference material sample in the real world for comparison to validate the model. Its been a breath taking journey to realize there are people with doctorate degrees who are doing complete psuedo science whereby they are creating “ghosts” in the test and measurement, methodologies being implemented. The lack of performing proper control measures, baselines and recognition of variables being introduced is in my opinion incredulous that someone with these alledged credentials would be this clueless, incompetent, grossly negligient and frankly just worthless in having actual analytical skills.

    Liked by 2 people

    1. It is unfortunate but the indoctrination is strong. These people are not taught critical thinking and/or logic. They are taught how to memorize and regurgitate. The system is designed to make one incapable of independent thought. The cognitive dissonance is strong.

      As for the software, this is a huge issue for all the reasons you pointed out as well as the reproducibility issue. They will get different results using different methods with different software. Nothing is standardized and it seems they are able to do their own thing and submit their genomes to GISAID. There are no attempts at reproducibility nor replication of results. That’s why, as of this moment, there are 4.25 million variants submitted to GISAID. The databases are not curated ptoprely as well. There are far too many variables involved to conclude that these random letters in a database mean anything, especially when they do not come from purified/isolated sources.

      In other words, spot on assessment! 🙂

      Liked by 1 person

Leave a comment