Genome Contamination: A Widespread Problem

Most clinical specimens and tissue culture samples to be used for viral genome sequencing are usually contaminated with human cells, other microorganisms and naked DNA and RNA from disrupted cells.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2638583/

When talking about the proof for the existence of “viruses,” it is logical to require that the particles believed to be a “virus” are purified (i.e. free of contamination, pollutants, foreign material) and isolated (separated from everything else). Only through the process of purification/isolation of particles believed to be “viruses” can one actually demonstrate that those specific particles exist in reality and are the only possible substance that could potentially be the cause of disease. Only once these purified/isolated particles are shown to exist independently would it then be theoretically possible to obtain an accurate genome from the genetic material obtained from a single source.

Unfortunately for virology, these two logical requirements are never met. “Viruses” are never taken directly from humans and purified free of contamination nor are they ever isolated from everything else. Instead, the samples taken from humans are subjected to the cell culture process which is the exact opposite of purification/isolation as can be seen in the above quote. Human cells, animal cells, various known and unknown microorganisms, naked DNA/RNA, and other sources of “non-viral” material are certain to be within tissue and cell cultured samples.

Without purification/isolation of the sample directly from humans, there is no proof the particles actually exist in humans. The particles claimed to be “viruses” are nothing but a creation of the culturing process resulting from the breakdown and decay of the cells. They exist only as a byproduct in petri dishes from lab-experimentation. The particles showcased in colorful electron microscope images claimed to be the “virus” could be of numerous identical substances such as exosomes or multivesicular bodies. There can be no proof of pathogenicity as any of the toxic additives (antibiotics/antifungals, fetal cow blood, chemicals) which are mixed into the culture can produce disease by themselves.

This lack of purified/isolated particles leads to many issues including contamination of the supposed genome. Since the cultured material used for sequencing contains potentially billions of “non-viral” genetic material both known and unknown, there can be no claim that the RNA used to create a specific “virus” genome came from one source. “Virus” genomes are amalgamations of genetic material from numerous sources cobbled together with the help of computer-algorithms. I have included highlights from a few studies that showcase the widespread problem in regards to the contamination of genomes and how this affects the results of sequence analysis from these unpurified sources.

In June 2019, a study came out which identified the problem of human contamination in genomics which has led to the creation of thousands of fake protein families spread throughout many genomes. The authors state that while the goal is to have complete and accuarte reference databases, most reference genomes are not complete and are in fact still “drafts.” These genomes vary in quality amongst species. A major contributer to this varied quality is contamination, which is a common problem that results in false sequences being propagated in future sequencing efforts. The authors believe that this contamination is a widespread problem which will lead to many false-positive results as human reads will be incorrectly said to be bacterial:

Human contamination in bacterial genomes has created thousands of spurious proteins

“Contaminant sequences that appear in published genomes can cause numerous problems for downstream analyses, particularly for evolutionary studies and metagenomics projects. Our large-scale scan of complete and draft bacterial and archaeal genomes in the NCBI RefSeq database reveals that 2250 genomes are contaminated by human sequence. The contaminant sequences derive primarily from high-copy human repeat regions, which themselves are not adequately represented in the current human reference genome, GRCh38. The absence of the sequences from the human assembly offers a likely explanation for their presence in bacterial assemblies. In some cases, the contaminating contigs have been erroneously annotated as containing protein-coding sequences, which over time have propagated to create spurious protein “families” across multiple prokaryotic and eukaryotic genomes. As a result, 3437 spurious protein entries are currently present in the widely used nr and TrEMBL protein databases. We report here an extensive list of contaminant sequences in bacterial genome assemblies and the proteins associated with them.”

“Ideally, all genomes in reference databases would be complete and accurate (Fraser et al. 2002), but for practical reasons, the vast majority of genomes available today are still “drafts.” A draft genome consists of multiple contigs or scaffolds that are typically unordered and not assigned into chromosomes (Ghurye et al. 2016). A genome is not truly complete or “finished” until every base pair has been determined for every chromosome and organelle, end-to-end, with no gaps. Even the human genome, although far more complete than most other animal genomes, is still unfinished: The current human assembly, GRCh38.p13 (released Feb. 28, 2019), has 473 scaffolds that contain 875 internal gaps. While most of the human sequence has been placed on chromosomes, some highly repetitive regions are underrepresented (Altemose et al. 2014), leading to problems that we discuss below. Draft genomes of other species vary widely in quality as well as contiguity, with some having thousands of contigs and others having a much smaller number.”

“Contamination of genome assemblies with sequences from other species is not uncommon, especially in draft genomes (Longo et al. 2011; Merchant et al. 2014; Delmont and Eren 2016; Kryukov and Imanishi 2016; Lu and Salzberg 2018). In 2011, researchers reported that over 10% of selected nonprimate assemblies in the NCBI and UCSC Genome Browser databases were contaminated with the primate-specific AluY repeats (Longo et al. 2011). Although validation pipelines have improved substantially since then (Tatusova et al. 2016; Haft et al. 2018), some contaminants still remain, as we describe below. Furthermore, when open reading frames (ORFs) in the contaminated contigs get annotated as protein-coding genes, their protein sequence may be added to other databases. Once in those databases, these spurious proteins may in turn be used in future annotation, leading to the so-called “transitive catastrophe” problem where errors are propagated widely (Karp 1998; Salzberg 2007; Danchin et al. 2018). Indeed, one study found that the percentage of misannotated entries in the NCBI nonredundant (nr) protein collection, which is used for thousands of BLAST searches every day, has been increasing over time (Schnoes et al. 2009).

Contamination of genomic sequences can be particularly problematic for metagenomic studies. For example, if a genome labeled as species X contains fragments of the human genome, then any sample containing human DNA might erroneously be identified as also containing species X. Since human DNA is virtually always present in the environment of sequencing laboratories, human contamination is very common in sequencing experiments of all types. Contamination of laboratory reagents with DNA from other organisms can also lead to serious misinterpretations, such as the supposed detection of the novel virus NIH-CQV in hepatitis patients, which was ultimately determined to be a contaminant of nucleic acid extraction kits (Smuts et al. 2014).”

“We demonstrated that human contamination has made its way into 2250 publicly available microbial genomes, primarily bacteria but also archaea and some eukaryotes. In turn, erroneous translations of these contaminants have generated more than 3000 annotated proteins, which now form highly conserved but spurious protein families spanning a broad range of bacterial phyla and some eukaryotic species. All of these genomes and proteins appear in at least one if not several widely used sequence databases. It is possible that additional contaminants might be present, because we did not screen for all possible sources of contamination, such as other human genomic regions, fragments of DNA from nonhuman host organisms, environmental sources, and laboratory vectors.

This widespread contamination creates serious problems for many types of scientific analyses that depend on genome and protein databases. One example where this problem is most acute is the use of metagenomic sequencing to diagnose infections, a rapidly growing clinical application in which human tissues are sequenced to identify a potential pathogen (Wilson et al. 2014; Naccache et al. 2015; Berger and Wilson 2016; Salzberg et al. 2016). In these samples, where the dominant species is human, contamination of even a small fraction of the bacterial genomes in the database will cause numerous false positives, as human reads may appear, incorrectly, to represent bacterial organisms.”

https://pubmed.ncbi.nlm.nih.gov/31064768/

In July 2020, a Nature article came out providing further evidence of contamination in genomes. It is admitted that gaps, errors and contamination exist within the database despite the reliance on accurate data. As in the previous article, the study reiterates that these problems result in false positives. The researchers decided to investigate this problem by creating an algorithm to search for contamination across many different kingdoms. While they expected to find only a few thousand contaminated sequences, they actually uncovered millions, including within the latest draft version of the human genome:

Contamination in sequence databases

“Biological sequences in public databases are indispensable resources for life science research. Despite our everyday reliance on these databases, there are gaps, errors and contamination in the data. “One of our research efforts for the past several years has been the detection of pathogens in humans by using metagenomic shotgun sequencing,” says Martin Steinegger, who was a member in Steven L. Salzberg’s lab at Johns Hopkins University and is now at the Seoul National University. “Unfortunately, in many cases we have found that contamination within the genome sequences produces false positives.”

This motivated Steinegger and Salzberg to start a project to assess contamination in the GenBank, RefSeq and NR databases. Using recent fast algorithms, they developed a tool called Conterminator that enables searching for contamination across kingdoms and scales linearly. “The version of GenBank we evaluated had a size of 3.3 terabytes and contained 400 million sequences. Aligning them all-against-all would require hundreds of years using classic methods,” says Steinegger. “Our algorithm only required 12 days to process all of GenBank on a single 32-core server.”

They expected to see a few thousand contaminated sequences but ended up with millions. 2,161,746, 114,035 and 14,148 contaminated sequences were detected in GenBank, RefSeq and NR, respectively. “This single most surprising finding was the presence of a piece of a bacterium, Acidithiobacillus thiooxidans, in an alternative scaffold of the current version of the human reference genome (GRCh38),” says Steinegger. “The human genome has been around for such a long time, and so many researchers use it on a daily basis, that we didn’t expect to see any contaminants there.”

Steinegger hopes Conterminator can help researchers who sequence genomes and database managers to detect contamination. As a word of caution to users of genome sequences, “Many of the genomes contain contamination. One particular problem that arises, again and again, is that contamination leads to incorrect claims about horizontal gene transfer,” says Steinegger.”

https://www.nature.com/articles/s41592-020-0895-8

In a February 2020 study, the authors also admitted to a widespread problem of contamination in genomics which propagates in the database and affects the outcomes of studies and analysis. Even though this is a known issue, there is a lack of an accurate assessment in regards to the scope of the problem. Researchers regularly utilize the work of others and are deoendent on this work being accurate and reliable. As they rarely check the reliability of the data generated by others, the contaminated sequences spread and accumulate in the database in a perpetual cycle. A way of checking the work involves the use of reference genomes, however, this relies on the completeness and accuracy of the reference database. This process can only propagate contamination, not correct it. They concluded that their estimation of the problem is most likely an underestimate and that the problem is far worse than the picture presented:

Prevalence and Implications of Contamination in Public Genomic Resources: A Case Study of 43 Reference Arthropod Assemblies

“Thanks to huge advances in sequencing technologies, genomic resources are increasingly being generated and shared by the scientific community. The quality of such public resources are therefore of critical importance. Errors due to contamination are particularly worrying; they are widespread, propagate across databases, and can compromise downstream analyses, especially the detection of horizontally-transferred sequences. However we still lack consistent and comprehensive assessments of contamination prevalence in public genomic data.”

“Scientists typically re-use sequence data generated by others, and are therefore dependent on the reliability of the available genomic resources. For this reason, the problem of public data quality in molecular biology has long been identified as a crucial issue (Lamperti et al. 1992; Mistry et al. 1993; Binns 1993). The problem is even more acute nowadays with the advent of high-throughput sequencing technologies, when most datasets generated in genomic research are simply not amenable to manual curation by humans. This brings a new challenge to current methodologies in genomic sciences, namely, the development of automated approaches to the detection and processing of errors (e.g., Andorf et al. 2007; Schmieder and Edwards 2011; Parks et al. 2015; Delmont and Eren 2016; Drăgan et al. 2016; Tennessen et al. 2016; Laetsch and Blaxter 2017; Lee et al. 2017).

Data quality issues in genome sequences include sequencing errors, assembly errors and contamination, among other things. Errors due to contamination are particularly worrying for several reasons. First, they can lead to serious mis-interpretations of the data, as illustrated by recent, spectacular examples. Potential problems include mis-characterization of gene content and related metabolic functions (e.g., Koutsovoulos et al. 2016; Breitwieser et al. 2019), improper inference of evolutionary events (e.g., Laurin-Lemay et al. 2012; Simion et al. 2018), and biases in genotype calling and population genomic analyses (e.g., Ballenghien et al. 2017; Wilson et al. 2018). Second, contamination is suspected to be widespread. It occurs naturally in most sequencing projects due to foreign DNA initially present in the raw biological material (e.g., symbionts, parasites, ingested food; Salzberg et al. 2005; Starcevic et al. 2008; Artamonova and Mushegian 2013; Driscoll et al. 2013; Martinson et al. 2014; Cornet et al. 2018), or entering the process in wet labs and sequencing centers (Longo et al. 2011; Salter et al. 2014; Wilson et al. 2018). Third, contamination errors easily propagate across databases in a self-reinforcing vicious circle. If a DNA sequence from species A is initially assigned to the wrong species B due to a contamination of B by A, it is likely to keep its incorrect status for a while, and may even be identified as a contamination of A by B when the genome of A is eventually sequenced (Merchant et al. 2014). Despite all the possible problems stemming from contamination in genomic resources, most studies addressing this issue so far have focused on one particular genome (e.g., tardigrades) and/or one particular source of contaminants (e.g., humans). Only two studies that we are aware of have consistently screened more than one genome assembly. Merchant et al. (2014) focused on the bovine genome but also applied their pipeline to eight randomly drawn draft genomes (five animals, two plants, one fungus), with contrasted results. Cornet et al. (2018) analyzed 440 genomes of Cyanobacteria and uncovered a substantial level of contamination in >5% of these. There is obviously a need for further assessment of the problem of contamination in publicly available genomic data.”

“A straightforward way to identify contamination in a newly sequenced genome is to compare the assembled sequences to existing databases using BLAST-like algorithms. If a sequence’s best match is assigned to a species that is phylogenetically distant from the target organism, then the sequence is annotated as a contaminant. There are several problems with this simple strategy. First, this does not allow one to distinguish contaminants from HGT. Second, this approach is entirely dependent on the correctness of the reference database. A best-BLAST-hit survey can only propagate, not correct, pre-existing taxonomic mis-assignments, as discussed above. Third, such an approach is also dependent on the completeness of the reference database, and on the phylogenetic position of the target organism. If the reference database is imbalanced and dominated by one or a few particular taxa (typically model organisms), then its power to properly discriminate genuine sequences from contaminants will be maximal for newly sequenced organisms closely related to the dominant taxa, and much lower for organisms distantly related to the dominant taxa.”

“In summary, out of 43 published genome assemblies, 15 (i.e., 35%) presented at least some traces of non-metazoan contamination, including four which were substantially contaminated. These figures are likely an underestimation of the actual prevalence of contamination because of the limitation due to incompleteness of reference genomic databases, as discussed above. Moreover, the overall prevalence of contamination is expected to be even higher as we did not consider metazoan contaminants. Yet contamination from wet lab technicians as well as from model organisms extensively used in research facilities (e.g., mouse, zebrafish, …) is likely to occur in any sequencing project. Our results are consistent with recent analyses which uncovered similar level of contamination in published genome assemblies (e.g., Borner and Burmester 2017). In particular, Bemm et al. (BioRxiv: https://doi.org/10.1101/122309) reported from 0 to ca. 5% of bacterial contamination in Ensembl Metazoa genome and identified the bumblebee Bombus impatiens as one of the most highly contaminated assemblies. In addition, our analyses focused on CDS, which are among the most conserved and easy to annotate sequences of a genome, i.e., probably most easily filtered for contamination by assembly pipelines. Therefore the situation regarding contamination is probably even worse as far as non-coding sequences are concerned.”

https://academic.oup.com/g3journal/article/10/2/721/6026299

Finally, in a March 2020 study, the authors state that contamination is a well-known problem which leads to inaccurate assessments in basic and clinical research. It is stated that contamination is pervasive and results in false positives and negatives. The role of contamination is rarely considered as well as the resulting incorrect conclusions related to biological phenomenon. They found that contamination events are frequent not only in unpurified clinical samples but also in samples considered to be of pure culture. Even sequencing samples with low contamination can result in dozens of errors. Sequencing directly from clinical samples yields the highest errors as these are the more highly contaminated sources. Culture-free sequencing results in high amounts of contamination from the host and regardless of the source of contamination, the result is non-target sequencing reads that affect the analysis. While it was expected by the researchers that contamination is a major issue when sequencing DNA that has not been extracted from pure cultures or single colonies, as is done in clinical specimens, it was shown that sequencing from pure cultures is not free of contamination, and that using standard mapping quality parameters is not enough to deal with contaminant reads:

Contaminant DNA in bacterial sequencing experiments is a major source of false genetic variability

Background

“Contaminant DNA is a well-known confounding factor in molecular biology and in genomic repositories. Strikingly, analysis workflows for whole-genome sequencing (WGS) data commonly do not account for errors potentially introduced by contamination, which could lead to the wrong assessment of allele frequency both in basic and clinical research.

Results

We used a taxonomic filter to remove contaminant reads from more than 4000 bacterial samples from 20 different studies and performed a comprehensive evaluation of the extent and impact of contaminant DNA in WGS. We found that contamination is pervasive and can introduce large biases in variant analysis. We showed that these biases can result in hundreds of false positive and negative SNPs, even for samples with slight contamination. Studies investigating complex biological traits from sequencing data can be completely biased if contamination is neglected during the bioinformatic analysis, and we demonstrate that removing contaminant reads with a taxonomic classifier permits more accurate variant calling.”

“While many factors are taken into account when developing SNP calling pipelines, surprisingly, the potential role of contamination is seldomly considered [13]. However, misinterpretation of contaminated data can lead to draw incorrect conclusions about biological phenomena [14, 15].

Genomic databases are known to encompass contaminated sequences, with assembled genomes that can contain large genomic regions from non-target organisms [16, 17]. Strikingly, a recent study revealed that deposited bacterial and archaeal assemblies are contaminated by human sequences that created thousands of spurious proteins [18]. While the potential impact of contaminants has been considered in fields like metagenomics or transcriptomics, most bacterial WGS analysis pipelines lack specific steps aimed to deal with contaminant data. This situation likely originates from the assumptions that microbiological cultures are mostly free of non-target organisms and that even if present, contaminant sequences are unlikely to map to the reference genomes or are removed using standard filter cutoffs. To date, the extent of contamination and its impact in bacterial re-sequencing pipelines has not been comprehensively assessed.”

“We found that contamination events are frequent across bacterial WGS studies and can introduce large biases in variant analysis despite the use of stringent mapping and variant calling cutoffs. Importantly, this is not only true for culture-free sequencing strategies, but also for experiments sequencing from pure cultures. We show that the effect size is not dependent on the amount of contamination and that samples with even low-level contamination can accumulate dozens of errors, particularly for non-fixed SNPs.”

“When looking at the MTB dataset, we also observed contamination to be common across studies (Fig. 1b). As expected, direct sequencing from clinical specimens and early positive mycobacterial growth indicator tubes (MGIT), which are inoculated with primary clinical samples, present higher levels of contamination in terms of both the number of samples contaminated and the proportion of non-target reads within them. Common contaminants for these samples comprise human DNA, and bacteria usually found in oral and respiratory cavities like Pseudomonas, Rothia, Streptococcus, or Actinomyces, and can constitute virtually all reads in some samples. However, as observed for the bacterial dataset, contamination was also detected in studies in which the sequenced DNA came from pure culture isolates. For instance, Bacillus, Negativicoccus, and Enterococcus represented up to 68%, 58%, and 32%, respectively, of different samples from the KwaZulu study. Strikingly, 17 out of 73 MTB samples from the Nigeria study were identified as Staphylococcus aureus (92 to 99% of reads). The high-depth dataset was mostly free of contamination, with the exception of two samples for which 3.32% of A. baumannii and 2.83% of non-tuberculous mycobacteria (NTM) were identified (representing 795,887 and 920,379 reads, respectively).”

“Remarkably, even a 5% of contaminating reads can introduce a large number of false positive vSNPs. As expected, the erroneous calls produced by such small contamination fall mainly in conserved regions. However, in agreement with the results shown in Fig. 4a, spurious SNPs can be called across the genome (Additional file 8: Figure S2).”

“Sequencing directly from clinical specimens is subject to greater alterations in variant analysis (Fig. 5) since this strategy usually yields highly contaminated samples and limited sequencing depth. In these cases, the SNP frequencies are more sensitive to contaminant reads since only few reads can be responsible for a shift in the frequencies that make a position to fall below or above the required thresholds to call a variant (Additional file 7: Figure S1). However, a high sequencing depth does not guarantee an analysis safe of errors either.”

Discussion

“In this work, we analyze more than 4000 WGS samples from 14 different pathogenic bacterial species to evaluate the extent and impact of contamination in bacterial WGS studies. We show that presence of sequencing reads from contaminating organisms is frequent, even when sequencing is performed from pure culture isolates (Fig. 1). Beyond inappropriate laboratory practices, there are several potential sources of contamination which depend on different factors such as the type of sample processed and its origin, or the protocols followed for culture, DNA extraction, and sequencing. For instance, Salter et al. demonstrated that contaminating DNA in laboratory reagents can critically impact microbiome analysis from low-biomass samples [19]. Culture-free sequencing approaches for unculturable or slow-growing pathogens, such as T. pallidum or MTB, entail the presence of high amounts of contaminating DNA from the host organism. Other sources unrelated to sample handling are also possible. For example, the S. aureus samples supposed to be MTB from the Nigeria study are most likely an error during data submission to the genomic repository. Regardless of the source of contamination, the shared consequence is the presence of non-target reads in the sequencing files that might impact the results of genomic analysis.

We evaluated such an impact and demonstrate that contaminant reads suppose a pitfall in re-sequencing pipelines, since they are unexpectedly frequent and can have major implications in variant analysis, which is the foundation of many genomic analyses. As expected, contamination is a major issue when sequencing DNA that has not been extracted from pure cultures or single colonies, as is often the case for clinical specimens. However, we show that experiments sequencing from pure cultures are not necessarily free of contamination, and that using standard mapping quality parameters is not enough to deal with contaminant reads. Therefore, bioinformatic pipelines assuming that all the reads successfully mapped are from the target organism might lead to a biased variant analysis. We show that the errors introduced by contamination are very variable among different studies, (Table 2; Fig. 3; Fig. 5), which differ not only on the organism being sequenced but also on the sampling source and laboratory protocols.

“The analyses for MTB reveal a large number of variants introduced by contaminants with downstream consequences when calling vSNPs and fSNPs as well as the wild type. Remarkably, we show that contamination can introduce substantial errors in samples that could be considered “pure” or with high sequencing depths, implying that contamination-aware pipelines will be needed in any circumstance.

Contamination has been recognized as a major source of error in genome assemblies and other fields like metagenomics [16, 19]. However, the role of contamination in re-sequencing pipelines is usually neglected. Whereas some groups are already aware of this issue, most bacterial re-sequencing pipelines are still lacking contamination-control strategies or, if any, these are rarely detailed in published works.”

https://bmcbiol.biomedcentral.com/articles/10.1186/s12915-020-0748-z

In Summary:

Contaminant sequences that appear in published genomes can cause numerous problems for downstream analyses, particularly for evolutionary studies and metagenomics projects
A 2019 large-scale scan of complete and draft bacterial and archaeal genomes in the NCBI RefSeq database revealed that 2250 genomes are contaminated by human sequence
The contaminant sequences derived primarily from high-copy human repeat regions, which themselves are not adequately represented in the current human reference genome, GRCh38
In some cases, the contaminating contigs have been erroneously annotated as containing protein-coding sequences, which over time have propagated to create spurious protein “families” across multiple prokaryotic and eukaryotic genomes
As a result, 3437 spurious protein entries are currently present in the widely used nr and TrEMBL protein databases
Ideally, all genomes in reference databases would be complete and accurate but for practical reasons, the vast majority of genomes available today are still “drafts”
A genome is not truly complete or “finished” until every base pair has been determined for every chromosome and organelle, end-to-end, with no gaps
The human genome, although far more complete than most other animal genomes, is still unfinished
Draft genomes of other species vary widely in quality as well as contiguity, with some having thousands of contigs and others having a much smaller number
Contamination of genome assemblies with sequences from other species is not uncommon, especially in draft genomes
In 2011, researchers reported that over 10% of selected nonprimate assemblies in the NCBI and UCSC Genome Browser databases were contaminated with the primate-specific AluY repeats
Furthermore, when open reading frames (ORFs) in the contaminated contigs get annotated as protein-coding genes, their protein sequence may be added to other databases
Once in those databases, these spurious proteins may in turn be used in future annotation, leading to the so-called “transitive catastrophe” problem where errors are propagated widely
In 2009, Schnoes et al. found that the percentage of misannotated entries in the NCBI nonredundant (nr) protein collection, which is used for thousands of BLAST searches every day, has been increasing over time
Contamination of genomic sequences can be particularly problematic for metagenomic studies
For example, if a genome labeled as species X contains fragments of the human genome, then any sample containing human DNA might erroneously be identified as also containing species X
Since human DNA is virtually always present in the environment of sequencing laboratories, human contamination is very common in sequencing experiments of all types
Contamination of laboratory reagents with DNA from other organisms can also lead to serious misinterpretations, such as the supposed detection of the novel “virus” NIH-CQV in hepatitis patients, which was ultimately determined to be a contaminant of nucleic acid extraction kits
It was demonstrated that human contamination has made its way into 2250 publicly available microbial genomes, primarily bacteria but also archaea and some eukaryotes
Erroneous translations of these contaminants have generated more than 3000 annotated proteins
All of these genomes and proteins appear in at least one if not several widely used sequence databases
It is possible that additional contaminants might be present such as:
1. Other human genomic regions
2. Fragments of DNA from nonhuman host organisms
3. Environmental sources
4. Laboratory vectors
This widespread contamination creates serious problems for many types of scientific analyses that depend on genome and protein databases
One area heavily affected is the use of metagenomic sequencing to diagnose infections, a rapidly growing clinical application in which human tissues are sequenced to identify a potential pathogen
In these samples, where the dominant species is human, contamination of even a small fraction of the bacterial genomes in the database will cause numerous false positives, as human reads may appear, incorrectly, to represent bacterial organisms

Despite everyday reliance on genomic databases, there are gaps, errors and contamination in the data
“Unfortunately, in many cases we have found that contamination within the genome sequences produces false positives.” – Martin Steinegger
Using recent fast algorithms, Steinegger and Salzberg developed a tool called Conterminator that enabled searching for contamination across kingdoms and scales linearly
They expected to see a few thousand contaminated sequences but ended up with millions
This single most surprising finding was the presence of a piece of a bacterium, Acidithiobacillus thiooxidans, in an alternative scaffold of the current version of the human reference genome (GRCh38)
“Many of the genomes contain contamination. One particular problem that arises, again and again, is that contamination leads to incorrect claims about horizontal gene transfer,” says Steinegger

Errors due to contamination are particularly worrying; they are widespread, propagate across databases, and can compromise downstream analyses, especially the detection of horizontally-transferred sequences
We still lack consistent and comprehensive assessments of contamination prevalence in public genomic data
Scientists typically re-use sequence data generated by others, and are therefore dependent on the reliability of the available genomic resources
The problem is even more acute nowadays with the advent of high-throughput sequencing technologies, when most datasets generated in genomic research are simply not amenable to manual curation by humans
Data quality issues in genome sequences include sequencing errors, assembly errors and contamination, among other things
Errors due to contamination are particularly worrying for several reasons:
1. They can lead to serious mis-interpretations of the data
2. Contamination is suspected to be widespread as it occurs naturally in most sequencing projects due to foreign DNA initially present in the raw biological material (e.g., symbionts, parasites, ingested food) or entering the process in wet labs and sequencing centers
3. Contamination errors easily propagate across databases in a self-reinforcing vicious circle
There is obviously a need for further assessment of the problem of contamination in publicly available genomic data
There are several problems with this simple strategy:
1. This does not allow one to distinguish contaminants from HGT
2. This approach is entirely dependent on the correctness of the reference database and a best-BLAST-hit survey can only propagate, not correct, pre-existing taxonomic mis-assignments
3. It is also dependent on the completeness of the reference database, and on the phylogenetic position of the target organism
The overall prevalence of contamination uncovered by this study is expected to be even higher as they did not consider metazoan contaminants
The situation regarding contamination is probably even worse as far as non-coding sequences are concerned

Analysis workflows for whole-genome sequencing (WGS) data commonly do not account for errors potentially introduced by contamination, which could lead to the wrong assessment of allele frequency both in basic and clinical research
The study found that contamination is pervasive and can introduce large biases in variant analysis
The study showed that these biases can result in hundreds of false positive and negative SNPs, even for samples with slight contamination
Studies investigating complex biological traits from sequencing data can be completely biased if contamination is neglected during the bioinformatic analysis, and this study demonstrated that removing contaminant reads with a taxonomic classifier permits more accurate variant calling (who would’ve thought…?)
The potential role of contamination is seldomly considered
Misinterpretation of contaminated data can lead to drawing incorrect conclusions about biological phenomena
Genomic databases are known to encompass contaminated sequences, with assembled genomes that can contain large genomic regions from non-target organisms
While the potential impact of contaminants has been considered in fields like metagenomics or transcriptomics, most bacterial WGS analysis pipelines lack specific steps aimed to deal with contaminant data
This situation likely originates from the assumptions that microbiological cultures are mostly free of non-target organisms and that even if present, contaminant sequences are unlikely to map to the reference genomes or are removed using standard filter cutoffs
To date, the extent of contamination and its impact in bacterial re-sequencing pipelines has not been comprehensively assessed
They found that contamination events are frequent across bacterial WGS studies and can introduce large biases in variant analysis despite the use of stringent mapping and variant calling cutoffs
This was not only true for culture-free sequencing strategies, but also for experiments sequencing from pure cultures
The study showed that the effect size is not dependent on the amount of contamination and that samples with even low-level contamination can accumulate dozens of errors, particularly for non-fixed SNPs
When looking at the MTB dataset, they also observed contamination to be common across studies
Direct sequencing from clinical specimens and early positive mycobacterial growth indicator tubes (MGIT), which are inoculated with primary clinical samples, presented higher levels of contamination in terms of both the number of samples contaminated and the proportion of non-target reads within them
Common contaminants for these samples comprise human DNA, and bacteria usually found in oral and respiratory cavities and can constitute virtually all reads in some samples
However, as observed for the bacterial dataset, contamination was also detected in studies in which the sequenced DNA came from pure culture isolates
Even a 5% of contaminating reads can introduce a large number of false positive vSNPs
The erroneous calls produced by such small contamination fall mainly in conserved regions however, spurious SNPs can be called across the genome
Sequencing directly from clinical specimens is subject to greater alterations in variant analysis since this strategy usually yields highly contaminated samples and limited sequencing depth
This study showed that presence of sequencing reads from contaminating organisms is frequent, even when sequencing is performed from pure culture isolates
Beyond inappropriate laboratory practices, there are several potential sources of contamination which depend on different factors such as:
1. The type of sample processed and its origin
2. The protocols followed for culture
3. DNA extraction
4. Sequencing
Culture-free sequencing approaches for unculturable or slow-growing pathogens, such as T. pallidum or MTB, entail the presence of high amounts of contaminating DNA from the host organism
Regardless of the source of contamination, the shared consequence is the presence of non-target reads in the sequencing files that might impact the results of genomic analysis
They demonstrated that contaminant reads suppose a pitfall in re-sequencing pipelines, since they are unexpectedly frequent and can have major implications in variant analysis, which is the foundation of many genomic analyses
As expected, contamination is a major issue when sequencing DNA that has not been extracted from pure cultures or single colonies, as is often the case for clinical specimens
Experiments sequencing from pure cultures are not necessarily free of contamination, and that using standard mapping quality parameters is not enough to deal with contaminant reads
Therefore, bioinformatic pipelines assuming that all the reads successfully mapped are from the target organism might lead to a biased variant analysis
They also showed that the errors introduced by contamination are very variable among different studies, which differ not only on the organism being sequenced but also on the sampling source and laboratory protocols
Contamination can introduce substantial errors in samples that could be considered “pure” or with high sequencing depths, implying that contamination-aware pipelines will be needed in any circumstance
Contamination has been recognized as a major source of error in genome assemblies and other fields like metagenomics
However, the role of contamination in re-sequencing pipelines is usually neglected
Whereas some groups are already aware of this issue, most bacterial re-sequencing pipelines are still lacking contamination-control strategies or, if any, these are rarely detailed in published works

Anyone claiming that the existence of a genome is proof of a purified/isolated “virus” is completely mistaken. The contamination of genomes is admittedly a widespread problem and one that is only getting worse. While it is a well-known issue, contamination of the database has not been properly assessed nor corrected. The use of inaccurate and incomplete reference sequences has only further propagated the problem into a vicious perpetual cycle of erroneous genomes built upon erroneous genome.

Due to the problems related to contamination, it is admitted that human DNA is incorrectly transcribed as bacterial DNA. It is admitted that results from contaminated sequences lead to false positives in both research and clinical settings. It is easy to see how this finding can be applied to “viruses” where normal human DNA/RNA is being used to create the framework of the “virus.” Faulty PCR tests are then designed which look for fragments of a “virus” that are actually normal fragments of human genetic material. Healthy people are then incorrectly labelled as asymptomatically sick for a “virus” never shown to exist other than in the form of a false contamination-created series of sequences. Think this is implausible? Remember that metagenomic sequencing, as done with the unpurified broncoalveloar fluid (BALF) from one patient, was used to create the “SARS-COV-2” genome. Clinical samples such as BALF have the highest levels of contamination. It is admitted by the WHO that this will result in high levels of host contamination and non-target sequences. Now, reread this passage from the 2019 study once again:

This widespread contamination creates serious problems for many types of scientific analyses that depend on genome and protein databases. One example where this problem is most acute is the use of metagenomic sequencing to diagnose infections, a rapidly growing clinical application in which human tissues are sequenced to identify a potential pathogen (Wilson et al. 2014; Naccache et al. 2015; Berger and Wilson 2016; Salzberg et al. 2016). In these samples, where the dominant species is human, contamination of even a small fraction of the bacterial genomes in the database will cause numerous false positives, as human reads may appear, incorrectly, to represent bacterial organisms.

“SARS-COV-2” being nothing but fragments of human RNA/DNA doesn’t seem so far-fethched now, does it?

9 comments

bbbnorth says:

January 24, 2022 at 4:38 pm

And when you think that the drosten pcr sequence was released BEFORE China released their so called covid genome, based on rumours and probable sequences from data bases of previous sars genomes!
From the perspective of creating a pandemic they had the problem of not being able to define a specific disease, and therefore not being able to isolate any particular cause. This compounded by lack of clinical trials and kochs postulates. The answer, as Andy Kaufman suggests, is to create a test for a factor which will usually appear in a pcr cylced at 40 cycles and then call cases, even when unsymptomatic, a viral disease.
As the saying goes, this is never more than correlation not causation.
Or to put it more humorously, it is true that there is a correlation between shark attacks and ice cream sales. But it does not prove one causes the other!

Loading...

1. Mike Stone says:
  
  January 24, 2022 at 5:45 pm
  
  Well said!
  
  Loading...
  
Beverly says:

January 24, 2022 at 11:03 pm

Hello Mike! Would you be willing to be a guest on ‘Conversations with Dr. Cowan and Friends?’
https://drtomcowan.com/pages/podcast His most recent podcast episode was with Christine Masssey. He wrote The Contagion Myth. https://drtomcowan.com/pages/books If you are interested, please email me. Warmly, Beverly

Loading...

1. Mike Stone says:
  
  January 25, 2022 at 3:03 am
  
  Hi Beverly, I definitely know of Dr. Cowan. I was first in line to buy The Contagion Myth when it came out. I actually got my copy from a pre-order on Amazon even though they had banned it. I guess they had to fulfill it. 😉
  
  In any case, I am honored Dr. Cowan would like to speak with me. I will send you an email tomorrow. 🙂
  
  Loading...
  
Pingback: Is Purification of a “Virus” Necessary? Yes. – ViroLIEgy
Pingback: This virus is udderly cow-culated – Piece of Mindful
Pingback: The Challenges in Genome Library Construction – ViroLIEgy
Pingback: The Case Against “Viral” Genomes – ViroLIEgy
Pingback: Everything you never thought you’d need to know about Monkeypox – Northern Tracey's scribblings

Genome Contamination: A Widespread Problem

Human contamination in bacterial genomes has created thousands of spurious proteins

Contamination in sequence databases

Prevalence and Implications of Contamination in Public Genomic Resources: A Case Study of 43 Reference Arthropod Assemblies

Contaminant DNA in bacterial sequencing experiments is a major source of false genetic variability

Background

Results

Discussion

In Summary:

Like this:

9 comments

Leave a ReplyCancel reply

ViroLIEgy

Useful Links

Quick Links

Useful Links

Categories

© 2026 ViroLIEgy. All Right Reserved

ViroLIEgy

Human contamination in bacterial genomes has created thousands of spurious proteins

Contamination in sequence databases

Prevalence and Implications of Contamination in Public Genomic Resources: A Case Study of 43 Reference Arthropod Assemblies

Contaminant DNA in bacterial sequencing experiments is a major source of false genetic variability

Background

Results

Discussion

In Summary:

Share this:

Like this:

9 comments

Leave a ReplyCancel reply

ViroLIEgy

Useful Links

Quick Links

Useful Links

Categories

© 2026 ViroLIEgy. All Right Reserved

ViroLIEgy

Discover more from ViroLIEgy