“Virus” Variants or Sequencing Errors/Artefacts?

ARTEFACTS (Genomics):

“In genetics a result that does not represent the true biological material or function but arises from a technical, often artificial, process.

Artefacts can lead to misleading results from sequencing. To avoid giving patients incorrect results during data analysis and validation, any data that could be an artefact from the sequencing process is thoroughly investigated.”

Artefact (genetics)

When dealing with the question of whether variants of “viruses” exist or not, all one needs to do is go to the evidence for whether “viruses” themselves actually exist. We can’t have variants if the original has never been proven. If one were to read through any of the original “virus” studies, it becomes very clear that no “virus” has ever been properly purified/isolated directly from a sick human nor ever proven pathogenic in a natural way. What will be seen is that random particles are selected as the representation of the desired “virus” from a sea of billions of similar/identical particles created from within unpurified cell culture supernatant containing host DNA, animal DNA, antibiotics, fetal bovine serum, added nutrients/chemicals, pollutants, contaminants, and foreign elements.

Knowing that no “virus” has ever been proven to exist, how can virologists claim that there are multiple variants of the non-existent “virus” running around? This all boils down to the “magic” of genomics which creates theoretical models from these unpurified cell culture soups and puts them together with the use of computer algorithms/alignments, consensus sequences, and reference databases of other unpurified/unisolated “viral” sequences created in the same way. Any divergence from the original reference genome gets submitted as a variant. The problem, however, is that no two”viral” genomes ever match 100% so in actuality, every sequenced “virus” is a variant.

According to virologist Vincent Racaniello:

“A virus variant is an isolate whose genome sequence differs from that of a reference virus. No inference is made about whether the change in genome sequence causes any change in the phenotype of the virus. The meaning of variant has become clouded in the era of whole viral genome sequencing, because nearly every isolate may have a slightly different genome sequence. Such is the case for SARS-CoV-2: nearly every sequence from a different person is slightly different.” 

Understanding virus isolates, variants, and strains

The excuse for why “viral” sequences never match.

The reason usually given for these non-identical sequences is due to the “virus” integrating with the host cells and hijacking them in order to create more copies of itself. During this process, virologists claim that the “virus” can supposedly mutate and/or create little errors in its genome sequence. Of course, this is all assumed and theorized as the actual process of a “virus” invading host cells to replicate has never been observed much like the “virus” itself. However, there are numerous other reasons for these sequencing errors that have nothing to do with “viral” replication/mutation. These include technological limitations, inaccurate software, contamination, artefacts, lack of standardization of methods and quality control, faulty reference genomes, and non-curated databases. More information on these issues can be found here:

“Viral” Genomics: Nothing but Strings of Letters in a Data Bank

Digging a bit deeper, we can find that the actual process of determining variants is referred to as variant calling. This is how it is described by the EMBL-EBI, “the home for big data in biology:”

“Variant calling is the process by which we identify variants from sequence data (Figure 11).

  1. Carry out whole genome or whole exome sequencing to create FASTQ files.
  2. Align the sequences to a reference genome, creating BAM or CRAM files.
  3. Identify where the aligned reads differ from the reference genome and write to a VCF file.”
Figure 11 A CRAM file aligned to a reference genomic region as visualised in Ensembl. Differences are highlighted in red in the reads, and will be called as variants.

Variant identification and analysis

Variant calling is very much intertwined with the accuracy of the reference genome. If the reference genome is inaccurate, every variant built from it will be as well. There are also various technological limitations and challenges with the variant calling software which can lead to labelling sequencing errors and artefacts as variants. This raises the question, how can it be certain that variants are “real” rather than the creation of sequencing errors and artefacts brought about by the technological limitations and various forms of contamination inherent within the sequencing process itself? Let’s take a look at a few sources highlighted below to see if we can break down this variant scam.

PCR amplification errors is one of various factors that lead to variants/mutations which have nothing to do with “recombination/replication.”

This first source is a study from January 2015. It goes into detail on the lack of gold standard quality controls in regards to identifying variants, the difficulties in distinguishing biological from artificial variants, and the overestimation of “viral” minority variants as a consequence of the required RT-PCR amplification step. It makes the argument that most error-correction algorithms were designed for invariable genomes like the human genome and that assumptions made for usage in human genomes can not apply for “viral” population sequencing as each variant could represent a unique “viral” variant:

Improved detection of artifactual viral minority variants in high-throughput sequencing data

“High-throughput sequencing (HTS) of viral samples provides important information on the presence of viral minority variants. However, detection and accurate quantification is limited by the capacity to distinguish biological from artificial variation. In this study, errors related to the Illumina HiSeq2000 library generation and HTS process were investigated by determining minority variant frequencies in an influenza A/WSN/1933(H1N1) virus reverse-genetics plasmid pool. Errors related to amplification and sequencing were determined using the same plasmid pool, by generation of infectious virus using reverse genetics followed by in duplo reverse-transcriptase PCR (RT-PCR) amplification and HTS in the same sequence run. Results showed that after “best practice” quality control (QC), within the plasmid pool, one minority variant with a frequency >0.5% was identified, while 84 and 139 were identified in the RT-PCR amplified samples, indicating RT-PCR amplification artificially increased variation.”

“Our study clearly demonstrated that, without additional QC steps, overestimation of viral minority variants is very likely to occur, mainly as a consequence of the required RT-PCR amplification step.

However, it is essential to distinguish true biological variation from artificial variation introduced by the laboratory and sequencing methods used. Over time, several approaches to error-correction have been developed incorporating complicated probabilistic methods. These methods use the read sequence information but ignore the corresponding phred quality scores (Zagordi et al., 2011Beerenwinkel et al., 2012Yang et al., 2013). In addition, most error-correction algorithms were designed for invariable genomes like the human genome, and generally assumed that errors were not only random and infrequent, but can be corrected using the majority of the reads that have the correct base (Beerenwinkel et al., 2012Yang et al., 2013). For usage in viral population sequencing this assumption does not apply as each variant in theory could represent a unique viral variant, and different approaches for viral quasispecies reconstruction have been designed, like local haplotype reconstruction incorporated in Shorah (Zagordi et al., 2010a,b2011). However, these methods still rely solely on read sequence information requiring prior QC steps. In addition, while Shorah performs well using long read sequences and relatively small sized datasets, typical for Roche 454 data, it does not scale well to the shorter reads and much larger Illumina datasets as it can only handle up to one million reads, limiting the usability with Illumina sequencers.”

Quality Control of Sequence Reads

“While there is no gold standard for QC, it is generally divided in pre-mapping QC and post-mapping QC steps (see Materials and Methods).”

In this study, after “best practice” QC, 223 influenza genome positions remained which would normally be considered to be “true” biological variants, while 95% were in fact artifacts, hampering reliable interpretation of the sequence data and hindering efficient follow-up studies. These artifacts were not randomly distributed over the influenza virus genome but were shown to be specifically influenced by the sequence context, detectable by a mutational strand bias. The SSE analysis indicated that nearly 90% of the positions with increased MMF were artifacts. Furthermore, shearing or nebulization of amplified DNA during library preparation will lead to biological variant mutations to be evenly spread over the sequence read. When a particular fragment containing a mutation is preferentially (over)amplified during library preparation (i.e., resequenced due to low copy number) this random spread would be disrupted and the prevalence of this mutation overestimated.

https://www.frontiersin.org/articles/10.3389/fmicb.2014.00804/full

Each step of this process, from the cell culture to the finished sequence, is prone to contamination and errors.

In this next study from February 2018, we can see that the variant callers have been extensively benchmarked with inconsistent performances across various studies. The researchers claim that calling variants is made difficult due to the noise in the sequencing reads. They state that even though genetic variants can be grouped into three categories by size, very few variant callers are versatile enough to call all three because they require very different algorithms. This means different strategies with different techniques such as probabilistic modeling and de novo assembly are required to get “accurate” variant calls. On top of this, they admit that sequencing or alignment artefacts may appear to have strong read evidence and can trick the statistical model to pass them off as “real” variants. These revelations are but a few of the many contained within this source:

A review of somatic single nucleotide variant calling algorithms for
next-generation sequencing data

“A collection of variant calling pipelines have been developed with different underlying models, filters, input data requirements, and targeted applications. This review aims to enumerate these unique features of the state-of-the-art variant callers, in the hope to provide a practical guide for selecting the appropriate pipeline for specific applications. We will focus on the detection of somatic single nucleotide variants, ranging from traditional variant callers based on whole genome or exome sequencing of paired tumor-normal samples to recent low-frequency variant callers designed for targeted sequencing protocols with unique molecular identifiers. The variant callers have been extensively benchmarked with inconsistent performances across these studies.

Theoretically, all mutations regardless of the variant allele frequency (VAF) or genomic region can be observed given enough read depth. However, calling them with confidence is not trivial due to noise in the reads. Numerous bioinformatics tools have been developed to uncover mutations (variants) from sequencing reads, and such procedures typically consist of three components: read processing, mapping and alignment, and variant calling. First, low quality bases (usually near the 3’ end of reads) and exogenous sequences such as sequencing adapters are trimmed with read processing tools such as Cutadapt [1], NGS QC Toolkit [2], and FASTX-Toolkit. Some targeted sequencing protocols use PCR primers or unique molecular identifiers (UMI) during library preparation. In this case, custom-built read processing scripts may be required to trim and extract these oligonucleotides. Second, the cleaned reads are mapped to where they may come from in the reference genome, and then aligned base-by-base. Commonly used mapping and alignment tools include BWA [3], NovoAlign, and TMAP (for Ion Torrent reads) for DNA sequencing, and splice-aware aligners such as TopHat [4] and STAR [5] for RNA sequencing. PCR de-duplication, indel-realignment, and base quality recalibration can be performed in this step as outlined in the Genome Analysis Toolkits (GATK)’s best practice for variant calling [6,7]. The last step, variant calling, is essentially a process of separating real variants from artifacts stemming from library preparation, sample enrichment, sequencing, and mapping/alignment. It has been a very active research field for years and plenty of variant callers have been developed, many freely available.”

The underlying assumptions are quite different for germline and somatic variant calling algorithms. Germline variants are expected to have 50 or 100% allele frequencies, therefore germline variant calling is essentially to determine which of the three genotypes, AA, AB, or BB, fit the data best [7–10]. Most artifacts are present in low frequency and unlikely to cause trouble, because homozygous reference would be the most likely genotype in this case. But rejecting these artifacts is not as easy in somatic variant calling, because some real variants could also be present in very low frequencies in cases of impure sample, rare tumor subclone, or circulating DNA. Therefore, the biggest challenge of somatic variant calling is to disambiguate low-frequency variants from artifacts, which requires more sensitive statistical modeling and advanced error correction technology.

Genetic variants can be grouped into three categories by size: single nucleotide variant (SNV), insertion and deletion (indel), and structural variant (SV, including copy number variation, duplication, translocation, etc.). Very few variant callers are versatile enough to call all three because they require very different algorithms. For SNV and short indels (typically ≤10bp), the general strategy is to look for non-reference bases from the stack of reads that cover each position. Probabilistic modeling is critical here to infer the underlying genotype or evaluate the odds of variant versus artifacts. For structural variants and long indels, since the reads are too short to span over any variant, the focus is to locate the breakpoints based on the sudden change of read depth or patterns of misalignment with paired end reads. Split-reads and de-novo assembly methods are often used for SV and long indel detection.”

2.1. Pre-processing

“In general, variant callers consist of three components: pre-processing, variant evaluation, and post-filtering. The main purpose of pre-processing is to keep low-quality reads from entering the variant evaluation procedure. Read quality is typically measured by average base quality score, mapping quality score, and number of mismatches from the reference genome, etc.

2.2. Variant evaluation

“Variant evaluation algorithm is the centerpiece of somatic variant callers and hence the focus of this review.”

2.3. Post-filtering

Sequencing or alignment artifacts may appear to have strong read evidence and trick the statistical model to pass them as real variants. Most variant callers apply a set of filters to identify these artifacts and hence improve the specificity. Strand bias filter, for example, catches artifacts whose reads are only or dominantly observed on one strand, a common error in Illumina reads [19,20]. Strand bias filters rely on the Fisher’s exact test to identify imbalanced strand distribution. A number of filters focus on repetitive regions such as homopolymer, microsatellite, or low complexity regions, which are known to cause false calls due to increased alignment and sequencing errors [21,22]. Hard filters are used in most variant callers, either completely rejecting variants in certain regions or relying on empirical hard thresholds [23].”

7.1. Benchmarking studies

Although most variant callers were published with benchmarking results against other mainstream pipelines of their time, the claimed performance may not be replicated on independent datasets. A number of independent studies to benchmark and compare various somatic variant callers have been published [13,14,50,82-85], but inconsistent performance data and contradicting rankings of the variant callers were reported. The inconsistency of benchmarking results is due to two reasons. First, most variant callers need to be fine-tuned to achieve the expected accuracy on a naive dataset, yet the optimal parameter values are unknown to the tester.

“Second, some variant callers were original designed for certain types of applications and then published without extensive validation on a wide range of datasets, so their performance may drop in some occasions.

“Traditional model-based variant callers rely heavily on ad-hoc filters to reduce false calls because artifacts are generated in a very complex way that is beyond simple modeling. As a result, a variant caller often contains dozens of parameters and some of them can only be understood or safely tuned by the developers, hampering the practical utility.

However, for somatic variant callers, independent and unbiased benchmarking is still limited by the lack of good validation datasets. Datasets used in recent benchmarking studies include synthetic and semi-synthetic reads, reference standards including GIAB samples and other cell lines, and real tumor-normal pairs. None of
these are perfect validation data for reasons discussed above.

https://doi.org/10.1016/j.csbj.2018.01.003

Only 4.5 million versions of the “same virus” so far and growing daily.

Specifically looking at “SARS-COV-2” variants, this study from May 2020 took issue with the early sequencing data which has led to numerous submitted variants since the start of this “pandemic.” The researchers believe many of the observed mutations stem from contamination and recurrent sequencing errors and artefacts which arise from specific combinations of sample preparation, sequencing technology, and consensus calling approaches. They did various updates and in June 2020, the researchers concluded that in many cases, these apparent mutations are almost certainly sequencing or data processing artefacts:

Issues with SARS-CoV-2 sequencing data

We investigate oddities in the SARS-CoV-2 genome sequences from GISAID. Many putative sequencing issues seem specific to genomic ends and to certain samples, and are easily filtered out. However, many mutations seem to arise many times along the phylogenetic tree (are highly homoplasic), and seem more likely the result of contamination, recurrent sequencing errors, or hypermutability, than selection or recombination. Some homoplasic substitutions seem laboratory-specific, suggesting that they might arise from specific combinations of sample preparation, sequencing technology, and consensus calling approaches.

“Finally, as other groups have already suggested (see e.g. [12]), we recommend filtering out sequences that: have too few resolved characters (our somewhat arbitrary threshold is about 29,400 reference bases), are too diverged (as can be tested using TreeTime), have unusual locally high divergence (as can be tested using ClonalFramML), have missing/incomplete sampling date information, or that are distant from any other sequence in the dataset (we use a custom script to remove all sequences that are at least three substitutions away from any other sequence). We don’t provide a current list as this is quite long and varies as the number of publicly shared SARS-CoV-2 genome sequences increases.

“We then ran IQTree v1.6.12 [3] (options -st DNA -m HKY) on the resulting alignment. A few samples have extremely long terminal branches (Figure 2), suggesting either evolutionary events leading to many substitutions (e.g. recombination events or large mutation events), or sequencing/calling artefacts in specific samples.

“To further investigate, and to look into possible recombination events, homoplasies (mutation events seemingly happening multiple times along the phylogeny) and mutational clusters, we ran ClonalFrameML v1.12 [4]. This software found 30 putative recombination events happening at terminal branches, consistent with the long terminal branches in Figure 2. These putative events appear as clusters of substitutions in individual samples and genomic positions (e.g. Figure 3). Given that these mutations are not observed in any other samples, they could represent artefacts in the corresponding sequence.

“The most striking of these putative multi-nucleotide mutations is 28881–28883 (Figure 4), replacing GGG with AAC. This is the only one of these mutations reaching high frequency in the population (776 sequences, or 16.5% of the samples, while 6971–6979 and 13686–13693 only appear in 2 samples each). The three substitutions at 28881–28883 seem to only appear in complete linkage, with the exception of G28881A that also appears in another two samples. (However, these samples were later filtered from the dataset in following steps.) Another interesting aspect is that the 28881–28883 mutation also seems to appear in other non-phylogenetically related samples as within-host polymorphism (Figure 4). This hints at a considerable proportion of cases being either mixed infections (patients infected with multiple strains of SARS-CoV-2) or maybe in some cases contaminated at low frequency.

“In addition to these clusters of mutations, ClonalFrameML also reveals homoplasies, mutations apparently appearing multiple times along the phylogeny. When multiple homoplasies happen on the same branch, they are the hallmark of recombination. However, we only observe isolated homoplasies (Figure 5), suggesting either recurring mutations, or recurrent sequencing/calling artefacts (see e.g. [6]), rather than very short recombination events at given hotspots, as suggested by some previous studies (see e.g. [7-8]).”

“To test for the possibility that some of these homoplasies might be the result of phylogenetic errors, we masked the most common homoplasies from these datasets one part at a time, to see if other homoplasies would disappear. This had almost no effect.

Another possibility is that these homoplasies might be caused by some of the samples being particularly enriched in recurrent artefacts, rather than artefacts distributed uniformly across all samples.

“We will now discuss some of the most common homoplasies. As mentioned above, G11083T is the most frequent, appearing 679 times and apparently mutating 21 times forward and 7 times reverting to the original T allele. The T allele is observed in different sequencing technologies and different countries. This is a new non-synonymous (L to F) mutation (ORF1a 3606 nsp6 37) and also is considered one of the best candidates for positive selection ( https://observablehq.com/@spond/natural-selection-analysis-of-sars-cov-2-covid-19 22 ), but this homoplasy has also been interpreted in literature as the result of frequent recombination [7]. It appears in all samples from the Diamond Princess cruise ship. Notably the mutation is next to the longest non-terminal homopolymer in the genome, further extending it from 8 consecutive T’s to 10 (Figure 9). The mutation also appears in samples as a within-host polymorphism, as can be seen from the presence of isolated N’s (17 times) and K’s (9 times) in the alignment (see e.g. Figure 9). We also observe this from read files from the Sequence Read Archive (see next section), where the mutation appears as within-patient polymorphism 16 times, and even more often with less stringent filtering and variant calling. When applying more stringent read filtering, the frequency of the T allele seems to consistently decrease. Considering all of these observations, we think that G11083T might be a particularly frequent mutation or artefact. It is unlikely to be the result of positive selective pressure at the amino acid level, as the mutation seems apparently to revert to the original allele many times, and as the same amino acid substitution L→F would also be obtained with the substitution G11083C, which however we never observed in our data.

As a consequence of the observations above, we suspect frequent homoplasies specific to dataset C could be artefacts (or possibly normal mutations that appear as homoplasic due to phylogenetic errors). These include T13402G, which is a nonsense mutation appearing in 51 sequences; its neighbour 13408, in which all three non-reference alleles are observed; and A4050C, a nonsynonymous mutation appearing in 18 sequences. Others are less frequent, and it is particularly unclear if their homoplasy might be caused by phylogenetic errors; these include T8022G (nonsynonymous, appearing in 5 samples), T28785G and C3130T (appearing only in 4 and 3 samples respectively). Recently the dataset C-specific mutations at positions 24389-24390 have been suggested to be strongly affected by local recombination [9], but this might be a phylogenetic artefact caused by other homoplasies.

To test which of these homoplasies might (also) be actual inherited viral mutations, and which are more likely to be non-inherited sequencing artefacts, we measured the phylogenetic signal present in the most homoplasic and/or lab-specific variants using the methods of Borges et al 2018 [10].”

Most of the homoplasies, in particular those appearing in only a few sequences, show close to no phylogenetic signal (Figure 11), supporting the hypothesis that they are artificial. On the other hand, many homoplasies, including site 11083, show strong phylogenetic signal, suggesting that they originated, at least once, as a true mutational event. Of course, this analysis has substantial limitations, it ignores uncertainty in the tree and noise from possible artefacts within the remaining sites. It also does not tell us if a variant is both a true mutational event as well as the result of recurrent sequencing biases, as possible for position 11083.

“From preliminary analyses of the 8 sequencing datasets of Shen et al 2020 [11] we realized that in order to remove most variants that might emerge from mapping and sequencing artefacts, stricter filters than usual were needed. More specifically, we used fastp (https://github.com/OpenGene/fastp 22) to:

  1. trim adapters and polyX tails from reads
  2. trim 15 bases from the start and end of each read
  3. only retain reads with high quality sequence (≥Q35) at ≥90% of the read length
  4. remove reads with low quality (<Q30) at the 3’ end
  5. only retain reads of length ≥50 before alignment with bwa-mem (https://github.com/lh3/bwa 7), possible PCR duplicates were removed using MarkDuplicates from Picard tools (https://github.com/broadinstitute/picard 2), and any reads aligned through hard or soft clipping were removed from the bam file.

We only consider variants at a site if the depth of coverage is ≥5 and the base quality is ≥30.

The strong filters of removing 15bp from each read end and removing clipped reads were useful because some variants were observed at very high frequency in some samples but only at read ends or only in clipped reads. For example, at reference position 10779, an A/T polymorphism was observed in 7 out of 8 samples from Shen et al 2020, usually with A being the minor allele, and once being the majority allele. In the 8th sample, allele A appeared fixed. We observed allele A only in read ends (see Figure 12), and after trimming read ends this polymorphism was no longer detected. This suggests that recurrent artefacts might be present not only at the level of substitutions between consensus sequences, but also as frequent, apparent within-host variants.

“Generally, most samples contain 0–5 variants (median 1, mean 8.2). However, some samples have many more, reaching 564 and 433 in two cases (SRR11494637 and SRR11494664 respectively). Coverage or mixed infection/contamination do not seem the issues. Instead, these samples were mostly made of clipped reads, i.e. reads that only partly map on the reference genome. Removing these reads eliminated these extreme cases (Figure 13), but other samples with extreme numbers of variants can still be found, for example SRR11494662 with 359 within-host variants. As samples with an extreme number of within-host variants persist, a more rigorous filtering procedure might still be required before we can confidently interpret within-host variant calls. We therefore offer a word of caution when attempting to interpret the results of such variant calling methods, and minimally recommend a stringent set of filters (as outlined above), as well as removing samples with more than 2% of “N”s within reads.

The mutational spectrum of within-host variants seems extremely shifted toward G→T variants, even more than when comparing consensus sequences (Figure 6) suggesting that most of these variants might be the result of Illumina sequencing errors and/or sequencing biases, or otherwise that most of these new G→T mutations do not reach fixation due to selective forces. We have not yet investigated possible nanopore sequencing biases from within-host variation data.

From an update on June 2020:

Background: We and others have identified several variants in SARS-CoV-2 sequencing datasets that have the potential to adversely affect phylogenetic and evolutionary inference. For details, see our previous posts in this thread and bioRxiv preprint [1]. In many cases, these apparent mutations are almost certainly sequencing or data processing artefacts.

https://virological.org/t/issues-with-sars-cov-2-sequencing-data/473

Change in color? A new variant!

In September 2021, a study came out assessing the quality control (or lack thereof) in regards to “SARS-COV-2” genome sequences. The researchers point out that variants can result from PCR amplification errors, coinfection with other “viral” strains, and/or sample cross contaminations. They believe that in order to ensure the accuracy of “SARS-COV-2” genome sequences, it is essential to identify and distinguish PCR errors as well as potential sample cross-contaminations. They state that the key to the reporting of accurate results is the detection of potential contaminations that could hamper variant calling. Since very few labs report on any sort of quality control that would prevent these errors from occuring, the researchers took it upon themselves to provide some guidelines to minimize erroneous interpretation of low-quality or contaminated data:

Assessment of SARS-CoV-2 Genome Sequencing: Quality Criteria and Low-Frequency Variants

“Although many laboratories worldwide have developed their sequencing capacities in response to the need for SARS-CoV-2 genome-based surveillance of variants, only a few reported some quality criteria to ensure sequence quality before lineage assignment and submission to public databases. Hence, we aimed here to provide simple quality control criteria for SARS-CoV-2 sequencing to prevent erroneous interpretation of low-quality or contaminated data.

Low-frequency variants (<70% of supporting reads) can result from PCR amplification errors, sample cross contaminations, or presence of distinct SARS-CoV2 genomes in the sample sequenced. The number and the prevalence of low-frequency variants can be used as a robust quality criterion to identify possible sequencing errors or contaminations.”

“The expected increase of SARS-CoV-2 genomic data, the potential impact of these results on patient management and public health measures, and the many actors entering the SARS-CoV-2 sequencing business, demand a systematic quality control assessment of these genomes. So far, SARS-CoV-2 genome sequencing has been largely used for epidemiological purposes (79) or associated with investigations of nosocomial transmission chains (10). However, most studies, with few exceptions, do not clearly define the quality control criteria used to include or exclude genomic data.

“While SARS-CoV-2 genome sequencing and analysis are performed in many laboratories worldwide for epidemiological purposes, no clear criteria have been proposed to assess the quality of SARS-CoV-2 sequences.

Low-frequency variants: from amplification errors to contaminations.

Key to the reporting of accurate results is the detection of potential contaminations that could hamper variant calling, affecting cluster analyses and leading to inaccurate lineage definition. The presence of low-frequency variants, defined as positions with variable nucleotides in 10% to 70% of mapped reads (11), could reflect such contaminations between samples. However, this can also be observed due to intrahost heterogeneity (21), uncommon coinfection of different viral strains (22), or PCR polymerase errors (23). However, with an estimated 25 SNP per year (19), corresponding to around two mutations per month or one mutation every two transmissions, multiple variants are unlikely to be found within an individual. Furthermore, such mutations arising randomly in the 29,903 bp of the SARS-CoV-2 reference genome should be observed only rarely, if at all, in other genomes sequenced in the same batch.”

These unique low-frequency variants likely result from errors during PCR amplification that occur randomly in the SARS-CoV-2 genome and should only rarely be observed in other genomes, as they do not reflect circulating strains (12). However, low-frequency variants present in other genomes of the sequencing run (Fig. 3F) could represent cross-contaminations (2426).”

For routine analyses, the intrarun prevalence of every variant, including those with low frequency, should be calculated among sequenced samples to distinguish potentially contaminated samples from random incorrect base incorporation during genome amplification. In our setting, if more than three low-frequency variant calls are found in one sample, a manual investigation of the prevalence of these mutations among the other samples of the run is performed. Most samples with several low-frequency variants failed the quality control due to multiple metrics (Fig. S3E). However, some samples harbored a large number of unique low-frequency variants (Table S3), likely as a result of PCR amplification errors. In such cases, the consensus genome sequence can still be used for lineage assignment. Finally, the occurrence of multiple low-frequency variants observed in other samples of the sequencing run is suggestive of a possible cross contamination. This highlights the importance of carefully validating all sequencing metrics. This quality analysis was critical in the following two cases to identify cross-contaminations.

Case 1 included a first nasopharyngeal swab (called sample A) with a CT of 32, identified as B.1.1.7 lineage by genomic analyses. However, it did not show the characteristic S gene dropout signature of this variant by RT-PCR (27) and presented only 10 of the 17 typical mutations present in B.1.1.7 lineage (Table S4). In the second case, we observed an inconsistency between two samples taken concomitantly from the same patient that resulted in two different lineages. One sample was identified as B.1.1.7 (bucal swab, CT of 34; named sample B) and the other sample as B.1.160 (nasopharyngeal swab, CT of 24; named sample C). A close examination of the three sequences showed the presence of several low-frequency variants in samples A (Table S4) and B (Table S5) (46 and 14, respectively) but not in C (Table S6). Further analyses of the prevalence of low-frequency variant calls showed that 20 positions and 9 positions for cases A and B, respectively, were observed in multiple other genomes, which suggested contamination. Low-frequency variants were used here to prevent erroneous lineage assignment.”

“As previously reported (1128), variant calling accuracy and sequencing quality strongly depend on the input material, but very few papers explicitly reported their quality control procedures.

We observed an abrupt increase of low-frequency variants in samples with cycle thresholds above 28. In contrast, lower CTs showed no or not more than two low-frequency calls. This strongly supports the important effect of technical factors on the accuracy of variant calls, in accordance with independent preliminary findings suggesting that low-copy-number inputs impacted the allele frequency calls and introduced false intrahost mutations (1112). In addition, other authors suggested that a significant number of new mutations reported in only one genome submission are likely the result of contamination or recurrent sequencing errors rather than selection or recombination (3033). By analyzing the presence and the prevalence of low-frequency variants within sequencing runs or within the whole SARS-CoV-2 population, it is possible to rapidly highlight samples strongly affected by these problems.

In conclusion, we retrospectively analyzed 647 samples over 10 SARS-CoV-2 sequencing runs and identified metrics associated with sequencing quality useful to easily implement quality enhancement measures. To ensure the accuracy of SARS-CoV-2 genome sequences, it is essential to identify and distinguish PCR errors as well as potential sample cross-contaminations.

https://journals.asm.org/doi/10.1128/JCM.00944-21

In Summary:

  • Detection and accurate quantification is limited by the capacity to distinguish biological from artificial variation
  • Results showed that after “best practice” quality control (QC), within the plasmid pool, one minority variant with a frequency >0.5% was identified, while 84 and 139 were identified in the RT-PCR amplified samples, indicating RT-PCR amplification artificially increased variation
  • The study demonstrated that, without additional QC steps, overestimation of “viral” minority variants is very likely to occur, mainly as a consequence of the required RT-PCR amplification step
  • It is essential to distinguish true biological variation from artificial variation introduced by the laboratory and sequencing methods used
  • Most error-correction algorithms were designed for invariable genomes like the human genome, and generally assumed that errors were not only random and infrequent, but can be corrected using the majority of the reads that have the correct base
  • For usage in “viral” population sequencing this assumption does not apply as each variant in theory could represent a unique “viral” variant
  • There is no gold standard for quality control
  • In this study, after “best practice” quality control (their quotation marks, not mine), 223 influenza genome positions remained which would normally be considered to be “true” biological variants, while 95% were in fact artifacts, hampering reliable interpretation of the sequence data and hindering efficient follow-up studies
  • When a particular fragment containing a mutation is preferentially (over)amplified during library preparation (i.e., resequenced due to low copy number) this random spread would be disrupted and the prevalence of this mutation overestimated
  • The variant callers have been extensively benchmarked with inconsistent performances across these studies
  • Theoretically, all mutations regardless of the variant allele frequency (VAF) or genomic region can be observed given enough read depth
  • However, calling them with confidence is not trivial due to noise in the reads
  • Numerous bioinformatics tools have been developed to uncover mutations (variants) from sequencing reads, and such procedures typically consist of three components: read processing, mapping and alignment, and variant calling
  • Variant calling is a process of separating real variants from artifacts stemming from library preparation, sample enrichment, sequencing, and mapping/alignment
  • The underlying assumptions are quite different for germline and somatic variant calling algorithms
  • Rejecting artifacts is not as easy in somatic variant calling, because some “real” variants could also be present in very low frequencies in cases of impure sample, rare tumor subclone, or circulating DNA
  • The biggest challenge of somatic variant calling is to disambiguate low-frequency variants from artifacts, which requires more sensitive statistical modeling and advanced error correction technology
  • Genetic variants can be grouped into three categories by size: single nucleotide variant (SNV), insertion and deletion (indel), and structural variant
  • Very few variant callers are versatile enough to call all three because they require very different algorithms
  • Probabilistic modeling is critical to infer the underlying genotype or evaluate the odds of variant versus artifacts
  • Read quality is typically measured by average base quality score, mapping quality score, and number of mismatches from the reference genome, etc.
  • Sequencing or alignment artifacts may appear to have strong read evidence and trick the statistical model to pass them as “real” variants
  • A number of filters focus on repetitive regions such as homopolymer, microsatellite, or low complexity regions, which are known to cause false calls due to increased alignment and sequencing errors
  • Although most variant callers were published with benchmarking results against other mainstream pipelines of their time, the claimed performance may not be replicated on independent datasets
  • A number of independent studies to benchmark and compare various somatic variant callers have been published, but inconsistent performance data and contradicting rankings of the variant callers were reported
  • The inconsistency of benchmarking results is due to two reasons:
    1. Most variant callers need to be fine-tuned to achieve the expected accuracy on a naive dataset, yet the optimal parameter values are unknown to the tester
    2. Some variant callers were originally designed for certain types of applications and then published without extensive validation on a wide range of datasets, so their performance may drop in some occasions
  • Traditional model-based variant callers rely heavily on ad-hoc filters to reduce false calls because artifacts are generated in a very complex way that is beyond simple modeling
  • As a result, a variant caller often contains dozens of parameters and some of them can only be understood or safely tuned by the developers, hampering the practical utility
  • For somatic variant callers, independent and unbiased benchmarking is still limited by the lack of good validation datasets
  • The researchers investigated oddities in the “SARS-CoV-2” genome sequences from GISAID
  • Many mutations seem to arise many times along the phylogenetic tree (are highly homoplasic), and seem more likely the result of contamination, recurrent sequencing errors, or hypermutability, than selection or recombination
  • Some homoplasic substitutions seem laboratory-specific, suggesting that they might arise from specific combinations of sample preparation, sequencing technology, and consensus calling approaches
  • They recommend filtering out sequences that: have too few resolved characters (they admitted to a somewhat arbitrary threshold is about 29,400 reference bases), are too diverged, have unusual locally high divergence, have missing/incomplete sampling date information, or that are distant from any other sequence in the dataset
  • The researchers do not provide a current list of sequences to filter out as it is quite long and varies as the number of publicly shared “SARS-CoV-2” genome sequences increases
  • A few samples they analyzed had extremely long terminal branches suggesting either evolutionary events leading to many substitutions (e.g. recombination events or large mutation events), or sequencing/calling artefacts in specific samples
  • Their software found 30 putative recombination events happening at terminal branches and, given that these mutations are not observed in any other samples, they could represent artefacts in the corresponding sequence
  • The 28881–28883 mutation also seemed to appear in other non-phylogenetically related samples as within-host polymorphism which hints at a considerable proportion of cases being either mixed infections (according to the researchers, these are patients infected with multiple strains of “SARS-CoV-2” yet there is only one supposed strain) or maybe in some cases contaminated at low frequency
  • Quick sidenote on homoplasies, this is an evolutionary biology concept defined as:
  • They only observed isolated homoplasies suggesting either recurring mutations, or recurrent sequencing/calling artefacts
  • Some of the homoplasies (nucleotide identity resulting from a process other than inheritance from a common ancestor) they observed might be the result of phylogenetic errors
  • Another possibility was that these homoplasies might be caused by some of the samples being particularly enriched in recurrent artefacts, rather than artefacts distributed uniformly across all samples
  • They believe that the G11083T homoplasy might be a particularly frequent mutation or artefact
  • They suspect frequent homoplasies specific to dataset C could be artefacts (or possibly normal mutations that appear as homoplasic due to phylogenetic errors)
  • Tests were performed to determine which of these homoplasies might (also) be actual inherited “viral” mutations, and which are more likely to be non-inherited sequencing artefacts
  • Most of the homoplasies, in particular those appearing in only a few sequences, show close to no phylogenetic signal, supporting the hypothesis that they are artificial
  • They attempt to claim some may be true homoplasies yet offer that this analysis has substantial limitations as it ignores uncertainty in the tree and noise from possible artefacts within the remaining sites
  • It also does not tell if a variant is both a true mutational event as well as the result of recurrent sequencing biases, as possible for position 11083 (trying to have their cake and eat it too)
  • The researchers realized that in order to remove most variants that might emerge from mapping and sequencing artefacts, stricter filters than usual were needed
  • The trimming results suggested that recurrent artefacts might be present not only at the level of substitutions between consensus sequences, but also as frequent, apparent within-host variants
  • Variants in some samples were mostly made of clipped reads, i.e. reads that only partly map on the reference genome
  • They offer a word of caution when attempting to interpret the results of such variant calling methods, and minimally recommend a stringent set of filters
  • The mutational spectrum of within-host variants seems extremely shifted toward G→T variants, even more than when comparing consensus sequences suggesting that most of these variants might be the result of Illumina sequencing errors and/or sequencing biases
  • They had not yet investigated possible nanopore sequencing biases from within-host variation data
  • In June 2020, an update was provided where the researchers along with others identified several variants in “SARS-CoV-2” sequencing datasets that have the potential to adversely affect phylogenetic and evolutionary inference
  • In many cases, these apparent mutations were almost certainly sequencing or data processing artefacts
  • Only a few labs reported some quality criteria to ensure sequence quality before lineage assignment and submission to public databases
  • This study aimed to provide simple quality control criteria for “SARS-CoV-2” sequencing to prevent erroneous interpretation of low-quality or contaminated data
  • Low-frequency variants (<70% of supporting reads) can result from PCR amplification errors, sample cross contaminations, or presence of distinct “SARS-CoV-2” genomes in the sample sequenced
  • Most studies do not clearly define the quality control criteria used to include or exclude genomic data
  • No clear criteria have been proposed to assess the quality of “SARS-CoV-2” sequences
  • The key to the reporting of accurate results is the detection of potential contaminations that could hamper variant calling, affecting cluster analyses and leading to inaccurate lineage definition
  • The presence of low-frequency variants could reflect contamination between samples
  • However, this can also be observed due to intrahost heterogeneity, uncommon coinfection of different “viral” strains, or PCR polymerase errors
  • It is estimated that there are only 25 SNP per year, corresponding to around two mutations per month or one mutation every two transmissions, so they assume multiple variants are unlikely to be found within an individual (even though, at the time of writing, there are nearly 4.5 million variants sequenced and it’s growing daily)
  • There are unique low-frequency variants which are the likely result from errors during PCR amplification that occur randomly in the “SARS-CoV-2” genome
  • Low-frequency variants present in other genomes of the sequencing run could represent cross-contaminations
  • For routine analyses, the intrarun prevalence of every variant, including those with low frequency, should be calculated among sequenced samples to distinguish potentially contaminated samples from random incorrect base incorporation during genome amplification
  • The occurrence of multiple low-frequency variants observed in other samples of the sequencing run is suggestive of a possible cross contamination
  • They present two cases of cross-contamination:
    1. Case 1 included a first nasopharyngeal swab (called sample A) with a Ct of 32, identified as B.1.1.7 lineage by genomic analyses. However, it did not show the characteristic S gene dropout signature of this variant by RT-PCR and presented only 10 of the 17 typical mutations present in B.1.1.7 lineage
    2. In the second case, they observed an inconsistency between two samples taken concomitantly from the same patient that resulted in two different lineages. One sample was identified as B.1.1.7 (bucal swab, Ct of 34; named sample B) and the other sample as B.1.160 (nasopharyngeal swab, Ct of 24; named sample C)
  • It is reiterated again that variant calling accuracy and sequencing quality strongly depend on the input material, but very few papers explicitly reported their quality control procedures
  • They observed an abrupt increase of low-frequency variants in samples with PCR cycle thresholds above 28
  • They claim that this strongly supports the important effect of technical factors on the accuracy of variant calls, in accordance with independent preliminary findings suggesting that low-copy-number inputs impacted the allele frequency calls and introduced false intrahost mutations
  • Other authors have suggested that a significant number of new mutations reported in only one genome submission are likely the result of contamination or recurrent sequencing errors rather than selection or recombination
  • They conclude that to ensure the accuracy of “SARS-CoV-2” genome sequences, it is essential to identify and distinguish PCR errors as well as potential sample cross-contaminations

It should be clear by now that there are numerous factors which can affect the determination of a variant. Even if we were to put aside the lack of purified/isolated “viral” particles as well as the numerous sources of foreign material and contaminants that will undoubtedly be present in either the BALF or the cell culture supernatant used to sequence a genome, there are various technological limitations, quality control issues, software errors, and other sources of contamination throughout the assembly and variant calling processes that lead to the creation of variant sequences. There are no two “viral” genomes alike. They are all variants. It doesn’t matter if virologists claim these mutations in the sequenced genome occur due to replication or recombination as it is admitted in their own sources that variants can also occur due to sequencing errors, artefacts, PCR amplification, contamination, etc. The excuse virologists give for why these variants occur boils down to unobservable theories and assumptions which are used to cover up the fact that they are unable to sequence the exact same “viral” genome twice. Thus we have over 4.5 million “SARS-COV-2” variant sequences as of this writing.

To state a long post simply:

Sequencing artefacts, biases, errors = not sequencing the same “virus” every time = “variants.”

Leave a comment