“SARS-COV-2” and the Slippery Slope of Reference Genomics

When sequencing the genome of an organism, many short segments of DNA are created using technology such as the Illumina program. This makes the process pf creating the genome fast and cheap. However, a rather challenging problem is that there is no information on how to put the many short segments of DNA back together into an assembled whole. Because of this, the reconstruction of a genome is often done using computer algorithms and prediction programs. Once completed, these finished genomes are put into a database and utilized as reference genomes for future sequencing projects. The below two sources highlight the importance of the use of reference genomes to genomics:

Piecing together the best reference genome

“When researchers create a reference genome, DNA from the organism of interest is first sequenced in short pieces. A big challenge when making a genome assembly is piecing these genome sequence fragments back together correctly. Once reassembled, you’re left with a reference genome. This can be used to answer fundamental questions about biology, disease, and biodiversity.”


A reference standard for genome biology

Reference genomes are the cornerstone of modern genomics. These high-quality genomes are differentiated from draft genomes by their completeness (low number of gaps), low number of errors, and high percentage of sequence assembled into chromosomes.”

“Sequencing today is largely dominated by Illumina’s high-throughput, short-read (∼100–200 base pairs) technology, which has made the process of decoding genomes much faster and cheaper. But this speed comes with a cost. The millions of short, overlapping reads generated by these instruments represent a complex puzzle that must be pieced together in the correct order and orientation. Computational algorithms are needed to assemble reads into continuous segments of DNA sequence (known as contigs) and to subsequently order and orient these contigs into chains (known as scaffolds), which often contain gaps. To improve scaffolding, additional long-range information is needed from technologies like Hi-C (which chemically cross-links neighboring chromatin domains), optical mapping (which visualizes fluorescent probes bound to single immobilized DNA molecules) and linked reads (sets of barcoded short reads from the same DNA molecule). Even these approaches still fall short when attempting to decipher intractable genomic features such as repetitive DNA, G+C-rich sequence or structural rearrangements that span distances much longer than a short read.”

“Beyond these practical concerns, there is no definitive method to verify the correctness of the finished product. For some species, even simple information like the number of expected chromosomes is unknown. In most cases, researchers perform several checks to evaluate the quality of a final assembly. The assembly size can be compared with existing genome size estimates for that organism or can be estimated using statistical approaches. Algorithms can be applied to identify all the sequences of set length (called k-mers) in an Illumina library that are likely to be real, and then to work out what fraction of those k-mers are recovered in the new assembly. The final assembly can also be inspected for ‘core’ genes—a set of genes common in species related to the sequenced species.

In the absence of a perfect measure of genome correctness, the definition of a high-quality genome continues to evolve, together with advances in sequencing and assembly.”


While it is clear that reference genomes are an extremely important part of the fabric of modern genomics, the last of the above sources points out two glaring challenges related to their use:

  1. There is no definitive method to verify the correctness of the finished product
  2. The definition of a high-quality genome continues to evolve

If there is no method to verify the correctness of a finished reference genome and the definition for what constitutes a “high-quality” genome is constantly evolving, how can one be sure that any genome built from a reference genome is ever accurate? In fact, how could any genome ever assembled be considered accurate? In the case of “SARS-COV-2,” the genome was created using multiple reference genomes. One of them was RaTG13, a bat “coronavirus:”

A pneumonia outbreak associated with a new coronavirus of probable bat origin

“Full-length genome sequences were obtained from five patients at an early stage of the outbreak. The sequences are almost identical and share 79.6% sequence identity to SARS-CoV. Furthermore, we show that 2019-nCoV is 96% identical at the whole-genome level to a bat coronavirus.”

“Full-length genome sequences of SARS-CoV BJ01, bat SARSr-CoV WIV1, bat coronavirus RaTG13 and ZC45 were used as reference sequences.”

We then found that a short region of RNA-dependent RNA polymerase (RdRp) from a bat coronavirus (BatCoV RaTG13)—which was previously detected in Rhinolophus affinis from Yunnan province—showed high sequence identity to 2019-nCoV. We carried out full-length sequencing on this RNA sample (GISAID accession number EPI_ISL_402131). Simplot analysis showed that 2019-nCoV was highly similar throughout the genome to RaTG13 (Fig. 1c), with an overall genome sequence identity of 96.2%


So what exactly is RaTG13?

Bat coronavirus RaTG13 is a SARS-like betacoronavirus that infects the horseshoe bat Rhinolophus affinis.[2][3] It was discovered in 2013 in bat droppings from a mining cave near the town of Tongguan in Mojiang county in YunnanChina. As of 2020, it is the closest known relative of SARS-CoV-2, the virus that causes COVID-19.”


This bat “Coronavirus” is the closet known relative of “SARS-COV-2.” But what happens if it turns out that RatG13 doesn’t exist? What if it is an inaccurate and faulty reference genome? What would that mean for “SARS-COV-2,” a “virus” which has a 96.2% identity match and used RatG13 as a reference genome? This next source lays out many of the problems with the RatG13 genome:

Scientists claim serious data discrepancies in RaTG13 sequence

A new preprint* published in September 2020 by molecular biologists at the All India Institute of Medical Sciences, New Delhi, and the Indraprastha Institute of Information Technology Delhi discusses the current issues with the bat coronavirus (CoV) strain that is often considered to have very close homology with the above-mentioned virus, concluding that there are inadequate grounds to consider it to be the ancestral pool of SARS-CoV-2.

Many scientists mention the genome sequence of this bat CoVs, RaTG13, as being part of the ancestral descent of the current virus. A recent paper in the journal Nature also mentions its 96.2% homology with SARS-CoV-2, considering it to be a fossil record of a strain whose current existence is doubtful, but which may have been the original pool from which the current virus developed.

The scientists assembled the viral genome from scratch, performed a metagenomic analysis, and looked at data quality. They concluded that the RaTG13 genome had serious issues and all data related to it required a full review.”

“Full experimental details backed up the published genomic sequence of SARS-CoV-2, but not so that of the RaTG13. This is documented by several papers that have shown up the holes in the dataset underlying the published genome of the bat virus.”

The dataset that has been published in support of the RaTG13 genome, almost 30 kb long, has been found inadequate to reproduce the sequence or the experimental observations based on this dataset. While the dataset is unique and contains much information beyond the fragmented coronavirus sequence, not much is known about how it was generated.”

De novo RaTG13 Assembly Not Possible

“The researchers found that using the available data, they were unable to detect any contiguous sequences larger than 17 kb, using several different settings. Several matching sequences were found, but none over a fifth of the length of the reported sequence. A gap spanning 111 positions was found, and it is unclear on what basis this was filled in the published sequence.

Contamination Likely

The researchers also uncovered proof that DNA contamination is likely to have occurred. For instance, the largest contig contains genetic material with 98% similarity to the full-length mitochondrial sequence of the Chinese rufous horseshoe bat (Rhinolophus sinicus), an unlikely event since a complete assembly of such a sequence is typically interrupted by stop codons.

Secondly, non-adapter-related repetitive sequences were found in most reads, often at the same end of the read, comprising one G-quadruplex sequence and its reverse complement. This is unlikely to happen on the same end of an RNA sample since only one strand is dominant. The researchers say more information about how the experiments were carried out is crucial to rule out the possibility of gross RNA sample contamination by DNA.

Poor Data Quality

The researchers also calculated that the average coverage is 9.73, indicating a low value. This may be why only partial segments of the RaTG13 sequence are assembled. The coverage is only 2 or less for about 3,000 bases, which could markedly impair the accuracy. They draw attention to multiple ambiguous bases in the first end that could prevent de novo assembly, and to many unreliable second end reads as well.

Experimental Procedural Concerns

The significantly large differences in the bacterial content of the two referenced datasets are surprising, say the researchers, since both purport to be from similar sites, fecal and oral samples. One has only 0.65% bacteria, and ~68% Eukaryota, with the rest being unidentified. The other is ~91% bacteria and ~4% Eukaryota. This concern has been raised before.

Again, 0.1% of the first dataset is similar to plant genomes like rice and maize, which is unexpected from bat samples from creatures like the intermediate horseshoe bat Rhinolophus affinis. The researchers attribute this to contamination by possible index hopping because of evidence that the same platform has been used to sequence maize earlier. Multiplex sequencing of maize and the CoV genome of interest could lead to such contamination.

Again, the dataset also contains material identical to that of the Malayan pangolin Manis javanica, a totally different order. This again could be due to index hopping of some fragments for the same reason. This could have misdirected the discussion on the origin of the novel CoV, as some have reported that pangolin CoV genomic sequences also have close homology with that of the former.

Thus, the inference could also be that contamination accounts for the presence of various portions of the RaTG13 in the dataset, accounting for 0.0008% of the total.”

The second run also has sequences resembling another virus accession number, apart from its own accession number. This dataset is supposed to have a separate lane, and index hopping may be supposed not to have occurred here, but cross-contamination still seems to have occurred. The researchers note that this “raises a distinct possibility that sample from previous runs might not have been guarded against either index hopping or cross-contamination.”

This could explain the discrepancies in the earlier dataset. Furthermore, some sequences seem to have been derived from retroviruses such as the greater horseshoe bat Rhinolophus ferrumequinum, but a whole virus could not be assembled.


While most work on the origins of SARS-CoV-2 has focused on the human CoV sequence, the current study shows that equal importance must be given to the other half of the equation, namely, RaTG13, in order to justify giving it a role in the narrative. Secondly, discussions may instead be withheld, while the precise details of the methods used to generate the RaTG13 are awaited. And thirdly, this genome should not be used in further studies until its scientific reliability is established in entirety, by independent researchers with access to the full dataset and methods used for its generation.”


Obviously, the results from these researchers carry some very heavy implications for “SARS-COV-2” as the error-filled and unreliable genome for RaTG13 was used as a reference genome during the creation of “SARS-COV-2.” They are almost genetic twins with a 96.2% sequence identity match. If the reference genome is faulty, how accurate could the genome based on it truly be? These next few sources provide some help in finding the answer:

Can I Get a Reference?

“Perhaps one of the biggest drawbacks is the need for a reference genome for comparison with your sequence. If you don’t have one of these to compare your results to, you have no real way of determining what is normal and what is unique about your sample. Good luck identifying an insertion mutation without an unaltered genome to compare to! While de novo sequencing for when a reference is not available is possible, it can lead to more errors since you have nothing to compare to.”

The Good, the Bad and the Expensive of Whole Genome Sequencing

Standards for Sequencing Viral Genomes in the Era of High-Throughput Sequencing

“To alleviate any reliance on particular aspects of the different sequencing technologies, we have made two assumptions that should be valid in most viral sequencing projects. The first assumption is a basic understanding of the genomic structure of the virus being sequenced, including the expected size of the genome, the number of segments, and the number and distribution of major open reading frames (ORFs). Fortunately, genome structure is highly conserved within viral groups (6), and although new viruses are constantly being uncovered, the discovery of a novel family or even genus remains relatively uncommon (7). In the absence of such information, the defined standards can still be applied following further analysis to determine genome structure. The second assumption is that the genetic material of the virus being described can be accurately separated from the genomes of the host and/or other microbes, either physically or bioinformatically. Depending on the technology used, it is critical that the potential for cross-contamination of samples during the sample indexing/bar coding process and sequencing procedure be addressed with appropriate internal controls and procedural methods (8).”

“Additionally, with the current suite of primarily sequence similarity-based pathogen identification tools, the ability to detect novel pathogens is wholly dependent on high-quality reference databases (22). There is a trend toward requiring a complete genome sequence when a description of a novel virus is being published, and we agree that this is a good goal;”


From these two sources, we can see that without an accurate reference genome, there is no way to tell what is normal or unique about a sample and that multiple errors are more likely to occur without a reference to compare to. This means that the ability to detect novel “viruses” is entirely dependent on having a high-quality reference database with accurate genomes. Without this, they would not know what they have. Yet we already know from previous sources that they do not have a definitive nor perfect method for determining the correctness of a reference genome and what constitutes a “high-quality” genome constantly changes.

Interestingly, the last article offered two assumptions they make when sequencing any “virus.” The first is that they always have a basic understanding of the genome of the “virus.” This is quite the assumption to make. They are claiming that all “viral” genomes are conserved and that it is rare to find new “virus” families. However, without knowing what novel “viruses” are out there as well as what their genetic makeup would be, how could this possibly be true? It has been estimated that the human body is made up of some 380 trillion “viruses,” most of which are undiscovered and have unknown genomes. How would they be able to know that the genetic material they are piecing together is not coming from one or more of these trillions of undiscovered families?

The second assumption is that the “virus” can be separated completely from the host and/or bacterial genetic material. In other words, they assume that the “virus” is free of off-target genetic material and that whatever is sequenced is completely “viral” in nature. If this has always been the assumption since the first “viral” genome was sequenced, how was it ever confirmed that any material was ever completely “viral” to begin with? In order for that to be an accurate assumption, they would have needed to know all human/bacterial sequences from the very first “viral” genome ever created to be able to accurately separate human from bacterial DNA/RNA in order to determine what is uniquely “viral.” This is obviously not the case as to date there still is not a completed human genome and there are at least an estimated 38 trillion bacteria said to be living inside of us. As with the “virome,” our biome is made up of mostly undiscovered bacteria with unknown genomes. Without purification/isolation, there would always be off-target genetic material in the sample. The WHO has admitted that even with purification, “non-viral” genetic material will be sequenced. This means that they are assuming that human and/or bacterial genetic material is “viral” when in fact it is not.

Putting it all together, the “SARS-COV-2” genome is only as good as its reference genomes. However, the reference genomes are only as good as the reference genomes used to create them. If any or all of the references are faulty, the whole chain of genomes built from them are faulty as well. Judging from the breakdown of RatG13, the closest relative to “SARS-COV-2,” there are numerous errors within the genome such as:

  1. The dataset that has been published in support of the RaTG13 genome, almost 30 kb long, has been found inadequate to reproduce the sequence or the experimental observations based on this dataset
  2. De Novo assembly was not possible
  3. The researchers uncovered proof that DNA contamination is likely to have occurred
  4. There was poor data quality with multiple ambiguous bases in the first end that could prevent de novo assembly, and many unreliable second end reads as well
  5. There were numerous experimental procedural concerns, especially involving practices that lead to cross-contamination of maize and pangolin DNA being contained within the genome
  6. The researchers concluded that this genome should not be used in further studies until its scientific reliability is established in entirety

In the beginning of this chain of references, there would have had to have been purified/isolated “virus” particles free of any off-target genetic material used in the creation of a completely accurate reference genome. It is clear that this was not the case for RatG13, the closest genetic relative of “SARS-COV-2.” RatG13 is an error-filled, contaminated mess with mysterious origins. If RatG13 is the faulty reference genome it appears to be, then that immediately places the “SARS-COV-2” genome under the same umbrella. If the reference genome is not accurate, there is no way the genome created by using it is accurate as well.

1 comment

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s