How Accurate and Reliable are Genomes?

One argument people try to make as proof of “viruses” is the existence of “viral” genomes. They believe that if a genome can be sequenced from unpurified cell culture soup where a “virus” is assumed to exist, that this is proof enough that a “virus” actually physically exists. Looking beyond the irony of claiming random A,C,G,T’s in a computer database can somehow be used as evidence for the physical existence of an unseen entity, there are numerous reasons to question the reliability and accuracy of genomes. These include, but are not limited to:

  1. The reliance on inaccurate reference genomes
  2. The inability to replicate results
  3. The numerous technological hurdles based on the tech that is used at the time the genome was sequenced
  4. The introduction of biases, errors, and artefacts throughout the sequencing process
  5. The uncurated databases
  6. The various assumptions that are made by the researchers

It is utterly ridiculous to believe that these non-reproducible and error-prone sequences from unpurified cell culture soup can be used as indirect proof of a “virus” when the direct proof, i.e. purified/isolated particles taken directly from sick humans which are proven pathogenic in a natural way, have yet to be scientifically proven first.

In order to showcase the various problems associated with the sequencing process and the unreliability in regards to  the accuracy of genomes, I have provided information from four different sources. It should be clear afterwards why these random A,C,T,G’s in a computer database can not be used as proof of “viruses.”

Shattered Assumptions

This first source is a 2008 article by molecular biologist Ulrich Bahnsen. These highlights showcase many of the assumptions that were made early on in genomics which were proven false as newer technology came about. He showed that the genome, once considered a static construct, is actually in a constantly evolving state. A genome is only a snapshot in a moment of time and is not in a fixed state but rather a revolving door of genetic information. Keep in mind while reading this how quickly the previous assumptions by geneticists changed from 2000 to 2008. How many false assumptions have been built upon? What assumptions are they making now which will be obsolete a few years down the line?

Genetics: Genome in Dissolution

“The genome was considered to be the unchangeable blueprint of the human being, which is determined at the beginning of our life. Science must bid farewell to this idea. In reality, our genetic make-up is in a state of constant change.

“Geneticists must abandon their image of a stable genome, in which changes are pathological exceptions. The genome of each individual is in a state of constant transformation. As a result, every organism, every human being, even every cell of the body is a genetic universe in itself.

The first analysis of the human genome was still a lengthy and costly affair, the result – celebrated in 2000 by US President Bill Clinton as the “Book of Life” – a sequence of three billion letters. Since then, new laboratory techniques, with the help of which enormous amounts of data can be generated and analyzed, have generated a flood of new findings on the inner life of the human genome in particular. In the process, the book dissolves before the eyes of the readers. The genome is not a stable text. The state of knowledge also raises basic philosophical questions such as the genetic and thus biophysical identity of the human being – and possibly demands radically different answers. The geneticists have their sights set on a new “human project” – motto: All about the ego.

The latest results show more than ever that humans are a product of genetic processes. But also that these processes are equipped with many degrees of freedom. They form an open system in which by no means everything is predetermined.

After the first genome coding, only a few people suspected this. The experts believed they had understood how a gene looks and functions, which functional principles the human or microbial genome follows. “In retrospect, our assumptions about how the genome works back then were so naive that it is almost embarrassing,” says Craig Venter, who was involved in the project with his company Celera. What was expected was a collection of complicated but understandable recipes for the life processes. Now it becomes clear: The book of life is full of enigmatic prose.

It was only the first climax of the upheaval, when a few months ago the conviction of the genetic uniformity and thus identity of mankind broke down. Until then, the assumption had been that the genetic material of any two people differed only by about one per mille of all DNA building blocks. But the differences in the genetic makeup of humans are in reality so great that science now confirms what the vernacular has long known: “Every man is different. Completely different!”

The previous conviction that each gene usually exists only twice in the genome (once in the paternal, once in the maternally inherited set of chromosomes) is also incorrect. In reality, a great deal of genetic information is subject to a process of duplication and exists in up to 16 copies in the cell nucleus. Various research teams have now discovered such copy number variants (CNVs) in at least 1500 human genes; there are probably many more of these Xerox genes, with each person having a different CNV profile. The explosiveness of the findings is exacerbated by the discovery that the CNV patterns in the genome are by no means stable, the copy number of the genes may decrease or increase, and even the somatic cells of an individual human differ from each other.

The idea that the genome represents a natural constant, a fixed source code of the human being, is now crumbling under the weight of the findings. The US geneticist Matthew Hahn already compared the genome with a revolving door: “Genes constantly come, others go.”

https://telegra.ph/Genetics-Genome-in-Dissolution-11-01

Acceptable Errors?

This next source is from Stanford.edu and it discusses the “accuracy” of the human genome. It details the attempts to achieve greater accuracy as newer technology developed. Throughout the project, there were varying levels of acceptable accuracy with differing amounts of acceptable error rates. Whereas the accepted error rate was 1 error per every 1,000 base pairs, it is now set at 1 error for every 10,000 base pairs. The article questions whether human genomes can ever be accurate enough in order to serve its purpose in providing personalized medical information. With acceptable error rates allowed, that is a very good question.

Accuracy of Human DNA Sequencing

“The Human Genome Project was culmination of combined efforts from several different research groups including the National Human Genome Research Institute, the Department of Energy and the International Human Genome Sequencing Consortium1.  The end goal of this project was to produce a sufficiently accurate version of the human genetic code.  Our DNA is composed of 23 pairs of chromosomes which contains approximately 30,000 genes which is coded by sets of base pairs (either adenine [A], thymine[T], cytosine[C], or guanine [G]).  All in all, the human genome contains approximately 3 billion base pairs.  Recent improvements regarding computational analysis have drastically progressed the advancements of DNA sequencing.  From a computing aspect, each base pair could be represented by a minimum of 2 bits, which would thus require over 750 megabytes (MB) to store the entire human genome2. But just how accurate is DNA sequencing and its data storage techniques? What effect do these inaccuracies have on genomics and their use in pharmacogenetics? 

Throughout the course of the Human Genome Project, there have been varying levels of target accuracies that the research institutes have aimed for.  In 2000, the first draft was released with an error rate of one error per every 1,000 base pairs.  In 2003, the official results were cited to have an error rate of one per every 10,000 base pairs1.  Currently, this requires going through and sequencing the DNA a total of ten times to achieve that level of accuracy3.  Known as the Bermuda Standards, the international standard for accuracy is currently held at one error per 10,000 base pairs for the entire contiguous sequence – the DNA is sequenced in parts, and often times, gaps exist between these different parts4.  Regardless of how accurate this process of sequencing may seem, through the sequencimg of the entire human genome, this yields a total of approximately 300,000 base pair errors.

But how significant is a 00.0001% error rate?  The Human Genome Project has brought attention to the significance of single nucleotide polymorphisms (SNPs).  SNPs are natural DNA sequencing variations of a single nucleotide (A, T, C or G) that occur every 100 to 300 base pairs5.  The variations caused by SNP can dramatically affect how humans react differently to things such as drugs, vaccines, or diseases.  However, because of the inherent and allowable errors for companies such as 23andMe that sequence DNA, their results will certainly sequence some SNPs inaccurately.  The problem is that companies like 23andMe expect to use their DNA sequencing results to provide medical advice for the participants and their doctors so that they can better prescribe more accurate drug dosages.  However, with over 300,000 base pair errors, how accurate can this medical advice be?  If the capabilities and limitations of the human body are sensitive down to the individual nucleotide (as with SNP), can human genome sequencing be reliable enough to serve its purpose as a source for personalized medical information completely dependent on human DNA?”

https://cs.stanford.edu/people/eroberts/courses/cs181/projects/2010-11/Genomics/accuracy.html

Defining Accuracy

This third source is from 2021 and while it admits that it is a difficult task determining the accuracy of sequencing results due to differences among technologies and genomic regions, it set out the parameters for defining accuracy in genomics. This is broken down to read accuracy and consensus accuracy which are heavily influenced by mappability. Even perfect reads can contribute to errors if they are not ordered and placed correctly during assembly. If there are high error rates, it is admittedly impossible to determine whether any discrepancies between the reference genome and the data set are variants or sequencing errors.

Sequencing 101: Understanding Accuracy in DNA Sequencing

“For scientists who utilize DNA sequencing in their research but are not experts in the underlying technology, it can be difficult to determine the accuracy of sequencing results — and even harder to compare accuracy across sequencing platforms. Furthermore, accuracy differs not only between technologies but also across genomic regions as some stretches of the genome are inherently more difficult to read.

It is critically important to understand accuracy in DNA sequencing to distinguish important biological information from sequencing errors.

There are two key types of accuracy in DNA sequencing technologies: read accuracy and consensus accuracy. Read accuracy is the inherent error rate of individual measurements (reads) from a DNA sequencing technology. Typical read accuracy ranges from ~90% for traditional long reads to >99% for short reads and HiFi reads.

Consensus accuracy, on the other hand, is determined by combining information from multiple reads in a data set, which eliminates any random errors in individual reads. Deeper coverage, meaning more reads from which to build a consensus, generally increases the accuracy of results. However, there are still limitations to calling consensus from multiple reads. Consensus calculation is a complicated and computationally expensive process, and it cannot overcome systematic errors. If a sequencing platform consistently makes the same mistake, then it will not be erased by generating more sequencing coverage.

To sidestep this problem, it is common to “polish” long reads that have systematic errors with high accuracy short reads. However, because of their read length, short reads cannot always map to the long reads unambiguously, limiting their ability to improve accuracy. In general, consensus is improved – and vastly simplified – by starting with highly accurate reads with no systematic biases.”

Mappability

“The accuracy of a genome assembly goes beyond the accuracy of each individual base. Even perfect reads can contribute to poor accuracy if they are not ordered and oriented correctly in the assembly. This question of where to place the read is called mappability.

Reads containing only a piece of a large structural element, or consisting of highly repetitive sequences, can be very difficult to align, mapping ambiguously to many different locations in a reference. This is where short reads really struggle; because of their size, there is a greater chance that they will not contain enough unique sequence data to anchor them properly in a genome.”

Phasing

When exploring diploid or polyploid genomes, phasing means separating the different copies of each chromosome (e.g. maternal and paternal for diploid), known as haplotypes. With sufficient accuracy, the identity of nucleotides at each position in the genome can be compared with a reference sequence to identify SNVs, with a heterozygous locus indicating a difference in sequence between a homologous chromosome pair. This is where the inherent low accuracy of traditional error prone long reads becomes a limitation – with a high error rate, it makes it impossible to decide whether a disagreement between a reference and data set is a variant or a sequence error.”

Sequencing 101: Understanding Accuracy in DNA Sequencing

Works in Progress?

This fourth and final source is from a review published in 2019. In it, the authors discuss the many errors that are encountered during genomic sequencing even in the face of newer technologies and algorithms. Various pitfalls such as biases, artefacts, and errors can crop up at any stage of the process. The authors concede that all genomes have issues and are essentially “works in progress.” The type of technology used and the way in which the genome was assembled ultimately have profound impacts on the accuracy of any genome.

Is reliance on an inaccurate genome sequence sabotaging your experiments?

Abstract

“Advances in genomics have made whole genome studies increasingly feasible across the life sciences. However, new technologies and algorithmic advances do not guarantee flawless genomic sequences or annotation. Bias, errors, and artifacts can enter at any stage of the process from library preparation to annotation. When planning an experiment that utilizes a genome sequence as the basis for the design, there are a few basic checks that, if performed, may better inform the experimental design and ideally help avoid a failed experiment or inconclusive result.

All genome sequences have “issues”

There are many factors that can affect the ultimate genome sequence and annotation that are produced, and both should be considered “works in progress.” An awareness of these factors can inform experimental decisions that may depend upon the accuracy of a particular genome sequence, region, gene, or genes.”

What is the origin of the sample used to generate the genome sequence?

“The origin matters. Did the sample originate from a clone, a mixed population (common with microbes), or possibly a hybrid? Differences between individuals can be single nucleotide polymorphisms (SNPs), but often they involve insertions or deletions (indels) of various sizes, copy number variations (CNV), and even small rearrangements. Hybrids can have dramatic differences between orthologous chromosomes [1]. Genome sequences derived from a heterogenous population, especially when CNVs exist, complicate genome assembly, and often the sequence produced is a composite of the major alleles present in the sequenced sample. Genome sequences derived from clonal laboratory strains are often easier to assemble, but they may not be truly representative of circulating wildtype strains because they are adapted to culture and, if propagated for a long time, may have lost genes or accumulated mutations [2].”

Does the genome have troublesome characteristics?

“Some genome sequences are physically difficult to sequence because of extreme nucleotide bias. The Plasmodium falciparum genome sequence was so AT-rich that specialized sequencing chemistry was developed [3]. Long homopolymeric runs of any base are particularly troublesome for some sequencing technologies [4] and may lead to an incorrect number of nucleotides, resulting in frame-shifts if the sequence is coding. If a putative frameshift interrupts your gene of interest, confirm its presence in your stocks with PCR and Sanger sequencing, ideally, or view the assembly (see Fig 1) before accepting it. If the genome sequence contains numerous repetitive sequences, retrotransposons or mobile elements, or large, highly similar gene families, the genome assembly will be affected (Fig 1), especially if only short-read sequences were used.

Repetitive sequences are a huge challenge for most assembly algorithms. It is well established that multiple variant copies of genes can increase novelty and help a pathogen to survive in the face of immune system pressures [5] and are often related to pathogenesis and virulence [68]. Long-read and single-molecule technologies like PacBio and Nanopore [9] can provide verification of tandem copies and in some cases, the number of gene copies. Low-coverage, less accurate, long-molecule reads can be used as a framework upon which shorter-read sequences can be mapped, or long reads, when sufficiently deep, can be used for the complete assembly and provide self-error correction [1012].

“There is an easy way to assess the quality of your organism’s genome assembly. Map the reads from the sequencing project back to the assembled genome sequence and have a look (Fig 1A) (see the following sections for pointers on how to do this: “How as the genome assembled?”, “How good is the assembly?”, “Was the assembly corrected?”, and “Common challenges and strategies to help”). This quick screen checks for “pile-ups,” the tell-tale indicator for the presence of collapsed repetitive sequence regions in the assessed genome sequence (Fig 1B). Alternatively, a genomic Southern blot utilizing a restriction enzyme that cuts once within your sequence of interest will also reveal additional copies if present. The reference genome assembly for the apicomplexan parasite Toxoplasma gondii ME49 contains several collapsed regions that vary by strain (Fig 1C) [8]. Despite the high quality of this genome sequence and its correspondence to genetic maps, issues related to the number of chromosomes still exist [1314].

How were the libraries prepared?

Before long-read technologies existed, large distances were covered by biological libraries of different insert sizes in plasmids, bacterial artificial chromosomes (BACs), cosmids, and fosmids. Generation of single reads from each end of a known length (e.g., 10 kb) library insert sequence would suggest that the reads should end up in the assembled genome facing each other and about 10 kb apart. If they are not, it is suggestive of an assembly error. Genome sequences that relied on cloning and biological replication have additional issues that need to be considered. Some sequences simply cannot be cloned; they are toxic to the organism used for cloning and replication and thus, will be missing in the genome sequence produced. Unclonable sequences often contain a few select genes and heterochromatin. The inverse is also true; a DNA sequence from the cloning vector or organism used to construct the library can end up in the assembled target genome sequence.”

“High-throughput NGS library preparation plays a critical role with respect to the quality of the genome sequence produced. Many protocols contain amplification steps, which can introduce bias. For example, single cells can be used for genome sequencing but via the application of whole genome amplification (WGA). The approach is powerful when material is limited, but the amplification process is biased, and several different WGA reactions (on different cells or populations of like cells) are necessary to fully identify and remove the amplification bias [1516]. It should be noted that bias is rarely removed from the reads submitted to archives, so it is imperative to know if WGA was utilized.

What sequencing platform was used?

Different sequencing platforms have different strengths and weaknesses [9], and they continue to evolve rapidly and often complement each other if several different approaches are applied. Genome sequences assembled with Sanger chemistry will have good quality sequence, but the assembled genome sequence will be affected by the library issues mentioned previously. Genome sequences generated with legacy systems, e.g., 454 and Ion Torrent, will have homopolymer miscount issues. Newer genome sequences will consist of highly accurate Illumina short-read technology, but the assembled sequence, especially if repeats are present, will be incomplete and contain gaps and mis-assemblies unless a hybrid assembly using long-read technologies like PacBio or Oxford Nanopore are utilized.

How was the genome assembled?

Sequence assemblies are of two types: de novo, assembled from scratch, and reference-based. The latter is normally used when an established organismal reference genome already exists and the experimental goal is to determine variation with respect to it. It is not a good approach to detect rearrangements or syntenic breaks, but it is ideal to detect SNPs, some indels, and CNV. Reference-based approaches will not reveal genome features not present in the reference, a significant drawback. Due to the large volume of population studies focused on SNPs, most genome sequence data, sadly, remain as unassembled files of reads.

De novo assemblies are the only option for an organism’s first genome sequence, and when possible, they should be performed as a matter of practice to permit discovery of new features. In the case of eukaryotic genome sequences, especially when the karyotype is unknown and physical maps do not exist, reads can only be partially assembled into contiguous reads, “contigs,” or scaffolds of contigs, containing gaps. Contigs often contain sequences that are fairly unique because repetitive sequences are often “masked” in a de novo assembly because of the issues they cause. As a result, contigs often end at, or are separated by, missing repetitive regions that were not utilized (e.g., masked) or could not be resolved during the assembly. Variation found at the ends of contigs should be treated with caution.

Gaps between contigs that have been ordered and oriented into scaffolds are often indicated by exactly 100 “N’s” to indicate a gap of unknown size. In some cases, scaffolds representative of whole chromosomes are assembled, but these, too, often contain numerous gaps or ambiguous bases (Table 1). Some assemblers also create a scaffold that links together all “leftover” contigs. Beware of this scaffold, often named “scaffold 0,” as the order and orientation of these contigs bears no resemblance to their biological location; it is simply a convenient mechanism to make sure all contigs are available to those using or searching the genome sequence.”

Each type of sequence assembly comes with a set of inherent issues, and most genome sequence projects produce an assortment of leftover reads and contigs that do not assemble. In some cases, these reads can be identified as contamination, an unexpected symbiont, or organellar genome sequence. In other cases, the leftover bits are a tell-tale sign of particular types of assembly errors or unexpected genome sequence variation, e.g., CNV (Fig 1) or high levels of heterozygosity between alleles (especially if a population was sequenced, rather than an individual). New ploidy-aware assembly programs are emerging, and they will assist greatly with several of the issues presented here. For this reason, it is important to know when and how a genome sequence was assembled (Table 1). Remember, a genome sequence can always be reassembled from the archived reads as new algorithms and new, or longer, sequences become available. Many communities are actively reassembling important reference sequences.

Was the genome sequence “corrected,” and if so, how?

“Error-prone long-sequence reads can be corrected prior to assembly using proovread [21]. Correction prior to assembly can facilitate assembly when the error rate is high, e.g., in low-coverage PacBio reads. Assembled genome sequences can also be “polished.” Polishing involves base call correction, and ICORN2 [22] is a popular tool. Polishing is performed using highly accurate Illumina reads mapped back against the final genome assembly. Read correction and polishing are useful and recommended steps, but they are highly dependent on the performance of the aligner, and the end user must be aware that the corrected and polished sequences will represent the most abundant alleles present in the reads. In other words, isoforms and rare variants of repetitive sequences will be “corrected,” i.e., overwritten, in the final assembly by more abundant sequence variants.”

Gene predictions are genome-assembly dependent, which means if a region is missing, it cannot be annotated. Likewise, if the region is poorly assembled or missing in a reference genome sequence used for orthology, it may end up missing in the genome sequence that is being annotated. A good example is Cryptosporidium. The genome sequence for Cparvum was released in 2004, with a state-of-the-art assembly and annotation for the time [27]. This genome sequence was used as the reference sequence for several additional Cryptosporidium strains and species [2829]. This practice can be dangerous, as one of the genome features that facilitates speciation is genome rearrangement, which affects chromosome pairing during reproduction. As there are no genetic systems for many pathogens that can be used to generate a physical map, reference mapping is useful, but it is easy to forget the origins of genome sequence assemblies and annotation created or propagated in this way, so care must be exercised when using reference-mapped genome assemblies as the basis for experiments.”

The gene is annotated as single copy, is it?

“Additional copies of genes can thwart experiments designed to target, clone, delete, or modify a particular gene. The annotation may indicate a single-copy gene, but depending on the technology used to generate your genome sequence, nearly identical copies of genes can become assembled as one gene (short-read only assemblies are most prone to this issue), and slightly divergent gene family members, especially if they are in tandem repeats, often don’t assemble and can be found in the leftover reads or small unassembled contigs (Fig 1).”

The annotation doesn’t describe your gene. Is it really missing from the genome?

“It is easy to be misled on the basis of existing annotation that a gene is missing. Genes can be lost, and they do decay or evolve beyond recognition, but they may also be missing because of a sequence assembly gap. Missing genes, especially if they are part of a gene family, can often be located in the unassembled reads or contigs (Fig 1B). Note that unassembled contigs are often not annotated, so they should be searched using BLASTX (protein against translated nucleotides). The best practice for determining gene loss is to look at a synteny map of the genomic contigs and see if the region of the genome that is expected to contain the gene of interest (based on its location in a close species) is present, conserved, and not rearranged (Fig 1C). Alternatively, the region may be missing from the genome assembly, i.e., a gap relative to the comparator sequence. Misassemblies and gaps can provide the illusion of missing genes, when in reality, they are missing from the assembly, have evolved into pseudogenes, or, in some cases, have been replaced by a horizontal gene transfer located elsewhere in the genome.

Genome sequence gaps have many downstream consequences. The number of genes may be reduced relative to the actual number, and ironically, the number of genes can also be inflated because a portion of the same gene can be found on each side of the gap, resulting in two partial predictions. Small assembly gaps often lead to frameshifts in coding sequences, which, in turn, lead to an artificial increase in the number of pseudogenes, when, in reality, the culprit is an assembly gap. Gaps can also indicate the location of a missing tandem array of genes or repeat sequences that could not be properly assembled (Fig 1C).

Can I trust the annotation?

Some organismal genome sequences are continuously curated by the community or experts and have a good, recent genome annotation (Table 1). However, annotators cannot annotate what does not exist (e.g., gaps). Eukaryotic genome sequences, especially from animal, vector, or plant hosts, are complex, and even with continuous curation, there is much more to be fixed and discovered as new sequence technology, assembly algorithms, and experimental evidence appear. For example, untranslated regions and noncoding RNAs aren’t routinely annotated. All genome sequences and their annotation are “works in progress” and are static representatives of one point in time for a continuously evolving molecule within a genetically diverse population.

Does the annotation affect pathway analyses?

Yes. Studies aimed at drug target discovery often look for a gene that appears to be essential to a pathway. Once discovered, the gene is knocked out, and to everyone’s dismay, it was not essential, and the organism survives in the presence of drug. There are many reasons this may have happened, which range from the ability of the drug to reach the target to the possibility that the assessment of essentiality is flawed. Errors in the annotation or the assembly can also lead to this result. For example, the gene may not be single copy, or the knockout construct behaved oddly and targeted a related or additional gene copy of the target, producing unusual or hard to interpret results. Alternatively, the large proportion of genes of unknown function (as high as 40% in some organisms) encode functions that allow the organism to circumvent the knockout. Much work is still needed on this important class of genes.”

“Some genome sequences will require additional approaches beyond long reads, such as Hi-C (chromatin conformation capture) [35], Chicago library methodologies [36], or optical mapping [37]. Truly difficult genome sequences can be hexaploid (like wheat), have enormous numbers of scaffolds (like Ixodes scapularis, which has >350,000), be littered with highly similar repeat elements (like Tvaginalis), or suffer from extreme heterogeneity and length differences between sister chromosomes (as in the hybrid Tcruzi). Some genome sequences have already been “fixed” with these new technologies, but there is still significant work required to make them as good as they can be. New assemblies and annotations are always needed. It is frustrating when all the naming and numbering changes, but these changes result from progress that will facilitate and inform the basis of much-needed further experimentation.”

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6742220/

In Summary:

  • The genome was considered to be the unchangeable blueprint of the human being, which is determined at the beginning of our life
  • However, science must bid farewell to this idea as in reality, our genetic make-up is in a state of constant change
  • Geneticists must abandon their image of a stable genome, in which changes are pathological exceptions
  • The genome of each individual is in a state of constant transformation
  • In 2000, US President Bill Clinton referred to the first draft of the human genome as the “Book of Life” – a sequence of three billion letters
  • However, newer technology has shown that the book dissolves before the eyes of the readers and that the genome is not a stable text
  • The genetic processes form an open system in which by no means everything is predetermined
  • The experts believed they had understood how a gene looks and functions, which functional principles the human or microbial genome follows
  • “In retrospect, our assumptions about how the genome works back then were so naive that it is almost embarrassing,” says Craig Venter, who was involved in the project with his company Celera
  • The assumption had been that the genetic material of any two people differed only by about one per mille of all DNA building blocks yet the differences in the genetic makeup of humans are in reality so great that science now confirms what the vernacular has long known: “Every man is different. Completely different!”
  • The previous conviction (i.e. assumption) that each gene usually exists only twice in the genome (once in the paternal, once in the maternally inherited set of chromosomes) is also incorrect
  • In reality, a great deal of genetic information is subject to a process of duplication and exists in up to 16 copies in the cell nucleus
  • The explosiveness of the findings is exacerbated by the discovery that the CNV patterns in the genome are by no means stable, the copy number of the genes may decrease or increase, and even the somatic cells of an individual human differ from each other
  • The idea that the genome represents a natural constant, a fixed source code of the human being, is now crumbling under the weight of the findings
  • The US geneticist Matthew Hahn already compared the genome with a revolving door: “Genes constantly come, others go.”
  • Throughout the course of the Human Genome Project, there have been varying levels of target accuracies that the research institutes have aimed for
  • Known as the Bermuda Standards, the international standard for accuracy is currently held at one error per 10,000 base pairs for the entire contiguous sequence – the DNA is sequenced in parts, and often times, gaps exist between these different parts
  • Regardless of how accurate this process of sequencing may seem, through the sequencimg of the entire human genome, this yields a total of approximately 300,000 base pair errors
  • SNPs are natural DNA sequencing variations of a single nucleotide (A, T, C or G) that occur every 100 to 300 base pairs
  • The variations caused by SNP can dramatically affect how humans react differently to things such as drugs, vaccines, or diseases
  • Because of the inherent and allowable errors for companies such as 23andMe that sequence DNA, their results will certainly sequence some SNPs inaccurately
  • With over 300,000 base pair errors, it is questionable how accurate genomes can be 
  • The capabilities and limitations of the human body are sensitive down to the individual nucleotide (as with SNP), thus human genome sequencing can not be reliable enough to serve its purpose as a source for personalized medical information completely dependent on human DNA
  • It can be difficult to determine the accuracy of sequencing results — and even harder to compare accuracy across sequencing platforms
  • Accuracy differs not only between technologies but also across genomic regions as some stretches of the genome are inherently more difficult to read
  • There are two types of accuracy in genomics:
    1. Read accuracy
      • The inherent error rate of individual measurements (reads) from a DNA sequencing technology
      • Typical read accuracy ranges from ~90% for traditional long reads to >99% for short reads and HiFi reads
    2. Consensus accuracy
      • Determined by combining information from multiple reads in a data set, which eliminates any random errors in individual reads
      • Consensus calculation is a complicated and computationally expensive process, and it cannot overcome systematic errors
      • If a sequencing platform consistently makes the same mistake, then it will not be erased by generating more sequencing coverage
  • To sidestep this problem, it is common to “polish” long reads that have systematic errors with high accuracy short reads
  • However, because of their read length, short reads cannot always map to the long reads unambiguously, limiting their ability to improve accuracy
  • Even perfect reads can contribute to poor accuracy if they are not ordered and oriented correctly in the assembly
  • This question of where to place the read is called mappability
  • Reads containing only a piece of a large structural element, or consisting of highly repetitive sequences, can be very difficult to align, mapping ambiguously to many different locations in a reference
  • This is where short reads really struggle; because of their size, there is a greater chance that they will not contain enough unique sequence data to anchor them properly in a genome
  • For longer reads with a high error rate, it makes it impossible to decide whether a disagreement between a reference and data set is a variant or a sequence error
  • New technologies and algorithmic advances do not guarantee flawless genomic sequences or annotation
  • Bias, errors, and artifacts can enter at any stage of the process from library preparation to annotation
  • There are many factors that can affect the ultimate genome sequence and annotation that are produced, and both should be considered “works in progress”
  • The origin of the genome matters:
    1. Hybrids can have dramatic differences between orthologous chromosomes
    2. Genome sequences derived from a heterogenous population, especially when CNVs exist, complicate genome assembly, and often the sequence produced is a composite (i.e. made up of various parts or elements) of the major alleles present in the sequenced sample
    3. Genome sequences derived from clonal laboratory strains are often easier to assemble, but they may not be truly representative of circulating wildtype strains because they are adapted to culture and, if propagated for a long time, may have lost genes or accumulated mutations
  • Some genome sequences are physically difficult to sequence because of extreme nucleotide bias
  • Long homopolymeric runs of any base are particularly troublesome for some sequencing technologies and may lead to an incorrect number of nucleotides, resulting in frame-shifts if the sequence is coding
  • If the genome sequence contains numerous repetitive sequences, retrotransposons or mobile elements, or large, highly similar gene families, the genome assembly will be affected, especially if only short-read sequences were used
  • Repetitive sequences are a huge challenge for most assembly algorithms
  • Low-coverage, less accurate, long-molecule reads can be used as a framework upon which shorter-read sequences can be mapped
  • The easy way to assess the quality of your organism’s genome assembly is to map the reads from the sequencing project back to the assembled genome sequence and have a look
  • If a check to see how accurate the genome is relies upon a reference genome, how was the reference genome determined accurate without having its own reference to validate accuracy?
  • The reference genome assembly for the apicomplexan parasite Toxoplasma gondii ME49 contains several collapsed regions that vary by strain and despite the high quality of this genome sequence and its correspondence to genetic maps, issues related to the number of chromosomes still exist
  • Generation of single reads from each end of a known length (e.g., 10 kb) library insert sequence would suggest that the reads should end up in the assembled genome facing each other and about 10 kb apart
  • If they are not, it is suggestive of an assembly error
  • Genome sequences that relied on cloning and biological replication have additional issues that need to be considered:
    1. Some sequences simply cannot be cloned; they are toxic to the organism used for cloning and replication and thus, will be missing in the genome sequence produced
    2. A DNA sequence from the cloning vector or organism used to construct the library can end up in the assembled target genome sequence
  • Many protocols contain amplification steps, which can introduce bias (i.e. any factors that cause distortion of genetic predictions)
  • The amplification process is biased and several different WGA reactions (on different cells or populations of like cells) are necessary to fully identify and remove the amplification bias
  • It should be noted that bias is rarely removed from the reads submitted to archives, so it is imperative to know if WGA was utilized
  • Different sequencing platforms have different strengths and weaknesses:
    1. Genome sequences assembled with Sanger chemistry will have good quality sequence, but the assembled genome sequence will be affected by  library issues
    2. Genome sequences generated with legacy systems, e.g., 454 and Ion Torrent, will have homopolymer miscount issues
    3. Newer genome sequences will consist of highly accurate Illumina short-read technology, but the assembled sequence, especially if repeats are present, will be incomplete and contain gaps and mis-assemblies unless a hybrid assembly using long-read technologies like PacBio or Oxford Nanopore are utilized
  • Sequence assemblies are of two types: de novo, assembled from scratch, and reference-based:
    1. Reference-based
      • Normally used when an established organismal reference genome already exists and the experimental goal is to determine variation with respect to it
      • It is not a good approach to detect rearrangements or syntenic breaks, but it is ideal to detect SNPs, some indels, and CNV
      • It will not reveal genome features not present in the reference, a significant drawback
      • Due to the large volume of population studies focused on SNPs, most genome sequence data, sadly, remain as unassembled files of reads
    2. De Novo
      • The only option for an organism’s first genome sequence, and when possible, they should be performed as a matter of practice to permit discovery of new features
      • In the case of eukaryotic genome sequences, especially when the karyotype is unknown and physical maps do not exist, reads can only be partially assembled into contiguous reads, “contigs,” or scaffolds of contigs, containing gaps
      • Contigs often contain sequences that are fairly unique because repetitive sequences are often “masked” in a de novo assembly because of the issues they cause
      • As a result, contigs often end at, or are separated by, missing repetitive regions that were not utilized (e.g., masked) or could not be resolved during the assembly
      • Variation found at the ends of contigs should be treated with caution
  • In some cases, scaffolds representative of whole chromosomes are assembled, but these, too, often contain numerous gaps or ambiguous bases
  • Some assemblers also create a scaffold that links together all “leftover” contigs but beware of this scaffold, often named “scaffold 0,” as the order and orientation of these contigs bears no resemblance to their biological location
  • Each type of sequence assembly comes with a set of inherent issues, and most genome sequence projects produce an assortment of leftover reads and contigs that do not assemble
  • In some cases, these reads can be identified as contamination, an unexpected symbiont, or organellar genome sequence
  • In other cases, the leftover bits are a tell-tale sign of particular types of assembly errors or unexpected genome sequence variation, e.g., CNV or high levels of heterozygosity between alleles (especially if a population was sequenced, rather than an individual)
  • A genome sequence can always be reassembled from the archived reads as new algorithms and new, or longer, sequences become available
  • Many communities are actively reassembling important reference sequences
  • In other words, reference genomes are a constantly evolving set of inaccurate data in need of updating as better technology becomes available
  • Error-prone long-sequence reads can be corrected prior to assembly using proovread
  • Correction prior to assembly can facilitate assembly when the error rate is high, e.g., in low-coverage PacBio reads
  • Assembled genome sequences can also be “polished”
  • Read correction and polishing are useful and recommended steps, but they are highly dependent on the performance of the aligner, and the end user must be aware that the corrected and polished sequences will represent the most abundant alleles present in the reads
  • Isoforms and rare variants of repetitive sequences will be “corrected,” i.e., overwritten, in the final assembly by more abundant sequence variants
  • Gene predictions are genome-assembly dependent, which means if a region is missing, it cannot be annotated
  • Likewise, if the region is poorly assembled or missing in a reference genome sequence used for orthology, it may end up missing in the genome sequence that is being annotated
  • As there are no genetic systems for many pathogens that can be used to generate a physical map, reference mapping is useful, but it is easy to forget the origins of genome sequence assemblies and annotation created or propagated in this way, so care must be exercised when using reference-mapped genome assemblies as the basis for experiments
  • In other words, genomes are only as accurate as their references, which having been made with older technology, are not accurate and therefore any genome based on one will carry over and build on those inaccuracies
  • Additional copies of genes can thwart experiments designed to target, clone, delete, or modify a particular gene
  • The annotation may indicate a single-copy gene, but depending on the technology used to generate your genome sequence, nearly identical copies of genes can become assembled as one gene
  • Slightly divergent gene family members, especially if they are in tandem repeats, often don’t assemble and can be found in the leftover reads or small unassembled contigs
  • It is easy to be misled on the basis of existing annotation that a gene is missing
  • The best practice for determining gene loss is to look at a synteny map of the genomic contigs and see if the region of the genome that is expected to contain the gene of interest (based on its location in a close species) is present, conserved, and not rearranged
  • Alternatively, the region may be missing from the genome assembly, i.e., a gap relative to the comparator sequence
  • Misassemblies and gaps can provide the illusion of missing genes, when in reality, they are missing from the assembly, have evolved into pseudogenes, or, in some cases, have been replaced by a horizontal gene transfer located elsewhere in the genome
  • The number of genes may be reduced relative to the actual number, and ironically, the number of genes can also be inflated because a portion of the same gene can be found on each side of the gap, resulting in two partial predictions
  • Gaps can lead to an artificial increase in the number of pseudogenes
  • Gaps can also indicate the location of a missing tandem array of genes or repeat sequences that could not be properly assembled
  • Annotators cannot annotate what does not exist (e.g., gaps)
  • Eukaryotic genome sequences, especially from animal, vector, or plant hosts, are complex, and even with continuous curation, there is much more to be fixed and discovered as new sequence technology, assembly algorithms, and experimental evidence appear
  • All genome sequences and their annotation are “works in progress” and are static representatives of one point in time for a continuously evolving molecule within a genetically diverse population
  • In other words, no one genome is accurate as it is nothing but a snapshot in time
  • Errors in the annotation or the assembly can cause false results in pharmacological studies
  • For example, the gene may not be single copy, or the knockout construct behaved oddly and targeted a related or additional gene copy of the target, producing unusual or hard to interpret results
  • Alternatively, the large proportion of genes of unknown function (as high as 40% in some organisms) encode functions that allow the organism to circumvent the knockout
  • Some genome sequences have already been “fixed” with these new technologies, but there is still significant work required to make them as good as they can be
  • New assemblies and annotations are always needed

Genomes are a constantly evolving snapshot of computer-generated data that is heavily influenced by the type of the technology used and limitations inherent at the time. There are numerous steps involved in the sequencing process where biases, artefacts, and errors can crop up. It is claimed that even with newer technologies, errors are still present and accuracy can be difficult to determine. This being the case, how can any genome ever truly be considered accurate and reliable? If newer technology must be used to update the accuracy of genomes regularly, how can older genomes be used as a reference? Reference genomes built from older technology are used as a framework to build the newer genomes from. The inaccuracies and errors from the reference carry over to the new genome created from it which will carry over to any future genomes. It is clear that these non-static constructs are only as “accurate” as the technology utilized at the given time. As newer technology comes along, old theories and assumptions are broken and replaced by newer ones attempting to explain the shifting landscape.

Beyond the problem of technology is that in order for a genome to be “accurate,” the thing that is sequenced must be shown to exist in reality first. For human genomes, a person must exist to get the genetic material from in order to get a valid sequence. “Viruses” do not physically exist in a purified/isolated state. “Viral” genomes are taken either from unpurified samples such as broncoalveloar fluid or from cell culture supernatant created in a lab. These are mixtures of many different known and unknown microbes/organisms as well various sources of unrelated DNA/RNA. As shown in the human genome project, the genome is not an accurate representation even with known source material from physical entities that exist in reality. It has been through at least 38 revisions since it was released in 2003. As described by Ulrich Bahnsen, the genome is not a static book but is an ever-changing narrative. If it is impossible to get an accurate representation of the human genome after the last 3 decades using known sources of genetic material, what does that say about the genomes of “viruses?”

At the time of this writing, there are 7.3 million “SARS-COV-2” variants running around. No one sequence is identical. Thus, we have two options to believe:

  1. Genomes are a constantly shifting work in progress that seemingly can not be captured in a static state.
  2. Genomes are inaccurate error-prone computer-generated creations.

After reading the laundry list of problems associated with the creation of genomes and the breakdown in the assumptions of a static genome, how reliable and accurate do you believe these “WORKS IN PROGRESS” truly are?

13 comments

  1. This is fascinating! A genome is a chimera! I love that when scientists try to hammer down the boundaries and details in a scientific field it usually serves to open up our awareness to ever more vast frontiers of the unknown.

    Liked by 2 people

    1. Brilliant useful reference article. This amplifies what seems obvious at the macro cosmic level, that such levels of complexity only produce more questions than they solve. In his book Biological Relativity, Denis Noble criticises the reductionist view as absurd. Complex ideas are the result of hypothetical imaginings relevant to the scale at which they exist and are perceived. Science has been seduced by technology to the extent that it compares the genetic code with the maths of computer code. But as Noble says, in reality there is no separate computer and programme. The computer is the programme on the biological level.

      Liked by 2 people

      1. Yes, the pseudoscience of virology, immunology, and genomics offer complex and confusing (and often contradictory) hypotheses and theories as an answer when it has become nothing but an overly complicated and confusing mess. I believe this is by design to keep people from questioning as it takes so much effort to peel back all of the layers. People have been tricked to believe in these complex technologies, tests, and studies as proof for the ridiculous claims of these pseudosciences. However, the simplest explanation is most often the best one. 😉

        Liked by 1 person

      2. This is indeed a brilliant and useful article! Thank you for the book title: Denis Noble’s book, Dance to the Tune of Life: Biological relativity, is now on its way to me. It seems scientists get lost in the thickets of overly complex theories and wander for years in mountains of data. It reminds me of the increasingly complex models made of the universe made when people kept doggedly insisting that the earth was the center of the universe.

        Liked by 2 people

  2. Excellent work again – much appreciated. The more you dig, the more it shows a greater margin for error than for realistic evidence. Personally, I can’t get past the first hurdle – that immunology/virology is a human model, based on human ideas that are supposed to reflect our inner cellular/physiological processes, but that model is limited to human observation and interpretation of what happens outside of the natural environment where this cellular activity and accompanying processes are relevant. And it’s only physical observation and interpretation that determines their conclusions. There’s no accounting for any meta-physical symbiosis with the physical in their models. So how can the models possibly be accurate from the outset? They fail on first principles.

    Liked by 1 person

  3. Maybe put the summary at the top (or mention there is a summary at the end)? It is tempting to give up because it gets quite technical quickly.

    Liked by 1 person

    1. Thanks for the feedback! I typically always put a summary at the end. Sometimes I’m good at mentioning it while I do forget to do so sometimes. I will add that up front so people know.

      Like

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: