The Reproducibility Crisis in Genomics

colorful big data pattern background

It has been known since at least 2005 that much of the scientific literature being published is fundamentally flawed, non-reproducible, and/or outright fraudulent. Relating to the (pseudo)science of virology, this crisis extends to the cell culturing techniques used to “isolate” the “virus” as well as the theoretical antibodies used as an indirect method to identify them. Another area closely tied to virology is genomics, hence all the talk about genomes and variants lately with “SARS-COV-2.” Just as with every other area surrounding virology, genomics is also embroiled in a reproducibilty crisis itself. I’ve provided highlights from a few studies/articles which help to paint the scope of this problem.

This first highlight is from a report on the Bio-IT World Asia meeting, Marina Bay Sands, Singapore, 6-8 June 2012. It is a brief passage mentioning that access to data, tools, and accessibility in computational data-driven approaches has not kept pace with the increased need for these resources in genomics. This has been singled out as a factor in the reproducibility crisis. I provided this source to show that the problem of reproducibility in genomics specifically has been known as far back as 2012, although it could be argued it has been known since 2005 when John Ioannidis first raised this issue in his 2005 essay “Why Most Published Research Findings Are False.” (https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124):

Eastern genomics promises

“Of the many critical issues coming out of the data-rich universe that we now find ourselves in, James Taylor (Emory University) focused his talk on what feels is the main crisis in genomics research reproducibility. With the life-sciences increasingly reliant on computational and data-driven approaches, access to the supporting data and tools and accessibility in using computational resources has not kept pace.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3491378/

Say it isn’t so!

In June 2015, an article by Roger D. Peng, an associate professor of biostatistics at the Johns Hopkins Bloomberg School of Public Health, came out breaking down the reproducibility crisis in science. He gives a great overview of the reproducibility crisis in all of science yet highlights a few problem areas in genomics specifically. The first is a shortage of software to reproducibly perform and communicate data analysis. The second is that the data needed to successfully reproduce genomic results is not always available. Peng argues that, especially in areas where computational data is a necessity (i.e. genomics), reproducibility is the only way an investigator can guarantee the work is accurate:

The reproducibility crisis in science

Making research reproducible

There are two major components to a reproducible study: that the raw data from the experiment are available; and that the statistical code and documentation to reproduce the analysis are also available. These requirements point to some of the problems at the heart of the reproducibility crisis.

First, there has been a shortage of software to reproducibly perform and communicate data analyses. Recently, there have been significant efforts to address this problem and tools such as knitr, iPython notebooks, LONI, and Galaxy have made serious progress.

Second, data from publications have not always been available for inspection and reanalysis. Substantial efforts are under way to encourage the disclosure of data in publications and to build infrastructure to support such disclosure. Recent cultural shifts in genomics and other areas have led to journals requiring data availability as a condition for publication and to centralised databases such as the US National Center for Biotechnology Information’s Gene Expression Omnibus (GEO) being created for depositing data generated by publicly funded scientific experiments.

One might question whether reproducibility is a useful standard. Indeed, one can program gibberish and have it be perfectly reproducible. However, in investigations where computation plays a large part in deriving the findings, reproducibility is important because it is essentially the only thing an investigator can guarantee about a study. Replicability cannot be guaranteed – that question will ultimately be settled by other independent investigators who conduct their own studies and arrive at similar findings. Furthermore, many computational investigations are difficult to describe in traditional journal papers, and the only way to uncover what an investigator did is to look at the computer code and apply it to the data. In a time where data sets and computational analyses are growing in complexity, the need for reproducibility is similarly growing.”

Public failings

“Yet there is increasing concern in the scientific community about the rate at which published studies are either reproducible or replicable.

This concern gained significant traction
with a statistical argument that suggested most published scientific results may be false positives (bit.ly/1PWAhBx). Concurrently, there have been some very public failings of reproducibility across a range of disciplines, from cancer genomics (bit.ly/1PWAC7a), to clinical medicine (bit.ly/1KNc4u6) and economics (bit.ly/1PWBngz) and the data for many publications have not been made publicly available, raising doubts about the quality of data analyses. Compounding these problems is the lack of widely available and user-friendly tools for conducting reproducible research.

Perhaps the most infamous recent
example of a lack of replicability comes from Duke University, where in 2006 a group of researchers led by Anil Potti published a paper claiming that they had built an algorithm using genomic microarray data that predicted which cancer patients would respond to chemotherapy.1 This paper drew
immediate attention, with many independent investigators attempting to reproduce its results. Because the data were publicly available, two statisticians at MD Anderson Cancer Center, Keith Baggerly and Kevin Coombes, obtained the data and attempted to apply Potti et al.’s algorithms.2 What they found instead was a morass of poorly conducted data analyses, with errors ranging from trivial and strange to devastating. Ultimately, Baggerly and Coombes were able to reproduce the (erroneous) analysis conducted by Potti et al., but by then the damage was done. It was not until 2011 that the original study was retracted from Nature Medicine.

https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/j.1740-9713.2015.00827.x

The typical process for producing unreproducible/irreplicable results.

In 2017, a study was published which investigated reproducibility and tracking provenance in genomics. It made the case that the ability to create DNA sequences has overrun the ability to store and interpret this data successfully. It stresses that a key challenge is how to improve reproducibility of genomic experiments involving complex software environments and large datasets. Many examples are provided showcasing the difficulties different researchers have had with reproducibility due to a lack of adequate data being provided:

Investigating reproducibility and tracking provenance – A genomic workflow case study

“Computational bioinformatics workflows are extensively used to analyse genomics data, with different approaches available to support implementation and execution of these workflows. Reproducibility is one of the core principles for any scientific workflow and remains a challenge, which is not fully addressed. This is due to incomplete understanding of reproducibility requirements and assumptions of workflow definition approaches.

“Recent rapid evolution in the field of genomics, driven by advances in massively parallel DNA sequencing technologies, and the uptake of genomics as a mechanism
for clinical genetic testing, have resulted in high expectations from clinicians and the biomedical community at large regarding the reliable, reproducible, effective and
timely use of genomic data to realise the vision of personalized medicine and improved understanding of various diseases. There has been a contemporaneous recent upsurge in the number of techniques and platforms developed to support genomic data analysis [1]. Computational bioinformatics workflows are used extensively within these platforms (Fig. 1). Typically, a bioinformatics analysis of genomics data involves processing files through a series of steps and transformations, called a workflow or a pipeline. Usually, these steps are performed by deploying third party GUI or command line based software capable of implementing robust pipelines.”

“The reproducibility of scientific research is becoming increasingly important for the scientific community, as validation of scientific claims is a first step for any translational effort. The standards of computational reproducibility are especially relevant in clinical settings following the establishment of Next Generation Sequencing (NGS) approaches. It has become crucial to optimize the NGS data processing and analysis to keep at pace with the exponentially increasing genomics data
production. The ability to determine DNA sequences has outrun the ability to store, transmit and interpret this data. Hence, the major bottleneck to support the complex experiments involving NGS data is data processing instead of data generation. Computational bioinformatics workflows consisting of various community generated tools [6] and libraries [7, 8] are often deployed to deal with the data processing bottleneck.

Despite the large number of published literature on the use and importance of -omics data, only a few have been actually translated into clinical settings [9]. The committee on the review of -omics-based tests for predicting patient outcomes in clinical trials [10] attributed two primary causes; inadequate design of the preclinical studies and weak bioinformatics rigour, for this limited translation. The scientific community has paid special attention with
respect to benchmarking -omics analysis to establish transparency and reproducibility of bioinformatics studies [11]. Nekrutenko and Taylor [12] discussed important issues of accessibility, interpretation and reproducibility for analysis of NGS data. Only ten out of 299 articles that cited the 1000 Genomes project as their experimental approach used the recommended tools and only four studies used the full workflow. Out of 50 randomly selected papers that cited BWA [13] for alignment step, only seven studies provided complete information about parameter setting and version of the tool. The unavailability of primary data from two cancer studies [14] was a barrier to achieve biological reproducibility of claimed results.

Ioannidis et al. [15] attributed unavailability of data, software and annotation details as reasons for non-reproducibility of microarray gene expression studies. Hothorn et al. [16] found that only 11% of the articles conducting simulation experiments provided access to both data and code. The authors reviewing 100 Bioinformatics journal papers [17] claimed that along with the textual descriptions, availability of valid data and code for analysis is crucial for reproducibility of results. Moreover, the majority of papers that explained the software environment, failed to mention version details, which made it difficult to reproduce these studies.”

Ludäscher et al. [20] reviewed common requirements of any scientific workflow, most of which (such as data provenance, reliability and fault-tolerance, smart reruns and smart semantic links) are directly linked to provenance capture. In addition to workflow evolution [21], prospective (defined as the specification of the work-flow used in an analysis) as well as retrospective (defined as the run time environment of an execution of the work-flow in an analysis) provenance [22] was identified as an essential requirement for every computational process in a workflow to achieve reproducibility of a published analysis and ultimately accountability in case of inconsistent results. Several provenance models have been proposed and implemented to support retrospective and prospective provenance [23–25] but these are seldom used by WMS used in genomic studies. Despite high expectations, various existing WMS [26–30] do not truly preserve all necessary provenance information to support reproducibility – particularly to the standards that might be expected for clinical genomics.

The inability to reproduce and use exactly the same procedures/workflows means that considerable effort and time is required on reproducing results produced by others [12, 16, 17, 31]. At present the consolidation of
expertise and best practice workflows that support reproducibility are not mature. Most of the time, this is due to the lack of understanding of reproducibility requirements and incomplete provenance capture that can make it difficult for other researchers to reuse existing work. The sustainability of clinical genomics research requires that reproducibility of results goes hand-in-hand with data production. We, as the scientific community, need to address this gap by proposing and implementing practices that can ensure reproducibility, confirmation and ultimately extension of existing work.”

Results and discussion

The expectation for science to be reproducible is considered fundamental but often not tested. Every new discovery in science is built on already known knowledge, that is, published literature acts as a building block for new findings or discoveries. Using this published literature as a base, the next level of understanding is developed and hence the cycle continues. Therefore, if we cannot reproduce already existing knowledge from the literature, we are wasting a lot of effort, resources and
time in doing potentially wrong science [53] resulting in “reproducibility crisis” [54]. If a researcher claims a novel finding, someone else, interested in the study, should be able to reproduce it. Reports are accumulating that most of the scientific claims are not reproducible, hence questioning the reliability of science and rendering literature questionable [55, 56]. The true reproducibility of experiments in different systems has not been investigated rigorously in systematic fashion. For computational work like the one described in this paper, reproducibility not only requires an in depth understanding of science but also data, methods, tools and computational infrastructure, making it a non-trivial task. The challenges imposed by large-scale genomics data demand complex computational workflow environments. A key challenge is how can we improve reproducibility of experiments involving complex software environments and large datasets. Although this question is pertinent to scientific community as a whole [57], here we have focused on genomic workflows.

Reproducibility of an experiment often requires replication of the precise software environment including the operating system, the base software dependencies and configuration settings under which the original analysis was conducted. In addition, detailed provenance information of required software versions and parameter
settings used for the workflow aids in the reusability of any workflow. Provenance tracking and reproducibility go hand in hand as provenance traces contribute to
make any research process auditable and results verifiable [58]. The variant calling workflows (as our case study) result in genetic variation data that serves to enhance understanding of diseases when translated into a clinical setting resulting in improved healthcare. Keeping in view the critical application of the data generated, it is safe to state that entire process leading to such biological comprehensions must be documented systematically to guarantee reproducibility of the research. However a generalised set of rules and recommendations to achieve this is still a challenge to be met as workflow implementation, storage, sharing and reuse significantly varies depending on the choice of approach and platform used by the researcher.

Conclusion

Reproducibility of computational genomic studies has been considered as a major issue in recent times. In this context, we have characterised workflows on the basis of approach used for their definition and implementation. To evaluate reproducibility and provenance requirements, we implemented a complex variant discovery workflow using three exemplar workflow definition approaches. We identified numerous implicit assumptions interpreted through the practical execution of the work-flow, leading to recommendations for reproducibility and provenance, as shown in Table 1.”

doi: 10.1186/s12859-017-1747-0.

So many assumptions, so little time…

From a May 2019 article by the Genetic Literacy Project, a newly released report by the National Academies of Sciences, Engineering, and Medicine is briefly summarized which found that the sciences are still full of fraudulent, poorly designed studies with small sample sizes and embellished findings:

Reproducibility crisis: Is scientific research ‘fundamentally flawed’?

“A new report released [May 2019] by the National Academies of Sciences, Engineering, and Medicine is weighing in on a contentious debate within the science world: the idea that scientific research is fundamentally flawed, rife with published findings that often can’t be reproduced or replicated by other scientists, otherwise known as the replication and reproducibility crisis.

Common issues highlighted by these scientists have included fraudulent, poorly done, or overhyped studies, with embellished findings based on small sample sizes.

https://geneticliteracyproject.org/2019/05/16/reproducibility-crisis-is-scientific-research-fundamentally-flawed/

The report itself is very long and seems to attempt to somewhat justify the lack of reproducibility even while outlining the various reasons for the lack of consistent results. In any case, I’ve included somes passages that are relevant to genomics as well as the link for anyone interested in reading the full 7 chapter report:

Reproducibility and Replicability in Science

Our definition of reproducibility focuses on computation because of its large and increasing role in scientific research. Science is now conducted using computers and shared databases in ways that were unthinkable even at the turn of the 21st century. Fields of science focused solely on computation have emerged or expanded. However, the training of scientists in best computational research practices has not kept pace, which likely contributes to a surprisingly low rate of computational reproducibility across studies. Reproducibility is strongly associated with transparency; a study’s data and code have to be available in order for others to reproduce and confirm results. Proprietary and nonpublic data and code add challenges to meeting transparency goals. In addition, many decisions related to data selection or parameter setting for code are made throughout a study and can affect the results. Although newly developed tools can be used to capture these decisions and include them as part of the digital record, these tools are not used by the majority of scientists. Archives to store digital artifacts linked to published results are inconsistently maintained across journals, academic and federal institutions, and disciplines, making it difficult for scientists to identify archives that can curate, store, and make available their digital artifacts for other researchers.”

The Extent of Non-Reproducibility in Research

“Reproducibility studies can be grouped into one of two kinds: (1) direct, which regenerate computationally consistent results; and (2) indirect, which assess the transparency of available information to allow reproducibility.

Direct assessments of reproducibility, replaying the computations to obtain consistent results, are rare in comparison to indirect assessments of transparency, that is, checking the availability of data and code. Direct assessments of computational reproducibility are more limited in breadth and often take much more time and resources than indirect assessments of transparency.

The standards for success of direct and indirect computational reproducibility assessments are neither universal nor clear-cut. Additionally, the evidence base of computational non-reproducibility3 across science is incomplete. Thus, determining the extent of issues related to computational reproducibility across fields or within fields of science would be a massive undertaking with a low probability of success. Notably, however, a number of systematic efforts to reproduce computational results across a variety of fields have failed in more than one-half of the attempts made, mainly due to insufficient detail on digital artifacts, such as data, code, and computational workflow.”

https://www.nap.edu/read/25303/chapter/1

$$$ums it up perfectly.

Finally, a study came out in September 2021 offering a “solution” to the still unmet need of improving reproducibility and accuracy in genomics. It admits that researchers are rarely able to reproduce the genomic studies of others. The massive amount of data that is generated and in need of storing has made reproducibility a seemingly impossible task. The study also provides a few factors contributing to this growing problem:

NPARS—A Novel Approach to Address Accuracy and Reproducibility in Genomic Data Science

“Background: Accuracy and reproducibility are vital in science and presents a significant challenge in the emerging discipline of data science, especially when the data are scientifically complex and massive in size. Further complicating matters, in the field of genomic-based science high-throughput sequencing technologies generate considerable amounts of data that needs to be stored, manipulated, and analyzed using a plethora of software tools. Researchers are rarely able to reproduce published genomic studies.

Conclusion: Accuracy and reproducibility in science is of a paramount importance. For the biomedical sciences, advances in high throughput technologies, molecular biology and quantitative methods are providing unprecedented insights into disease mechanisms. With these insights come the associated challenge of scientific data that is complex and massive in size. This makes collaboration, verification, validation, and reproducibility of findings difficult.

“The term “Data Science” is becoming increasingly associated with data sets massive in size, but there are additional challenges in this rapidly evolving field. Some factors considered to contribute to the challenges include: 1) data complexity, which refers to complicated data circumstances and characteristics, including the quality of data, largeness of scale, high dimensionality, and extreme imbalance; 2) the development of effective algorithms and, common task infrastructures and learning paradigms needed to handle various aspects of data; 3) the appropriate design of experiments; 4) proper translation mechanisms in order to present and visualize analytical results; 5) domain complexities, which refers to expert knowledge, hypotheses, meta-knowledge, etc., in the particular subject matter field (Cao, 2017b).”

In the field of genomic data science, accuracy and reproducibility remains a considerable challenge due to the sheer size, complexity, and dynamic nature plus relative inventiveness of the quantitative biology approaches. The accuracy and reproducibility challenge does not just block the path to new scientific discoveries, more importantly, it may lead to a scenario where critical findings used for medical decision making are found to be incorrect (Huang and Gottardo, 2013). NPARS has been developed to meet the unmet need of improving accuracy and reproducibility in genomic data science. Currently, a limitation of our system is the requirement of the user to put their data into a standardized format for import into NPARS. These steps are not automated.”

https://www.frontiersin.org/articles/10.3389/fdata.2021.725095/full

In Summary:

  • At the Eatern Genomics Conference in 2012, James Taylor (Emory University) focused his talk on what he feels is the main crisis in genomics research reproducibility: access to the supporting data and tools and accessibility in using computational resources has not kept pace
  • A 2015 study on the reproducibility crisis states that there are two major components to a reproducible study:
    1. That the raw data from the experiment are available
    2. That the statistical code and documentation to reproduce the analysis are also available
  • There has been a shortage of software to reproducibly perform and communicate data analyses
  • Data from publications have not always been available for inspection and reanalysis
  • Recent cultural shifts in genomics and other areas have led to journals requiring data availability as a condition for publication and to centralised databases such as the US National Center for Biotechnology Information’s Gene Expression Omnibus (GEO) being created for depositing data generated by publicly funded scientific experiments
  • In investigations where computation plays a large part in deriving the findings, reproducibility is important because it is essentially the only thing an investigator can guarantee about a study
  • Many computational investigations are difficult to describe in traditional journal papers, and the only way to uncover what an investigator did is to look at the computer code and apply it to the data
  • A statistical argument suggested most published scientific results may be false positives (bit.ly/1PWAhBx)
  • There have been some very public failings of reproducibility across a range of disciplines, from cancer genomics, to clinical medicine, and economics and the data for many publications have not been made publicly available, raising doubts about the quality of data analyses
  • Compounding these problems is the lack of widely available and user-friendly tools for conducting reproducible research
  • In 2006 a group of researchers led by Anil Potti published a paper claiming that they had built an algorithm using genomic microarray data that predicted which cancer patients would respond to chemotherapy
  • Baggerly and Coombes attempted to reproduce the results and found a morass of poorly conducted data analyses, with errors ranging from trivial and strange to devastating
  • While they ultimately reproduced the erroneous results, it was not until 2011 that the original study was retracted from Nature Medicine
  • According to a 2017 BMC Bioinformatics article, reproducibility is one of the core principles for any scientific workflow and remains a challenge, which is not fully addressed
  • This is due to incomplete understanding of reproducibility requirements and assumptions of workflow definition approaches
  • Computational bioinformatics workflows are used extensively within genomic platforms
  • The standards of computational reproducibility are especially relevant in clinical settings following the establishment of Next Generation Sequencing (NGS) approaches
  • The ability to determine DNA sequences has outrun the ability to store, transmit and interpret this data
  • The major bottleneck to support the complex experiments involving NGS data is data processing instead of data generation
  • The committee on the review of -omics-based tests for predicting patient outcomes in clinical trials attributed two primary causes for this limited translation of studies; inadequate design of the preclinical studies and weak bioinformatics rigour
  • Only ten out of 299 articles that cited the 1000 Genomes project as their experimental approach used the recommended tools and only four studies used the full workflow
  • Out of 50 randomly selected papers that cited BWA for alignment step, only seven studies provided complete information about parameter setting and version of the tool
  • The unavailability of primary data from two cancer studies was a barrier to achieve biological reproducibility of claimed results
  • Ioannidis et al. attributed unavailability of data, software and annotation details as reasons for non-reproducibility of microarray gene expression studies.
  • Hothorn et al. found that only 11% of the articles conducting simulation experiments provided access to both data and code
  • The authors reviewing 100 Bioinformatics journal papers claimed that along with the textual descriptions, availability of valid data and code for analysis is crucial for reproducibility of results
  • The majority of papers that explained the software environment, failed to mention version details, which made it difficult to reproduce these studies
  • Provenance was identified as an essential requirement for every computational process in a workflow to achieve reproducibility of a published analysis and ultimately accountability in case of inconsistent results
  • Provenance models exist but these are seldom used by WMS used in genomic studies
  • Despite high expectations, various existing WMS do not truly preserve all necessary provenance information to support reproducibility – particularly to the standards that might be expected for clinical genomics
  • At present the consolidation of expertise and best practice workflows that support reproducibility are not mature
  • Most of the time, this is due to the lack of understanding of reproducibility requirements and incomplete provenance capture that can make it difficult for other researchers to reuse existing work
  • The scientific community must address the gap between reproducibilty and data generation by proposing and implementing practices that can ensure reproducibility, confirmation and ultimately extension of existing work (in other words, these practices do not exist and are not being used)
  • The true reproducibility of experiments in different systems has not been investigated rigorously in systematic fashion
  • The challenges imposed by large-scale genomics data demand complex computational workflow environments
  • A key challenge is how to improve reproducibility of experiments involving complex software environments and large datasets
  • Reproducibility of an experiment often requires replication of the precise software environment including the operating system, the base software dependencies and configuration settings under which the original analysis was conducted
  • The entire process leading to such biological comprehensions must be documented systematically to guarantee reproducibility of the research
  • However a generalised set of rules and recommendations to achieve this is still a challenge to be met (as in, not yet solved or implemented) as workflow implementation, storage, sharing and reuse significantly varies depending on the choice of approach and platform used by the researcher
  • Reproducibility of computational genomic studies has been considered as a major issue in recent times
  • Numerous implicit assumptions interpreted through the practical execution of the work-flow were identified which led to recommendations for reproducibility and provenance
  • From a May 2019 article by the Genetic Literacy Project, the idea that scientific research is fundamentally flawed, rife with published findings that often can’t be reproduced or replicated by other scientists, is otherwise known as the replication and reproducibility crisis
  • Common issues highlighted by these scientists have included fraudulent, poorly done, or overhyped studies, with embellished findings based on small sample sizes
  • A few relevant points from the report itself:
    1. Their definition of reproducibility focuses on computation because of its large and increasing role in scientific research
    2. The training of scientists in best computational research practices has not kept pace, which likely contributes to a surprisingly low rate of computational reproducibility across studies
    3. Reproducibility is strongly associated with transparency: a study’s data and code have to be available in order for others to reproduce and confirm results
    4. Proprietary and nonpublic data and code add challenges to meeting transparency goals
    5. Many decisions related to data selection or parameter setting for code are made throughout a study and can affect the results
    6. Tools developed to capture these digital decisions are not used by the majority of scientists
    7. Archives to store digital artifacts linked to published results are inconsistently maintained across journals, academic and federal institutions, and disciplines
    8. They group reproducibility into two categories:
      • Direct, which regenerate computationally consistent results
      • Indirect, which assess the transparency of available information to allow reproducibility
    9. Direct assessments of reproducibility, replaying the computations to obtain consistent results, are rare in comparison to indirect assessments
    10. The standards for success of direct and indirect computational reproducibility assessments are neither universal nor clear-cut
    11. The evidence base of computational non-reproducibility across science is incomplete
    12. A number of systematic efforts to reproduce computational results across a variety of fields have failed in more than one-half of the attempts made, mainly due to insufficient detail on digital artifacts, such as data, code, and computational workflow
  • Finally, from a September 2021 Frontiers study, it is stated that in the field of genomic-based science, high-throughput sequencing technologies generate considerable amounts of data that needs to be stored, manipulated, and analyzed using a plethora of software tools
  • Researchers are rarely able to reproduce published genomic studies
  • The challenge of scientific data that is complex and massive in size makes collaboration, verification, validation, and reproducibility of findings difficult
  • Some factors considered to contribute to the challenges include:
    1. Data complexity, which refers to complicated data circumstances and characteristics, including the quality of data, largeness of scale, high dimensionality, and extreme imbalance
    2. The development of effective algorithms and, common task infrastructures and learning paradigms needed to handle various aspects of data
    3. The appropriate design of experiments
    4. Proper translation mechanisms in order to present and visualize analytical results
    5. Domain complexities, which refers to expert knowledge, hypotheses, meta-knowledge, etc., in the particular subject matter field
  • In the field of genomic data science, accuracy and reproducibility remains a considerable challenge due to the sheer size, complexity, and dynamic nature plus relative inventiveness of the quantitative biology approaches
  • This may lead to a scenario where critical findings used for medical decision making are found to be incorrect
  • The researchers developed NPARS to meet the unmet need of improving accuracy and reproducibility in genomic data science

Genomics is primarily a computationally data-driven “science.” It requires that the technology, software, and data generated are accurate and easily accessible to other researchers in order to reproduce and confirm the findings. As can be seen from the various sources presented above, the ability to generate data has surpassed the technology as well as the ability to store and interpret the massive amounts of data generated. These results from genomic studies are rarely, if ever, reproduced. This is an enormous problem as reproducibility is the only way to be able to confirm any supposed accuracy of any genomic study. Without this, genomics amounts to nothing but useless random letters in a database.

This highlight from the 2017 study cited above summed up the importance of reproducibility so brilliantly I wanted to share it one more time for emphasis:

The expectation for science to be reproducible is considered fundamental but often not tested. Every new discovery in science is built on already known knowledge, that is, published literature acts as a building block for new findings or discoveries. Using this published literature as a base, the next level of understanding is developed and hence the cycle continues. Therefore, if we cannot reproduce already existing knowledge from the literature, we are wasting a lot of effort, resources and
time in doing potentially wrong science resulting in “reproducibility crisis.” If a researcher claims a novel finding, someone else, interested in the study, should be able to reproduce it.

Cell cultures, antibodies, and genomics are at the core of virology. Without them, virology as it is today would not exist. With each of these “sciences” embroiled in a reproducibilty crisis, how confident can anyone feel about virology as a whole when it is intimately tied to all three?

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: