The Epistemological Crisis in Genomics

Is there a danger, in molecular biology, that the accumulation of data will get so far ahead of its assimilation into a conceptual framework that the data will eventually prove an encumbrance? Part of the trouble is that excitement of the chase leaves little time for reflection. And there are grants for producing data, but hardly any for standing back in contemplation.” -John Maddox. Nature 335, 11 (1988).

EPISTEMOLOGICAL: relating to the theory of knowledge, especially with regard to its methods, validity, and scope, and the distinction between justified belief and opinion

https://www.lexico.com/en/definition/epistemological

How do we know what we take as “fact” in science? Is it really based on observable phenomena that we can see with our own eyes or do most of the results we claim as knowledge come from computational data analysis prone to biases and errors that are often not discussed and commonly ignored?

The following highlights come from an article by Edward R Dougherty:

“Edward R. Dougherty is an American mathematician, electrical engineer, Robert M. Kennedy ’26 Chair, and Distinguished Professor of Electrical Engineering at Texas A&M University. He is also the Scientific Director of the Center for Bioinformatics and Genomic Systems Engineering”

https://engineering.tamu.edu/electrical/profiles/edougherty.html

In his paper, Dougherty focuses on the crisis of what constitutes scientific knowledge in genomics. He touches on the problems associated with relying on an abundance of data and analysis to act as science when in fact it fails as valid scientific knowledge. It is a long article and I’m sure I left out some useful information so I recommend giving the whole article a read sometime. Highlights below:

On the Epistemological Crisis in Genomics

Edward R Dougherty

Abstract

“There is an epistemological crisis in genomics. At issue is what constitutes scientific knowledge in genomic science, or systems biology in general. Does this crisis require a new perspective on knowledge heretofore absent from science or is it merely a matter of interpreting new scientific developments in an existing epistemological framework? This paper discusses the manner in which the experimental method, as developed and understood over recent centuries, leads naturally to a scientific epistemology grounded in an experimental-mathematical duality. It places genomics into this epistemological framework and examines the current situation in genomics. Meaning and the constitution of scientific knowledge are key concerns for genomics, and the nature of the epistemological crisis in genomics depends on how these are understood.

INTRODUCTION

There is an epistemological crisis in genomics. The rules of the scientific game are not being followed. Given the historical empirical emphasis of biology and the large number of ingenious experiments that have moved the field, one might suspect that the major epistemological problems would lie with mathematics, but this is not the case. While there certainly needs to be more care paid to mathematical modeling, the major problem lies on the experimental side of the mathematical-experimental scientific duality. High-throughput technologies such as gene-expression microarrays have lead to the accumulation of massive amounts of data, orders of magnitude in excess to what has heretofore been conceivable. But the accumulation of data does not constitute science, nor does the a postiori rational analysis of data.

The ancients were well aware of the role of observation in natural science. Reason applied to observations, not reason alone, yielded pragmatic knowledge of Nature. This is emphasized by the second century Greek physician Galen in his treatise, On the Natural Faculties, when, in regard to the effects of a certain drug, he refutes the rationalism of Asclepiades when he writes, “This is so obvious that even those who make experience alone their starting point are aware of it… In this, then, they show good sense; whereas Asclepiades goes far astray in bidding us distrust our senses where obvious facts plainly overturn his hypotheses” [1]. For the ancients, the philosophy of Nature might have dealt with principles of unity, ideal forms, and final causes, but natural science was observation followed by rational analysis. This was especially so during the Roman period, as evidenced by their remarkable engineering achievements.”

“Everything begins with the notion of a designed experiment – that is, methodological as opposed to unplanned observation. Rather than being a passive observer of Nature, the scientist structures the manner in which Nature is to be observed. The monumental importance of this change is reflected by the inclusion of the following statement concerning the early modern scientists, in particular, Galileo and Torricelli, by Immanuel Kant in the preface of the second edition of the Critique of Pure Reason:

They learned that reason only perceives that which it produces after its own design; that it must not be content to follow, as it were, in the leading-strings of Nature, but must proceed in advance with principles of judgment according to unvarying laws and compel Nature to reply to its questions. For accidental observations, made according to no preconceived plan, cannot be united under a necessary law… Reason must approach Nature… [as] a judge who compels witnesses to reply to those questions which he himself thinks fit to propose. To this single idea must the revolution be ascribed, by which, after groping in the dark for so many centuries, natural science was at length conducted into the path of certain progress [2].

A good deal of the crisis in genomics turns on a return to “groping in the dark.”

In previous papers, we have considered how the model-experiment duality leads to a contemporary epistemology for computational biology [3], treated the validation of computational methods in genomics [4], and characterized inference validity for gene regulatory networks in the framework of distances between networks [5]. Here we focus on how the experimental method leads to a general scientific epistemology and how contemporary genomic research often fails to satisfy the basic requirements of that epistemology, thereby failing to produce valid scientific knowledge.”

“Even if we were to accept causality in the form of necessary connections, only if all causal factors were known could we predict effects with certainty. In complex situations, such as the regulatory system of a cell, one cannot conceive of taking account of all contributing factors. Model complexity is limited due to several factors, including mathematical tractability, data requirements for inference, computation, and feasible experimental design. Thus, there will be latent variables external to the model affecting the variables in the model and making the model behave stochastically.”

“The truth of a scientific theory rests with its validation and a theory is validated independently of the thinking leading to it. No amount of rationalist explanation can validate a theory. Science is not about rationalist explanation, neither in its classic philosophic form of explaining events in terms of natural categories or its more recent computational form in terms of explaining the data by fitting a model. It is not unusual to hear it said that some theory “explains” some phenomena. One listens to the explanation and it all seems quite reasonable. The explanation fits the data. Consider the following statement of Steven Jay Gould: “Science tries to document the factual character of the natural world, and to develop theories that coordinate and explain these facts” [19]. Perhaps this statement would have been accurate during medieval times, but not today. While it is true that theories coordinate measurements (facts), it is not the documented measurements that are crucial, but rather the yet to be obtained measurements. Gould’s statement is prima fascia off the mark because it does not mention prediction.

Science is not about data fitting. Consider designing a linear classifier. A classifier (binary decision function) is constructed according to some design procedure that takes into account its mathematical structure, the data, and its success at categorizing the data relative to some criterion. The result might be good relative to the assembled data; indeed, the constructed line might even classify the data perfectly. But this linear-classifier model does not constitute a scientific theory unless there is an error rate associated with the line, predicting the error rate on future observations. Of critical importance to the scientific epistemology is that the model, consisting of both classifier and error rate, is valid only to the extent that the reported error rate is accurate. A model is validated neither by the rational thinking behind the design procedure nor its excellent data-fitting performance. Only knowledge of its predictive power provides validity. In practice, the error rate of a classifier is estimated via some error-estimation procedure, so that the validity of the model depends upon this procedure. Specifically, the degree to which one knows the classifier error, which quantifies the predictive capacity of the classifier, depends upon the mathematical properties of the estimation procedure. Absent an understanding of those properties, the results are meaningless.”

“Let us focus on Intelligibility, which may be the interpretation of explanation that is most often confused with science. If we take intelligibility to mean that the phenomena themselves are grasped by the intellect, then this would imply that Nature is accessible to the human intellect. It is true that the mathematical model (conceptual system) is intelligible, but that is because the mathematical model is constructed by humans in accordance with human intelligibility. But the model does not mirror the physical world. One might argue that what is meant by explanation is mathematical explanation, in the sense that the equations fit the observations. Even if we accept this data-fitting meaning of explanation, it leaves out the fundamental aspect of scientific meaning – prediction.”

“It is not that we are without any understanding whatsoever; as previously noted, we understand the mathematical model. Our knowledge of phenomena resides in the mathematical model, insofar as that knowledge is conceptual. But here we must avoid the danger of slipping into rationalism, mistaking the conceptual system for Nature herself. Scientific knowledge does not stop with reasoning about possibilities and creating a model. It goes further to include a predictive validation methodology and then actual validation. Reichenbach notes that “the very mistake which made rationalism incompatible with science” is “the mistake of identifying [scientific] knowledge with mathematical knowledge” [22]. It is here that we see a great danger lying in Gould’s formulation. Without operational definitions and concomitant experimental protocols for validation, as well as the validation itself, the development of “theories that coordinate and explain” facts quickly drifts into rationalism. Reasoning, either in the form of conceptual categories such as causality or via a mathematical system, is applied to data absent any probabilistic quantification relating to the outcome of future observation. Explanation and opinion replace scientific methodology. Whose reasoning do we trust? A formal validation procedure settles the matter.”

IS GENOMICS UNDERSTANDABLE

“When he refers to Nature as being absurd, Feynman is not criticizing his understanding of the mathematical systems that allow one to model physical phenomena and to make predictions regarding those phenomena; rather, he is referring to a lack of categorical understanding of the physical phenomena themselves. Light is conceived as neither wave nor particle. Thus, the categorical requirement that it be one or the other is violated. From the Kantian perspective, the object of sensibility cannot be conformed to the categories of understanding and therefore cannot be understood. As a product of the human intellect, a mathematical model is ipso facto understandable. Nature is not a product of the human intellect.”

“Our difficulties of understanding arise because the categories of our ordinary understanding relate to possible sensory experiences. These difficulties extend to genomics. We have no sensory experience with networks of thousands of nonlinearly interacting nodes exhibiting feedback, distributed regulation, and massive redundancy. The reasons for lacking understanding are different from those in physics, but they are compelling in their own way. Nature is absurd from the human perspective because we lack the categories of understanding with which to intuit it – be it physics or biology.

THE CURRENT SITUATION IN GENOMICS

Almost from the onset of the high-throughput microarray era, papers reporting classifiers based on gene-expression features have appeared. There have also been cautionary warnings about the dangers of misapplication of classification methods designed for use with at most hundreds of features and many thousands of sample points to data sets with thousands or tens of thousands of features (genes) and less than one hundred sample points (microarrays) [3132]. Keeping in mind the thousands of gene expressions on a microarray, consider a sampling of sample sizes for cancer classification: acute leukemia, 38 [33]; leukemia, 37 [34]; breast cancer, 38 [35]; breast cancer, 22 [36]; follicular lymphoma, 24 [37]; glioma, 50 (but only 21 classic tumors used for class prediction) [38]; and uveal melanoma, 20 [39]. This is a tiny sampling of the host of microarray classification papers based on very small samples and selecting feature sets from among thousands of genes.

Since the foundation of scientific knowledge is prediction, the scientific worth of a classifier depends on the accuracy of the error estimate. If a classifier is trained from sample data and its error estimated, then classifier validity relates to the accuracy of the error estimate, since this estimate quantifies the predictive capability of the classifier. An inability to evaluate predictive power would constitute an epistemological barrier to being able to claim that a classifier model is scientifically sound. Certainly, there are mathematical issues at each step in applying classification to microarray data. Can one design a good classifier given the small samples commonplace in genomics? [40] Can one expect a feature-selection algorithm to find good features under these limitations? [41] These concerns, while important for obtaining useful classifiers, are epistemologically overridden by the concern that the predictive capability, and therefore the scientific meaning, of a designed classifier lies with the accuracy of the error estimate. Except in trivial cases, there has been no evidence provided that acceptable error estimation is possible with so many features and such small samples. Even worse, in many cases studied it has been shown to be impossible [4245]. Hence, not only have the vast majority of the papers not been shown to possess scientific content, large numbers of them have been shown not to possess scientific content. Braga-Neto writes, “Here, we are facing the careless, unsound application of classification methods to small-sample microarray data, which has generated a large number of publications and an equally large amount of unsubstantiated scientific hypotheses” [40]. The failure of the research community to demand solid mathematical demonstrations of the validity of the classification methods used with the type of data available has resulted in a large number of papers lacking scientific content.

Many epistemological issues in genomics relate to statistics. Mehta et al. write, “Many papers aimed at the high-dimensional biology community describe the development or application of statistical techniques. The validity of many of these is questionable, and a shared understanding about the epistemological foundations of the statistical methods themselves seems to be lacking” [46]. They are calling attention to a lack of sound statistical epistemology, which renders the results meaningless. The point is further emphasized by Dupuy and Simon, who write, “Both the validity and the reproducibility of microarray-based clinical research have been challenged” [47]. To examine the issue, they have reviewed 90 studies, 76% of which were published in journals having impact factor larger than 6. Based on a detailed analysis of the 42 studies published in 2004, they report:

Twenty-one (50%) of them contained at least one of the following three basic flaws: (1) in outcome-related gene finding, an unstated, unclear, or inadequate control for multiple testing; (2) in class discovery, a spurious claim of correlation between clusters and clinical outcome, made after clustering samples using a selection of outcome-related differentially expressed genes; or (3) in supervised prediction, a biased estimation of the prediction accuracy through an incorrect cross-validation procedure [47].

The situation is actually much worse than stated here, since in high-dimensional, small-sample settings, cross-validation error estimation, which is ubiquitous in microarray studies, does not provide acceptable error estimation (as will be illustrated in the following paragraph) [4245]. Thus, using cross-validation in supervised prediction undermines scientific validity.”

“Experimental design is a key element in drawing statistical conclusions. A properly designed experiment can substantially increase the power of the conclusions, whereas a poorly designed experiment can make it impossible to draw meaningful conclusions. Potter has drawn attention to this issue in the context of high-throughput biological data by distinguishing between mere observation and experimental design, the fundamental distinction between pre-modern and modern science:

Making the observations with new and powerful technology seems to induce amnesia as to the original nature of the study design. It is though astronomers were to ignore every thing they knew both about how to classify stars and about sampling methods, and instead were to point spectroscopes haphazardly at stars and note how different and interesting the pattern of spectral absorption lines were. Nonetheless, I doubt the astronomers would claim to be doing an experiment. This dilettante’s approach to either astronomy or biology has not been in vogue for at least half a century [32].

In fact, it has not been in vogue since Galileo and Torricelli. Are we to return to “groping in the dark?”

In this vein, the ubiquity of data mining techniques is particularly worrisome. These tend to search for patterns in existing data without regard to experimental design or predictive capability. Keller points out the danger of trying to draw grand inferences from patterns found in data. Referring to William Feller’s classic text [52] on probability theory, she writes,

By 1971, the attempt to fit empirical phenomena to such distributions was already so widespread that Feller felt obliged to warn his readers against their overuse….Feller’s emphasis on the logistic curve as ‘an explicit example of how misleading a mere goodness of fit can be’ was motivated precisely by the persistence of such ‘naïve reasoning’ [53].

Data mining is often erroneously identified with pattern recognition when, in fact, they are very different subjects. Pattern recognition can be used as a basis for science because it is based on a rigorous probabilistic framework [54]. On the other hand, all too often, data mining techniques consist of a collection of computational techniques backed by heuristics and lacking any mathematical theory of error, and therefore lacking the potential to constitute scientific knowledge.

While inattention to epistemology in genomic classification is troubling, the situation with clustering is truly astounding. As generally practiced, there is no predictive aspect and hence no scientific content whatsoever. Indeed, Jain et al. state that “clustering is a subjective process,” [55] so that it lacks the basic scientific requirement of inter-subjectivity. In the context of genomics, Kerr and Churchill have asked the epistemological question: “How does one make statistical inferences based on clustering” [56]. Inferences are possible when clustering is put on a sound probabilistic (predictive) footing by recognizing that, whereas the epistemology of classification lies in the domain of random variables, [54] the epistemology of clustering must lie within the framework of random sets [57]. A great deal of study needs to be done in this direction before clustering can practically provide scientific knowledge. In the mean time, so-called “validation indices” are sometimes used to support a clustering result, but these are often poorly correlated to the clustering error and therefore do not provide scientific validation [58].

Epistemological considerations for genomics inexorably point to systems biology. It would seem obvious that systems biology should be based on systems theory, which, as we have discussed, is a direction clearly pointed to a half century ago in the work of Wiener, Rosenblueth, Monod, Waddington, Kauffman, and others. It is the approach taken in genomic signal processing, where both the dynamics of gene regulatory networks and their external control are being pursued within the context of systems theory [59]. Genomic research has mostly taken a different path. Based upon the historical path of genomics, Wolkenhauer goes so far as to virtually cleave genomics from systems biology when he writes,

The role of systems theory in systems biology is to elucidate the functional organization of cells. This is a complementary but very different effort to genomics, biophysics, and molecular biology, whose primary role it has been to discover and characterize the components of the cell – to describe its structural organization. A basic philosophical point systems theory makes is that objects and relations between objects have the same ontological status. Life is a relation among molecules/cells and not a property of any molecule/cell; a cell is built up of molecules, as a house is with stones. A soup of molecules is no more a cell than a plane is a heap of metal [60].

Wolkenhauer is making an empirical observation regarding a widespread inattention to systems theory. Genomics, being the study of multivariate interactions among cellular components, requires systems-based modeling, in particular, the use of nonlinear stochastic dynamical systems, whether these be in the form of differential equations, discrete networks, Markov processes, or some other form of random process. Science and engineering have more than half a century of experience with stochastic systems. Since it is impossible to conceive of modern communication and control systems absent their being grounded in systems theory, it is surely impossible to conceive of meaningful progress in genomics without the use (and extension of) this theory. Of course, there are obstacles. Experiments need to be designed and carried out in a manner suitable for the construction of nonlinear dynamical systems and systems theory needs to be developed in ways appropriate to biological modeling [61]. These are imposing tasks. Nonetheless, based on our long experience with humanly designed systems it is virtually certain that the study of biological systems cannot meaningfully progress without well thought out experiments and deep mathematics.”

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2674806/

In Summary:

  • There is an epistemological crisis in genomics
  • At issue is what constitutes scientific knowledge in genomic science, or systems biology in general
  • Meaning and the constitution of scientific knowledge are key concerns for genomics, and the nature of the epistemological crisis in genomics depends on how these are understood
  • According to Dougherty, the rules of the scientific game are not being followed
  • High-throughput technologies such as gene-expression microarrays have lead to the accumulation of massive amounts of data, orders of magnitude in excess to what has heretofore been conceivable, yet the accumulation of data does not constitute science, nor does the a postiori rational analysis of data

Not-So-Quick Detour on MICROARRAYS:

“A microarray is a laboratory tool used to detect the expression of thousands of genes at the same time. DNA microarrays are microscope slides that are printed with thousands of tiny spots in defined positions, with each spot containing a known DNA sequence or gene. Often, these slides are referred to as gene chips or DNA chips. The DNA molecules attached to each slide act as probes to detect gene expression, which is also known as the transcriptome or the set of messenger RNA (mRNA) transcripts expressed by a group of genes.”

https://www.nature.com/scitable/definition/microarray-202/

What are Microarrays Used for?

“Scientists use DNA microarrays to measure the expression levels of large numbers of genes simultaneously or to genotype multiple regions of a genome.”

https://en.m.wikipedia.org/wiki/DNA_microarray

What is Genotyping?

“Genotyping is the technology that detects small genetic differences that can lead to major changes in phenotype, including both physical differences that make us unique and pathological changes underlying disease.”

https://www.thermofisher.com/us/en/home/life-science/pcr/real-time-pcr/real-time-pcr-learning-center/genotyping-analysis-real-time-pcr-information/what-is-genotyping.html

In other words, microarrays are used to compare the genomic diversity of genomes in order to determine differences and pathological changes leading to disease.

How Accurate are Microarrays?

Microarray experiments and factors which affect their reliability

“Analysis of microarray data is however very complex, requiring sophisticated methods to control for various factors that are inherent to the procedures used. In this article we describe the individual steps of a microarray experiment, highlighting important elements and factors that may affect the processes involved and that influence the interpretation of the results.”

Microarray analysis offers a variety of methods allowing, among other, identification of genes which might be significant in a specific cellular response mechanism or a particular gene expression pattern that characterizes a particular disease. To obtain significant results, microarray data need to undergo statistical processing to differentiate between signal changes caused by direct experimental factors and arising from the indirect experimental factors such as specific methods used, as well as from inaccuracies of the measurements. This level of processing challenges led to studies of the compatibility of different microarray platforms [2328] which usually is achieved by standardizing protocols and data analysis pipelines [2930]. Selection of an appropriate statistical method for microarray processing is a significant subject of scientific discussion and although microarrays have been in use for more than fifteen years, many issues related to data analysis remain unresolved.

The most discussed issues concern the algorithms used for the data normalization [3132], whose goal is to eliminate differences between samples that originate from technical aspects of the microarray handling which may confound the biological differences in a given experimental setup. A similar goal underlies methods used for batch-effect removal, a step which is crucial when comparing datasets that originate from different times and laboratories [33]. Other frequently-discussed issues concern the identification of sample differentiating genes [3435] and evaluation of noise level in the sample [36], as well as methods to evaluate contamination or damage on the microarray’s surface [3738]. The most commonly used microarrays, produced by Affymetrix, are known for additional issues related to their particular design which influence the final results. These include problems resulting from several measurements of expression level for a single gene [3940], incorrect assignments of probes to genes [4142], incorrect evaluation of the background level and non-specific probe hybridization signals [43], and the effects of distinct probe features on data processing algorithms [44].

The most significant disadvantages of microarrays include the high cost of a single experiment, the large number of probe designs based on sequences of low-specificity, as well as the lack of control over the pool of analyzed transcripts since most of the commonly used microarray platforms utilize only one set of probes designed by the manufacturer. Other weaknesses of microarrays are their relatively low accuracy, precision and specificity [45] as well as the high sensitivity of the experimental setup to variations in hybridization temperature [46], the purity and degradation rate of genetic material [47], and the amplification process [48] which, together with other factors, may impact the estimates of gene expression.”

Conclusions

“Despite successful studies of reproducibility [27] and specificity [97], microarrays have been often subject of criticism as a method which fails to identify relevant information that can be transferred directly into clinical applications [98]. The main reason is that statistical significance often differs from biological relevance due to a very limited number of samples or to the influence of other factors, such as cellular heterogeneity or variability of the morphological features, which are difficult to separate from the studied features.”

“The capabilities of microarray studies are limited, since the measurement of transcript levels provides only a rough estimate of the intracellular conditions at a specific time point, and is affected by a plethora of experiment-specific factors. The process of discovery of new drugs, using expression or genotyping microarrays, is therefore uneven in pace and in some cases might be even misleading.”

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4559324/#Sec1title

Limitations of microarrays

Hybridisation-based approaches are high throughput and relatively inexpensive, but have several limitations which include (6):

  • reliance upon existing knowledge about the genome sequence
  • high background levels owing to cross-hybridisation
  • limited dynamic range of detection owing to both background and saturation signals
  • comparing expression levels across different experiments is often difficult and can require complicated normalisation methods

Beyond all of those weaknesses and limitations, I’m sure the microarray results are “accurate” and “reliable.” Perhaps this is why Dougherty states that the gathering and interpretation of dara is not science?

End Detour.

  • The ancients were well aware of the role of observation in natural science
  • Reason applied to observations, not reason alone, yielded pragmatic knowledge of Nature
  • Natural science was observation followed by rational analysis
  • Everything begins with the notion of a designed experiment – that is, methodological as opposed to unplanned observation
  • Rather than being a passive observer of Nature, the scientist structures the manner in which Nature is to be observed
  • A good deal of the crisis in genomics turns on a return to “groping in the dark”
  • Dougherty focuses on how the experimental method leads to a general scientific epistemology and how contemporary genomic research often fails to satisfy the basic requirements of that epistemology, thereby failing to produce valid scientific knowledge
  • Even if we were to accept causality in the form of necessary connections, only if all causal factors were known could we predict effects with certainty
  • In complex situations, such as the regulatory system of a cell, one cannot conceive of taking account of all contributing factors
  • Model complexity is limited due to several factors, including:
    1. Mathematical tractability
    2. Data requirements for inference
    3. Computation
    4. Feasible experimental design
  • There will be latent variables external to the model affecting the variables in the model and making the model behave stochastically
  • No amount of rationalist explanation can validate a theory
  • Science is not about rationalist explanation, neither in its classic philosophic form of explaining events in terms of natural categories or its more recent computational form in terms of explaining the data by fitting a model
  • Science is not about data fitting
  • A model is validated neither by the rational thinking behind the design procedure nor its excellent data-fitting performance, only knowledge of its predictive power provides validity
  • Absent an understanding of those properties, the results are meaningless
  • It is true that the mathematical model (conceptual system) is intelligible, but that is because the mathematical model is constructed by humans in accordance with human intelligibility
  • But the model does not mirror the physical world
  • Our knowledge of phenomena resides in the mathematical model, insofar as that knowledge is conceptual
  • But here we must avoid the danger of slipping into rationalism, mistaking the conceptual system for Nature herself as scientific knowledge does not stop with reasoning about possibilities and creating a model
  • It goes further to include a predictive validation methodology and then actual validation
  • When he refers to Nature as being absurd, Feynman is not criticizing his understanding of the mathematical systems that allow one to model physical phenomena and to make predictions regarding those phenomena; rather, he is referring to a lack of categorical understanding of the physical phenomena themselves
  • As a product of the human intellect, a mathematical model is ipso facto understandable, however, Nature is not a product of the human intellect
  • Our difficulties of understanding arise because the categories of our ordinary understanding relate to possible sensory experiences and these difficulties extend to genomics
  • We have no sensory experience with networks of thousands of nonlinearly interacting nodes exhibiting feedback, distributed regulation, and massive redundancy
  • There have been cautionary warnings about the dangers of misapplication of classification methods designed for use with at most hundreds of features and many thousands of sample points to data sets with thousands or tens of thousands of features (genes) and less than one hundred sample points (microarrays)
  • Certainly, there are mathematical issues at each step in applying classification to microarray data:
    1. Can one design a good classifier given the small samples commonplace in genomics?
    2. Can one expect a feature-selection algorithm to find good features under these limitations?
  • Except in trivial cases, there has been no evidence provided that acceptable error estimation is possible with so many features and such small samples
  • Even worse, in many cases studied it has been shown to be impossible
  • Not only have the vast majority of the papers not been shown to possess scientific content, large numbers of them have been shown not to possess scientific content
  • Braga-Neto writes, “Here, we are facing the careless, unsound application of classification methods to small-sample microarray data, which has generated a large number of publications and an equally large amount of unsubstantiated scientific hypotheses”
  • The failure of the research community to demand solid mathematical demonstrations of the validity of the classification methods used with the type of data available has resulted in a large number of papers lacking scientific content
  • Many epistemological issues in genomics relate to statistics
  • Mehta et al. write, “Many papers aimed at the high-dimensional biology community describe the development or application of statistical techniques. The validity of many of these is questionable, and a shared understanding about the epistemological foundations of the statistical methods themselves seems to be lacking”
  • They are calling attention to a lack of sound statistical epistemology, which renders the results meaningless
  • A study by Dupuy and Simon found that twenty-one (50%) of 42 microarray papers contained at least one of the following three basic flaws:
    1. In outcome-related gene finding, an unstated, unclear, or inadequate control for multiple testing
    2. In class discovery, a spurious claim of correlation between clusters and clinical outcome, made after clustering samples using a selection of outcome-related differentially expressed genes
    3. In supervised prediction, a biased estimation of the prediction accuracy through an incorrect cross-validation procedure
  • Cross-validation error estimation, which is ubiquitous in microarray studies, does not provide acceptable error estimation
  • Thus, using cross-validation in supervised prediction undermines scientific validity
  • A properly designed experiment can substantially increase the power of the conclusions, whereas a poorly designed experiment can make it impossible to draw meaningful conclusions
  • Making the observations with new and powerful technology seems to induce amnesia as to the original nature of the study design
  • The ubiquity of data mining techniques is particularly worrisome as these tend to search for patterns in existing data without regard to experimental design or predictive capability.
  • Data mining techniques consist of a collection of computational techniques backed by heuristics and lacking any mathematical theory of error, and therefore lacking the potential to constitute scientific knowledge

Quick Detour on DATA MINING:

“Data mining is the process of finding anomalies, patterns and correlations within large data sets to predict outcomes.”

https://www.sas.com/en_us/insights/analytics/data-mining.html

Data Mining in Genomics

Introduction

There has been a great explosion of genomic data in recent years. This is due to the advances in various high-throughput biotechnologies such as RNA gene expression microarrays. These large genomic data sets are information-rich and often contain much more information than the researchers who generated the data may have anticipated. Such an enormous data volume enables new types of analyses, but also makes it difficult to answer research questions using traditional methods. Analysis of these massive genomic data has several unprecedented challenges:

Challenge 1: Multiple comparisons issue

Analysis of high-throughput genomic data requires handling an astronomical number of candidate targets, most of which are false positives [12].”

Challenge 2: High dimensional biological data

“The second challenge is the high dimensional nature of biological data in many genomic studies [3]. In genomic data analysis, many gene targets are investigated simultaneously, yielding dramatically sparse data points in the corresponding high-dimensional data space. It is well known that mathematical and computational approaches often fail to capture such high dimensional phenomena accurately.”

Challenge 3: Small n and large p problem

“The third challenge is the so-called “small n and large p” problem [2]. Desired performance of conventional statistical methods is achieved when the sample size of the data, namely “n”—the number of independent observations and subjects—is much larger than the number of candidate prediction parameters and targets, namely “p”. In many genomic data analyses this situation is often completely reversed.”

Challenge 4: Computational limitation

“We also note that no matter how powerful a computer system becomes, it is often prohibitive to solve many genomic data mining problems by exhaustive combinatorial search and comparisons [4]. In fact, many current problems in genomic data analysis have been theoretically proven to be of NP (non-polynomial)-hard complexity, implying that no computational algorithm can search for all possible candidate solutions. Thus, heuristic—most frequently statistical—algorithms that effectively search and investigate a very small portion of all possible solutions are often sought for genomic data mining problems. The success of many bioinformatics studies critically depends on the construction and use of effective and efficient heuristic algorithms, most of which are based on the careful application of probabilistic modeling and statistical inference techniques.

Challenge 5: Noisy high-throughput biological data

The next challenge derives from the fact that high-throughput biotechnical data and large biological databases are inevitably noisy because biological information and signals of interest are often observed with many other random or confounding factors. Furthermore, a one-size-fit-all experimental design for high-throughput biotechniques can introduce bias and error for many candidate targets.”

Challenge 6: Integration of multiple, heterogeneous biological data for translational bioinformatics research

The last challenge is the integration of genomic data with heterogeneous biological data and associated metadata, such as gene function, biological subjects’ phenotypes, and patient clinical parameters. For example, multiple heterogeneous data sets including gene expression data, biological responses, clinical findings and outcomes data may need to be combined to discover genomic biomarkers and gene networks that are relevant to disease and predictive of clinical outcomes such as cancer progression and chemosensitivity to an anticancer compound. Some of these data sets exist in very different formats and may require combined preprocessing, mapping between data elements, or other preparatory steps prior to correlative analysis, depending on their biological characteristics and data distributions. Effective combination and utilization of the information from such heterogeneous genomic, clinical and other data resources remains a significant challenge.”

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2253491/

These are just a few problems associated with data mining in relation to genomics that hinder accuracy and reliability. I recommend reading the rest of the review when you have the time. The challenges outlined here offer more evidence as to why the gathering and interpretation of data as done in genomics is not science.

End Detour.

  • While inattention to epistemology in genomic classification is troubling, the situation with clustering is truly astounding
  • As generally practiced, there is no predictive aspect and hence no scientific content whatsoever

Quick Detour on CLUSTERING:

An Introduction to Clustering and different methods of clustering

1. Overview

“Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters.

2. Types of Clustering

Broadly speaking, clustering can be divided into two subgroups:

  1. Hard Clustering: In hard clustering, each data point either belongs to a cluster completely or not.
  2. Soft Clustering: In soft clustering, instead of putting each data point into a separate cluster, a probability or likelihood of that data point to be in those clusters is assigned.

3. Types of clustering algorithms

Since the task of clustering is subjective, the means that can be used for achieving this goal are plenty. Every methodology follows a different set of rules for defining the ‘similarity’ among data points. In fact, there are more than 100 clustering algorithms known. But few of the algorithms are used popularly, let’s look at them in detail.”

An Introduction to Clustering and different methods of clustering

In genomics, there are numerous clustering algorithms that can be used. There are too many approaches to list here but they all have their own set of strengths and weaknesses. New algorithms are always created leading to an overabundance of methods to choose from.

End Detour.

  • Jain et al. state that “clustering is a subjective process,” so that it lacks the basic scientific requirement of inter-subjectivity
  • A great deal of study needs to be done in this direction before clustering can practically provide scientific knowledge
  • In the mean time, so-called “validation indices” are sometimes used to support a clustering result, but these are often poorly correlated to the clustering error and therefore do not provide scientific validation
  • Epistemological considerations for genomics inexorably point to systems biology
  • Genomics, being the study of multivariate interactions among cellular components, requires systems-based modeling, in particular, the use of nonlinear stochastic dynamical systems, whether these be in the form of differential equations, discrete networks, Markov processes, or some other form of random process
  • Experiments need to be designed and carried out in a manner suitable for the construction of nonlinear dynamical systems and systems theory needs to be developed in ways appropriate to biological modeling
  • Dougherty concludes that these are imposing tasks
  • Nonetheless, based on his long experience with humanly designed systems it is virtually certain that the study of biological systems cannot meaningfully progress without well thought out experiments and deep mathematics

Should we be relying on computer algorithms, modeling, mining/clustering of massive amounts of data, etc. each with various drawbacks, limitations, and weaknesses to shape what we are supposed to take as “scientific knowledge?” Seeing that so many different computational methods with various limitations are needed in order to interpret the results to paint what is thought to be an “accurate” representation of the data, is it any wonder why there is a reproducibility crisis in the world of science and genomics? Dougherty was adamant that the collection, interpretation, and analysis of data is not science. Genomics does not observe a natural phenomenon and create experiments using valid independent and dependent variables in order to determine cause and effect. It cteates an artificial phenomena in a lab using questionable technology. It is concerned with the generation of massive amounts of data and the subjective determination and analysis of that data. This is used to create a construct that is never observed in reality other than as random A,C,T,G’s on a computer screen. While it is clear that the data generated by genomics does not constitute scientific knowledge, there is a strong case that it constitutes scientific fraud.

30 comments

  1. The late Sir John Maddox, editor of Nature saw this coming: “Is there a danger, in molecular biology, that the accumulation of data will get so far ahead of its assimilation into a conceptual framework that the data will eventually prove an encumbrance? Part of the trouble is that excitement of the chase leaves little time for reflection. And there are grants for producing data, but hardly any for standing back in contemplation”. Maddox, J. Nature 335, 11 (1988).

    Liked by 2 people

  2. Amazing work Mike! So much in this I have never thought about, nor ever read about, but I have had an intuition for quite sometime that all this genetic / genomic “science” might just be a whole heap of rubbish and conjecture. Years ago I spent the money for 23andMe. What did I learn from it? That I am not ApoE4. Then I spent more money on various sites that would crunch my raw data and show me even more characteristics I might have or diseases that might be coming my way. “You might have blue eyes” “You have a propensity for stomach cancer” with some percentage of likelihood attached. I also learned that I am a fast metabolizer of caffeine. Whoopie. SNIPs.that are said show this, that or the other with absolutely NO actionable items. It was all worthless. Hucksters taking my $10 or $15 to show me something that was supposed to mean something yet it all meant nothing. The War on Cancer, right? Based on genetics. Nope. Not happening. Then a big international conference some years back where geneticists could not even agree on what a gene is? They argued with no conclusion. This sentence is gold, “Genomics does not observe a natural phenomenon and create experiments using valid independent and dependent variables in order to determine cause and effect.” Bingo! Save your money, and your time. Genetics appears to be a lot like virology. Not science.

    Like

    1. I had heard from a few people that genetic companies like 23andMe generate much of their information from questionnaires people fill out beforehand. I have seen stories of people getting wildly different results from different companies. Then there are stories of convictions overturned due to faulty DNA testing. Genomics is definitely not science. It is a scam.

      Liked by 1 person

  3. Hi Mike — like Lynn, I just wanted to share my thoughts: Wow, what a mess of data worship! It feels like being in a hall of mirrors just reading about it. The belief in data being a sound measure of reality, or a way to determine the nature of reality, is so deeply believed in. It’s one of the tenets of this “‘hidden religion” where the tenets themselves are never questioned (they are tenets, after all!) and those that adhere to the religion, and practice it, don’t see it as something to question. “Data IS reality” is the belief, in short. It’s all based on the gene being seen as a fixed and static “thing”, like the body is built up like a mechanical object. So, Bruce Lipton and others who are saying that the genetics, the code, CHANGE constantly are simply ignored.

    Also, the phrase “The map is not the territory” kept ringing in my ears as I read this. Data seems to be a materialist’s substitute for knowledge. Then they want to create a map out of the data. They want a map like they want a business plan, and for the same reasons. They can USE a static map. They can’t USE something that is always changing and morphing in accordance with….. mind, emotions, intention, and the individuals own volition! Oh, no no no no no. That is not a good business plan. Who cares if it is the truth, it’s not going to fly. They would rather have a map that works as a model for…. a business plan.

    I agree with Lynn that this line is the best: “Genomics does not observe a natural phenomenon and create experiments using valid independent and dependent variables in order to determine cause and effect.” (Wait, did Jordan Grant write that line? lol)

    I would say that the above statement will seem radical and might be seen as “the problem”, but to me it’s just the surface layer of a much larger problem. Not using the standards of “the scientific method” is, to me, the RESULT of deeper problems. It’s like a clue, or sign, or indicator, that they are WAY off in not only in the methodology, but the deeper philosophical underpinnings. The fact that their methodology does not include an independent variable and dependent variable — the fact that they throw all of that under the bus — is like the symptoms of a larger disease. This applies to the lack of real observation, too, I think. All the basics of logic and common sense are tossed out. “We don’t need no stinkin’ observation of reality, we don’t need an independent variable”. But why? How did we get here? Is it just a deterioration of methodology? That deterioration doesn’t just happen in a vacuum.

    The philosophical underpinnings are, of course, all about the dominant paradigm of materialism. At least that is getting closer to the roots of the problem. I think materialism can be summed up as: matter happens. Shit happens. It just happens to be here. lol It’s all random and mindless and if there is mind and consciousness and emotions and stuff like love, it emerged out of matter. Matter is primary. Causal. Well, maybe not causal, the randomness is causal, in this crazy belief system. Randomness created matter, then everything else happens out of that. (Does any scientist actually believe these tenets? One has to wonder.) Matter does NOT emerge out of mind, or consciousness, according to this view. No, it’s the other way around: consciousness emerged out of matter. Matter came first. God, or any concept of God, is in another department across the hall, in religious studies. God is not in the science department.

    But maybe even more basically, I think it all comes down to perception. Which is I guess another way of saying the same thing, because materialism is a way to perceive. So, in order for science to adhere to the materialist view it has to CONFORM to materialism. So, the methods emerge out of the philosophy.

    Obviously, perception is a huge topic. But very basically speaking, perception relates to our senses. It’s interesting to me that the senses are devalued in materialism. One would think they would be prioritized, yes? If the world is simply a material construction, randomly and mindlessly existing without any intelligence behind it all, then wouldn’t our senses be observing a material, objectively observed world? If the body is also an object, and our eyes just SEE A PREEXISTING REALITY, then wouldn’t materialism prioritize and worship our senses? Wouldn’t materialism value the senses, and wouldn’t objectivism maintain an allegiance to the physical senses as the proper and only way to observe the world? But what has happened is that materialism PLAYS BOTH SIDES. (That’s the best business plan.)

    Materialism has led to a devaluing of observation, and rational methodology. Fantasy trumps reason. They want the world to be mechanistic, but they abandon all rational observation in favor of fantasy models created in computers, with no foundational observation, no gold standard, to back up the model. They just make shit up with computers, that’s where this business plan leads. They worship the map and model. But the model is fantasy. So the materialists are believers in fantasy. This seems contradictory on the surface. But it isn’t, because MATERIALISM IS FANTASY. It’s not a material world. Sorry, Madonna. It’s just not.

    They will even worship the fantasy and say “don’t trust your senses”. This is so confusing, So, it’s all matter and we see objectively, but don’t trust your senses? Really?

    What is amazing is that it IS true that our physical senses are not perceiving “all of reality”. Physical reality IS illusionary in those terms. We don’t even know how perception works. Are our eyes creating as they perceive? Is it all subjective? (Yes, I think so.) Do our eyes project instead of receive an objective view? Well, our eyes are NOT independent organs! They are attached to our brains. There’s a clue!

    Our senses deal with a very narrow range of…. um…. data. Perceptual data! Our eyes only see within a very narrow range. Dragonflies see a different world than we do. Different senses.

    Basically, materialism is really about denying the primacy consciousness, love, mind, and God. It’s the ego’s invention, and the ego LOVES to play both sides, so it can pretend to win. Little does the ego know, the game is much bigger than it can perceive!

    Liked by 3 people

    1. Wow! Very deep and beautifully written. I still believe you need to write your own book Carolyn, breaking down the philosophical side of this scam. From what I’ve learned, genomics is nothing but the reliance on data which is generated by (faulty) technology and subjectively interpreted to explain an unseen process. They create the material (data) for which they get to base their fiction on. They have sold this belief that modern technology trumps all and that we can now “see” things we were never meant to see. However, what we are shown is merely an illusion. It is just another trick to deceive in order to keep people believing in and dependent upon the lie. Dougherty is one of the few who I have seen who has called out the lack of science involved in genomics. It was refreshing to read.

      And yes, I was channeling Jordan Grant a bit there. 😉

      Liked by 1 person

    2. Wonderful Carolyn! “So, it’s all matter and we see objectively, but don’t trust your senses?” I read this and wondered what the definition is of schizophrenia! So many things I never thought about that you have deeply thought about. I thank-you for this! Yes, maybe it all starts with philosophy. how you see the world. Mechanistically where nothing exists that can’t be seen, but sometimes this “seeing” is faulty (think electron micrographs of particles taken form living beings)? Or it is claimed to exist by measuring, which can be just modeling (de novo assemblers, computer algorithms, In Silico constructs) i.e. making shit up using fancy hi-tech. And to top it off, all these thousands of data points, the AGTCs of genetic codes said to be in these particles? We know bascially nothing about these genes that we can be certain of, we understand this after reading Mike’s blog post. What are these particles, and what is their function in the body? And we can throw it all out if genetic codes are a “moving target”, in constant flux due to us being complex biological creatures. Emotions, hope, dreams, prayer, the future and the past creeping into our minds, all might influence them. I think of the gut, the heart and the mind (thanks Brecht). The gut is survival, might be the souce of the most honest feelings we have. Primal. Then the heart, where the spirit lives, and when it is open, all things are possible. Then the mind. Ah, it can be a great trickster! Or an extremely useful tool to organize and triage and make sense of the feelings or perceptions coming from the gut and the heart. I would say materialists live 99% of their time in the mind. Then they use all kinds of word salad and change definitions to suit their purpose and employ technology to create the illusion their theories are reality. Man, what a mess! And this is how you manipulate the entire world with some fairy tale. Because people are distracted, unable or unwilling to even take a look and question the narrative. Or they lack the courage to do so, to stand up (perhaps alone) and say, “Hey, wait a minute!” We are told “trust the science” when there’s absolutely no science being done. As I have said a hundred times since early 2020, the emperor has no clothes.

      Liked by 2 people

    1. I have two friends who are really into the scientific method and have told me many of the problems with quantum theory as well the issues relating to much of what we call science today. I’ve been primarily focused on virology but I do love looking at other pseudosciences. Thanks for the link! 🙂

      Like

  4. Erwin Chargaff quotes :
    “ We do not know what life is, and yet we manipulate it as if it were an inorganic salt solution…
    “ What I see coming is a gigantic slaughterhouse, a molecular Auschwitz, in which valuable enzymes, hormones, and so on will be extracted instead of gold teeth.“ — Erwin Chargaff”
    ———————
    Nucleic Acid Research, Biology after Hamer
    Genetics completely failed ( genetic engineering does not work, mRNA technology does not work , etc.)
    Only for 10% of our protein and enzymes we have some raw building blocks which is similar to all people.
    For 90% of our enzymes and proteins we find no genetic material that would fit into this biochemical model they have.
    If we are not using oxygen in one part of the body , we are losing huge amounts of nucleic acids with some basic material ( enzymes and proteins).
    When we use it again we get it again. Why?
    Because the nucleic acid in form of RNA ( famous with the mRNA jabs) comes into its existence out of its own, it is building huge amount of different sequences , digesting, building up in all varieties . That which fits with our metabolism and is useful stays longer and finally is , then, transformed into its more stable form, the DNA.This is why we are able to learn to deal with all kind of toxic substances if you are not killed by huge amount.
    When you change your environment, food, exercise, season you change your metabolism.
    Matter follows our soul, the biggest obstacle we have to our healing is fear.
    The difference between biological fears, meaningful fears help us to avoid dangers, be burned , cold water etc.Fear for a long time is very disturbing and damaging .
    Dr Hamer understanding ( the Five Biological Laws ) is the basis for a better future and a stable life.
    STEFAN LANKA – ILSEDORA LAKER: BIOLOGY AFTER HAMER PT.2
    https://www.bitchute.com/video/CO31xdhikemG/
    At around minute 31 l Rough transcript
    https://media2-production.mightynetworks.com/asset/37482967/Biology_after_Hamer_-_Nucleic_Acid_Research.pdf?_gl=1*12bzd7m*_ga*MTIzMTkzOTQ4OC4xNjI1NzM1OTcw*_ga_T49FMYQ9FZ*MTY0OTAwNjAzNi44NDMuMS4xNjQ5MDA2MDY5LjA.

    Liked by 1 person

Leave a Reply to Mike Stone Cancel reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: