It has become perhaps the most important guiding principle of today’s world of data science: “data is truth.” The statisticians, programmers and machine learning experts that acquire and analyze the vast oceans of data that power modern society are seen as uncovering undeniable underlying “truths” about human society through the power of unbiased data and unerring algorithms. Unfortunately, data scientists themselves too often conflate their work with the search for truth and fail to ask whether the data they are analyzing can actually answer the questions they ask of it. Why can’t data scientists be more like those of the physical sciences that see not “universal truths” but rather “current consensus understanding?”
Given the sheer density of statisticians in the data sciences, it is remarkable how poorly the field adheres to statistical best practices like normalization and characterizing data before analyzing it. Programmers in the data sciences, too, tend to lack the deep numerical methods and scientific computing backgrounds of their predecessors, making them dangerously unaware of the myriad traps that await numerically-intensive codes.
Most importantly, however, somewhere along the way data science became about pursuing “truth” rather than “evidence.”
We see piles of numbers as containing indisputable facts rather than merely a given constructed reality capturing one possible interpretation.
In contrast, the hard sciences are about running experiments to collect evidence, building theories to describe that evidence and arriving at temporary consensus, together with the willingness to allow today’s understanding to be readily upended by new evidence or descriptive theories.
Most importantly, all evidence in the hard sciences is treated as suspect and tainted by the conditions of its collection, requiring triangulation and replication. This is in marked opposition to the data sciences’ habit of relying on single datasets and failing to run even the most basic of characterization tests.
In the sciences, all knowledge is accepted to be temporary, based on the limitations of experimentation, simulation and current theories. Experiments are run to gather evidence to either confirm or contradict current theories. In turn, theories are adjusted to fit the current available evidence. Experiments that appear to strongly contradict existing understanding are subjected to extensive replication until the preponderance of evidence leaves no other available conclusion but that current theory must be amended to account for this new information.
Even basic “laws” are viewed not as dogmatic undisputed truth, but rather evidentiary understanding that has withstood all attempts to refute it, but which may eventually be replaced by new knowledge.
The hard sciences are replete with disagreements, novel experiments that contradict existing theories and competing theories without an obvious winner. Yet, physicists and chemists do not speak of “truth” and “fiction,” they speak work to gather evidence on behalf or against each possible explanation.
Most importantly, the hard sciences balance available evidence gathered through experimentation with designing new experimentation to gather currently unavailable evidence with theory to explain it all.
In contrast, data science has increasingly become about making use of the easiest obtainable data, not the data that best answers the question at hand.
In fact, much of the bias of deep learning comes from the reliance of the AI community on free data rather than paying to create minimally biased data.
Much like deep learning, the broader world of data science has been marred by its fixation on free data, rather than the best data. Look across the output of any major company’s data science division and one will find that most of their analyses are based on whatever data the company already has at hand or can obtain freely from the web or cheaply from vendors or itself.
Few companies step back to ask what the best data for any given question would be and have sufficient resources and budget to create that dataset. Instead, data science divisions are typically asked to produce ever more analyses ever faster and ever more efficiently with ever fewer resources per analysis.
Rather than spend months commissioning and executing a methodologically sound survey instrument to collect consumer feedback about a new product, the modern data sciences division is far more likely to turn to what it knows: running a few keyword searches on Twitter and pasting the resulting graphs into a PDF report.
In fact, few data scientists are even familiar with survey design, let alone understand that keyword searching tweets may yield a result that bears little resemblance to reality.
The hard sciences attempt to find new ways of analyzing existing data to answer the field’s questions but are constantly designing new experiments to gather new data. Data science has become more about analyzing existing data and trying to find some way of connecting it back to the given question. Creating new datasets is typically viewed as out of scope.
Yet, quantitative analysis lends the aura of “truth” organically emerging from massive piles of data examined by unerring machines.
All of the badly biased data, flawed algorithms, random seeds, wildly wrong estimations, confirmation bias and myriad other damaging influences on our results are hidden from the ultimate consumers of those results by the mesmerizing “Apple effect” of beautiful graphs that convey hard certainty in what may amount to nothing more than random guesses based on wrong data analyzed by flawed algorithms with bad parameters.
None of this matters, however, because data science has become about lending false credibility to decisions we’ve already made, rather than seeking out what our data tells us.
In short, we search our data until we can find evidence to support our preordained conclusions, wrapping them in the false security of “big data” and the assumption that from large enough data emerges indisputable “truth.”