It has become perhaps the most important guiding principle of today’s world of data science: “data is truth.” The statisticians, programmers and machine learning experts that acquire and analyze the vast oceans of data that power modern society are seen as uncovering undeniable underlying “truths” about human society through the power of unbiased data and unerring algorithms. Unfortunately, data scientists themselves too often conflate their work with the search for truth and fail to ask whether the data they are analyzing can actually answer the questions they ask of it. Why can’t data scientists be more like those of the physical sciences that see not “universal truths” but rather “current consensus understanding?”
Given the sheer density of statisticians in the data sciences, it is remarkable how poorly the field adheres to statistical best practices like normalization and characterizing data before analyzing it. Programmers in the data sciences, too, tend to lack the deep numerical methods and scientific computing backgrounds of their predecessors, making them dangerously unaware of the myriad traps that await numerically-intensive codes.
Most importantly, however, somewhere along the way data science became about pursuing “truth” rather than “evidence.”