High-Quality Data Science Demands Independent Verification
James Kobielus 06000021Q7 firstname.lastname@example.org | 2013-07-08 08:02:42.0 | 0 Comments | 5,296 Visits
Junk science is what we see everywhere. The media scoops up the fruits of this dubious endeavor: excited researchers rush to announce "breakthroughs," "discoveries," and other "findings." Never mind that the findings may never be replicated and thereby confirmed by independent investigators running their own tests, using their own data, and applying their own tools. And never mind that the findings may in fact by repudiated by future investigators who could never replicate them.
High-quality science is far more painstaking. Before the scientific community can coalesce around some dramatic new finding, competing teams of researchers must independently vet the methodologies and replicate the splashy-new results. In theory, this canonical scientific process, abetted by the peer review of independent scientific journals, will identify any biases in the original research and allow only the most valid findings to gain broad media exposure. Of course, we all know that's not how it always works in the real world: everybody has the runs for scientific glory, including the journals and media channels that prematurely publish flawed research.
High-quality data science should follow the same community-wide process for weeding out error and bias. But--where applied business-oriented data science is concerned--there are other fundamental issues why bad research doesn't always get corrected.
First off, business-oriented data scientists produce valuable intellectual property--the data, models, and findings--that is generally inaccessible to their peers in other companies; unless their immediate colleagues attempt to replicate their results--a duplication of effort that their cost/time-conscious bosses may frown on--their errors may go unexposed until it's too late. By "too late," one of the commentators, Thomas Speidel, on one of my previous LinkedIn blogs on this topic (http://linkd.in/11HYfJQ) expressed it nicely: " [W]hile presenting the wrong ad to a few users may have minimal impact and cost, making a large investment, or giving the wrong treatment to a patient may have quite a different effect."
Second, business-oriented data scientists in the same company, team, and project tend to standardize on the same modeling approaches and use the same (ostensibly high-quality) data. Consequently, even if one data scientist attempts to replicate the other's results, they're likely to replicate the same errors.
The rarely-used alternative that would address these issues is, per commentator Dan Rice, to "use extraordinary stacked ensemble averaging across perhaps hundreds of different models built with diverse algorithms." For more general information on ensemble modeling, see this Wikipedia page. For information on the ensemble modeling features of IBM SPSS Modeler, follow this link.
But even though ensemble modeling can enhance the replicability of predictive results, it too has its limitations, says Rice (with my interpolations in square brackets]: "This does lessen the variability [of model predictive results], but the models [due to their myriad differences in underlying variables, relationships, and algorithms] can no longer be interpreted [in terms of illuminating the underlying cause-effect mechanisms], so you are 'driving blindly'."
In other words, data scientists often face a trade-off. Some modeling approaches can explain the causal relationships quite well but are weak on predictive strength, due perhaps to overfitting the model to historical data. Others, such as the ensemble approach discussed above, may predict quite well, but do so by averaging so many diverse "stacked" models that the underlying causal relationships become opaque.
The best science insists on independent verification of empirical models' explanatory strength and predictive replicability. Businesses should hesitate to base critical decisions on statistical models that promise too-perfect predictions (for reasons no one understands) or provide too-perfect post-hoc explanations (but are useless for understanding what may happen next).
To what extent do you think independent verification and replication of data-science results are feasible in business settings?
Connect with me on Twitter @jameskobielus