Counterintuitive Data Science Methods May Yield Keener Analytical Insights
James Kobielus 06000021Q7 firstname.lastname@example.org | 2013-08-02 12:42:05.0 | 2 Comments | 13,046 Visits
Mathematics is not a hermetic metaphysical pursuit, but rather a field where researchers craft and tweak algorithmic approaches that are suited to various problem domains. The best mathematicians know it's a dead-end to develop new approaches with any or all of these limitations: have no real-world applications, consume an inordinate amount of computing resources, and/or are so complex that almost no one else understands or knows how to apply them.
The best statistical-analysis algorithms provide tools for collective discovery of quantitative relationships--preferably, where science comes into the picture, of an empirical nature. However, sometimes the traditional approaches get in the way of data-driven insight extraction. The underlying algorithms can just as easily obscure key quantitative relationships as reveal them. New branches of the mathematical arts often emerge to help scientists see patterns that are otherwise dark. Think of Newton, modern physics, and the pivotal impact of the calculus. Think of Mandelbrot, modern chaos theory, and fractal dimensionality.As more scientists incorporate big data into their working methods, they're going to re-assess whether the mathematical and statistical algorithms in their data-science toolkits are as effective at peta-scale as in "small data" territory. One key criterion is whether machine-learning algorithms can continue to calculate "good enough" predictions from data at extreme scales. One key way to define "good enough" is "efficiently executable with available big-data platforms in a acceptable timeframe while delivering actionable results."
In that regard, I recently came across an excellent article presenting a new mathematical approach for tuning otherwise "inferior" machine-learning algorithms for big data. Within the context of the article, the author, Brian Dalessandro, essentially defines "inferior" as any algorithm that degrades the quality of training-set data that is used to tune the statistical model.
What was most noteworthy about the discussion was the counterintuitive thrust of the approach: an algorithm that is inferior on one attribute (e.g., data quality) can also be superior on others (e.g., predictive accuracy, efficient linear scaling, cost-effectiveness on big-data platforms). Dalessandro outlines an approach that relies on "stochastic gradient descent" (SGD) and feature-hashing algorithms to reduce the "dimensionality" (i.e., the number of features/variables) being modeled. From a statistical analysis standpoint, the dimensionality-reduction approach increases one type of modeling error ("optimization error") in order to reduce the other types ("estimation error" and "approximation error") that contribute to modeling accuracy.
Dalessandro makes it clear why this algorithmic approach is suited to big data: "By choosing SGD, one introduces more optimization error into the model, but using more data reduces both estimation and approximation errors. If the data is big enough, the trade-off is favorable to the modeler." Essentially, it's favorable to the modeler in analytical problem domains, such natural language processing, to which the approach's optimization errors are not showstoppers.
He also mentions other benefits, such as enabling more complex feature/variable sets to be modeled in constrained memory resources and providing a more privacy friendly way to store and use personal data. But he also notes a trade-off: the approach introduces more chaos into the modeling results.
Though highly arcane, this is the soul of practical data science: fitting the mathematical, statistical, and algorithmic approaches to the problem at hand and adapting them to the big-data resources at our disposal. Like any engineering discipline, this involves making trade-offs among algorithmic approaches.
It's applied math on the proverbial steroids.
Connect with me on Twitter @jameskobielus