The data analysis and data processing paradigm has transformed quite a bit since the mid-nineties. Small to medium-size data sets used to be by far the most prevalent while the number of variables and the subjects used to both be fairly modest. This meant that data was expensive and elaborate methods involving computation intensive estimation - as well as accurate and robust inference - ruled the data analysis approaches.
Fast forward to the present. There is an abundance of data and faster, as well as comparatively more streamlined, methods are required to extract the essential information quickly. There are still many areas such as clinical trials where data remains relatively scarce, expensive and needs to be carefully examined. However, in most domains, both old and new, the availability of digital data has simply exploded.
How can we even begin to address the issue?
A computing technology leap is developing in parallel and allowing us to approach data analysis in new ways. Rather than using expensive servers with many computing cores, including memory and hard drive space, a cluster of commodity computers can be employed instead. Data is distributed such that moving it around for central processing is impractical, but this is fine given the emergence of highly parallel and distributed computing architecture solutions such as Hadoop.
The highly parallel and robust MapReduce framework is available not only for data access and manipulation, but also for the data analysis as well. Rather than using methods requiring sequential and iterative computation, the methods that execute in single or multiple data passes are preferred. As a side effect, a large number of analyses can be performed together.
More importantly, the problems that are presented by big data are often different from the ones which used to be considered for smaller data sets. While the methods and algorithms of the past often do not scale to new data sizes, new methods and parallel computing come to the rescue when these emergent problems need to be addressed.
Finally, while computation used to be an important aspect of data analysis, it has become the key to it. Nevertheless, both creative and judicious application of statistical and machine learning methods as well as new research are essential to continue the momentum made to date in this emerging discipline.
For this year’s Joint Statistical Meetings show in Boston, you will hear more from Damir in his featured session entitled, “Tradeoffs in Big Data Modeling” (Session #9), which begins at 3:20pm in room CC-103. You can also stop by the IBM booth to connect with Damir as well.
Tradeoffs in Big Data Modeling Session #9
(Clustering and Feature Selection for Big Data – begins at 3:20pm)
Damir Spisic, IBM Advisory Statistician, R&D – Statistics and Data Mining Component