OLAP cubes and data analysis: a marriage of convenience
Delaney Turner 270003RQ8K Delaney.Turner@ca.ibm.com | | Tags:  ibmsoftware
0 Comments | 3,096 Visits
The following is the fourth in a new six-part series on Advanced Data Visualization. Over the next three months, IBM visualization experts will explore new and emerging visual techniques and the underlying technologies you can deploy to better understand your data to transform insights into better business outcomes.
Frank van Ham is a well-known research scientist and an IBM Master Inventor with over a decade in experience in designing and deploying interactive information visualization. Some of his past projects include Many Eyes, a site for collaborative visualization and SequoiaView, a visual disk browser. Dr. van Ham currently works with the IBM Business Analytics division on integrating visualization into IBM's product portfolio.
In this blog post I wanted to highlight some of the potential incompatibilities between the concepts underlying OLAP (online analytical processing) cubes and the general task of data analysis. OLAP cubes are a data storage and retrieval technology that allows us to quickly request aggregate information across many different dimensions. OLAP cubes can do this because they have pre-computed and stored all of these aggregates when the cube was built. In a sense cube technology was invented as a storage optimization over then-standard transactional databases, and allowed us to greatly increase the speed and hence flexibility with which we can inspect high level aggregates of our data. This speed increase is what kick-started the area of interactive business analytics.
However, this optimization also carries one fundamental disadvantage. It ‘locks’ us into the data model we have to define before we can build the cube in the first place. Data doesn’t always have one definite model: In almost all cases the desired view of the data model depends on the current analysis task that a user wants to perform. In data analysis we don’t know which paths we are going to follow, and we certainly don’t know that when we build the data cube. In fact, the person building the data cube and the analyst will very often be two different people altogether! Below I’ll give one concrete example as why cubes are not always the right technology for interactive data analysis.
OLAP cubes require a user to partition all data dimensions into two categories: dimensions and measures. Measures are the numbers that I’m interested in observing because they tell me how my business is doing. Typical examples of measures include my sales numbers, my production costs and so on. Measures are almost always numerical values. Dimensions are usually categorical values that allow me to partition all of my data into different slices. Typical examples of dimensions are division or sales location. Now consider the following example: I run a small online store that sells items to many customers. Whenever customers buy something in my store, they can give their experience a rating from 1-5 stars. When I build the cube for analyzing this data, do I designate ‘rating’ as a measure or as a category? I might want to look at my sales over all ratings, like this:
In this case, rating is a category that I want to use to aggregate my measure (in this case, sales). This will help me detect any relationships between sales and customer satisfaction. But equally valid is the view below, in which I consider rating as a measure and want to view my average ratings over time:
Here, rating is considered a measure because I’m aggregating it over a number of different categories. This will help me detect whether my customer satisfaction is improving over time. If I hadn’t considered both of these use cases at the time I built my cube, I might not be able to answer one of these questions.
Although measures might tell me what my business is doing, they don’t always tell me why it’s doing that. For that, I might need to dig deeper into other attributes of my data. This is exactly the difference between interactive reporting and interactive analysis: interactive reporting tells me what is happening with my business, where analysis attempts to explain me why this is happening. The variables that explain me how my business is doing are usually known in advance, the variables that explain why it’s performing like that are usually not.
This distinction makes OLAP cubes well-suited to interactive reporting, where the information needs of a user are usually known in advance. To optimize data retrieval, we can rely on a relatively fixed data model, with known measures and fixed drill paths. On the other hand, pure data analysis applications require a data storage setup that requires less upfront modeling and can adapt quickly to different types of information needs. Columnar stores and in-RAM databases are some of the current technologies that have filled this gap, and we’ll see much more of them in the future of freeform data analytics.
Continue exploring visual analytics on IBM Many Eyes
Why stop the insight with this article? Visit IBM’s hub of visual analytics, IBM Many Eyes and join over 100,000 like-mined visualization enthusiasts, academia and professionals. The Many Eyes web community democratizes data visualization by providing a simple three step process to create and interact with a visualization using your data set. Then share or embed your visualization across the web or your social network.
Read previous entries in this series: