Guest post from Frank van Ham, Master Inventor, Information Visualization and Visual Interaction Expert
Many years ago I pursued a Computer Science PhD by trying to find better ways to give people insight in the structure of large networks. With large network data, I mean data that describes connections between entities. You might think about the local network of your friends on Facebook, a very large database of phone calls between different people, relationships between different genes in a genome or software dependencies in a large code base. Network data can be found in almost any context, but traditionally we haven’t been very good at giving users an overview of what is going on in said network. Visualization is a critical component here, as it’s very hard for humans to form a mental image of what is going on without actually seeing a visual rendition of the network in question. To prove my point I’ll let you make a mental image of this set of network data (without grabbing pen and paper, no cheating!).
The most straightforward way to approach the problem is to create tools that extract the network data from a database and subsequently render this network. Arguably this helps the user figure out what is going on, detect high level structure and maybe be triggered for further investigation. There are plenty of algorithms that will convert the dataset I described above into a visual representation of a 3x3 grid. Watching these algorithms perform their magic can be mesmerizing and the images they generate can certainly be captivating in some cases. For example consider these displays of the Internet backbone (left) and a network of gene interrelations (right).
Captivating maybe, but I’d argue that they’re not particularly useful to learn anything from the underlying data itself. The most we can glean from these diagrams is a rough clustering of nodes (assuming the clustering we find is not an artifact of the layout algorithm). The visualization research community (including myself) has developed a number of methods to avoid having to deal with hairballs, such as adjacency matrix representations, clustering methods, visual distortion techniques, different layout techniques or data abstraction techniques. All of these make the problem somewhat less glaring, but none of them really solves the core issue, which is mostly one of too much data at the wrong level of detail. The core problem is that there is not one fixed level of detail: depending on the task I’m interested in, I might need to investigate my data from a different perspective.
So, let’s step back to back to business analytics for a moment. One of the key interfaces to data in many of today’s business analytics application is the pivot table. Pivot tables allow us to very quickly aggregate millions and millions of data rows into smaller grids that summarize the data into typically two dimensions and plot aggregated number at the intersection of different members. Many basic visualizations (though not all of them) are just different renditions of data in a pivot table. For example here are a number of dashboards generated by IBM Cognos Insight, which are driven by the crosstab definitions above the table.
IBM Cognos Insight driving visualizations through crosstabs.
Crosstabs for structured data can generally be defined by assigning one categorical set of values to the rows, one categorical set of values to the columns and designate one numerical attribute to drive the values in the grid. This allows me to generate rollups of my data from different perspectives: I can view total amount sales by business unit and geography or I can view the total number of products sold by product color and age category.
Now consider a usecase where I’m operating an international airline. I have a very large database of flight bookings from each of my customers. Each flight consists of a number attributes, including origin airport code, destination airport code, customer id, flight designation, booking date, price etcetera. Just like I would do with normal business data I can generate crosstabs that would explain me the total revenue booked per destination airport by year, by dragging years to the columns, destination airport to the rows and selecting the sum of the price as the measure I want to display inside the grid. However, I can also generate a crosstab that would give me the total number of passengers travelling between different airports, by dragging destination to columns, origin to rows and selecting a count function on customerid as the value I want to display in my grid. That will get me a weighted matrix of the number of connections between airports which I can then easily convert into a graph display.
Converting a crosstab (left) into a network display (right).
Even though a five by five crosstab could be considered as small, in many real world cases the number of rows and columns can be in the hundreds. Often I don’t want to report network data at that high level of detail and I want to report on aggregated number of passengers travelling between states, not between airports. In that case, I can specify that the data on both of the edges of my crosstab should be aggregated up to the state level in almost exactly the same way as I would for a regular crosstab. The same comment goes for weighting the connections based on the aggregated revenue for each leg. All I’d have to do is select a different value to display in the cells.
We can roll up networks, just like we can rollup pivot tables.
In the above small sample cases, I’ve reduced the number of nodes in my diagram by aggregating connections up to higher levels in the associated hierarchies. Though in the sample the reduction is small, you can imagine that I could easily aggregate arbitrarily many connections into a much smaller number of aggregated connections. If you compare the two network layouts above you’ll see aggregation is consistent: if there is a connection between two airports, there will be a connection between their associated states at the higher level. The one disadvantage is that the aggregated crosstabs will generally be much denser than their non-aggregated counterparts, which in turn makes layouts harder. This can be partially solved by applying a filter to all resulting cells based on simple weight or more complex statistical properties.
The idea I’ve sketched here is not new of course, as the core ideas behind it appeared in an existing approach called ‘pivotGraph’ for networks, which defines similar rollup operations. I think one powerful, yet underexplored, approach for gaining insight into networks is similar to the one we use for gaining insight into large amounts of transactional data: by providing users with a flexible means to generate aggregated summary data along dimensions of interest.
Why stop the insight with this article? Visit IBM’s visualization hub, IBM Many Eyes and join over 100,000 like-mined visualization enthusiasts, academia and professionals, including additional insights from Frank van Ham and other IBM visualization luminaries.
Frank van Ham is a well-known research scientist and an IBM Master Inventor with over a decade in experience in designing and deploying interactive information visualization. Some of his past projects include Many Eyes, a site for collaborative visualization and SequoiaView, a visual disk browser. Dr. van Ham currently works with the IBM Business Analytics division on integrating visualization into IBM's product portfolio.