Two + Two = Zero
Delaney Turner 270003RQ8K Delaney.Turner@ca.ibm.com | | Tags:  ibmsoftware
0 Comments | 9,774 Visits
The following is the fifth of a new six-part series on Advanced Data Visualization. Over the next three months, IBM visualization experts will explore new and emerging visual techniques and the underlying technologies you can deploy to better understand your data to transform insights into better business outcomes.
Graham Wills is the lead architect for IBM’s visualization engine. He has two decades experience in research and implementation of visualization systems in areas including statistical models, geo- and temporal- visualization, large-scale networks and coordinated views. He has published widely in the field and his recent book, Visualizing Time is currently available on Amazon.
No, this isn’t a piece on modulo arithmetic, binary logic or the like. No need to call in the mathematicians and programmers. What we are going to discuss here is visualization design and how different visual features work together.
The figure below is an example of a well-designed chart. It’s a scatter plot of two numeric fields, with color used to encode a third field. When we look at the figure, we immediately notice two important features:
The data for this chart are in fact simulated data. The X and Y locations of the points are what is called “stratified random” – they are random, but only within certain bounds. In this case the data consist of 100 points placed on a 10x10 grid with a bit of random noise added to allow them to move a little. One of the reasons a scatter plot is such a great tool for numeric data is that we have a very good ability to assess distances from a fixed point – in this case, distances from the axes or boundaries of the data space. This chart draws a box around the data space and adds faint grid lines to make those comparisons even easier. That makes it easy to spot relationships between fields used for position on a scatter plot.
This is a general presentation rule – if you want to allow people to compare numeric values, the two best ways to do so are using aligned lengths (bars on a bar chart which all start at the X axis are a simple and powerful example) and by using aligned distances (like the scatter plot). In the chart we present above the regularity is immediately apparent. If we tried to use angles or color or something like that to show one of the fields, it would be much harder to spot that regularity.
The second feature of this chart is related to the field we use for color. Color is a seductively powerful way of encoding information. For a lot of human evolutionary history, it has been critical that we can identify items based on color. We have a strong ability to differentiate greens, a pretty good ability to differentiate shades of red and a relatively weak ability to differentiate blues. This is very probably because those are the colors of the foods that we and our ancestors ate. You are here, at least in part, because your great-great-(etc.)-grandparent was able to tell subtle differences between a red/green nutritious plant, and red/green poisonous one.
Color is a “grouping” function – we see colors in groups, not really as a continuous scale. Even if we present people with a chart that goes smoothly from blue to red (say), they will perceive it more in terms of groups of similar-colored items. For this reason, color is an excellent field to use for a categorical piece of data. Color does not also have a natural order; we can impose orders like blue/red, or heat scales; we can learn scales like those used in maps (blues get darker as the altitude goes below zero, browns get stronger as the altitude increases above the baseline); but these are not natural.
This chart is using color just as an indicator that a point belongs to a given group. This is the simplest and most effective use of color, and so this chart works: it represents the data not only truthfully, but also in a way that fits with our ability to interpret it.
We saw there was a pattern in the way the X and Y fields interacted – they were distributed regularly, more spaced out than we would expect. This chart is also clear in that it we do not draw false conclusions about the locations of the groups. The color of the points does not appear to have a dependency on their locations.
So far, so good. Take a moment to look at the chart on the left. This chart uses the same X and Y data, but instead of using color to map a field, we use symbol shape. Take a moment to look at the chart and compare it to the previous one.
This chart is effectively the same as the previous one. Although the field used for symbol shape contains different data, it is the same type of data (categorical) and, as in the previous chart, has three groups – a “default” group with most of the data, a group with 4 items (“green” previously, “square” here) and a singleton group (“red” previously, “plus” here).
Symbol shape works in a similar way to color. It is good for categories, has no particular order, and we process it mentally using a “grouping” function. People using charts based on symbols will say things like “the square points are mostly …” or “the bottom-right sector contains points from both plus and square groups”. The same language is used when working with charts using color.
So, off to our third example. Given our success with coding one field with color and another with symbol, it is natural for us to want to use both! If we have two numeric data fields and two categorical fields, it seems pretty clear that we can make a good chart using X and Y for the numeric values and color and shape for the categorical ones.
And, in the figure to the side here, this is exactly what we have done. Again, take a good look at the figure and compare it to the previous ones and see what conclusions can be drawn from it.
A famous quotation (much mangled in repetition) from American journalist H.L. Mencken (pictured at right) is “There is always an easy solution to every human problem--neat, plausible, and wrong”. This chart is not deceitful in purpose; it doesn’t misrepresent the data. It also follows good advice about drawing charts and for each mapping of the categorical field it is very plausible and very neat. In fact the only reason this chart fails is that at a fundamental level, combining shape and color just doesn’t work.
When we process symbol and color in our brains (or maybe just outside it … I’m not going deeply into the optic processing system in this article), we process them very separately. When we look at the 100 items in the first chart we instantly spot the unusual colors. If we had a million points we would do that identification just as fast. Similarly if we presented a million circles with just four squares and a single plus, we would immediately note and classify those unusual points. What we cannot do is process both at the same time. We cannot spot combinations of color and shape without detailed thought.
The third chart we showed works moderately well for comparing groups of color OR groups of symbol. The presence of the other encoding is distracting, but we can cope with that without much cognitive overhead. But if we wanted to deal with each separately, we could just use two charts more simply and more easily. The promise of combining both mappings in one chart is that we can spot patterns between them. But we cannot.
There is a critical feature of this chart that we cannot immediately spot – to find it we must carefully process each of the unusual items and investigate them sequentially. If you found the following feature in the chart – congratulations! It is not obvious and needs mental work to find. In this chart there is exactly one point that is both an unusual color and an unusual shape – the green square at the center bottom of the chart.
This is a critical piece of information. The red point is 1% of the data. If you assume color and shape are independent, then being unusual in color and shape are each 5% and so the combination is has a 0.25% probability of happening by chance (5% multiplied by 5%). This is four times as unusual, and it should be the most important point in the chart, and yet it is not visually obvious that this point is more than “slightly unusual”. This chart contains only 100 data points. The task becomes much more complex as the data and groups increase in number.
This article has been a cautionary tale. The human visual system is complex, and perhaps the strongest overall message to take away is that coding four or more fields into one chart is hard. It’s almost certainly best to avoid using two encodings like color/shape/size/orientation/texture for different fields if you have any interest in seeing relationships between those fields. Position, on the other hand, is very good for showing relationships. Maybe if you really need to see four fields, it might be better to use three for position and one for color? As we have seen, using two + two can lead to a chart that rates a solid “zero”.
Continue exploring visual analytics on IBM Many Eyes
Why stop the insight with this article? Visit IBM’s hub of visual analytics, IBM Many Eyes and join over 100,000 like-mined visualization enthusiasts, academia and professionals. The Many Eyes web community democratizes data visualization by providing a simple three step process to create and interact with a visualization using your data set. Then share or embed your visualization across the web or your social network.
Read previous entries in this series: