Some Thoughts on Data Density

Tim Brock / Monday, September 07, 2015

Back in February I wrote about 7 Do's and Don'ts of DataViz. My first "don't" was "Don’t use a chart when a sentence will do" with the accompanying bar chart:

As noted in that article, the bar chart above can be omitted in favor of a simple sentence like "237 of the respondents preferred Product A, while only 112 preferred Product B" without any reduction in understanding to the reader.

While I frequently see bar charts with only two bars, it's rare that I ever see a scatter plot with only two points. One possibility is that the relative sparseness of a scatter plot emphasizes the tiny size of the dataset in question so people shy away from them. Conversely, big, blocky bars give the illusion that a chart conveys more information than it really does. Despite appearances, the scatter plot below encodes more information than the bar chart above.

In his influential book "The Visual Display of Quantitative Information", Edward Tufte devotes a whole chapter to the merits of high-density data visualization, the antithesis of the bar chart and scatter plot above. The chapter includes a New York times review of the weather for 2003 and multiple small multiple examples (see my earlier article for an introduction to this format). There's also a whole section on sparklines — "small, high-resolution graphics usually embedded in a full context of words, numbers, images" — that he developed.

Here's a very simple sparkline I made, using World Bank data and IgniteUI, embedded in a sentence:

In contrast to our large bar chart and scatter plot with only a couple of data points, the tiny sparkline above encodes 50 years worth of data. We can see the sharp rise up to 1963, followed by a much slower, bumpier decline. And that all fits in the space of two or three extra words. (Sparklines can work really well in tables too!)

On the face of it this all sounds great. Sparklines can be a wonderfully efficient use of space and I've already discussed the merits of small multiple. Data-rich maps too can provide us with valuable insight if done well. Furthermore, compact displays of data reduce the need for eye movement, making visual search easier. Nevertheless, I'm not entirely convinced by Tufte's arguments in favor of high-density graphics. I'm certainly not saying they're bad, just that they shouldn't be a primary objective. Ultimately I think it boils down to the key question: what are the benefits of visualizing data?

The whole point of data visualization, at least in my eyes (pun intended), is to help aid with understanding of data. We can describe a dataset in fine detail in continuous prose. We can use a table with tens of columns and hundreds of rows. But neither will help us "see" the overarching patterns and we may struggle to pick out the anomalies too. We just don't have the working-memory capabilities. If designed well, charts allow us to overcome these limitations and draw more insight from our data. That's probably not true if there are only two or three data points but with ten it can be. (While ten data points is not a lot, it allows for 45 pairwise comparisons.)

Tufte laments that "[V]ery few statistical graphics achieve the information display rates found in maps" and hopes that "some day statistical graphics will perform as successfully as maps in carrying information". To me this just seems like a really odd way of measuring the success of a graphic. It also offers no consideration of context. Maps are invariably used for exploration, statistical graphics are often used for explanation. The former generally involves showing all the data that might be needed, the latter the data that is relevant (without misleading through omission, of course). Showing more data than is necessary can muddy the picture rather than enhance it. And sometimes there just isn't any more (relevant) data.

When deciding whether or not to produce a chart from some set of data, you shouldn't be asking yourself: will my chart display enough data? You should be asking yourself whether you or your intended audience will (or might) learn anything useful from your graphical display that they wouldn't get from regular text or a table. If the answer is "no" then you can leave it there. If the answer is "yes" then you probably want to consider whether the extra insight is worth the extra cost in terms of the time and effort you have to expend creating the graphic and, when limited, the space any graphic will take up. If sparklines, small multiples and/or maps are useful then make liberal use of them.