If you browse through a large, printed, newspaper and pick out all the charts you find you'll probably come across some of the following: bar chart, timeseries (line chart), pie chart, donut chart, stacked area chart. You may come across the odd scatter plot too. If you've picked up the New York Times on a good day then you might even stumble into a connected scatter plot. What you're highly unlikely to find is a box-and-whisker plot (aka a "boxplot" or a "box plot"). By contrast, they're very popular in scientific circles.
One of the problems with box-and-whisker plots, one of the reasons I think thy haven't made it into the consciousness of the general public via the media, is that they're not particularly intuitive from a visual point of view. You can look at a bar chart and immediately get the idea that the longer the bar, the bigger the number being represented. Similarly, we can look at a pie chart and grasp the part-to-whole concept without much thought or see a sharply downsloping line in a timeseries chart and think "this thing is declining very quickly". But box plots are frequently a mix of rectangles, lines and points. However, I really don't think they're that difficult to understand. And they can be very useful when you have multiple distributions you want to compare. So in this post I'm going to try to demystify them.
Rather than dive straight in with "proper" box-and-whisker plots, I'm going to start with something a little simpler: range bars. The diagram below shows how a range bar might look and labels the salient parts. As you can see, there's not a lot to them. You take a univariate dataset and draw a box to signify the lowest and highest values. Typically, you also add a line to indicate the position of the median relative to the lowest and highest values.
Now a single range bar doesn't tell us very much. Without an accompanying axis all we can tell is whether the median is closer to the highest or lowest value. You don't really need a chart to convey that small amount of information. Stick the range bar on a scale and we can estimate absolute values of all three of these things. But the real power of range bars, and their box plot cousins, is how they enable simple comparisons between different univariate datasets. The chart below illustrates this for five arbitrary datasets (the precise details aren't important here) I created, each made up of 100 data points. You could extend this layout to ten or so datasets without much problem.
As a collection, the range bars look like a set of shifted bars from a bar chart. That's basically what they are. The longer the bar the bigger the range of each distribution. But, like I said, the real insight is from comparing bars. We can see, for example, that dataset E has a much larger range than the others and a much lower median. We also see that while the medians for datasets A to D lie (very) roughly halfway between their respective minimum and maximum values, the median for E is much much closer to the minimum.
Shortly I'll turn the range-bar plot above into what I'll call a simple box-and-whisker plot. But first, here's a labeled diagram illustrating the important parts of a simple box-and-whisker:
Now the box only covers 50% of the data. Above and below the box we have "whiskers" extending out to the highest and lowest values in the dataset. 25% of data points have a value between the minimum and the bottom of the box, 25% of data points have a value between the top of the box and the maximum. Here's the data I generated earlier displayed as a set of simple box-and-whiskers.
We can now see that the large range seen for dataset E comes mostly from (at most) just a quarter of the data points — the 75th percentile is much closer to the minimum than the maximum.
There are a number variations of the box-and-whisker plot that attempt to show outliers. The version I see most often (and which I was taught in school) is as follows: Rather than the whiskers necessarily extending out to the smallest and largest values, they instead extend out to the smallest/largest values that are up to 1.5 times the interquartile range (IQR) below/above the 25th/75th percentile. The interquartile range is simply the distance between the 25th and 75th percentiles. Still, all that is quite a mouthful and an explanatory diagram certainly helps:
Individual points that fall outside the permitted range for the whiskers are explicitly marked and given the status of "outlier". (I find this a strange use of the term "outlier". In other circumstances "outlier" refers to a data point distant from all other data points. As you'll see below, only a couple of outliers really fit that definition in the datasets we're using here.) The chart below illustrates our 5 datasets using a "typical" box-and-whisker plot.
For datasets A, C, and D there's no change from the simple box-and-whisker since no data point lies more than 1.5 times the IQR from the 25th or 75th percentile. For dataset B there is one point just below this range. Dataset E has four high-lying outliers (two data points are almost on top of each other); despite the maximum value in E being greater than 100, 96% of points lie below 70.
Now that I've (hopefully) demystified box-and-whisker plots, in Part 2 I'm going to use them with some real-world data to illustrate their strengths.
Try our jQuery HTML5 controls for your web apps and take immediate advantage of their powerful data visualization capabilities. Download Free Trial now!