Demystifying Box-and-whisker plots — Part 1

If you browse through a large, printed, newspaper and pick out all the charts you find you'll probably come across some of the following: bar chart, timeseries (line chart), pie chart, donut chart, stacked area chart. You may come across the odd scatter plot too. If you've picked up the New York Times on a good day then you might even stumble into a connected scatter plot. What you're highly unlikely to find is a box-and-whisker plot (aka a "boxplot" or a "box plot"). By contrast, they're very popular in scientific circles.

One of the problems with box-and-whisker plots, one of the reasons I think thy haven't made it into the consciousness of the general public via the media, is that they're not particularly intuitive from a visual point of view. You can look at a bar chart and immediately get the idea that the longer the bar, the bigger the number being represented. Similarly, we can look at a pie chart and grasp the part-to-whole concept without much thought or see a sharply downsloping line in a timeseries chart and think "this thing is declining very quickly". But box plots are frequently a mix of rectangles, lines and points. However, I really don't think they're that difficult to understand. And they can be very useful when you have multiple distributions you want to compare. So in this post I'm going to try to demystify them.

Rather than dive straight in with "proper" box-and-whisker plots, I'm going to start with something a little simpler: range bars. The diagram below shows how a range bar might look and labels the salient parts. As you can see, there's not a lot to them. You take a univariate dataset and draw a box to signify the lowest and highest values. Typically, you also add a line to indicate the position of the median relative to the lowest and highest values.

Now a single range bar doesn't tell us very much. Without an accompanying axis all we can tell is whether the median is closer to the highest or lowest value. You don't really need a chart to convey that small amount of information. Stick the range bar on a scale and we can estimate absolute values of all three of these things. But the real power of range bars, and their box plot cousins, is how they enable simple comparisons between different univariate datasets. The chart below illustrates this for five arbitrary datasets (the precise details aren't important here) I created, each made up of 100 data points. You could extend this layout to ten or so datasets without much problem.

As a collection, the range bars look like a set of shifted bars from a bar chart. That's basically what they are. The longer the bar the bigger the range of each distribution. But, like I said, the real insight is from comparing bars. We can see, for example, that dataset E has a much larger range than the others and a much lower median. We also see that while the medians for datasets A to D lie (very) roughly halfway between their respective minimum and maximum values, the median for E is much much closer to the minimum.

Shortly I'll turn the range-bar plot above into what I'll call a simple box-and-whisker plot. But first, here's a labeled diagram illustrating the important parts of a simple box-and-whisker:

Now the box only covers 50% of the data. Above and below the box we have "whiskers" extending out to the highest and lowest values in the dataset. 25% of data points have a value between the minimum and the bottom of the box, 25% of data points have a value between the top of the box and the maximum. Here's the data I generated earlier displayed as a set of simple box-and-whiskers.

We can now see that the large range seen for dataset E comes mostly from (at most) just a quarter of the data points — the 75th percentile is much closer to the minimum than the maximum.

There are a number variations of the box-and-whisker plot that attempt to show outliers. The version I see most often (and which I was taught in school) is as follows: Rather than the whiskers necessarily extending out to the smallest and largest values, they instead extend out to the smallest/largest values that are up to 1.5 times the interquartile range (IQR) below/above the 25th/75th percentile. The interquartile range is simply the distance between the 25th and 75th percentiles. Still, all that is quite a mouthful and an explanatory diagram certainly helps:

Individual points that fall outside the permitted range for the whiskers are explicitly marked and given the status of "outlier". (I find this a strange use of the term "outlier". In other circumstances "outlier" refers to a data point distant from all other data points. As you'll see below, only a couple of outliers really fit that definition in the datasets we're using here.) The chart below illustrates our 5 datasets using a "typical" box-and-whisker plot.

For datasets A, C, and D there's no change from the simple box-and-whisker since no data point lies more than 1.5 times the IQR from the 25th or 75th percentile. For dataset B there is one point just below this range. Dataset E has four high-lying outliers (two data points are almost on top of each other); despite the maximum value in E being greater than 100, 96% of points lie below 70.

Now that I've (hopefully) demystified box-and-whisker plots, in Part 2 I'm going to use them with some real-world data to illustrate their strengths.

Try our jQuery HTML5 controls for your web apps and take immediate advantage of their powerful data visualization capabilities. Download Free Trial now!

Comments  (8 )

Tim Brock
on Tue, Jan 26 2016 5:08 AM

Having shown you how to read range bars and box-and-whiskers in Part 1 , I now want to use some real

George Abraham
on Thu, Feb 25 2016 9:43 PM

Ah! One of my favorite charts explained. Unfortunately many people are unaware of why this chart works. Do you think there are any obvious limitations ? For example, does too few data points dull the appeal for this chart?

Another issue I have noticed is that viewers have a tough time drawing a quick conclusion. Are there any pointers with respect to extracting actionable insights, especially when comparing >2 plots side by side.

Timothy Brock
on Mon, Mar 7 2016 5:36 AM

The main limitation, I think, is that the coarseness can obscure interesting details of the underlying distribution. For example, you won't pick up that a distribution is bimodal. There's a number of variants on box plots — eg violin plots and bean plots — that were designed to overcome such limitations but these are rather more complicated.

Again there are ways of complicating box plots to make certain comparisons more apparent. You may want to take a look at notched box plots. As before, the extra features come at the cost of no longer having as simple an explanation. Sticking to the box and whiskers as presented above, I think one of the best things is just to compare different markers. For example, is the median on one box higher than the 75th percentile or even the upper extreme of another? In the majority of cases this will be telling you you have two very different datasets. With an ordered series of box plots you may also see an upward or downward trend. In this case, a box plot that doesn't follow the trend will stand out. Even when there's no outlier, you can check whether the distributions overlap or not just from box plots. If the distributions don't overlap then the value of whatever you've just plotted may be seen as a good indicator of category. Of course, you can do this with histograms too, but it's a lot less elegant when there is a degree of overlap.

Cinthia Saladino
on Wed, Nov 9 2016 5:35 AM

This is great explanation. <a href="">7 Steps to Health</a>

Merry McAlpine
on Thu, Nov 10 2016 2:19 AM

Good point of view. I am hugely impressed.

Morton Solley
on Mon, Nov 21 2016 4:36 PM

This is really important information. It contains some good tips.

Donny Mendez
on Sun, Jan 15 2017 4:29 AM

These type of information is really working for everyone of us.

Donny Mendez
on Sun, Jan 15 2017 4:40 AM

This is really awesome plots.