Aspects of Datasets - Part 1

Tim Brock / Monday, July 20, 2015

Whether compiling your own dataset or using somebody else's, there's huge potential for wasting both time and money if the data isn't fit for purpose. This is the first article in a two-part series taking a (largely qualitative) look at evaluating the usefulness of datasets. In this part I will cover relevance, accuracy and precision; part two will cover consistency, completeness and size.


It is, of course, obvious that the data being measured must be relevant to the analysis project in question. There's no point in measuring the mean circumference of oranges in your local store if you want to know the unemployment rate in South Australia. In the ideal case we can directly measure the variables we're interested in or survey people from a specific demographic group of concern. Sometimes this simply isn't possible. Maybe we don't have the technology to measure what we'd like to measure, or the funds to obtain the relevant kit or carry out a survey. Legal and ethical considerations can also play a role. In cases such as these the best we can manage is to rely on some form of proxy data.

Proxy data can mean a simple substitution. For example, if we want to know the populations of the nations of the UK we could use the results of the 2011 census. For many cases this freely available four-year-old proxy is likely to be accurate enough. But if your goal is to look at net migration in the UK since 2010, this data alone is of little use.

Proxy data may also be a lot more complex. Because we only have direct measurements of global temperature dating back to the mid-nineteenth century, climate scientists have to use a mix of proxies to extrapolate backwards in time. This includes the width of tree rings, the isotopic composition of huge ice cores and the shape of pollen found in sediment at the bottom of lakes and oceans.

With every step away from direct measurement of variables, complexity of analysis and uncertainty in the results is likely to increase. The critical question that needs to be asked each time you have to resort to the use of proxy data is then whether the return will be worth the expense? In some cases - like estimating historical global temperatures - it is, but in other cases one might reasonably conclude otherwise.

Accuracy and Precision

In everyday speech "accurate" and "precise" are frequently used as synonyms. However, in scientific circles, and here, they tend to have distinct meanings. A measurement (or collection of measurements) is accurate if the measured value is close to the true value of the quantity. Measurements are precise if measuring the same quantity repeatedly gives little variation in the resultant value. The chart below illustrates these concepts.

In essence, accuracy reflects the degree to which measurements are free from systematic errors. Precision, on the other hand, refers to the size of random errors.


Sources of inaccuracy include human bias, methodological flaws and faulty equipment.

It's perhaps too easy for someone to use leading questions to get the survey results they want. But human bias doesn't have to be conscious. At various stages of data collection we might be asked to make implicit judgement calls. These will always be influenced by what we know or think we know. Informed judgements will help us avoid wasting time but also make us prone to going down the path we already assume to be correct. Conversely, data is more likely to be newsworthy and taken note of if it appears to defy expectation. This can lead to publication bias.

Flaws in the methods used to collect raw data and transform it into more meaningful data can also lead to inaccuracies. This seems to have been the case with phone and online polling for the recent general election in the UK, with the final result being a long way form the too-close-to-call scenario that was widely predicted. The post mortem into exactly what went wrong is still on-going, but suggestions have included failure to account for unrepresentative samples properly and simply asking the wrong questions.

While electronic and mechanical detection systems may not be afflicted by the biases we are, there's no guarantee they will work as expected throughout the process of data collection. Transportation, heavy use, age, poor handling or a dose of bad luck can lead to detectors breaking or not working as desired. And even if the equipment works perfectly, there's no guarantee a human will use it as intended or that lax recording of results won't lead to miscommunication of accurate output.

Because we tend to measure things that haven't been measured before, it can be difficult to know for certain whether a measurement is accurate. Aside from designing a well-controlled experiment with strict protocols, complimentary measurements of the same quantity of interest using different techniques can highlight systematic discrepancies. Another possibility is to use the same methodology to measure something that is already known and to check for inconsistencies.


In most experiments the level of precision seen in the lunar LASER ranging experiments is not needed in order to draw meaningful conclusions about whatever we're studying. Nevertheless, precision is something we should be concerned with. Without a grasp of the size of the random errors in our measurements, we can't know whether changes and discrepancies between one measurement and the next are due to real effects or statistical noise. Repeated, seemingly redundant, measurements help us assess how precisely we are really measuring things.