The world’s leading publication for data science, AI, and ML professionals.

Interpretation, interpolation and other lies | How data speaks (not) for itself

Why we might always need a trusted interpretation layer over the raw data

Did you know that by the year 2030, 80% of the world will receive their news via social media. Here’s another even more staggering one, 9 out of 10 people will not fact check anything they read…

People don’t deal with facts, only the interpretation of facts. The general guy on the street does not want to know the nuances in Data, they want punchlines. Punchlines sell, peer reviewed 50 pagers explaining the detail, do not.

image by author
image by author

So I often hear, "the data speaks for itself". No sir, it does not.

Both statistics I mentioned in my opening statement are completely fabricated… Chances are you were quite comfortable accepting both of them and reached this paragraph without doubting their accuracy.

When world leaders quote statistics prior to announcing a big decision I immediately hear Mark Twain in my mind "Most people use statistics the way a drunkard uses a lamp post, more for support than illumination." Apparently Twain was a fan of statistics and his point was more around explaining why often simple interpretations of statistics will not suffice¹. Perhaps I am tainted, but from my experience (in my field of expertise) I know how hard it is to interpret data, hence my pessimism on the number of Interpretation nodes (read: people who produce their unique spin on the data) it had to go through before reaching the top.

Perhaps you are untainted and have a very idealistic view of the world, perhaps you should stop reading, because I myself am undecided whether enlightenment on this topic is a good thing.

Ok, you have decided to continue. I will try and highlight the broad categories on how data can be misinterpreted.

Whenever you see italics, this is "explainer BOT" needing to give a brief explanation of the concept being discussed.

Common Statistical errors

There are a few very common statistical errors that might not seem obvious but if you know what they are, you will spot them easily.

  • Mean and Medians used selectively

Mean aka average. Mean or average age of 3 people aged 20, 25 and 80 is (20+25+80)/3 = 41.66 years. The median is the value right in the middle, separating the bottom half from the top, which in the same example is 25.

You can boast that the average income of a country has increased by 10% ("due to our amazing government policies") to R40 000 pm. But in actual fact, for example, the median salary has decreased by 5%. What that probably means (let me interpret for you, don’t worry you can trust me) is that the rich have gotten richer and the poor have pretty much stayed the same. A simple omission of the median and rather use the mean has allowed you to not lie but still look good.

  • Cherry picking supporting data

Cherry picking is when only a subset of the results are presented that best represents the outcome you are trying to achieve. For example if you want to advocate an investment model you might only quote the years, lets say 2015–2020 where it produced a 20% yield. However if you look at how the investment strategy performed between 2000–2015 it will show a yield of -7%.

  • Biased Samples

An online survey was done and asked people their general accessibility to the internet. The survey showed unanimously that access to the internet has been greatly improved. Uhm… 1 problem with that… online survey. The sample was biased from the start.

Now, in some cases if these mistakes happen it could be simple human error that has caused them, but in other cases people will torture the data till it tells them what they want to hear.

In Naked Statistics², Charles Wheelan walks through in more detail what these common mistakes are and how one can spot them easily.

Causation and correlation

Correlation is a mutual relationship or connection between two or more things.

A mutual relationship or connection between two or more things does not imply that one causes the other.

This topic alone has books written about it³, so I will not elaborate in too much detail. What I will rather do is show you how wrong it can get. Look at some of the spurious correlations mentioned in this link⁴.

One of the examples shows a strong correlation between the age of Miss America and "Murders by steam, hot vapors and hot objects". We as humans can immediately sense that this is pure coincidence, a computer algorithm cannot.

Ice cream consumption correlates with sunburn. It takes us a second to realize, oh, it’s probably excess sunshine that causes both.

Causation vs. Correlation (image by author)
Causation vs. Correlation (image by author)

These were two obvious examples on each side of the spectrum to simply show that the data does not speak for itself and more often than not correlation is mistaken for causation.

In the AI delusion⁵ Gary Smith argues that Big Data exacerbates this issue, because you are adding so many variables that could potentially correlate, so inevitably you are increasing the number of spurious correlations.

If we let the data speak for itself we will proclaim that the divorce rate in Main is caused by per capita consumption in margarine.⁴

Bad Data

Garbage in, garbage out. Another common misconception is that adding more data will help, adding more data doesn’t address the quality of the data, you will simply have more bad data.

Data scientists, which was deemed the sexiest profession of 2020, spends a considerable amount of time looking for other jobs.⁶ Data scientists want to create cutting edge algorithms that drive insight but most of the time the data they get is crap and needs scrubbing, structuring and massive amounts of effort to continuously provide the above mentioned.

Data Quality is critical to success for data projects but data does not arrive clean. Ensuring data quality and consistency requires effort and resources, you can’t just let the data speak for itself. If we had let the data speak for itself, it would be speaking Mandarin … backwards … under water.

Ethical bias issues

When we are talking about training data, "bad" data could simply mean that the training data was not well curated, and could have some bias in it. Several recent research studies⁷ demonstrated that popular data sets used to train image recognition AI included gender biases. For example men cooking in the kitchen were mistaken for women because a large portion of pictures of women were taken in the kitchen.

Machine learning or AI algorithms require data sets to train on i.e.. training data. Models that fit the data best are then determined and used to predict certain outputs based on similar data sets.

Inaccurate labeling due to biased training data (Photo by Jason Briscoe on Unsplash)
Inaccurate labeling due to biased training data (Photo by Jason Briscoe on Unsplash)

We live in a complex world with different world views and changing moral standards over time. We adapt and change with it. There is still a wide gap between the way we process information vs. how algorithms process information⁵.

Again theses are written on this topic, so I will try and highlight some more bad examples which should drive the point home on how algorithms do not take into consideration ethical complexities, consequences etc. It takes training data and produces an output.

  • Twitter trolls were able to turn Microsoft chatbot into a racist in a very short period of time, because there were no ethical checks in place⁸.
  • UK Covid scoring example – the biggest victims were students with high grades from less-advantaged schools, who were more likely to have their scores downgraded, while students from richer schools were more likely to have their scores raised⁹.

I am going to be bold and say that for the foreseeable future, real people will still be involved in AI processes. Why? We need to proactively put in place counter measures against built-in biases. We as a society can’t unanimously agree on certain moral topics so leaving it up to machines to decide is in no way going to make the problem go away. I don’t believe we will ever be able to completely remove that human interpretation layer. And if I am wrong … well in 10 years I will simply ask my personal robot assistant then to write a follow on piece apologising for my short sightedness, or just instruct it to delete any trace that I ever made such bold statements.

But in the meantime, if we let the data speak for itself, you will … Well, do you really want racist chatbots, sexist image recognition and any other …ist type of behaviour?


As a follow on, a part deux if you like, I would like to elaborate on another set of reasons, which is to do with the grain and semantic meaning of data.

Data is growing rapidly and rapidly is growing the way in which data is extracted from us. I do see mass use cases and benefits for AI but I also see a growing need for data professionals across the spectrum, from acquisition (you with your Fitbit sleeping at night) to acting on insight (AI model suggesting a more comfortable bed). Don’t actually know whether that is a thing, might need to patent it. The complexity within the entire data pipeline is not decreasing, they are becoming more (volume), faster changing (velocity) and more variable – stealing the 3 V’s for Big data, but fits the analogy. In my previous piece, I touched on the need for Data Generalists, individuals who will help span across this complex ecosystem.

My take away message is that you need to trust the source of the interpretation because without really wrestling with the raw data there is no way for you to know whether it is used to mislead you. We should wrestle with data, not torture it.

When the same conclusion is being drawn from different sources it should be a better indication of truth. If you find yourself quoting a single study out of hundreds then perhaps you should be asking whether you have fallen prey to one of these interpretation mistakes.

So yes, I do agree with Mark Twain’s "Lies, damned lies, and statistics" or rather my less plagiarized title

[1] Mark Twain was a stats fan, anything else is a Damn Lie. - Aaron Fisher. Available at: https://aaronjfisher.github.io/mark-twain-was-a-stats-fan.html
[2] Wheelan, C., 2014. Naked Statistics: Stripping the Dread from the Data.
[3] Baker, L., 2018. Correlation Is Not Causation.
[4] Tylervigen.com. 2021. 15 Insane Things That Correlate With Each Other. Available at: http://tylervigen.com/spurious-correlations 
[5] Smith, G., 2018. The AI delusion.
[6] Brooks-Bartlett, J., 2021. Why so many data scientists are leaving their jobs. Available at: https://towardsdatascience.com/why-so-many-data-scientists-are-leaving-their-jobs-a1f0329d7ea4>
[7] Wiggers, K., 2021. Researchers show that computer vision algorithms pretrained on ImageNet exhibit multiple, distressing biases. Available at: https://venturebeat.com/2020/11/03/researchers-show-that-computer-vision-algorithms-pretrained-on-imagenet-exhibit-multiple-distressing-biases/
[8] The Verge. 2021. Twitter taught Microsoft's friendly AI chatbot to be a racist asshole in less than a day. Available at: https://www.theverge.com/2016/3/24/11297050/tay-microsoft-chatbot-racist
[9]Walsh, B., 2021. How an AI grading system ignited a national controversy in the U.K.. Axios. Available at: https://www.axios.com/england-exams-algorithm-grading-4f728465-a3bf-476b-9127-9df036525c22.html

Related Articles