How Stats can Mislead you

SalRite
Towards Data Science
7 min readNov 23, 2018

--

“My Stats don’t lie”, said the Statistician Next door; well maybe yes, but Alice can use those cool charts, curvy numbers to cook up a lie!, not in Wonderland, but in reality!

If you are a Data Scientist, Analyst, Machine Learning Engineer or simply someone who love Numbers, understanding Statistics is vital. But Statistics are not absolute truth! Maybe Statistics you see were manipulated, why? because they can! Media and Advertising agencies often play with Consumer Psychology, showcasing only what they want the buyer to perceive. It is often necessary to know the stats you got is not lying to make a right decision, after all you don’t want to lose that million dollar deal, do you? Let’s have a closer look on how to spot a lie, so the next time you go to a meeting or read that newspaper you are not mislead by the fancy charts.

Ways in which Stats can mislead

1). Statistical significance doesn’t imply practical significance. To put it in plain language, something being Statistically Significant doesn’t necessarily mean it is feasible or of any practical importance.

In a Statistical experiment, Hypothesis test (H0 and H1) is framed which evaluates two mutually exclusive statements about the population to determine which statement is best supported by the sample data. A Null Hypothesis (H0) is a stated assumption that there is no difference in parameters (mean, variance) for two or more populations, and the other Hypothesis stating the difference exists is referred as Alternate Hypothesis.

The significance level, denoted by Greek letter alpha ‘α’ (is the probability of rejecting the null hypothesis when it is true), is chosen before the start of an experiment. For example, a significance level of 0.05 indicates a 5% risk of concluding that a difference exists when there is no actual difference (0.05 is usually chosen as the significance level, but it ultimately depends on the use case, if your study is critical you may consider 0.01). Another statistic is ‘p-value’, the probability of finding the observed, or more extreme, results when the Null Hypothesis (H0) of a study in question is true.

If p < α Null Hypothesis is rejected, which means the likelihood of the occurrence of the event if the Null Hypothesis is true is unlikely. A test result is said to be Statistically Significant when the sample statistic is unusual enough relative to the Null Hypothesis. Confused still?, read more here, but for now understand that Statistical significance doesn’t imply Practically significance. For example: A study conducted by you (even with alpha 0.01) may be totally Statistically significant, but maybe the sample you collected is not enough to represent the population, or the risk involved is too high that you don’t even want a 1% chance of getting it wrong.

2). Irrelevant Plots: As the saying goes, ‘A picture can tell a thousand words’, so are the Charts/Plots for the Statistics.

a. Plot/Chart Type: used to represent the data depends on what story you want to tell to your audience, For example: A Line chart will be better to depict the rise and fall in the stock price, while Bar chart maybe appropriate to represent a discrete data regarding the sales made by each company in region X. Using say a Pie Chart where Bar can do the job may make the whole presentation clumsy and you may fail to make a point across your audience.

b. Manipulating the Y-axis: Y-axis can be truncated to tell an irrelevant study

Other times, Y-axis needs to be truncated to represent the data on necessary scale

Omitting the Baseline or Cherry picking the data are some other ways in which a chart can be misrepresented. Also Interpreting your plot matters, For example don’t confuse the fluctuations for a trend (as represented below).

3). Correlation doesn’t imply causation: Imagine you wake-up a Sunday morning, and as you head towards your sofa with perhaps a mug of coffee in your hand, sipping it, as you grab that chatter-paper (newspaper) you get shocked looking at what is perhaps a well put, misleading piece of information, it reads something like

Don’t worry, just relax your brain muscles and simply grasp the fact that ‘correlation doesn’t imply causation’; what I mean is, though the Divorce rate in Maine and Margarine consumption is correlated there is no evidence that can indicate one has caused the other. To read more refer.

Correlation is like a Potential relationship, though there is a potential it doesn’t mean relationship exist. So next time you see a fancy chart and someone stating this has caused this, remember not to fall in trap unless or until you got some solid evidence for it. That doesn’t mean Correlation is useless, it is in fact used in many places, For example: to assess the strength and direction of the linear relationships between pairs of variables in Machine Learning.

4). Simpson’s paradox: Simpson’s paradox, or the Yule–Simpson effect, is a phenomenon in probability and statistics, in which a trend appears in several different groups of data but disappears or reverses when these groups are combined.

A real-life case of the paradox happened in 1973. Admission rates were investigated at the University of Berkeley’s graduate schools. The university was sued by women for the gender gap in admissions:

The results of the investigation were: When each school was looked at separately (law, medicine, engineering etc.), women were admitted at a higher rate than men! However, the average suggested that men were admitted at a much higher rate than women. Talk about confusing.

When individual departments were looked at, the admissions were actually slightly biased towards women. This misleading average is a classic example of Paradox.

5). Sampling: Data collected needs to be the right amount, statistics with small sample size is usually less accurate. One thing that can affect data collection is how the survey was done. The purpose of sampling is to collect, calculate and analyze statistics so that we can make inferences from the sample for the population. In the end, we need to be confident that the results adequately represent the population we care about.

While constructing a sample keep in mind-> Consistency, Diversity and Transparency of the process and data collected. There are several sampling techniques (Random, Systematic, Stratified etc) each with its pros and cons. You can further read about them here.

Conclusion

Stats are just the numbers, they are the half part of the story you want to tell, and without the context they are pretty meaningless, so one needs to be well versed and know the problem they are solving, the data they are handling before making any conclusion i.e. one needs domain knowledge to infer/interpret the stats behind the study conducted. Also Statistics is just a tool, like all tools one need to know when and where to use them, used appropriately it can help make you multi-million dollar decisions and used incorrectly it can make you lose them. So next time you see a graph or some stats thrown at you, don’t forget to question the Authority! And perhaps collect some more evidence before reaching that conclusion.

I hope I have provided an overview of the basics to consider while analyzing your stats or conducting an experiment. Though needless to say I haven’t gone in-depth on any topic and you may wish to further explore it, for this purpose I have provided the link to the source of my little knowledge.

References

Lecture Computational Thinking & Data Science — 15. Statistical Sins and Wrap Up: https://www.youtube.com/watch?v=mCHwwW_Y5wE&list=PL2gOEt98czRRQ0NKUc3mpvR0v2Mjb1qku&index=15

Statistical vs Practical Significance: https://atrium.lib.uoguelph.ca/xmlui/bitstream/handle/10214/1869/A_Statistical_versus_Practical_Significance.pdf;sequence=7

Simpson’s Paradox: https://www.statisticshowto.datasciencecentral.com/what-is-simpsons-paradox/

Sample size matters: https://askpivot.com/blog/2013/11/05/sample-size-matters/

Sampling techniques: https://blog.socialcops.com/academy/resources/6-sampling-techniques-choose-representative-subset/

Understanding Statistical Significance: http://blog.minitab.com/blog/adventures-in-statistics-2/understanding-hypothesis-tests-significance-levels-alpha-and-p-values-in-statistics

Misleading graphs: https://venngage.com/blog/misleading-graphs/

Correlation vs Causation:

1. https://www.insightsassociation.org/article/correlation-does-not-equal-causation-and-why-you-should-care

2. https://idatassist.com/why-journalists-love-causation-and-how-statisticians-can-help/

--

--

Sr. Data Scientist || Finance Enthusiast || Creativist, Novice Blogger, Leisure Writer, Moonly Photographer Personal Finance Site- theritefinance.com