The world’s leading publication for data science, AI, and ML professionals.

Measures of variability and z-scores. Why, when and how to use them?

Formulas and examples for variance, standard deviation, coefficient of variation, IQR and z-scores

Photo by Rostislav Uzunov, on Pexels
Photo by Rostislav Uzunov, on Pexels

Content

  1. Introduction
  2. Range
  3. Interquartile range
  4. Variance
  5. Standard deviation
  6. Coefficient of variation
  7. Z-scores
  8. Conclusion

Introduction

Why measures of variability matter? What can they offer us?

Variability refers to how "spread out" or dispersed Data is, and how different each score is from the other. For example, let’s think of the customers of a particular store and their age. In the case of a store for sport’s clothing and gear, the most frequent buyers might come from a younger age group, the data being concentrated around the latter. However, in the case of a supermarket, we might notice that its customers might belong in equal measure to different age groups, the data being more dispersed in this case.

The most common measures of variability, used especially for quantitative and continuous data, are:

  • range
  • quartiles and interquartile range
  • variance and standard deviation

To this, I will also add information regarding the coefficient of variation, useful when wishing to compare variability between different datasets, and z-scores, which are also called standard scores, because they help us see how far from the mean certain data scores are, especially when we wish to compare them to scores within the same distribution or from another distribution.

Each measure will come with its corresponding formula and some examples to illustrate its use.

Range

The range is the difference between the maximum score and the minimum score within a data set.

In this example, the range is the difference between 98, the highest score obtained, and 11, the lowest. As can be seen, it is very easy to calculate it. However, it can also be easily affected by outliers, in this case 11, 25 and 98. And that is why we calculate the IQR or interquartile range.

Interquartile range

Quartiles comes from quarters. And the interquartile range represents the range of the middle 50% of the scores comprised in a certain distribution.

IQR = 75th percentile (Q3) – 25th percentile (Q1)

Image caption by the author
Image caption by the author

To calculate the IQR, one needs to go through the following steps:

  1. Order the data from smallest to largest.
  2. Find the median (Q2), which in the example above is 73.
  3. Calculate the medians of the lower (Q1) and upper half (Q3) of data, which in our case are: Q1 = 57, Q3 = 79. When doing this, do not include the centermost value, Q2 (73, in this case).
  4. Subtract the lower half median Q1 from the upper half median Q3. In the example above, the middle 50% of the scores have a range of 22 points, being contained between a score of 57 and a score of 79.

The IQR is sometimes used for market segmentation analysis, and for getting rid of outliers, before doing a k-means clustering.

Variance

Variance is a measure illustrating how much variability we have in a group of scores. And it can be calculated as the average of the squared deviations of each number from its mean.

Or, to understand the formula even better, let’s put it differently. Variance is the sum of squares (short of sum of squared deviations from the mean) divided by the number of scores/ data points in the population or sample.

Variance formula for population
Variance formula for population
Variance formula for sample
Variance formula for sample

Even though we can use spreadsheets or other tools to calculate it, it is best to understand their formulas and what goes into them to better know in which circumstances they can provide useful insights.

Standard deviation

Standard deviation (the square root of variance) is useful as it can show variability into the same units as the original measure. Using it will be of more help than using variance. For example, if we measure the variability of a set of data about a sample population’s height, standard deviation helps us get from units of variance to centimeters and meters.

Standard deviation formula for population
Standard deviation formula for population
Standard deviation formula for sample
Standard deviation formula for sample

The difference between using variance and standard deviation can be also understood when looking at the example below, for which I have used the heights of 5 dogs. 320.213 variance units doesn’t say much about our sample and is difficult to interpret. On the other hand, the standard deviation here is expressed in inches, and it is easier to see it in context. A variability of 17 inches between the sampled dogs shows that we have chosen dogs from different size groups.

Image caption done by the author
Image caption done by the author

Coefficient of variation

The coefficient of variation or the relative standard deviation can be calculated by dividing the standard deviation to the mean. And it is used when whishing to compare the variability between two or more datasets.

Coefficient of variation formulas for population and sample
Coefficient of variation formulas for population and sample

I have used for the example above prices for houses from Bucharest, Romania, and London, UK. The prices are expressed in two different coins, Euro and Pound. In this case, the standard deviation is useless if we want to compare the variability between the two datasets. But this can be done successfully with the help of the coefficient of variation, which is universal across datasets.

Z-scores

Z-scores or standard scores are especially useful when it comes to putting data into context.

To calculate it, we need the mean and the standard deviation, as we will compare the score’s deviation from the mean to the standard deviation. To calculate a score’s (x) z-score, we first subtract the mean from it and then divide the result to the standard deviation, as can be seen below.

Z score formula for population
Z score formula for population
Z score formula for sample
Z score formula for sample

How to interpret the results?

  • z-score = 0, the score (x) is exactly average;
  • z-score is positive, the score (x) is above average;
  • z-score is negative, the score (x) is below average.

Z-scores can be used either to compare scores between the same distribution, or between different distributions.

Let’s look now at an example. The employees of a firm were given an aptitude test, for which they have obtained a different number of points. And we can see from the table below, that Joanne and Mark have obtained the highest number of points (91 and 88), while Sam obtained the lowest (49). If we calculate the z-scores, we will also know how much below or above the average is each obtained number of points.

Image caption by the author
Image caption by the author

But what if a second test was given to all employees and we wish to see, for example, if those that had a bad performance at the previous test have improved over time. We look at Sam’s case again and we notice that he obtained a higher number of points for his second test. This might make us think that his performance has increased. However, if we calculate the z-scores, we will get a different result. His performance has, in fact, decreased slightly. This can be explained by the fact that, overall, the performance of the rest of the employees has increased, leading to an increase in the mean, as same as due to the decrease in variability (lower standard deviation). And this is how context can sometimes make a difference.

Image caption by the author
Image caption by the author

Conclusions

Together with measures of central tendency and correlations, the measures of variability and the z-score can serve us when wishing to describe data numerically. And they are all part of what is called descriptive Statistics.

I wish for this article to be the start of a series on statistics, as I learn about different concepts and try to explore them better with my own examples. So, please, let me know if you would like to see certain topics covered. Or if you consider something should be added to this article.


Now is the best time to be a Data Scientist or a Data Steward in Europe

Data Science for Social Good

Data Science for Social Good: Best Sources for Free Open Data


Related Articles