DATA SCIENCE
Introduction
We’ve all heard the saying, "Variety is the spice of life," and in data, that variety or diversity often takes the form of dispersion.
Data dispersion makes data fascinating by highlighting patterns and insights we wouldn’t have found otherwise. Typically, we use the following as measures of dispersion: variance, standard deviation, range, and interquartile range (IQR). However, we may need to examine dataset dispersion beyond these typical measures in some cases.
This is where the Coefficient Of Variation (CV) and Quartile Coefficient of Dispersion (QCD) provide insights when comparing datasets.
In this tutorial, we will explore the two concepts of CV and QCD and answer the following questions for each of them:
- What are they, and how are they defined?
- How can they be computed?
- How to interpret the results?
All the above questions will be answered thoroughly and through two examples.
Understanding Variability and Dispersion
Whether we’re measuring people’s heights or housing prices, we seldom find all data points to be the same. We won’t expect everyone to be the same. Some people are tall, average, or short. The data generally varies. In order to study this data variability or dispersion, we usually quantify it using measures like range, variance, standard deviation, etc. The measures of dispersion quantify how spread out our data points are.
However, what if we wish to evaluate the variability across datasets? For example, what if we want to compare the sales prices of a jewelry shop and a bookstore? Standard deviation won’t work here, as the scales of the two datasets are likely very different.
The CV and QCD are useful indicators of dispersion in this context.
Deep Dive: Coefficient of Variation
The Coefficient of Variation (CV), also known as the relative standard deviation, is a standardized measure of dispersion. It’s expressed as a percentage and doesn’t have units. As a result, CV is an excellent measure of variability for comparing data on different scales.
Mathematically, CV is computed as the ratio of the standard deviation to the mean, sometimes multiplied by 100 to get a percentage. The formula is as follows:
Let’s use Numpy’s mean
and std
functions to compute CV in Python.
def calc_cv(data_array) -> float:
"""Calculate coefficient of variation."""
return np.std(data_array) / np.mean(data_array)
Now, let’s see how we can use this statistic in an example.
Example 1
Consider the following two datasets showing the monthly sales of a jewelry shop and a bookstore:
- Jewelry shop: The average monthly sales are $10,000, with a standard deviation of $2,000.
- Bookstore: The average monthly sales are $1,000, with a standard deviation of $200.
Let’s generate sample data for both examples using Numpy.
Jewelry Shop:
- Mean = $10119.616
- Standard Deviation = $2015.764
- CV = 0.199 (dimensionless)
Bookstore:
- Mean = $1016.403
- Standard Deviation = $206.933
- CV = 0.204 (dimensionless)
The jewelry shop’s average sales and standard deviation are substantially larger than the bookstore’s (mean of $10,119 and standard deviation of $2,015 compared to the mean of $1,016 with a standard deviation of $206), yet their CVs are the same (20%).
This means that, relative to their respective average sales, both the jewelry shop and the bookstore have the same relative variability despite their huge differences in sale volumes (and their standard deviation).
This exemplifies the idea of CV as a relative measure of variability and shows how it can be applied to make comparisons between datasets of different scales.
Next, let’s consider another dimensionless measure of dispersion, which is QCD.
Deep Dive: Quartile Coefficient of Dispersion
The Quartile Coefficient of Dispersion (QCD) is another measure of relative dispersion, especially useful when dealing with skewed data or even data with outliers. The QCD focuses on the spread of the middle 50% of a dataset, i.e., the interquartile range (IQR). That’s why QCD is a robust measure.
The QCD is calculated as follows:
Where Q1 is the first quartile (the 25th percentile), and Q3 is the third quartile (the 75th percentile).
def calc_qcd(data_array) -> float:
"""Calculates Quartile Coefficient Difference"""
q1, q3 = np.percentile(data_array, [25, 75])
return (q3 - q1) / (q3 + q1)
Similarly to the CV, the QCD is a unitless metric that may be very helpful for comparing the dispersion of skewed datasets.
The following examples will better demonstrate the idea behind CV and QCD.
Example 2
Consider two datasets of employee ages from two firms.
- Company A (a startup): Younger workers, some elderly.
- Company B (a well-established company): Older workers, some younger.
Let’s generate sample data for both examples using Numpy.
Company A:
- Q1 = 22.840 years
- Q3 = 26.490 years
- IQR = 3.650 years
- QCD = 0.074 (dimensionless)
Company B:
- Q1 = 42.351 years
- Q3 = 47.566 years
- IQR = 5.215 years
- QCD = 0.058 (dimensionless)
Now, let’s plot the distribution of the data along with the boxplot and QCD to visualise the information above.
Company B’s IQR (5.215 years vs. 3.65 years) suggests a wider age dispersion. However, Company B’s elderly staff affects this (check the boxplots).
On the other hand, Company A has a larger QCD (0.074 vs. 0.058) than Company B, showing a greater age distribution variation relative to its median. The IQR doesn’t reveal this.
In the upcoming sections, we’ll learn how to quantify this difference using the CV and the QCD.
Discussion
Let’s answer a few questions that you may have.
Why not focus on measures like standard deviation or IQR?
We use standard deviation and IQR to quantify dispersion in datasets. The standard deviation shows the average distance between data points from the mean. The IQR shows the distribution of the middle 50% of our data.
However, these measures may be deceptive when comparing the dispersion of two or more datasets with different units or scales, skewed distributions, or the presence of outliers.
While standard deviation and IQR are useful statistical tools, we occasionally require CV and QCD to conduct fair comparisons.
The CV and QCD both measure and compare variability, although they do so in somewhat different ways. Your data and desired variability determine which one to use.
When to use CV
CV is a good way to compare the amount of variation in different datasets that have different sizes, units, or average values. Because the CV is a relative measure of spread, it shows how different things are from the mean.
The mean and standard deviation, two measures that are greatly affected by "outliers," are used to create the CV. So, the CV can give a distorted view of spread in datasets that aren’t normally distributed or have outliers. Thus, CV works best with data that is evenly spread out and doesn’t have extreme values.
In the sales case, the price ranges for the two groups were very different, so the scales used to measure their sales were also very different. The jewelry store is likely to have much higher average sales and much more variation. If we used the standard deviation to measure how variable these two groups are, we might come to the wrong conclusion that the jewelry shop’s sales are more variable.
The CV allowed us to compare the variability of sales between the two datasets, regardless of their different scales. If the CV is higher for one category, it means that the sales are more variable relative to the average sales for that category.
When to use QCD
The QCD uses dataset quartiles, which are less outlier-sensitive. QCD is a robust dispersion measure for skewed distributions or datasets containing outliers. The QCD concentrates on the center 50% of the data, which may better capture dispersion in such datasets.
In our example, we examined the age differences between two companies: a startup company (A) with mostly younger employees and a well-established company (B) with mostly elderly employees. Given their distinct age ranges, the median age and variability would be higher for the older company. Using the Interquartile Range (IQR) to compare dispersion might inaccurately suggest higher age variance in the established company, as the IQR measures absolute variability and is higher for larger values.
The QCD is more effective as it standardizes variability against the median, enabling us to compare age variability between companies on different scales. A higher QCD indicates greater age variance relative to the median for that company. Therefore, the QCD was chosen for this comparison as it accounts for different scales and potential data skews or outliers.
Takeaways
Choosing between CV and QCD depends on the nature of your dataset and analysis goals. Below are key points about both measures:
Coefficient of Variation (CV):
- CV is calculated as the ratio of the standard deviation to the mean.
- CV is dimensionless.
- A higher CV indicates greater variability relative to the mean.
- CV could give misleading results if the mean is near zero (divide by zero!).
Quartile Coefficient of Dispersion (QCD):
- QCD is based on quartiles.
- QCD is a robust measure (less sensitive to extreme values).
- QCD is dimensionless.
- A higher QCD indicates higher variability of values relative to the median.
- QCD does not fully capture the spread if the distribution’s tails are important.
Conclusion
To sum up, the Coefficient of Variation (CV) and the Quartile Coefficient of Dispersion (QCD) are crucial statistics for examining dispersion in numerical data. CV excels at comparing scaled data, while QCD helps in cases of skewed or outlier datasets. We looked at two cases (with Python programs and analysis) to see how this works in practice. By using them wisely, we may get useful information for making decisions.
📓 You can find the notebook for this post on GitHub.
Thanks for reading! 📚
UPDATE (August 13, 2023): A previous version of this post included an incorrect plot of two identical histograms in Example 2 (QCD). The image has been updated to reflect the correct information.
I’m a senior data scientist 📊 and engineer, writing about statistics, Machine Learning, Python, and more.
- Follow me on Medium 👋 to get my latest post
- _Let’s connect on LinkedIn and Twitter 🤝_
Useful Links
Co-efficient of Variation Meaning and How to Use It
Originally published at https://ealizadeh.com.