The world’s leading publication for data science, AI, and ML professionals.

Robust Statistics for Data Scientists Part 1: Resilient Measures of Central Tendency and…

Building a foundation: understanding and applying robust measures in data analysis

Image generate with DALL-E
Image generate with DALL-E

The role of statistics in Data Science is central, bridging raw data to actionable insights. However, not all statistical methods are created equal, especially when faced with the harsh realities of (messy) real-world data. This brings us to the purpose of robust statistics, a subfield designed to withstand the anomalies of data that often throw traditional statistical methods off course.

Classical vs. Robust Statistics: A Necessary Shift

While classical statistics have served us well, their susceptibility to outliers and extreme values can lead to misleading conclusions. Enter robust statistics, which aims to provide more reliable results under a wider variety of conditions. This approach is not about discarding outliers without consideration but about developing methods that are less sensitive to them.

Robust statistics is grounded in the principle of resilience. It’s about constructing statistical methods that remain unaffected, or minimally affected, by small deviations from assumptions that traditional methods hold dear. This resilience is crucial in real-world Data Analysis, where perfectly distributed datasets are the exception, not the norm.

Key concepts in robust statistics are outliers, leverage points, and breakdown points.

Outliers and Legerave Points

Outliers are data points that significantly deviate from the other observations in the dataset. Leverage points, particularly in the context of regression analysis, are outliers in the independent variable space that can excessively influence the fit of the model. In both cases, their presence can distort the results of classical statistical analyses.

For instance, let’s consider a dataset where we measure the effect of hours on exam scores. An outlier might be a student who studied very little but scored exceptionally high, while a leverage point could be a student who studied an unusually high number of hours compared to peers.

To illustrate, we will simulate a simple dataset with both an outlier and a leverage point and visualise their effects on a linear regression model.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from scipy import stats
from scipy.stats import median_abs_deviation

def create_model_add_point(x_base, y_base, point=None):
    """Extend base data with an optional point and fit a linear model."""
    x, y = (np.append(x_base, point[0]), np.append(y_base, point[1])) if point else (x_base, y_base)
    model = LinearRegression().fit(x.reshape(-1, 1), y)
    return x, y, model

# Set seed for reproducibility and simulate base dataset
np.random.seed(42)
x_base = np.random.normal(5, 2, 30)
y_base = 0.5 * x_base + np.random.normal(0, 0.5, 30)

# Prepare datasets with base, outlier, and leverage points
datasets = [
    (x_base, y_base, LinearRegression().fit(x_base.reshape(-1, 1), y_base)),  # Base case
    create_model_add_point(x_base, y_base, (4, 10)),  # Adding an outlier
    create_model_add_point(x_base, y_base, (15, 10.5))  # Adding a leverage point
]

# Plotting setup
plt.rcParams.update({'font.size': 15})
sns.set_palette("deep")
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
titles = ["Original Data", "With Outlier", "With Leverage Point"]
special_points = [(None, None), (4, 10), (15, 10.5)]

palette = sns.color_palette()
plot_info = {"Original Data": {"col": palette[0]},
             "With Outlier": {"col": palette[3]},
             "With Leverage Point": {"col": palette[1]}}

# Plotting data points, regression lines, and special points
for i, ((x, y, model), title, (px, py)) in enumerate(zip(datasets, titles, special_points)):
    x_values = np.linspace(min(x), max(x), 100)
    y_values = model.predict(x_values.reshape(-1, 1))
    axes[0, i].scatter(x, y, color=blue, label="Data Points", s=100)
    axes[0, i].plot(x_values, y_values, color="red", label="Linear Fit", lw=3)
    if px is not None:
        axes[0, i].scatter(px, py, color=plot_info[title]["col"], label=title, zorder=5, s=100)
    axes[0, i].set_title(title, fontweight='bold')
    axes[0, i].set_xlabel("Study Hours")
    axes[0, i].set_ylim(-0.5, 11)
    axes[0, i].legend()

axes[0, 0].set_ylabel("Exam Score")

# Plotting intercepts and slopes with adapted colors
bar_colors = [plot_info[title]["col"] for title in titles]
intercepts = [model.intercept_ for _, _, model in datasets]
slopes = [model.coef_[0] for _, _, model in datasets]

axes[1, 0].bar(range(3), intercepts, color=bar_colors)
axes[1, 0].set_title("Estimated intercepts", fontweight='bold')
axes[1, 0].set_xticks(range(3))
axes[1, 0].set_xticklabels(titles)

axes[1, 1].bar(range(3), slopes, color=bar_colors)
axes[1, 1].set_ylim(0.2, 0.7)
axes[1, 1].set_title("Estimated slopes", fontweight='bold')
axes[1, 1].set_xticks(range(3))
axes[1, 1].set_xticklabels(titles)

# Hide the third subplot in the second row as it's not needed
axes[1, 2].axis('off')

plt.tight_layout()
plt.show()
Image by the author
Image by the author

The plots in the upper panels show the impact in the regression lines when these anomalies are introduced. The addition of an outlier significantly skews the regression line upwards by increasing its intercept (see bottom left barplot), demonstrating the outlier’s influence on the model. Conversely, the leverage point pulls the regression line towards itself increasing the estimated slope (see bottom right barplot) while decreasing the intercept, potentially leading to misleading interpretations of the data relationship.

Breakdown Point

The breakdown point of an estimator is the proportion of contaminated data it can tolerate before yielding incorrect results.

To demonstrate this concept, let’s compare the mean and median as estimators of central tendency in the presence of increasing outliers.

# Original data
data = np.random.normal(0, 1, 100)

# Function to introduce outliers
def introduce_outliers(data, proportion):
    n_outliers = int(len(data) * proportion)
    outliers = np.random.normal(20, 5, n_outliers)  # Generating outliers
    return np.concatenate([data[:-n_outliers], outliers])

proportions = np.linspace(0, 0.5, 25)
means = []
medians = []

for proportion in proportions:
    contaminated_data = introduce_outliers(data, proportion)
    means.append(np.mean(contaminated_data))
    medians.append(np.median(contaminated_data))

# Plotting
plt.figure(figsize=(10, 6))
plt.plot(proportions, means, label="Mean", marker="o", linestyle="--", color=palette[3])
plt.plot(proportions, medians, label="Median", marker="x", linestyle="-", color=palette[0])
plt.xlabel("Proportion of Outliers")
plt.ylabel("Value of Estimator")
plt.title("Breakdown Point: Mean vs. Median")
plt.legend()
plt.grid(True)
plt.show()
Image by the author
Image by the author

The plot shows that the median has a high breakdown point (50%), meaning it can handle up to 50% of the data being contaminated before becoming unreliable. In contrast, the mean has a low breakdown point (0%), as even a single extreme outlier can significantly alter its value.

Altogether, the results from the simulations above underscore the need for robust statistical methods that can mitigate the influence of such anomalies, ensuring more reliable and accurate data analysis in real-world scenarios.

Robust Central Tendency and Spread Measures

Photo by Billy Huynh on Unsplash
Photo by Billy Huynh on Unsplash

Outliers can significantly distort the results obtained from traditional measures of central tendency like the mean and standard deviation. Let’s delve into some robust measures that offer a clearer picture of the data’s central tendency and spread.

Robust Central Tendency Measures

  1. Median: The median is a robust measure of central tendency that divides a dataset into two equal halves. Unlike the mean, it is unaffected by extreme values, making it a reliable measure in the presence of outliers.
  2. Trimmed Mean: The trimmed mean enhances robustness by removing a specified percentage of the lowest and highest values from the dataset before calculating the mean. This process reduces the influence of outliers.
  3. Winsorized Mean: The Winsorized mean also aims to reduce the effect of outliers but does so by replacing the extreme values with the nearest values in the data, rather than removing them. This approach maintains the original dataset size.

Let’s create a synthetic dataset that includes outliers and illustrate how these robust measures compare to the traditional mean.

# Creating a synthetic dataset with outliers
np.random.seed(0)
data = np.random.normal(50, 15, 100)  # Normal distribution
data = np.append(data, [200, 220, 250])  # Adding outliers

# Calculating measures
mean = np.mean(data)
median = np.median(data)
trimmed_mean = stats.trim_mean(data, 0.1)
winsorized_mean = stats.mstats.winsorize(data, limits=[0.1, 0.1]).mean()

# Plotting the histogram and measures
plt.figure(figsize=(10, 6))
plt.hist(data, bins=30, color='skyblue', alpha=0.7, label='Data Distribution')
plt.axvline(mean, color='red', linestyle='--', linewidth=2, label='Mean')
plt.axvline(median, color='green', linestyle='-', linewidth=2, label='Median')
plt.axvline(trimmed_mean, color='blue', linestyle='-.', linewidth=2, label='Trimmed Mean')
plt.axvline(winsorized_mean, color='purple', linestyle=':', linewidth=2, label='Winsorized Mean')

plt.title("Comparison of Central Tendency Measures")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.legend()
plt.show()
Image by the author
Image by the author

In the above example, we generated a normal distribution and introduced a few extreme outliers to simulate a realistic scenario. We then calculated the mean, median, trimmed mean, and Winsorized mean. The histogram of the data, along with vertical lines representing each measure, illustrates how the median, trimmed mean, and Winsorized mean are less influenced by the outliers compared to the traditional mean. The median remains at the center of the data, while the trimmed and Winsorized means adjust to provide a more accurate representation of the central tendency, closer to the median than to the mean distorted by outliers.

Robust Spread Measures

Robust dispersion measures, such as the Median Absolute Deviation (MAD) and Interquartile Range (IQR), are essential in understanding the variability in a dataset, especially when it contains outliers.

  1. Median Absolute Deviation (MAD): The MAD is a robust measure of the variability of a univariate sample. It is calculated as the median of the absolute deviations from the dataset’s median. This measure gives an idea of the spread of the data and is less affected by outliers because it relies on the median rather than the mean.

# Generating a dataset with outliers
np.random.seed(0)
data = np.random.normal(0, 1, 100)  # Normal distribution
data = np.append(data, [5, 5, -5, -5])  # Adding outliers

# Calculating the MAD
mad = median_abs_deviation(data)

# Plotting the data and MAD
plt.figure(figsize=(10, 6))
plt.hist(data, bins=30, color='skyblue', alpha=0.7, label='Data Distribution')
plt.axvline(np.median(data), color='green', linestyle='-', linewidth=2, label='Median')
plt.axvline(np.median(data) + mad, color='red', linestyle='--', linewidth=2, label='MAD')
plt.axvline(np.median(data) - mad, color='red', linestyle='--', linewidth=2)
plt.title("Median Absolute Deviation (MAD)")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.legend()
plt.show()
Image by the author
Image by the author

Here the median is marked by a solid green line, and the MAD is represented by dashed red lines on either side of the median, indicating the spread of the data around the median.

  1. Interquartile Range (IQR): The IQR measures the range between the first quartile (25th percentile) and the third quartile (75th percentile) in a dataset, essentially covering the middle 50% of the data points. The IQR is a robust measure of spread because it is not influenced by extreme values or outliers.
# Calculating the IQR
q1, q3 = np.percentile(data, [25, 75])
iqr = q3 - q1

# Plotting the data and IQR
plt.figure(figsize=(10, 6))
plt.hist(data, bins=30, color='skyblue', alpha=0.7, label='Data Distribution')
plt.axvline(q1, color='blue', linestyle='--', linewidth=2, label='Q1')
plt.axvline(q3, color='red', linestyle='--', linewidth=2, label='Q3')
plt.title("Interquartile Range (IQR)")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.legend()
plt.show()
Image by the author
Image by the author

In this histogram, the first and third quartiles (Q1 and Q3) are marked by dashed lines. The region between these lines represents the IQR and covers the central 50% of the data, highlighting the spread that is less affected by outliers.

In summary, both MAD and IQR provide a resilient estimation of data variability in the presence of outliers, making them indispensable tools for robust statistical analysis.

Challenges and Considerations in choosing the right tool

Photo by Barn Images on Unsplash
Photo by Barn Images on Unsplash

While robust statistics offer powerful solutions, they are not universally applicable to every scenario. The nature of data distributions and the specific objectives of research inquiries necessitate customised approaches. Below is a concise guide to assist in this selection process.

Median

When to Use:

  • Ideal for skewed distributions or when data contains outliers.
  • Useful in ordinal data or when measuring central location is more important than the average.

Limits:

  • Does not utilize all data points, which may lead to a loss of information, especially in symmetric distributions without outliers.
  • Not as useful for further statistical analysis that requires mean values, such as computing variance.

Trimmed Mean

When to Use:

  • Suitable for data with outliers, but where the mean is still preferred over the median for retaining more information from the data set.
  • Effective in nearly symmetric distributions with extreme outliers.

Limits:

  • The choice of trimming percentage can be subjective and significantly affects the result.
  • May still be influenced by outliers if the trimming percentage is not chosen appropriately.

Winsorized Mean

When to Use:

  • Appropriate for distributions with outliers, particularly when data points are not to be discarded but extreme values need to be controlled.
  • Useful when the sample size is small, and retaining every data point is important.

Limits:

  • Similar to the trimmed mean, the choice of limits for Winsorization can be arbitrary and influence the result.
  • Can introduce bias if the Winsorization percentage is too high, especially in skewed distributions.

General Suggestions

  • Symmetric Distributions Without Outliers: The mean and standard deviation are typically sufficient.
  • Skewed Distributions or Presence of Outliers: Median, trimmed mean, or Winsorized mean are more reliable. The median is highly robust but less sensitive to small changes in data, making it suitable for highly skewed distributions. The trimmed and Winsorized means offer a compromise, reducing the impact of outliers while retaining more data information than the median.
  • Exploratory Data Analysis: Starting with the median and MAD can provide a quick, outlier-resistant overview of the data. If the data appears mostly symmetric and outlier-free, the mean and standard deviation can be used for a more detailed analysis.
  • Final Analysis: Consider the distribution and presence of outliers. In many cases, a combination of measures may provide the most comprehensive understanding, such as reporting both the mean and median, or using robust estimators alongside traditional measures for comparison.

In summary, the choice of metric should be driven by the data’s characteristics and the specific goals of the analysis. It’s often beneficial to explore multiple measures to gain a comprehensive understanding of the data’s central tendency and dispersion.

Forward Look: Robust Correlation, Regression, and Advanced Robust methods.

This introduction has laid the groundwork for robust statistical practices in Data Science. In the next tutorial, we will explore robust methods to analyse relationships between variables, even in the presence of outliers. See you there!


Related Articles