The world’s leading publication for data science, AI, and ML professionals.

Cognitive Biases in Data Science: The Category-Size Bias

A data scientist's guide to outsmarting biases

DATA BIAS HACKERS

Imagine you find yourself in a quaint neighborhood with two bakeries. The first is a small, family-owned bakery, warmly nestled on the corner street. The second, however, is a grand three-story establishment, with a sign that showcases its extensive selection and state-of-the-art ovens.

As you embark on your quest for the perfect loaf of bread, you are drawn to the towering bakery. The sheer size and grandeur of the building make an immediate impression, making the assumption that the larger bakery must surely produce the finest bread.

Here, in this scenario, you’re unknowingly succumbing to a mental tendency known as category size bias. The bias leads you to believe that the larger bakery is more likely to offer superior bread.

In reality, the size of the bakery doesn’t necessarily correlate with the quality of its bread. The smaller, family-owned bakery may have a closely guarded secret recipe, perfected over generations, while the larger bakery might focus on quantity over artisanal craftsmanship.

This bias echoes our inclination to associate larger categories with better outcomes, even when the specific characteristics within those categories may not align with our assumptions. This phenomenon is called category size bias.

Category size bias refers to our inclination to perceive outcomes as more probable when they belong to a larger category as opposed to a smaller one, even when the likelihood of each outcome is equal.

Despite the bias being grounded in experimentally validated studies, there is ongoing variability in the interpretation of the evidence.

In the realm of Data Science, category size bias may manifest via specific assumptions. For instance:

Assumption 1: Larger more complex models always provide better predictions compared to smaller models.

Within the context of category size bias, the tendency is to believe that the performance of a Neural Network or an ML model improves with its size or complexity. Consequently, regardless of whether the data or task aligns with the model’s characteristics, there’s often a focus on complex models. This inclination is akin to the bandwagon effect, where newer, more intricate, and renowned algorithms are presumed to be cutting-edge, even for tasks they are not well-suited for. Consider, for instance, using large language models (LLMs) for relatively simple tasks.

For example, adding extra layers to a Neural Network, even when they don’t really make the model much better, as most tasks can be handled well with just two layers.

Caveat: It’s important to note that the assumption of larger, more complex models yielding better predictions holds true in many cases, especially when dealing with complex tasks that require a high level of precision or problems involving vast and diverse datasets.

Trade-off: The trade-off in such scenarios encompasses both performance and resource consumption. While complex or larger models demand more resources, the increased resource consumption doesn’t automatically translate to superior model performance. Although this may not always be necessary or impactful, in critical problems, making such considerations can indeed make a significant difference.

Assumption 2: Overlooking Class Imbalance by Relying Solely on Higher Accuracy

This problem is a more widely acknowledged one. When utilizing a metric such as accuracy, there is a heightened risk of overlooking an underlying bias. To illustrate, consider a dataset with 5 instances of a disease among 1000 patients. If a model achieves 99.5% accuracy by consistently classifying almost all instances as negative (given the majority of instances are negative), the assumption might be that the model performs well. However, in reality, the model would be performing inadequately by not classifying any instances as positive.

To explain further, let’s look at a basic example. We’ll create 100 random numbers between 0 and 1. If a number is over 0.92, we call it positive; otherwise, it’s negative. We’ll use logistic regression as our model. The code snippet below demonstrates this scenario:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score

# Generate synthetic data
np.random.seed(42)
thresh = 0.92
X = np.random.uniform(0, 1, 100).reshape(-1, 1)
y = (X > thresh).astype(int).ravel()

model = LogisticRegression()
model.fit(X, y)

y_pred = model.predict(X)
y_prob = model.predict_proba(X)[:, 1]

#Calulcate performance metrics
accuracy = accuracy_score(y, y_pred)
precision, recall, _ = precision_recall_curve(y, y_prob)
average_precision = average_precision_score(y, y_prob)
f1 = f1_score(y, y_pred)

Although the model achieves a respectable accuracy of 0.92, an examination of other metrics like recall and f1-score reveals a far inferior performance. In the plotted chart on the left, it becomes clear that none of the positive instances have been correctly classified as positive.

Caveat: While accuracy is usually a good measure, there are cases where prioritizing higher accuracy makes sense. In cases where class imbalances are minimal, and the cost of misclassification is relatively low, prioritizing accuracy can be a reasonable choice.

Trade-off: The trade-off in this scenario revolves around performance itself, with considerable stakes. However, in instances of minor imbalances or data previously processed, this concern may not be as consequential.

Assumption 3: Equating Larger Datasets with Improved Performance

While it’s often the case that larger datasets bring about more features, additional information, and an enhanced likelihood of realistic predictions, this holds true only up to a point. The amount of data needed depends on the specific problem. For example, in classification tasks, scenarios with a smaller number of classes or a many informative features might do well with smaller datasets. Similarly, tasks involving lower-order function approximation might be content with a more modest dataset.

Let’s understand this with an example. First, we make fake data with half of it being useful, 5 categories, and around 20,000 samples.

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score

# Generate a synthetic dataset
samples = 20000
X, y = make_classification(n_samples=samples, n_features=10, n_informative=5, n_clusters_per_class=5, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
indices = np.arange(X_train.shape[0])
np.random.shuffle(indices)

Subsequently, we employ Support Vector Classification (SVC) for class separation on a subset of the data, specifically 9,000 instances:

count_small = 9000
X_train_small = X_train[:count_small]
y_train_small = y_train[:count_small]
start_time = time.time()

model_small = SVC(probability =True)
model_small.fit(X_train_small, y_train_small)  # Using only a small subset for training

y_pred_small = model_small.predict(X_test)
accuracy_small = accuracy_score(y_test, y_pred_small)

When measuring the time taken for this task, we observe 6.9590 seconds with an accuracy of 0.8430. Subsequently, we utilize the entire dataset for training the SVC, comprising approximately 16,000 instances for training and 4,000 instances for testing:

model_large = SVC(probability =True)
model_large.fit(X_train, y_train)

y_pred_large = model_large.predict(X_test)
accuracy_large = accuracy_score(y_test, y_pred_large)

This time, the algorithm takes 20.4149 seconds, about 3x the previous time, yielding an accuracy of 0.8452 – remarkably close despite almost double the data. A comparison of ROC curves for both models reveals nearly identical results.

Caveat: Complex tasks often need more data, but there’s a point where more data doesn’t necessarily mean better results.

Trade-off: While additional data rarely harms the model, the trade-off lies not in performance but in computational cost. This cost can be substantial when irrelevant information is added to the model, emphasizing the importance of data pre-processing and cleaning, particularly in complex tasks.

Assumption 4: Equating Longer, More Complex Algorithms with Superior Performance

While not always true, there exists a notion that longer, complex algorithms hold superiority over their shorter, simpler counterparts. Sometimes, a more intricate algorithm is favored even when a simpler one perfectly accomplishes the task. While this notion is not always unfounded, the issue lies in the underlying rationale for such a belief. If complexity and length are genuinely warranted by the algorithm, then there’s no harm in their application.

To illustrate this assumption, let’s compare two code blocks. The first function appears quite straightforward:

def is_even_simple(num):
    return num % 2 == 0

On the other hand, the second function seems to involve more work, evident in both time and space complexity. However, the ultimate goal of both functions is exactly the same:

def is_even_complex(num):
    if num < 0:
        return False
    elif num == 0:
        return True
    else:
        while num >= 2:
            num -= 2
        if num == 0:
            return True
        else:
            return False

I intentionally kept this example simple, but the idea applies to more complicated code that might focus too much on a specific problem or do different things.

Caveat: This doesn’t apply to adding unit tests, comments, or improvements that genuinely make the code better. Furthermore, the belief that longer and more complex algorithms are superior is not unfounded in certain contexts. For tasks that inherently require nuanced decision boundaries or involve complex relationships, a more sophisticated algorithm might indeed be necessary.

Trade-off: The cost of this bias often manifests in terms of computational resources. While it might not significantly impact simpler tasks, the expense becomes more pronounced for more complex tasks.

Avoiding Category-Size Bias: Strategies for Awareness and Mitigation

While category-size bias may not be the most detrimental bias, it can lead to resource drainage.

It’s better to break things down to simpler and smaller tasks and start there when possible for clearer understanding of the underlying problem.

Being aware of our subconscious inclination towards favoring one option over another is a key to avoid succumbing to the bias. A valuable approach is to challenge assumptions, utilizing the Socratic questioning technique.

Understanding Socratic Questioning: A Comprehensive Guide

Another approach is attempting to prove the hypothesis opposite to the initially favored one.

Taking the time to individually assess data and treat each problem distinctly can provide valuable insights.

Wrapping up…

In conclusion, this post highlighted the influence of category-size bias in the realm of data science. The key takeaway is, to stay mindful of the bias and consistently take the context of the problem into account.

Coming up on the horizon…

Our analyses often don’t rely only on algorithms and models, but are heavily influenced by the deeply embedded cognitive biases. This post explored the Category-Size Bias and its impact on decision-making in data science. However, this is just the start. In upcoming posts of this series, I’d focus on uncovering more cognitive biases and their impact on data research and analytics. From assumptions about causation to the appeal of anecdotal evidence, from confirmation bias to the bandwagon effect, the posts will explore the intersection of human biases and data science.

It’s my goal to present more in-depth examination of biases that could affect our analytical pursuits, and promote a more nuanced and bias-aware data science practice.

I add some additional resources or further exploration.

Resources

List of Cognitive Biases and Heuristics – The Decision Lab

The Cognitive Biases List: A Visual Of 180+ Heuristics

The Ladder of Inference – How to Avoid Jumping to Conclusions


Related Articles