The world’s leading publication for data science, AI, and ML professionals.

Efficient Data Valuation with Exact Shapley Values

Produce Better Models with Less Data.

Thoughts and Theory

Photo by Alexander Sinn on Unsplash
Photo by Alexander Sinn on Unsplash

This post is an overview of the work done in the paper – ‘Efficient task-specific data valuation for nearest neighbour algorithms’.

Does more data always produce better results? Given the trends in machine learning, you would expect as much. More data seems king given the recent rise in larger and larger machine learning models with exceptional performance.

This pattern is exemplified in advances in Machine learning like OpenAIs GPT-3 model, trained on over 45 terabytes of data from books and the internet to support its 175 billion parameters. But this model is effectively dwarfed by Google’s new switch Transformer architecture with 1.6 trillion parameters. So it appears that bigger and bigger models with more and more data are inevitable.

"Information is the oil of the 21st century, and analytics is the combustion engine" – (Peter Sondergaard, Senior Vice President, Gartner)

If data is so valuable, should we aim to gather more? Why shouldn’t we have more of what is so valuable?

There is a problem with this approach. What if the additional data you gather is irrelevant to your problem. What if it is purely noise? Then any additional data will actively hurt the performance of your model. So what can you do to mitigate this problem?

Well, for one, you can painstakingly look through your data. This approach is probably one of the most effective tools but is highly time-consuming when you have a lot of data. For computer vision tasks, there are a lot of different techniques that you can use. This post shows some exciting techniques for precisely that.

Explaining how I reached the top ranks of the new Data-Centric competition

For many cases, the applicable data is relatively easy to identify. However, for others, it is an unclear task at best. Often part of the problem is that the relationships between the data and the outcome are what we’re trying to model. Then identifying what data is most applicable is an arduous task.

Fortunately, there is a new method that is perfect for this scenario based on Shapley values.

SHAP Values

SHapley Additive exPlanations (SHAP) is a game-theoretic approach to explain the output of any machine learning model. This method is fairly well known, but the attribution is based solely on the features.

The method provides attribution values from features to the output of the model. SHAP values are based on the Shapley values, which determine how to distribute payment among multiple players in a game fairly. For SHAP values, the coalitions of players are based on the features.

GitHub – slundberg/shap: A game theoretic approach to explain the output of any machine learning…

SHAP values are incredibly flexible. For example, in computer vision tasks, SHAP values represent the attribution of different pixels to the model’s output. There are many different methods to calculate SHAP values, including a KernelSHAP method which is model-independent.

For each variation of SHAP values, the attributions are always related to the features of the models. However, there is an alternative. Use the instances as the players and calculate attribution for each instance.

Data Valuation

‘Efficient task-specific data valuation for nearest neighbour algorithms’ is a recent paper providing novel algorithms to calculate exact Shapley values. For the rest of this post, I refer to the Shapley values produced for each instance as Data Shapley Values.

Data Shapley Values are a recent innovation that utilizes Shapley values to determine the attribution of different data instances. The motivation of the research was inspired by optimizing records selected from a data market for privacy-preserving machine learning. The data market consists of many different medical records. Therefore, the data buyer chooses a subset of records to purchase from the data market. Because of the data cost, the buyer aim’s to select an optimal subset of patients for their model.

Data Market Problem https://arxiv.org/pdf/1908.08619.pdf (Photo from Paper)
Data Market Problem https://arxiv.org/pdf/1908.08619.pdf (Photo from Paper)

The Shapley values in this problem configuration measure the marginal improvements of the utility attributed to each data point average overall possible data subsets.

The most significant issue for computing Shapley values is the high degree of complexity. Generally, this is on the level of O(2^N) for exact calculations.

However, the researchers have developed a novel algorithm for exact Shapley values designed for K-NN classifiers using a KNN utility. This algorithm relies on the fact that the KNN utility satisfies a piecewise utility difference property.

I’ll leave the exact mathematical formulation for the paper and the curious readers. But here is the result. The algorithm performs exact computation with O(N log N) complexity.

GitHub – AI-secure/KNN-PVLDB: Official Repo for "Efficient task-specific data valuation for nearest…


Experiments

The structure of the experiments follows a simple format. First, the data is separated into training, validation, and test sets. Then, the exact Shapley values are calculated based on the attribution of the training instances to the validation instances.

Then the training instances are ordered according to their Shapley values. This setup provides the user with the attributions of each instance to the validation set. Next, the user can select only those instances with the highest Shapley value, providing the user with a smaller subset of data.

As this process selects data that contributes the most to the performance on the validation set, removing instances will low Shapley values often removes the noisiest instances within the data. And, at the same time, maintaining the most representative examples.

The experiments use the diamonds dataset. This dataset contains almost 54,000 diamonds. The features include diamond attributes such as the carat of the diamond, the cut, the colour, and several other features. Some of the features are categorical, and for these experiments, I transform these features into boolean features for each category.

The target is the price of the diamonds.

Diamonds

I’ve taken 5000 instances for validation and testing. The remainder of the data is used for training.

The aim is to produce a model to predict the price of diamonds using less data. I’ve evaluated the performance with a single decision tree regressor. The data is ordered based on the Shapley value, and the performance is measured with an R² score.

import numpy as np
from sklearn.utils import shuffle
from sklearn.tree import DecisionTreeRegressor
from exact_sp import get_true_KNN, compute_single_unweighted_knn_class_shapley
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('diamonds.csv', index_col='index').reset_index(drop=True)
df = pd.get_dummies(df, columns = ['cut', 'color', 'clarity'])
x_trn = df.drop(['price'], axis=1).to_numpy()
y_trn = df['price'].astype(float).to_numpy()
x_trn, y_trn = shuffle(x_trn, y_trn, random_state=10)
x_tst, y_tst = x_trn[:5000], y_trn[:5000]
x_val, y_val = x_trn[5000:10000], y_trn[5000:10000]
x_trn, y_trn = x_trn[10000:], y_trn[10000:]
K = 1

Next, the Shapley values are calculated for the validation. The chunk of code also sequentially fits a model on smaller amounts of data. The data is considering according to decreasing Shapley Values.

Once the model is trained on the subset of training data, the model is evaluated on the test dataset.

x_val_knn_gt = get_true_KNN(x_trn, x_val)
val_sp_gt = compute_single_unweighted_knn_class_shapley(x_trn, y_trn, x_val_knn_gt, y_val, K)
g_values = np.mean(val_sp_gt, axis=0)
count = int(len(x_trn)*4/5)
interval = int(count*0.02)
x = np.arange(0, count, interval)/len(x_trn)
g_r2_scores = []
g_r2_scores_val = []
idxs = np.argsort(g_values)
keep_idxs = idxs.tolist()
for j in range(0, count, interval):
    if len(keep_idxs) == len(x_trn):
        x_train_keep, y_train_keep = x_trn, y_trn
    else:
        x_train_keep, y_train_keep = x_trn[keep_idxs], y_trn[keep_idxs]
    reg = DecisionTreeRegressor()
    reg.fit(x_train_keep, y_train_keep)
    r2_score = reg.score(x_tst, y_tst)
    r2_score_val = reg.score(x_val, y_val)
    g_r2_scores.append(r2_score)
    g_r2_scores_val.append(r2_score_val)
    keep_idxs = keep_idxs[interval:]
plt.plot(x, g_r2_scores, '-', label='Test Performance', color='olive')
plt.plot(x, g_r2_scores_val, '-', label='Validation Performance', color='blue')
plt.ylabel('R2 Score')
plt.xlabel('Percent of Data Removed')
plt.legend(loc="lower left")
plt.title('Data Validation with Exact Shapley Values')
plt.show()

The experiment results show that with less data, the model performs better on the test data. The peak of the test performance occurs when almost 50% of the training data is removed.

What is also intriguing is that the model performance on the validation set peaks when over 60% of the training data is removed. Since the instances are removed based on the Shapley values on the validation set, this pattern makes some sense.

Performance Change after Removing Data (Photo by Author)
Performance Change after Removing Data (Photo by Author)

Conclusion

Despite the increasing availability of a massive amount of data, the quality of the data is crucial. Building your models on low-quality data produces low-quality models.

Utilizing Shapley values on instances offer an alternative approach. You can create your models with fewer instances and get improved performance. The experiments show that less data can improve the model performance even with a dramatically smaller subset of data.

Data is the new oil, but the quality of that oil matters.

Consider using data Shapley values in your next machine learning model.


If you’re interested in reading articles about novel Data Science tools and understanding machine learning algorithms, consider following me on medium.

If you’re interested in my writing and want to support me directly, please subscribe through the following link. This link ensures that I will receive a portion of your membership fees.

Join Medium with my referral link – Zachary Warnes


Related Articles