Dive into Bias Metrics and Model Explainability with Amazon SageMaker Clarify

Published in

Towards Data Science

13 min readDec 10, 2020

By Emily Webber and the Amazon SageMaker Clarify Team

Analyze bias in your datasets with Amazon SageMaker Clarify and SageMaker’s Data Wrangler

It’s the end of the quarter. You’ve spent months working on a complex modeling pipeline that synthesizes data across multiple sources, trains and tunes hundreds of models, identifies the optimal configuration, and deploys automatically to an inference pipeline via SageMaker. Now your stakeholders come back to you and are asking why your model predicts the way it does. Do you need to rewrite your entire pipeline? How do you know if there’s something wrong with your training data? How do you explain why your model made a certain prediction?

Enter Amazon SageMaker Clarify. SageMaker Clarify is a fully-managed toolkit to identify potential bias within a training dataset or model, explain individual inference results, aggregate these explanations for an entire dataset, integrate with built-in monitoring capabilities to assess production performance, and provide these capabilities across modeling frameworks. Starting in December of 2020, customers can use SageMaker Clarify to analyze their datasets for bias around specific attributes of interest, both before and after training. Customers can also point the feature to a SageMaker-hosted endpoint to view per-prediction feature importances, giving them the granularity they need to better understand their model performance. While SageMaker Clarify is fully available on both notebook instances and SageMaker Studio, the uniquely decoupled compute instances per notebook and ease-of-use that come with SageMaker Studio provide a seamless user-experience.

SageMaker Studio writes a lot of code for you, and we’re taking advantage of that within SageMaker Clarify.

For a framework outlining the intended usage for SageMaker Clarify, see a link here. For an academic paper on fairness measures within finance, check here. You can also walk through a coding example hosted on GitHub here to get a better feel for the bias metrics and functionality. This post should take 20 minutes of your time, but by the end you’ll be one step closer towards picking the right bias metrics for your use case and explaining these to your stakeholders.

It’s important to remember that AWS sees this feature launch as the starting point. As with all AWS services we index heavily on customer feedback to continuously improve our service offerings, both in terms of the basic capabilities and the prescriptive guidance we give to customers to see results.

Bias Metrics and Explainability

First off, let’s explore the two terms bias and explainability. While they both fall into a similar category of striving for fully robust and performant machine learning models, they actually refer to different methods and parts of your analysis.

Bias metrics refer to descriptive statistics that we can compute both before training the model, and after training. Bias metrics will look at terms like class imbalance, difference in positive proportions, and conditional demographic disparity, including almost twenty statistics built into SageMaker Clarify. All of these have slightly different implications for your use case. We need to spend some time studying these, really contemplating what they are describing, and then we can learn how to suggest them based on the datasets we’re working with and the use cases we are solving for.

Explainability, on the other hand, actually itself leverages models to quantify the relative impact of each feature on our prediction results. What this means is that you can use it to understand which specific columns, variables, or attributes in your dataset are driving the final prediction result. This is the case both for a specific prediction and for the entire dataset itself. By a wide margin the most commonly used methods for explainability within tabular machine learning is SHAP — a method inspired by game theory to actually train much smaller ML models on sets of features, iteratively investigating their impact on the data set in question. In SageMaker Clarify we’re introducing a managed connectivity to Kernel SHAP, with performance enhancements on runtime, across the entire ML lifecycle.

As we’ll come to find out, there are many different ways of defining bias. A central challenge that we need to learn how to overcome as aspiring data scientists is to master these definitions of bias, figure out which ones we need to apply for different stages in our workflow and our applications, and collaborate with our stakeholders to define the right thresholds for these metrics.

It’s helpful to call out that both bias metrics and explainability are model-agnostic methods. That is to say, they’ll operate on really any type of modeling framework. We just need to make sure the data format and type is supported by SageMaker Clarify!

Pre-Training Metrics

Pre-training metrics are a great first-pass on a dataset. We can run a job with SageMaker Clarify to identify key differences in our dataset across various attributes of interest, or variables for which we’d like to learn about any imbalances or samples that are present in the dataset which may be likely to lead to a biased model. SageMaker Clarify will produce a downloadable bias report with key statistics, and in Studio we can view all of these generated metrics with a rich visualization. To make it even easier, we also have metric description cards integrated within Studio for customers to surface the intuition behind the metric, read through examples and understand suggested ranges. Many of these cards will even point to the paper proposing that statistic on arxiv.org! If you’re interested, you can jump straight to one commonly referenced paper right here.

It’s helpful to point out that, in addition to looking at differences across attributes of interest, many of these metrics are looking at this difference in terms of the label. That is to say, to date these are primarily solving for supervised learning.

Class Imbalance

Our first metric to pick apart is called class imbalance. Simply put, this metric is telling us the difference of data samples we have for the advantaged group, n_a, relative to the disadvantaged group, n_d. Your class imbalance statistic falls in a range between [-1, +1], with a perfectly balanced dataset giving us a class imbalance of 0. One way to improve this statistic is actually downsampling on your advantaged group, or even augmenting your set of disadvantaged samples to improve their representation. Masked transformers might be a nice way of doing that!

Difference in Positive Proportions of True Labels

This statistic is simply looking at the difference in the ratio of positive cases to total sample amounts between advantaged and disadvantaged groups. That is to say, we’ll look at the number of positive cases for the advantaged group, n_a(1), and compare that to the total number of positive samples we have, n_a. Next, we’ll do the same thing and compare the number of positive cases in the disadvantaged group, n_d(1), and compare that to the total number of disadvantaged samples we have, n_d. This is a valuable metric because it tells us the relatively likelihood of members from both the advantaged and the disadvantaged group to fall into the positive class.

If this statistic is close to 0, then we say that demographic parity has been achieved.

Kullback-Liebler & Jensen-Shannon Divergences

These two metrics are a nice pair. They both describe the difference between two label distributions. The first term, P_a, refers to the distribution of the advantaged group, while P_d refers to the distribution for the disadvantaged group. They’re actually nice equations to get comfortable with, because you’re very likely to see them come up under other circumstances, such as generative deep learning. The Kullback-Liebler divergence is the relative distance between the distribution of advantaged cases versus the disadvantaged cases.

The Jensen-Shannon divergence builds on this to bring symmetry into the equation, literally speaking, because we can reverse the order of operations and still get the same result.

Conditional Demographic Disparity in Labels

The conditional demographic disparity in labels helps us understand whether the disadvantaged group has a larger proportion of rejected outcomes than accepted outcomes.

This statistic uses the demographic disparity statistic, or DD, which is the difference between the proportion of rejected outcomes and the proportion of accepted outcomes for a given group. The conditional demographic disparity in labels statistic looks at the weighted average of DD across different groups, each weighted by its size.

There are a variety of additional pre-training metrics to study, but in the interest of time we’ll move onto the post-training metrics.

Post-Training Metrics

Disparate (Adverse) Impact and Difference in Positive Proportions in Predicted Labels

These two related statistics are helpful in understanding the ratio of cases from the disadvantaged group that are accepted, q_d, from that of the advantaged group, q_a. This first statistic is simply looking at the difference between these ratios.

In the disparate (adverse) impact statistic, we simply divide the q_d term by the q_a term to assess the magnitude of this difference. There is some legal precedent for considering a result on this statistics in the range of [4/5, 5/4] as being fair.

Difference in conditional outcome (acceptance and rejection)

Equation for Difference in Conditional Outcome

This statistic looks at the relationship between the actual labels and the predicted labels, and helps us see if they are the same across classes.

In the case of DCA, we’re looking at this in the case of accepted samples. In the case of DCR, we’re doing the same thing for rejected samples. This helps us pick on the type of bias when a lender may be giving more loans to men than prescribed by a model, but fewer loans to women than prescribed by the model.

Recall, precision, and accuracy differences

This term helps us understand how well the model is performing in terms of recall within both the advantaged and disadvantaged groups. It may be the case that recall is high for a model only within the advantaged group, but low within the disadvantaged group, meaning it’s better at finding positive samples there.

The terms mentioned here come from a confusion matrix — a common table used to identify true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN), for a given confidence level within a set of predictions. This is well defined elsewhere, so we won’t define them here.

It’s common to look at similar statistics for precision and accuracy.

Treatment Equality

This statistic looks at the difference in the ratio of false positives to false negatives between the advantaged and disadvantaged groups. You may have cases where both groups have the same levels of accuracy, but different ratios of false accepts to rejects. For example, let’s say 100 men and 8 women apply for a loan. Out of those, 8 men were wrongly denied, while 6 were wrongly approved. For women, 5 were wrongly denied, but only 2 were wrongly approved. Using a treatment equality statistic brings this to the forefront.

Counterfactual Flip Test

This metric looks at pairs of similar candidates across advantaged and disadvantaged groups, and helps us understand whether or not the model treats them the same. Typically these pairs are chosen using k-nearest neighbors — that is to say, the F+ and F- actually refer to similar pairs of data objects selected via KNN based on alternate samples.

Believe it or not but this list is really just the beginning of all of the metrics available with SageMaker Clarify! For an exhaustive list with an even deeper treatment, take a look at our whitepaper.

Explainability

Now, let’s take a quick look at another way of analyzing our data, which is to produce global and local feature importances. This uses a method called SHAP, which is a model-agnostic method inspired by game theory. SHAP actually produces a variety of candidate datasets, each without a given feature, and sends these against the trained model endpoint. Based on the model response, and the baseline statistics and samples that we provide when we configure the job, SHAP infers the relative importance of that feature.

In SageMaker Clarify we use Kernel SHAP estimation, which treats your model as a closed box when applying the game-theoretic Shapley values model. Feature importance is determined by the contribution to the model’s prediction by the addition of a feature from all possible combinations of features that exclude it. This is valuable because we are not limited in terms of model framework, learning style, software packages or model size. As long as we can get our model onto a SageMaker endpoint and (as of this writing) we’re using tabular data, we can use that model with SageMaker Clarify.

For a deep dive on Kernel SHAP estimation, check out our whitepaper.

Bias alerts for models in production with SageMaker Clarify

Monitor Bias Drift and Feature Importance Over Time in Deployed Models

Now that we’re talking about endpoints, let’s have a quick discussion about ongoing monitoring of models in production! Last year SageMaker launched SageMaker Model Monitor, which uses Amazon Deequ to learn baseline thresholds and constraints on our training data. That is to say we’ll pass up the entire training dataset itself up to SageMaker, use the built-in image to learn adequate thresholds for each feature, and make any modifications we like. Then we can enable on-going monitoring jobs which compare the inference data hitting our endpoint with that in our learnt baseline image, alerting us anytime it finds a statistical example of data drift.

In SageMaker Clarify, we can monitor our model’s performance on the bias metrics over time. That is to say, even after we’ve deployed a model, we can leverage SageMaker Clarify to help check that the fairness of this model stays consistent with any compliance and regulation needs over time.

The integration with model monitor here leverages confidence intervals. Let’s imagine that we compute a confidence interval C, along with our pre-defined range A for a given fairness metric. Remember, the confidence interval is a sample of the live inferencing data hitting our endpoint, but the pre-defined range is coming from our offline training data and is fully customizable.

If there is an overlap between our confidence interval C and our pre-defined range A, we consider it likely that the bias metric falls within our allowed range. However, if the two are disjoint and there is not overlap, then this will raise a bias alert.

With SageMaker Clarify we can execute monitoring jobs that help us view changes in feature importance over time, in addition to monitoring bias drift against a threshold.

Analyze bias metrics against your threshold for models in production with SageMaker Clarify

Leverage Data Wrangler to Analyze Bias

Lastly, with the newly launched SageMaker Data Wrangler, now we can use a simple UI to click through our data connection, manipulations, and bias detection. The picture at the top of the post shows us how to easily post in our dataset and add a step to detect bias within the dataset. This can run within seconds! See the SageMaker Data Wrangler page for more details there.

FAQ’s

How do I pick the right metric for my case? The most important thing to do is consider the implications of what happens when our model is right, and what happens when it is wrong, across as many dimensions as we can think of. Then you want to work backwards from those scenarios and define KPI’s you are comfortable with and able to support.
How do I deal with multiple disadvantaged groups at the same time? When running our SageMaker Clarify job we can specify all of the groups we are interesting in analyzing during the job config. Then, we’ll need to view the results for these group-by-group in SageMaker Studio.
How do I mitigate bias after I’ve identified it? This is a really big topic, but briefly you can attack this by changing up your sampling strategy, apply transformations to dial up or dial down the impact of specific features, move samples around with clustering, use transformers to generate new samples, etc. You can also try training multiple models, each “optimizing” on different metrics. Yet another option is to put the metric calculation directly inside of your loss function when defining your model.
What if the metrics I am trying to optimize for conflict with each other? The reality is that this is almost certainly going to happen. The best path forward is to collaborate with our business stakeholders and identify the top few metrics that we really care about, and modify our modeling pipelines accordingly.
What threshold should I use for my statistics? This another open area of conversation. There are case studies across industries where certain groups have found some thresholds to be helpful, but by and large individual organizations need to determine which thresholds are most applicable for them.
What if I don’t train any models using protected variables? What should I do? The reality is that even if we are not training models using protected variables, we will still likely see the impact of biased datasets due to correlation of those protected groups with other features. That is to say, your model and dataset can easily still exhibit bias, even if we’re not explicitly training on protected variables. If you have access to those variables, we recommend including them in your analysis so you can understand your current state, even if you do not include them in your final models.
If I’m training deep learning models, can I still use SageMaker Clarify? Yes absolutely! The bigger question is what type of data you are using. At launch SageMaker Clarify supports tabular data, again with a view to prioritize the roadmap explicitly based on customer feedback and interest.
Where can I see the bias report in SageMaker? When you run SageMaker Clarify jobs on datasets and models, you can view the results under the Experiments tab. Right click on unassigned trial components, and then view trial details on your job

All images produced by the author