In this article, I will introduce you to two packages I use when working with imbalanced data (imblearn, although I also use it a lot, is not one of them). One will help you tune your model’s decision threshold, and the second will make your model use the selected threshold when it’s deployed.
Why should you care?
First, as a Data Scientist working with clients every day, the vast majority of problems I solve are imbalanced binary classification problems. If you think about it, if you have a question about your customers, patients, machines, etc. it’s likely to be a "yes/no" question. And it’s also very likely that if you are interested in this question it is because one of the two answers is less common but more important to you: "will my customer purchase something?" , or "will my machine need maintenance in the next X hours?".
There are a lot of fancy techniques out there for imbalanced problems: random under-sampling of the majority class, random over-sampling of the minority class, SMOTE, ADASYN… And a lot of good medium articles to learn to master them. This article is about a much simpler but, in my opinion, more prevalent problem in imbalanced contexts: tuning the decision threshold, and "embedding it" into your model.
In the next few sections, I will be using scikit-lego, an awesome set of extensions for sklearn (basically check that package whenever you think "I wish sklearn could do this…") and yellowbrick, a package for Machine Learning Visualizations. They both work perfectly well with sklearn so you can mix and match functions and classes together.
Some disclaimers before we get started: First, I assume that you have some familiarity with sklearn, so won’t go into the details of each line of code. Also, my goal here is just to get some working pipeline as an example, so I am purposely skipping essential steps of the Data Science process including checking for collinearity, feature selection, etc.
Building an initial sklearn pipeline
TL;DR: In this section, I remind you why pipelines rock and create one on a sample dataset. If you’re a pipeline pro, jump to the next section!
When you build and deploy Machine Learning models, it is generally a best practice to try to make them "as end to end as possible". In other terms, try to group most of the model related data transformations into one single object that you call your model. And in the sklearn world, you do this by using a pipeline.
I started using sklearn pipelines as a Data Science best practice. If you don’t use them, you should absolutely read this article from Andreas Mueller: one of the sections explains what problems can arise if you don’t, in particular Data Leakage. As I used them more, I realized that by using pipelines, you also get the benefit of abstracting away more things into what you call "your model", which is awesome for Deployment.
by using pipelines, you also get the benefit of abstracting away more things into what you call "your model"
To make this more concrete, let’s build a pipeline on a sample dataset. I will be using a sample dataset about heart diseases which is shipped with sklego.
This dataset is very simple: 9 numerical (for simplicity, I am keeping ordered categories such as cp as numerical values), 3 binary (sex, fbs, exang), 1 categorical (thal). The target is binary, and imbalanced: only 27% of the samples have a positive label.

I build a quick pipeline which is made of two pipelines, one for numerical values (for simplicity, I treat binary and ordered categorical features as numerical), and one for categorical values. There are no missing values in the training data, but just in case they came up in production, I include some simple imputation. I noticed that some categories of the thal feature seem wrong, so I manually provide the categories I want to one hot encode so that sklearn does not encode those.
In the code below, I am already using sklego actually. Typically, I would write different pipelines for numerical and categorical features, then define lists of features to apply them on, and put everything together using a ColumnTransformer. With sklego, it’s a bit simpler than that: there is a PandasTypeSelector class which can select columns based on their pandas type. I simply put one at the beginning of my two pipelines, then merge the two using a FeatureUnion.
Now that we have the preprocessing part done, all we need is to add the model. It’s a small dataset and a simple problem, therefore I will be building a simple Logistic Regression. I will tune the regularization parameter C using grid search, and leave the other parameters to default (again, skipping important parts of model building, do not reproduce at home!).
Alright, we now have a trained pipeline with a couple steps including a simple model. Note that for now, we haven’t done anything about the imbalance, but we are ready for the interesting part!
Tuning the Decision Threshold
Let’s look at the model predictions through the lens of the confusion matrix. Note that because we are going to be doing some additional tuning, we do this on the training set (on a real project, I would probably split in train, test and validation to do things correctly). Here we start using yellowbrick, because although making that kind of chart is easy with matplotlib, as you can see the code is much simpler using the ConfusionMatrix class.

First, we can see that this sample model is doing OK (the cross-validated Average Precision is 0.78): the precision (when we predict 1, how often do we get it right?) is 0.8 while the recall (how many of the true 1s did we predict?) is 0.67. Now, we ask ourselves: is this the right balance between precision and recall? Do we prefer to under-predict or over-predict our target? In real life, this is when you tie back your Data Science problem to the real underlying Business problem.
Now, we ask ourselves: is this the right balance between precision and recall? Do we prefer to under-predict or over-predict our target? In real life, this is when you tie back your Data Science problem to the real underlying Business problem.
To make this decision, we look at a different chart, and that’s where yellowbrick is really powerful, providing us the DiscriminationThreshold class. Any binary model typically outputs either a probability or a score that can be used as a proxy for a probability, and the final output that you get (0 or 1) is obtained by applying a threshold on that score. What we can do is define different quantities that depend on that threshold, then continuously move the threshold to see how these quantities increase or decrease. Typically, precision will increase and recall decrease, as you increase the threshold: for example, if you set the threshold to 0.9, your model will predict 1 only when 1 has a much higher likelihood than 0, so you will predict fewer 1s but have more chances to be right when you do.

There is a lot of information on this chart. First, the blue, green and red lines correspond to very common metrics. We already discussed precision and recall, and f1 is the harmonic mean of the two. We can see that the first one increases when the second one decreases, and there is a sweet spot where f1 is maximum, i.e. where the balance between precision and recall is ideal. Yellowbrick helps us by plotting a dotted line at that ideal threshold.
The purple quantity refers to a different quantity that you don’t necessarily learn about when learning Data Science, but is very important on real world Data Science use cases. The yellowbrick documentation does a perfect job at describing this metric:
Queue Rate: The "queue" is the spam folder or the inbox of the fraud investigation desk. This metric describes the percentage of instances that must be reviewed. If review has a high cost (e.g. fraud prevention) then this must be minimized with respect to business requirements; if it doesn’t (e.g. spam filter), this could be optimized to ensure the inbox stays clean.
Finally, the error band shows us the uncertainty about these quantities: As you can see in the code snippet above, yellowbrick re-fits the model for us a certain number of times (50, by default) and what we are seeing here are the median and a band corresponding to the interquartile range (which I set manually).
In our case, we can follow yellowbrick’s default behavior which is to select the best f1 score (which could be changed using the argmax
argument of DiscriminationThreshold
), meaning that for this particular problem, we want a good balance between precision and recall.
One last thing to note is that when we performed the Grid Search, we selected the best model using average_precision
, which is important. If we had used the f1 score then for example, we would have relied on the default threshold only and therefore could have missed a combination of hyperparameters that yield a better f1 score at a different threshold. Therefore, I made sure to select the model using a metric agnostic to the threshold, and only then tuned the threshold.
Changing our model’s Decision Threshold
In this last section, we go back to the "end to end" discussion. In your current imbalanced models, you might already probably perform decision threshold tuning. But what happens next? You now have a model that predicts a probability or a score, and a binary prediction, and you know not to trust that binary prediction because you want to override the default threshold.
But how do you, exactly, override that threshold? Well, I have been stuck on this problem for quite a long time. In a series of models that I put in production, we would store that ideal threshold somewhere, in an environment variable in the deployment environment for example, then call the model’s predict_proba()
method, and apply the threshold. This is much more painful to maintain than it seems, because you have to worry about two things: the model, and its threshold. Whenever you redeploy a new version of the model, you need to make sure to carry the threshold with it… Crazily, there has been an issue open on the sklearn repository for over 5 years!
Well, as I mentioned at the beginning of this article, whenever you think "I wish sklearn did that…", check in sklego!
The code snippet above is a bit harder to read, but really useful! First, we extract the best threshold from the yellowbrick visualizer by accessing the underlying cv scores array for our metric of choice (here, visualizer.argmax
is equal to f1
) and getting its argmax()
. This gives us the position of the best threshold in the visualizer.thresholds_
array. As far as I know (please drop a comment if there’s a better way!) this is the only way to get the best threshold that yellowbrick printed earlier (the dotted line).
Once we have this, we create our final pipeline, by simply wrapping best_model
in a Thresholder object which is a wrapper in charge of applying the specified threshold when we call .predict()
instead of the default 0.5 value. Because we passed refit=False
when creating the Thresholder, the .fit()
call doesn’t refit the wrapped model (feel free to check the source code to see what’s happening in that call).
And, that’s it! We now have a full pipeline that not only does all the preprocessing (because people accessing your model shouldn’t know that you’re doing scaling or one hot encoding), uses the best threshold for your business problem (because we don’t want to either under or over-predict our target), and embeds that threshold directly into the model, fulfilling the promise of a more end to end model!
EDIT – 2023 edits to this blog post
In an original version of this tutorial, the last code snippet was slightly more complex (I was redefining a new pipeline and only wrapping the estimator i.e. the last step of that pipeline with the Thresholder). Since then the Sklego team actually made things a little simpler as shown above, thanks to this PR. In addition, as they mentioned it in the related issue it’s possible to go further and perform threshold tuning directly while other hyperparameters are being tuned (during GridSearchCV), by leveraging the refit
parameter. In order to keep the main structure of the post I’m keeping threshold tuning as a separate step using yellowbrick (which has other benefits too) but both options work well!