
Introduction
Sometimes the sample data that data scientists are given does not fit what we know about the wider population data. For example, lets assume that the Data Science team were given survey data and we noticed that the survey respondents were 60% male and 40% female.
In the real world the UK general population is closer to 49.4% male and 50.6% female (source: https://tinyurl.com/43hpe5e4) and certainly not 60% / 40%.
There could be many explanations for our 60% male sample data. One possibility is that the data collection method might have been flawed. Perhaps the marketing team accidentally hit more males with their marketing campaign causing an imbalance.
If we can establish that the sample data should better reflect the population then we can "stratify" the data. This will involve resampling the sample data so that the proportions match the population (see https://www.investopedia.com/ask/answers/041615/what-are-advantages-and-disadvantages-stratified-random-sampling.asp for more information).
To make matters more complex, it might be that there are multiple feature columns involved. The example in this article shows a combination of two factors as follows –
- Male undergraduates = 45% of the population
- Female undergraduates = 20% of the population
- Male graduate students = 20% of the population
- Female graduate students = 15% of the population
If our sample data has 70% male undergraduates it will not represent the population.
In Machine Learning algorithms this can cause problems down the line. If we go ahead and train our model on the sample data which has the wrong proportions it is likely that the model will be over-fitted to the training data and it is also likely that when we run the model against real-world or testing data that is in the right proportions it will underperform.
This example shows how to resample the sample data such that it reflects the population which has the potential to improve the accuracy of your machine learning models
Getting Started
Lets start by importing the required libraries and reading in some data that was downloaded from https://www.kaggle.com/c/credit-default-prediction-ai-big-data/overview

Setting Up The Test Data
To make the example make sense I am going to simplfy the "Home Ownership" feature to have the two most common values and add a new feature called "Gender" with ~60% "Male" and ~40% "Female" and then take a quick look at the results …
(Home Mortgage 0.531647
Rent 0.468353
Name: Home Ownership, dtype: float64,
Male 0.601813
Female 0.398187
Name: Gender, dtype: float64)
Preparing to Stratify
In our example we want to resample the sample data to reflect the correct proportions of Gender and Home Ownership.
The first thing we need to do is to create a single feature that contains all of the data we want to stratify on as follows …
Male, Home Mortgage 0.321737
Male, Rent 0.280076
Female, Home Mortgage 0.209911
Female, Rent 0.188277
Name: Stratify, dtype: float64
So there we have it, we have a set of proportions in our sample data that we intend to use to train our model. However we check with our marketing team who assure us that the population proportions are as follows …
- Male, Home Mortgage = 45% of the population
- Male, Rent = 20% of the population
- Female, Home Mortgage = 20% of the population
- Female, Rent = 15% of the population
… and two teams agree that they must resample the data to match these proportions in order to build an accurate model that will work well on real-world data in future.
Stratifying the Data
Below is a function that uses DataFrame.sample
to sample exactly the right number of rows with the right values from the source data such that the result will be stratified exactly as specified in the parameters …
Testing
The code below specifies the values and proportions for stratifying the data as per the required proportions i.e. –
- Male, Home Mortgage = 45% of the population
- Male, Rent = 20% of the population
- Female, Home Mortgage = 20% of the population
- Female, Rent = 15% of the population
… and takes a look at the newly stratified dataset …

And just to be sure we have the right results, let’s take a look at the overall proportions of our Stratify
feature column …
((6841, 20), (6841, 20))
Male, Home Mortgage 0.449934
Female, Home Mortgage 0.199971
Male, Rent 0.199971
Female, Rent 0.150124
Name: Stratify, dtype: float64
Conclusion
We started by stating that flaws in the data collection process can sometimes cause sample data to have different proportions to known proportions of the population data and that this can lead to over-fitted models that perform poorly when they do encounter test or live data with the right proportions.
We went on to explore how stratifying the training data and resampling it to give it the right proportions can resolve the issue and improve performance of the production algorithms.
We then chose a complex example that stratified on two features, feature engineered those two features into a new column and defined a function that performs the calculations and returns a stratified dataset.
Finally we examined the results to make sure the calculations were correct.
The full source code can be found on GitHub:
Thank you for reading!
If you enjoyed reading this article, why not check out my other articles at https://grahamharrison-86487.medium.com/?
Also, I would love to hear from you to get your thoughts on this piece, any of my other articles or anything else related to data science and data analytics.
If you would like to get in touch to discuss any of these topics please look me up on LinkedIn – https://www.linkedin.com/in/grahamharrison1 or feel free to e-mail me at [email protected].