
When you work on a Data science project with a company, you usually don’t have a unique test set, unlike university and research, but you keep receiving newly updated samples from the client.
Before applying the machine learning model to the new sample, you need to verify its data quality, such as the column names, the column types, and the distribution of the fields, which should match the training and old test set.
Manually analyzing the data can be time-consuming when the data is dirty and presents more than 100 features. Luckily, there is a life-saving Python library, called Great Expectations. Did I intrigue you? Let’s get started!
What is Great Expectations?

Great Expectations is an open-source Python library that is specialized in solving three important aspects to manage data:
- validating data by verifying if it respects some important conditions or expectations
- automating data profiling to test your data fastly without the need of starting from scratch
- formatted documents, that contain the results of the expectations and validations.
In this tutorial, we are going to focus on validating data, which is one of the main issues when dealing with real-world data.
Airbnb listings in Amsterdam
We are going to analyze the Airbnb listings provided by Inside Airbnb. We are going to work with data from Amsterdam. The dataset is already split into training and test sets. As you may guess from the name of the dataset, the goal is to predict listing prices. If we just keep attention to the number of reviews, we can notice that the number of reviews on the test data has more variability than the ones of the training set.

The question that we should ask ourselves is "What other differences did we miss? Let’s get started with the library!
Table of contents:
- Requirements
- Load file
- Create Expectations
Requirements
Before installing the library, it’s recommended to install Python 3 and create a virtual environment. After you have activated the environment, we can install the Python library:
pip install great_expectations
You also need to download the data from Kaggle to follow the tutorial. The files should be contained inside the folder "data".
Load file
Like in Pandas, the great expectations library has the equivalent method to import the CSV file:
import pandas as pd
import great_expectations as gx
test_df = gx.read_csv('data/test.csv')
In case you have other types of data, like JSON, parquet and XLSX, you can just wrap the Pandas methods.
test_df.head()

This is a fast overview of our test data and the variables we are going to analyze in the next steps.
Create expectations
In this library, expectations consist in several tests that verify the quality of your data. The beauty of the library is that you don’t need to check manually, but there are already more than 300 implemented expectations with intuitive names.
- Check if there are the same columns as before
Let’s suppose that the client has sent us a new sample and we want to check if it contains the same columns as in the training set. There are a lot of ways to do it in pandas if you ask ChatGPT, but there is a more intuitive method to do it with great expectations:
l_train_column_names = ['id','neighbourhood','room_type','price',
'minimum_nights','number_of_reviews','last_review','availability_365']
test_df.expect_table_columns_to_match_set(column_set=l_train_column_names)
Output:
{
"success": false,
"meta": {},
"exception_info": {
"raised_exception": false,
"exception_traceback": null,
"exception_message": null
},
"result": {
"observed_value": [
"availability_365",
"last_review",
"minimum_nights",
"number_of_reviews",
"price",
"room_type"
],
"details": {
"mismatched": {
"missing": [
"id",
"neighbourhood"
]
}
}
}
}
From the result, we can see that the method found most of the columns, except for the fields id and neighbourhood. Since the condition isn’t respected completely, the key "success" is valued as false.
2. Check if there are no null values in last_review
The missing values are one of the main problems when working with real-world data:
test_df.expect_column_values_to_not_be_null('last_review')
Output:
{
"success": false,
"meta": {},
"exception_info": {
"raised_exception": false,
"exception_traceback": null,
"exception_message": null
},
"result": {
"element_count": 1522,
"unexpected_count": 143,
"unexpected_percent": 9.395532194480946,
"unexpected_percent_total": 9.395532194480946,
"partial_unexpected_list": []
}
}
From this test, we can see that we have 143 missing values in that column.
3. Check if the type of minimum_nights is an integer
It may seem banal, but you may find errors because the model was trained on a column with a different type. So, this expectation is useful for avoiding waste of time:
test_df.expect_column_values_to_be_in_type_list('minimum_nights', ['int'])
Output:
{
"success": true,
"meta": {},
"exception_info": {
"raised_exception": false,
"exception_traceback": null,
"exception_message": null
},
"result": {
"observed_value": "int64"
}
}
The expectation is respected, highlighted by "success":true
.
4. Check if the maximum price is within a specific interval
Before we have seen that the price has different distribution in training and test sets. We try to investigate if the maximum price is between 413 e 12000, which corresponds respectively to the 90th percentile and the maximum of the training set:
test_df.expect_column_max_to_be_between(column='price', min_value=413, max_value=12000)
Output:
{
"success": true,
"meta": {},
"exception_info": {
"raised_exception": false,
"exception_traceback": null,
"exception_message": null
},
"result": {
"observed_value": 7900,
"element_count": 1522,
"missing_count": null,
"missing_percent": null
}
}
The output tells us that the maximum price is 7900 and is within that interval.
Final thoughts:
I just provided an overview of the Great Expectations to introduce the main functionality: the expectations. You can explore other declarative statements, or expectations, from the official website. In the next article, I am going to cover the automation of data profiling and visualization of all the expectations in a unique document. You can find the GitHub code here. Thanks for reading! Have a nice day!
Disclaimer: This data set is licensed under Attribution 4.0 International (CC BY 4.0)