DATA ENGINEERING – DATA QUALITY – GREAT EXPECTATION SERIES

It doesn’t matter how fast data lands into your spreadsheets or dashboards, if it is not correct, then it is useless. Moreover, it could cause bad decisions to take place and could lead to irreversible repercussions. A robust data quality tool is integral to any data workload to prevent catastrophes. In this article, I’ll take you through how I’ve used Great Expectations with Pyspark to perform tests through data transformations.
PySpark as Data Processing Tool
Apache Spark is a famous tool used for optimising ETL workloads by implementing parallel computing in a distributed environment. It is commonly utilised for batch processing Big data to reduce the time it takes to process a huge amount of data while also minimising costs.
PySpark exposes the Python API to interface with Apache Spark. Through PySparkSQL library, developers can use SQL to process structured or semi-structured data. All you need is Python + SQL to get things going.
See below a simple use case wherein files are read from a cloud storage and loaded into a DataFrame. Raw data is transformed using SQL and the output is materialised back to the cloud storage.

Handling Data Quality
While PySpark does its job as an efficient transformation tool, the ultimate goal of Data Engineering is not just to transform data from its raw form to a consumable form but to ensure that the end product meets the expected quality standards. The data should align with the business rules agreed upon by subject matter experts.
Below are some examples of things we ask about data:
- Is the column mandatory?
- Are these values valid?
- Is the format correct?
- How do we determine that an account is active during a specific period?
- If the column is numeric in format, is it expected to be within a certain range?
The answers to these questions are translated to business rules and are validated against the data.
The role of Great Expectations
Unfortunately, Data Quality testing capability doesn’t come out of the box in Pyspark. That’s where tools like Great Expectations comes into play.
Great Expectations is an open source tool used for unit and integration testing. It comes with a predefined list of expectations to validate the data against and allows you to create custom tests as needed. There is more to it than that but for the sake of this post, these are the features I wish to explore initially.
If you’re keen to know more about other Great Expectations features, you can jump to my other post where I have assessed some of its useful features against a business requirement.
Project Exercise
The Goal
- To explore Kickstarter Campaign dataset downloaded from Kaggle.
- Produce a metric that counts the number of successful campaigns per defined categories per assessment year
- To use Great Expectations to execute unit and integration tests
The How
- Dataset goes through several layers of transformation from raw form to the final metric output


- For each transformation, data is validated whether it meets expectations on data

Prerequisites
- Kickstarter Dataset (In this exercise, the file is just stored locally)
- A Pyspark Environment with Great Expectations library installed
- Jupyter interface
Note: I did not share instructions on how to setup this environment locally. Let me know in the comments if you’re interested to know how I’ve set it up. I can write a post about it.
Let’s Start!
Note: Code shown below are screenshots but the Jupyter Notebook is shared in Github.
Raw data exploration
- To start, let’s import libraries and start Spark Session.

2. Load the file and create a view called "CAMPAIGNS"

3. Explore the Dataset

4. Do data profiling
This can be done using Great Expectations by leveraging its built-in functions to validate data.

SparkDFDataset inherits the PySpark DataFrame and allows you to validate expectations against it.
5. Create an instance of SparkDFDataset for raw_df

Unit tests on Raw Data
- Check for Mandatory Columns
Below are the relevant columns to be used for determining what is in scope for the final metrics.

2. Mandatory columns should not be null

Seems like we have an outlier!
3. Check for valid date format

We are not getting 100% compliance here.
4. Check for uniqueness

Uh oh! This dataset is filled comes with a lot of duplicates!
Seems like our raw data is not as clean as we want it to be. But it should be alright at this point since we still need to filter our data and eventually calculate the metrics against the subset of the dataset that is within the scope.
Filter data
- Define which campaigns are in scope

2. Declare some variables

3. Generate a Reference Data for Assessment Year
Why do we need a reference data to derive the assessment year? In this example, the assessment year starts on the 1st day of the month of July and ends in 30th of June. The table below shows when the period starts and ends for each assessment year.

See below example scenarios:
- If the campaign started in Jan 5, 2017 and ended in June 30, 2017, it is considered active in assessment year 2017.
- If the campaign started in July 5, 2017 and ended in Jan 5, 2018, it is considered active in assessment year 2018.
- If the campaign started in Jan 5, 2017 and ended in Dec 5, 2017, it is considered active in both years 2017 and 2018.
4. Apply Transformation

I expected that the dataset will still contain some duplicates so I did a little trick. The query picks the latest record if there is a duplicate campaign id.
5. Explore the filtered Dataset

6. Create an Instance of SparkDFDataset for filtered_df

Unit tests on Filtered Data
- Check if main_category is within scope

2. Check if campaigns are within the country "US", currency is "USD" and included successful campaigns only

3. Check if mandatory columns are present

4. Check for uniqueness

5. Check for valid datetime formats

Our data is now looking neat! Now we can proceed with the next step. The metrics are to be calculated based on derived categories. We’ll produce those categories in the next phase of transformation.
Standardised data
- Define categories to use in the final metric

2. Transform data

3. Explore data

4. Create an instance of SparkDFDataset for standardised_df

Unit tests on Standardised Data
- Check if metric and pledge categories are valid

2. Check if population is equal to the previous dataset

As we never filtered our dataset any further, it is expected that row count remains the same.
Onto the final step!
Produce the final metrics
- Define the metric

2. Apply Transformation

3. Explore Data

4. Create an instance of SparkDFDataset for successful_campaigns_df

Unit tests on final metrics
- Check for uniqueness of combination of columns

- Check if total sum of campaigns in the final metrics dataset is expected

The aggregate number of campaigns in the final metrics dataset should be no more or less than the total records before aggregation.
Validation summary
Calling the validate() function against the Great Expectations dataset (SparkDFDataSet) returns the summarised result of all the expectations executed against that specific dataset.

Custom Tests
There are few more tests that I would like to apply but I couldn’t find any suitable functions for it. For instance, I would like to validate if the assessment year derived is correct based on _launch_at and deadline_. Great Expectations can be extended to allow usage of custom functions. I will write a separate post on this but if you want to check the code, please see the last section of the notebook in Github.
Conclusion
I have showcased how Great Expectations can be utilised to check data quality in every phase of data transformation. I have used a good number of built-in expectations to validate Pyspark Dataframes. See the full list in their documentation. I find it convenient to use this tool in notebooks for data exploration. Unit tests also needs to be incorporated in the data pipeline to ensure confidence in the data every time it changes. This is what I’ll be exploring next. Stay tuned!