The world’s leading publication for data science, AI, and ML professionals.

Ensuring Trustworthy ML Systems With Data Validation and Real-Time Monitoring

Lesson 5: Data Validation for Quality and Integrity using GE. Model Performance Continuous Monitoring.

THE FULL STACK 7-STEPS MLOPS FRAMEWORK

This tutorial represents lesson 5 out of a 7-lesson course that will walk you step-by-step through how to design, implement, and deploy an ML system using MLOps good practices. During the course, you will build a production-ready model to forecast energy consumption levels for the next 24 hours across multiple consumer types from Denmark.

By the end of this course, you will understand all the fundamentals of designing, coding and deploying an ML system using a batch-serving architecture.

This course targets mid/advanced Machine Learning engineers who want to level up their skills by building their own end-to-end projects.

Nowadays, certificates are everywhere. Building advanced end-to-end projects that you can later show off is the best way to get recognition as a professional engineer.


Table of Contents:

  • Course Introduction
  • Course Lessons
  • Data Source
  • Lesson 5: Data Validation for Quality and Integrity using GE. Model Performance Continuous Monitoring.
  • Lesson 5: Code
  • Conclusion
  • References

Course Introduction

At the end of this 7 lessons course, you will know how to:

  • design a batch-serving architecture
  • use Hopsworks as a feature store
  • design a feature engineering pipeline that reads data from an API
  • build a training pipeline with hyper-parameter tunning
  • use W&B as an ML Platform to track your experiments, models, and metadata
  • implement a batch prediction pipeline
  • use Poetry to build your own Python packages
  • deploy your own private PyPi server
  • orchestrate everything with Airflow
  • use the predictions to code a web app using FastAPI and Streamlit
  • use Docker to containerize your code
  • use Great Expectations to ensure data validation and integrity
  • monitor the performance of the predictions over time
  • deploy everything to GCP
  • build a CI/CD pipeline using GitHub Actions

If that sounds like a lot, don’t worry. After you cover this course, you will understand everything I said before. Most importantly, you will know WHY I used all these tools and how they work together as a system.

If you want to get the most out of this course, I suggest you access the GitHub repository containing all the lessons’ code. This course is designed to quickly read and replicate the code along the articles.

By the end of the course, you will know how to implement the diagram below. Don’t worry if something doesn’t make sense to you. I will explain everything in detail.

By the end of Lesson 5, you will know how to use Great Expectations to validate the integrity and quality of your data. Also, you will understand how to implement a monitoring component on top of your ML system.


Course Lessons:

  1. Batch Serving. Feature Stores. Feature Engineering Pipelines.
  2. Training Pipelines. ML Platforms. Hyperparameter Tuning.
  3. Batch Prediction Pipeline. Package Python Modules with Poetry.
  4. Private PyPi Server. Orchestrate Everything with Airflow.
  5. Data Validation for Quality and Integrity using GE. Model Performance Continuous Monitoring.
  6. Consume and Visualize your Model’s Predictions using FastAPI and Streamlit. Dockerize Everything.
  7. Deploy All the ML Components to GCP. Build a CI/CD Pipeline Using Github Actions.
  8. [Bonus] Behind the Scenes of an ‘Imperfect’ ML Project – Lessons and Insights

For more context, check out Lesson 3, which will teach you how to build an inference pipeline using batch architecture and a Feature Store.

Also, Lesson 4 will show you how to orchestrate all the pipelines using Airflow.

This lesson will leverage the above ideas and assume you already understand them.


Data Source

We used a free & open API that provides hourly energy consumption values for all the energy consumer types within Denmark [1].

They provide an intuitive interface where you can easily query and visualize the data. You can access the data here [1].

The data has 4 main attributes:

  • Hour UTC: the UTC datetime when the data point was observed.
  • Price Area: Denmark is divided into two price areas: DK1 and DK2 – divided by the Great Belt. DK1 is west of the Great Belt, and DK2 is east of the Great Belt.
  • Consumer Type: The consumer type is the Industry Code DE35, owned and maintained by Danish Energy.
  • Total Consumption: Total electricity consumption in kWh

Note: The observations have a lag of 15 days! But for our demo use case, that is not a problem, as we can simulate the same steps as it would in real-time.

The data points have an hourly resolution. For example: "2023–04–15 21:00Z", "2023–04–15 20:00Z", "2023–04–15 19:00Z", etc.

We will model the data as multiple time series. Each unique price area and consumer type tuple represents its unique time series.

Thus, we will build a model that independently forecasts the energy consumption for the next 24 hours for every time series.

Check out the video below to better understand what the data looks like 👇


Lesson 5: Data Validation for Quality and Integrity using GE. Model Performance Continuous Monitoring.

The goal of Lesson 5

At this point, the ML pipeline is implemented and orchestrated. That means we are done, right?

Not quite…

One final step, which will transform you from a good engineer to an excellent one, is adding a component that will allow you to quickly diagnose what is happening in your production system.

During Lesson 5, you will primarily learn 2 different topics that serve one single goal: to ensure your production system is working correctly.

1. Data Validation: check if the data generated by the FE pipeline is OK before ingesting it into the Feature Store.

2. Model Monitoring: continually compute various metrics that reflect the performance of your production model.

I will go into more detail in the Theoretical Concepts & Tools section. Still, as a brief overview, to continually monitor your model’s performance, you will use your old predictions with the newly gathered ground truth to compute a desired metric, in your case MAPE.

For example, you predict the energy consumption values from the 1st of June for 24 hours. Initially, you don’t have the data to compute the metrics. But, after 12 hours, you can collect the real energy consumption. Thus, you just put your hands on the ground truth to compute the desired metrics for the last 12 hours.

After 1 hour, you can compute the metric for another data point, and so on…

This is the strategy we will adopt in this tutorial.


Theoretical Concepts & Tools

Data Validation: Data validation refers to the process of ensuring data quality and integrity. What do I mean by that?

As you automatically gather data from different sources (in our case, an API), you need a way to continually validate that the data you just extracted follows a set of rules that your system expects.

For example, you expect that the energy consumption values are:

  • of type float,
  • not null,
  • ≥0.

While you developed the ML pipeline, the API returned only values that respected these terms, as data people call it: a "data contract."

But, as you leave your system to run in production for a 1 month, 1 year, 2 years, etc., you will never know what could change to data sources you don’t have control over.

Thus, you need a way to constantly check these characteristics before ingesting the data into the Feature Store.

Note: To see how to extend this concept to unstructured data, such as images, you can check my Master Data Integrity to Clean Your Computer Vision Datasets article.

Great Expectations (aka GE): GE is a popular tool that easily lets you do data validation and report the results. Hopsworks has GE support. You can add a GE validation suit to Hopsworks and choose how to behave when new data is inserted, and the validation step fails – read more about GE + Hopsworks [2].

Ground Truth Types: While your model is running in production, you can have access to your ground truth in 3 different scenarios:

  1. real-time: an ideal scenario where you can easily access your target. For example, when you recommend an ad and the consumer either clicks it or not.
  2. delayed: eventually, you will access the ground truths. But, unfortunately, it will be too late to react in time adequately.
  3. none: you can’t automatically collect any GT. Usually, in these cases, you have to hire human annotators if you need any actuals.

In our case, we are somewhere between #1. and #2. The GT isn’t precisely in real-time, but it has a delay only of 1 hour.

Whether a delay of 1 hour is OK depends a lot on the business context, but let’s say that, in your case, it is okay.

As we considered that a delay of 1 hour is ok for our use case, we are in good luck: we have access to the GT in real-time(ish).

This means we can use metrics such as MAPE to monitor the model’s performance in real-time(ish).

In scenarios 2 or 3, we needed to use data & concept drifts as proxy metrics to compute performance signals in time.

ML Monitoring: ML monitoring is the process of assuring that your production system works well over time. Also, it gives you a mechanism to proactively adapt your system, such as retraining your model in time or adapting it to new changes in the environment.

In our case, we will continually compute the MAPE metric. Thus, if the error suddenly spikes, you can create an alarm to inform you or automatically trigger a hyper-optimization tuning step to adapt the model configuration to the new environment.


Lesson 5: Code

You can access the GitHub repository here.

Note: All the installation instructions are in the READMEs of the repository. Here you will jump straight to the code.

The code within Lesson 5 is located under the following:

Using Docker, you can quickly host everything inside Airflow, so you don’t have to waste a lot of time setting things up.

Directly storing credentials in your git repository is a huge security risk. That is why you will inject sensitive information using a .env file.

The .env.default is an example of all the variables you must configure. It is also helpful to store default values for attributes that are not sensitive (e.g., project name).


Prepare Credentials

I don’t want to repeat myself too much. You already have step-by-step instructions on how to set up your credentials in the "Prepare Credentials" of previous lessons.

Fortunately, in this article, you don’t have to prepare additional credentials from previous lessons.

Checking the "Prepare Credentials" of **** Lesson 4 is a great starting point showing you how to prepare all your credentials and tools. Also, check the GitHub Repository for additional information.

It will show you how to complete all your credentials in your .env file.

Now, let’s start coding 🔥


Data Validation

The GE suit is defined in the _feature-pipeline/feature_pipeline/etc/validation.py_ file.

In the code below, you defined a GE ExpectationSuite called energy_consumption_suite.

Using the ExpectationConfiguration class, you can add various validation tests. In the following example, 2 tests were added:

  1. Checks if the columns of the table match with a given ordered list.
  2. Checks the length of the columns to be equal to 4.

Easy and powerful 🔥

Now, let’s take a look at the full validation suit 👇

Using GE, you will check a Pandas DataFrame for the following characteristics:

  1. The columns should be equal to: ["datetime_utc,"… "energy_consumption"].
  2. The DF should have exactly 4 columns.
  3. Column "datetime_utc" should have all the values different than null.
  4. Column "area" expects only values equal to 0, 1 or 2.
  5. Column "area" should be of type int8.
  6. Column "consumer_type" expects only values equal to 111, …
  7. Column "consumer_type" should be of type int32.
  8. Column "energy_consumption" should have values ≥ 0.
  9. Column "energy_consumption" should be of type float64.
  10. Column "energy_consumption" should have all the values different than null.

    As you can see, the quality checks mostly resume to:

  11. Check the schema of the table.
  12. Check the type of the columns.
  13. Check the values of the columns (different logic for discrete or continuous features).
  14. Check for nulls.

You will attach this validation suit in the to_feature_store() loading function of the FE pipeline from the _feature-pipeline/feature_pipeline/etl/load.py_ file.

Now Hopsworks will run the given GE validation suit every time a new DataFrame is inserted into the feature group.

You can choose to reject the new data if the validation suit fails or to get an alarm to take manual action.


ML Monitoring

When it comes to ML monitoring, the hardest part isn’t the code itself but mostly choosing how to monitor your ML models.

Note that tools, such as Evidently or Arize, are usually used for ML monitoring. But in this case, I wanted to keep it simple and not add another tool to the series.

But the concepts remain the same, which is the most crucial to understand.

In the code snippet below, we did the following:

  1. Loaded the predictions from the GCP bucket. All the predictions are aggregated in the predictions_monitoring.parquet file during the batch prediction step.
  2. Prepared the structure of the predictions DataFrame.
  3. Connected to the Hopsworks Feature Store.
  4. Queried the Feature Store for data within the minimum and maximum predictions datetime edges. This is your GT. You want to get everything available based on your prediction’s datetime window.
  5. Prepared the structure of the GT DataFrame.
  6. Merge the two DataFrames.
  7. Where the GT is available, compute the MAPE metric. Out of simplicity, you will compute the MAPE metrics aggregated over all the time series.
  8. Write the results back to the GCP bucket, which will be loaded & displayed by the frontend.

    The function defined above will act as its own task in the Airflow DAG. It will be called every time the ML pipeline is running. Thus, every hour, it will look for new matches between a prediction and a GT, compute the MAPE metric and upload it to GCS.

Read Lesson 6 to see how you can display the results from the GCP bucket in a beautiful UI using Streamlit and FastAPI.


Conclusion

Congratulations! You finished the fifth lesson from the Full Stack 7-Steps Mlops Framework course. It means you are close to knowing how to build an end-to-end ML system using MLOps good practices.

In this lesson, you learned how to:

  • use GE to build a data validation suit that tests your data quality and integrity,
  • understand why ML Monitoring is essential,
  • build your own ML monitoring system to track model performance in real time.

Now that you understand the power of taking control of your data and ML system, you can sleep like a baby at night knowing that everything is working well or if not; you can quickly diagnose the issue.

Check out Lesson 6 to learn how to use your predictions and monitoring metrics from your GCP bucket to build a web app using FastAPI and Streamlit.

Also, you can access the GitHub repository here.


💡 My goal is to help machine learning engineers level up in designing and productionizing ML systems. Follow me on LinkedIn or subscribe to my weekly newsletter for more insights!

🔥 If you enjoy reading articles like this and wish to support my writing, consider becoming a Medium member. Using my referral link, you can support me without extra cost while enjoying limitless access to Medium’s rich collection of stories.

Join Medium with my referral link – Paul Iusztin


References

[1] Energy Consumption per DE35 Industry Code from Denmark API, Denmark Energy Data Service

[2] Data Validation for Enterprise AI: Using Great Expectations with Hopsworks (2022), Hopsworks Blog


Related Articles