Hands-on Tutorials

Monitoring and Retraining Your Machine Learning Models

With Google Data Studio, lakeFS and Great Expectations

Felipe de Pontes Adachi

Published in

Towards Data Science

14 min readMar 4, 2021

Like everything in life, machine learning models go stale. In a world of ever-changing, non-stationary data, everyone needs to go back to school and recycle itself once in a while, and your model is no different.

Well, we know that retraining our model is important, but when exactly should we do it? If we do it too frequently, we’d end up wasting valuable time and effort, but to do it seldomly would surely affect our prediction’s quality. Unfortunately, there is no one-size-fits-all answer. Each case should be carefully assessed in order to determine the impact of staleness.

In this article, I’d like to share my approach to monitoring and retraining on a personal project: A fake news detector web application. I’m by no means an expert on the subject, so if you have any suggestions or considerations, please feel free to get in contact.

The Application

Our simple web application is basically a fake news detector: the user is able to enter a URL of a news article, and the system will output the result of its prediction: whether it’s fake or real.

For every input, the system logs the prediction’s result and additional metadata in a BigQuery table at GCP. That’s the data we’ll use to monitor our model’s performance and, when needed, to retrain it.

This article is divided into two parts: Monitoring and Retraining.

First, I’ll talk about how I used the prediction logs at BigQuery to set up a Google Data Studio dashboard in order to have some updated charts and indicators to assess my text classification model’s health.

In the Retraining section, I’ll show how I approached data versioning to manage my data and model artifacts for each retraining experiment while keeping data quality in mind. To do so, I used tools such as lakeFS, Great Expectations, and W&B. Everything discussed in this part can be found at the project’s repository.

Monitoring

Let’s begin by taking a look at the model_predictions table’s schema:

title (STRING): The new’s title.
content (STRING): The new’s text content.
model (STRING): The name of the model that generated the prediction.
prediction (STRING): The model’s prediction — “Real” or “Fake”.
confidence (FLOAT): The prediction’s level of confidence by the model. From 0 to 1.
url (STRING): The new’s URL.
prediction_date (DATETIME): Date and time of when the prediction was made.
ground_truth (STRING): Starts as NULL, and can be changed to “Real” or “Fake” in a labeling process.
coverage (FLOAT): The percentage of words in the news that are present in the model’s vocabulary.
word_count (INTEGER): The number of words in the news.

These fields are all calculated upon serving the prediction’s online request by our application. If you’re interested in knowing how those metrics were calculated in the first place, you can take a look at app.py at the project’s repository.

From these fields, we can set up an online dashboard with Google Data Studio to constantly monitor some indicators of our prediction model:

Data Studio has a very intuitive interface. In this example, we’re using only one data source, which is the model prediction table at BigQuery. So, from a blank dashboard, we can add a data source by simply clicking the Add Data button at the top, and then selecting the BigQuery option.

I chose not to display charts related to ground_truth , as this field can be frequently empty if there’s not an intention of retraining the model in the near future. But if labeling is done constantly, a Confusion Matrix would be a great addition to our dashboard.

In this example, since I don’t have a great number of records, we shouldn’t draw any statistical conclusion from these numbers. Hopefully, in the future, with more daily predictions, these charts will be more informative. Nonetheless, let’s go by and discuss each one of these indicators:

Records

This is simply the number of predictions the application has served until now. To create it, go to Add a Chart at the top and select scorecard. Choose your table as Data Source and Record Count as Metric.

Labeled

This is the percentage of records that have been manually labeled. This is important for the next step of retraining. Without ground truth, you can’t retrain your model.

Like Records, this is also a scorecard. But for this one, you have to create a new calculated field, by clicking ADD A FIELD at the bottom of the scorecard’s DATA tab:

Then, just select the newly-created field as your Metric, just like before.

Current Model

The name of the last prediction’s model. Assumes we have only one application, and there can only be one model at a time.

This indicator is actually a Table chart, rather than a scorecard. As Dimension, select modeland prediction_date, then set Rows per page as 1 and then sort prediction_date in Descending order. Then you should hide everything possible at the STYLE tab.

Predictions

The Real/Fake prediction’s percentage.

To create it, go to Add a Chart → Pie. Choose prediction as Dimension, and Record_count as Metric, and you’re good to go.

Prediction Metrics

This graph shows us the average of two values over time, on a weekly basis: confidenceand coverage. From this, we should be able to see an unusual change in the model’s level of confidence in its predictions, and also how much of the news is present on the trained model’s vocabulary. If the coverage has a descending trend and goes below a predefined threshold, it might be time to retrain it.

This one’s a Time series. On the Dimension field, at the little calendar icon, you should be able to set the period. I chose ISO Year Week. As Metric, choose confidence as the first one, and AVG as aggregation for the metric. Then, go to Add metric and do the same with coverage. At the STYLE tab, you can change both series from Lines to Bars and set their colors.

Word Count Frequency

This chart groups records by the number of words, according to each prediction category. Even though we need more data to have a real grasp of the word count distribution, it seems that Fake news is usually shorter than Real ones.

Another strange detail is that there are 5 Fake predictions with a word count between 0–100. That seems a little low for a piece of news, which led me to investigate further. I eventually found out that these records were extracted from the same website, and they all had a parsing error. The content was not the actual news, but rather an error message. This is important, and we should make sure that these kinds of records will never be ingested into our training-test dataset.

That was the closest I got to making a histogram at Data Studio. To create one go to Chart → Bar, and add a new field, at the bottom of the DATA tab. In the formula field, enter:

FLOOR(word_count/100)*100

For each record, the word count will be divided by 100, and then the largest integral value of the result is multiplied by 100. That will put the records into bins of hundreds. I named that field wc_bin. You can use it as the Dimension, and Record Count as Metric. You should also set Sort to Ascending, and choose wc_bin .

To split the histograms into Fake and Real, you can go to Add a Filter, right below Sort, and insert a filter like this one:

And just do the opposite for the next category.

Retraining

Alright, now we have a way to check on our model’s health. If something is out of the usual, we could start labeling some of the records and take a look at some important classification metrics, such as Precision, Recall, and F-score. If you’re not happy with the results, we should start retraining the model.

Overview

The image below is an overview of the retraining process I set up for this project:

We have our Base Data, the original dataset we used in order to train our first prediction model. In addition, the application is constantly feeding us data from the online predictions. So, in order to retrain the model, we can extract this data and do some simple preprocessing, such as removing duplicate news. It is also very important to validate our data and make sure it complies with some assumptions we have regarding the data’s shape and distribution. Then, we move on by joining both data sources to finally retrain the model. Before replacing the old model, we must evaluate the newly-trained model with a test set to make sure of its quality.

Every time we wish to retrain the model, we follow these steps. But if we notice something wrong in the future, how do we debug our model? To do so, we need an orderly manner to store our model and data artifacts, in addition to the model’s performance results and code that generated it.

Data Versioning with LakeFS

Managing data pipelines in an ML Application is not an easy task. Unlike traditional software development, where GIT has become the standard for code versioning, data versioning is still at its early stages, and there’s not a “definitive” way of doing so.

I decided to treat each retraining process as a single experiment and store the artifacts of each experiment in a single branch. This is a little different from what we’re used to with code versioning because in this case, I don’t expect to merge the branches back into the trunk. They are short-lived branches with the purpose of versioning our artifacts — datasets and models — according to different retraining experiments.

Even though there are a number of MLOps tools that provide data versioning functionality, such as MLflow and W&B, I opted to tackle data versioning as an infrastructure, by enabling Git-like operations over my object storage. For that purpose, I decided to try out the recently-released lakeFS, which enables me to add the Git-like engine on top of my existing S3 object storage. That way, my data versioning capability for the project is independent of any tools that I might add or replace in the future.

Preliminary Steps — Deploying lakeFS

You can find the instructions to set up your lakeFS environment here.

I basically had to:

Create a PostgreSQL database on AWS RDS
Configure an S3 bucket for my repository

For the policy’s Principal, I created a user and created access keys for it. I’ll also use this user to authenticate lakeFS to AWS. You can read more about here and here.

3. Install lakeFS

I chose to install it via Docker, with the following command:

docker run --name lakefs -p 8000:8000
-e LAKEFS_DATABASE_CONNECTION_STRING="<postgres-connection-str>"
-e LAKEFS_AUTH_ENCRYPT_SECRET_KEY="<lakefs-secret-key>"
-e LAKEFS_BLOCKSTORE_TYPE="s3"
-e LAKEFS_GATEWAYS_S3_DOMAIN_NAME="s3.local.lakefs.io"
-e LAKEFS_BLOCKSTORE_S3_CREDENTIALS_ACCESS_SECRET_KEY="<s3-secret-key>"
-e LAKEFS_BLOCKSTORE_S3_CREDENTIALS_ACCESS_KEY_ID="<s3-access-key>" -e LAKEFS_BLOCKSTORE_S3_REGION="<s3-region>" treeverse/lakefs:latest run

Where postgres-connection-str is the connection string you obtained creating the PostgreSQL DB, lakefs-secret-key is any randomly generated string (just don’t forget it), s3-secret-key and s3-access-key are the key-pair you created for your AWS User earlier, and s3-region is the region of the bucket you created.

4. Setup

At localhost:8000, after setting a new administrator user and creating the repository, you should be able to see your list of repositories:

The Retraining Process

Now that we have everything set up, we need to translate that flowchart into a series of steps and implement it. We’ll basically need to:

Get data from online predictions (BigQuery)
Clean and assert data quality (Great Expectations)
Create a new branch from master (lakeFS)
Upload online predictions to a new branch (lakeFS)
Get base data from master branch (lakeFS)
Join online predictions to base data and split to train-test datasets
Upload train-test splits to branch (lakeFS)
Train model with the updated dataset
Log Experiment/Results (W&B)
Upload Model and Vocabulary to branch (lakeFS)
Commit changes to branch (lakeFS)

Data Structure

As for our data structure, we’ll define as external data our original dataset as well as data extracted from the web app’s online predictions. The interim folder will keep our data once the external sources are combined and split into appropriate train-test sets. The model files will also be stored in their own folder.

Get Data From Production

First, we need to get the new data from our BigQuery table. To use the Python APIs for accessing BigQuery, I need GCP credentials in JSON format. After creating a project, you can follow these instructions on “creating a service account” in this Google documentation in order to download your credentials.

Then, installing google-cloud-bigquery and setting an environment variable indicating the location of your credentials and initiating the BigQuery client should be enough for you to query your table. In this case, sunny-emissary-293912 is the name of my project, fakenewsdeploy the name of my dataset and model_predictions the name of my table.

Clean And Assert Quality

We should always ensure the quality of data ingested into our storage. A great way to do that is using the tool Great Expectations. After I do some very simple cleaning, like removing duplicate news and news with low word count, we can do some basic assumptions about what we expect from our data.

In this example, we’ll only keep going with our retraining if our data passes some validations. The ground_truth should assume only Fake or Real values, the url value should be unique and every sample should have a non-null content. In addition, we assume that an excessively low coverage must be investigated, as it might be a sign of parsing errors, or maybe content in another language. Finally, the word_count should be above 100, as we have discussed previously in the Monitoring section of this article.

We can then generate an expectation_suite , which gives us a JSON file showing the validations passed by our data. We should also make sure to store this information for future reference (which will be done in the sequence).

We are only scratching the surface with Great Expectations here. Check out their website to know more functionalities.

Create new Branch

Now that we trust our data, we can store it in a newly-created branch. In the early stages of Model Building/Evaluation of this project, I used W&B to track my experiments. I’ll keep using it during the retraining process, in conjunction with lakeFS. This way, the experiment identifier is the wandb’s run name, and I’ll just use the same name in order to create the branch:

Two things to point out here. In line 3, while starting the run, you see I configured a threshold. That’s for a nice W&B functionality where I can set up a threshold to be monitored during training. In this case, if the F1-Score of the trained model is below 0.9, W&B will send me an email letting me know. We’ll see the rest of it in the training code.

Another important issue here is how I’m doing operations at my lakeFS repository. This can be done in several different ways, but I chose to use the Python client. I created a lakefs_connector class to wrap the APIs into more tailored functions for my application. The class implementation is shown at the end of this article.

Upload Online Predictions to Branch

Let’s keep going by uploading online_predictions.csv to our branch. We’ll also save my_expectation_file.json to our wandb run.

Now we can access our expectation file whenever we need it, and make sure that the data’s state at this particular run complies with our assumptions:

Merge Data and Split

Our base data is comprised of two files: True.csv and Fake.csv, both previously uploaded into our master branch. Let’s append the data from online_predictions.csv to our base data and then split it into train-test datasets:

Train Model and Upload Artifacts

The last series of steps is to finally train the model and upload our artifacts to the repository’s branch:

We’ll upload the train-test splits as well as our model’s files — the actual joblib model and the vocabulary used for our Vectorizer (which is also used to calculate our coverage field).

You can check the whole code for training the model here. I won’t get into specifics, since I already covered it in my previous article. I just want to point out an excerpt of it, related to the alert configuration we set up:

f1_score is the calculated F1-Score during the training process. Since we set up the threshold of 0.9 previously, if the score for this run ends up below this value, an alert will be sent to the configured destination. In my case, it’s my email:

Commit Changes

Now, what is left to do is to commit our changes into our branch:

And we’ll have our data separated according to our retraining experiments, like this:

Since the run names are unique, we can easily match the data with our retraining’s results at our experiments dashboard at W&B:

The lakeFS connector

In the code snippets above, we have done some operations to our repository, such as creating branches, downloading and uploading objects, and committing changes. To do so, I used bravado to generate a dynamic client, as instructed here. This way we have access to all of the lakeFS’ APIs as Python commands.

By instantiating the lakefs_conn class, we create a lakeFS client, and do the required repository operations through the object’s methods:

What’s Next

I wanted to share with you my take on monitoring my model’s performance and retraining it, but this is really the first step of a long process. As time goes on, the additional data enables us to have more insights, and we eventually discover better ways to monitor our application.

For example, in the future, we could plot the performance of multiple retrained models and compare them over time, to more accurately assess the impact of time over our predictions. Or, given enough ground truth information, we could add metrics such as precision and recall to our dashboard.

As for the retraining part, much can be improved. As we discover more about our data’s distribution, additional expectations can be added to our assertion stage. The addition of hooks for automatic pre-commit validation is also a natural evolution. Webhooks weren’t yet supported by lakeFS during the writing of this article, but it’s already an available feat since the latest release (0.33).

That’s it for now! If you have any feedback or questions, feel free to reach out!

Thank you for reading!