Zombies & Model Rot (with ML Engine + DataStore)

Zack Akil
Towards Data Science
8 min readOct 24, 2018

--

Don’t leave your models to rot into obscurity

So you’ve deployed your machine learning model to the cloud and all of your apps and services are able to fetch predictions from it, nice! You can leave that model alone to do its thing forever… maybe not. Most machine learning models are modeling something about this world, and this world is constantly changing. Either change with it, or be left behind!

What is model rot?

Model rot, data rot, AI rot, whatever you want to call it, it’s not good! Let’s say we’ve built a model that predicts if a zombie is friendly or not. We deploy it to the cloud and now apps all over the world are using it to help the general public know which zombies they can befriend without getting bitten. Amazing, people seem super happy with your model, but after a couple of months you start getting angry emails from people who say that your model is terrible! Turns out that the zombie population mutated! Now your model is out of date, or rotten! You need to update your model, and even better, add in a way to keep track of your models state of rot so that this doesn’t happen again.

This is an example of a very sudden case of model rot!

It’s not just zombies that change

Sure, fictional creatures can change, but so can financial markets, residential environments, traffic patterns, weather patterns, the way people write tweets, the way cats look! Ok maybe cats will always look like cats (although give it a few million years and maybe not). The point is that depending on what your models are predicting will effect how fast they are going to rot.

It’s also important to note that the thing you are predicting doesn’t need to change for your model to rot. Maybe the sensor you are using to capture input data gets changed. Anything that negatively effects the performance of your model when it’s deployed is effectually causing model rot, either remove the thing causing the reduced performance or update the model (most likely going to be the latter choice).

Let’s fight model rot (with ML Engine + DataStore)

There’s a zombie outbreak, however it’s not as scary as the movies would lead you to believe. They are pretty slow moving creatures, and a lot of them are just looking for human friends, but some aren’t. To help people make the right choice of zombie friends we developed a model that predicts if a zombie is friendly or not based on a few characteristics:

We used Scikit-Learn to build a Decision Tree Classifier model. See the notebook here to see the exact code to do this.

The plan is now to deploy our model to ML Engine which will host our model for us on the cloud. (see how we do this using gcloud commands in a notebook)

First we’ll throw our model into cloud storage:

gsutil cp model.joblib gs://your-storage-bucket/v1/model.joblib

Then create a new ML Engine model, which you can do using the Google Cloud Console UI or using gcloud commands (which you can see here used in a notebook):

ML Engine UI

Then we deploy our Decision Tree model as version 1:

Creating a new version of your model using the Cloud Console UI

To make it easier for apps to fetch predictions from our model, we’ll create a public endpoint using Cloud Functions. You can read more about how to do this in this blog post I wrote with my colleague Sara Robinson.

here’s the current architecture of our system

OK, our model is deployed and apps can easily fetch predictions from it! Now to monitor for model rot with DataStore!

DataStore?

Imagine a place in the cloud where you can store millions of python dictionaries quickly, and then query them (also quickly). That’s DataStore, a fully managed NoSQL database on Google Cloud Platform. If you have previous experience with databases you might be used to carefully planning out exactly what structure of data to store in tables, then experience the pain of creating migration scripts to update the structure of your database. Non of that nonsense with DataStore, want to store the following data:

{
"name": "Alex"
"occupation": "Zombr trainer"
}

then do it (using the python client library):

# create new entity/row
new_person = datastore.Entity(key=client.key('person'))
new_person['name'] = 'Alex'
new_person['occupation'] = 'Zombr trainer'
# save to datastore
client.put(new_person)

Oh wait, you want to start storing peoples githubs and twitters? go for it:

# create new entity/row
new_person = datastore.Entity(key=client.key('person'))
new_person['name'] = 'Zack'
new_person['occupation'] = 'Zombr CEO'
new_person['github'] = 'https://github.com/zackakil'
new_person['twitter'] = '@zackakil'
# save to datastore
client.put(new_person)

and DataStore will say “thank you”:

DataStore’s UI

Using DataStore to collect model feedback

The feedback data that we are going to collect will look like the following:

{
"model": "v1",
"prediction input": [2.1, 1.4, 5.3, 8.0],
"prediction output": 1,
"was correct": False,
"time": "23-10-2018,14:45:23"
}

This data will tell us what version of our model on ML Engine was used to generate the prediction (model), what the input data for the prediction was (prediction input), what the prediction made by the model was (prediction output), if the prediction was correct (the actual feedback from the user) (was correct), and the time that the feedback was submitted (time).

We’ll use Cloud Functions again to make another web API endpoint, this time to receive the feedback data and store it in DataStore:

don’t forget to add “google-cloud-datastore” to the Cloud Function’s requirements.txt

Now our system architecture looks like the following:

the new architecture of our system

The client apps just need to add in a intuitive way for the users to submit their feedback. In our case it could be a simple ‘thumbs up or thumbs down’ prompt after the user is presented with a prediction:

You may have come across feedback prompts like this before

Get creative with how you collect feedback

Often times you can infer feedback about your model, rather than explicitly requesting it from the users like I’ve done in the Zombr interface. For example if we see that a user stops using the app immediately after a prediction, we could use that data to indicate a wrong prediction 😬.

Back in reality, a dog adoption agency might have a recommender system for new owners. The rate of successful adoptions made by the model is it’s own performance feedback. If the agency suddenly sees that the system is making a lot fewer successful matches than usual, then they can use that as the indication that the model is rotten and may need updating.

Feedback is collected, now what?

Now we can analyse the feedback data. For any data analyse work I default to using Jupyter Notebooks.

Click here for the full notebook of how I fetch data from DataStore and analyse the feedback.

The important bits of fetching data from DataStore are, first installing the DataStore python client library:

pip install google-cloud-datastore

then you can import it and connect to DataStore:

from google.cloud import datastore# connect to DataStore
client = datastore.Client('your project id')
# query for all prediction-feedback items
query = client.query(kind='prediction-feedback')

# order by the time field
query.order = ['time']

# fetch the items
# (returns an iterator so we will empty it into a list)

data = list(query.fetch())

The library with automatically convert all of the data into python dictionaries:

print(data[0]['was correct'])
print(data[0]['model'])
print(data[0]['time'])
print(data[0]['input data'])
>>> True
>>> v1
>>> 2018-10-22 14:21:02.199917+00:00
>>> [-0.8300105114555543, 0.3990742221560673, 1.9084475908892906, 0.3804372006233603]

Thanks to us saving a “was correct” boolean in our feedback data we can easily calculate the accuracy of our model from the feedback by looking at the ratio of ‘Trues’ for this field:

number_of_items = len(data)
number_of_was_correct = len([d for d in data if d['was correct']])
print(number_of_was_correct / number_of_items)>>> 0.84

0.84 is not much rot since we first trained our model which scored ~0.9 accuracy, but that’s calculated using all of the feedback data together. What if we do this same accuracy calculation on a sliding window across our data and plot it? (you can see the code for doing this in the analysis notebook)

That’s a big drop in performance for the most recent feedback.

We should investigate further. Let’s compare the input data (i.e the zombie characteristic data) from the times of high accuracy to the times of low accuracy. Good thing we also collected that in our feedback:

blue = correct prediction, red = incorrect prediction
blue = correct prediction, red = incorrect prediction

Ah, the data looks completely different. I guess the zombie population has mutated! We need to retrain our model ASAP with new data. Good thing we collected the input data in the feedback, we can use that as the new training data (saves us having to manually collect new data). We can use information about the prediction the model made (“prediction” field) and the users’ feedback (“was correct” field) to infer the correct prediction label for the new training data:

See how this code is used in the bottom of the feedback analysis notebook.

With this new data set we can train a new version of our model. This is an identical process to training the initial model but using a different data set (see the notebook), and then uploading it to ML Engine as a new version of the model.

Once it’s on ML Engine you can either set it as the new default version of the zombies model so that all of your clients will automatically start having their prediction requests sent to the new model, or you can instruct your clients to specify the version name in their prediction requests:

setting v2 as the default model

If you set the default model to v2 then all prediction request to “zombies” will go to the v2 version:

PREDICTION REQUEST BODY:
{
"instances":[[2.0, 3.4, 5.1, 1.0]],
"model":"zombies"
}

or your clients can just be more specific:

PREDICTION REQUEST BODY:
{
"instances":[[2.0, 3.4, 5.1, 1.0]],
"model":"zombies/versions/v2"
}

After all that you can sit back and just run the same analysis after some more feedback has been collected:

seems like people find our v2 model helpful

Hopefully this has given you a few ideas on how you can monitor your deployed models for model rot. All of the code used can be found in the github repo:

Reach out to me @ZackAkil with any thoughts/questions on monitoring model rot.

--

--