Quality Assurance for Artificial Intelligence

General thoughts and quality assurance for data

Michael Perlin

Published in

Towards Data Science

8 min readMar 8, 2020

This is the first post in my series on how to test systems which leverage machine learning.

What you need to know about how AI works

Let’s introduce an example AI project to explain relevant terms (data science people can skip it). Imagine you run a bakery. So you want to know every morning how much bread to bake. For this, you use historical data, which you collect before, like:

Predicting by just date is hardly feasible, but you notice that your demand depends on:

weather conditions (in cold or rainy weather, fewer people go outside for purchases)
month (in summer, your customers go on holidays)
day of the week (as you are closed on Sundays, people buy more before and after)
some events in the neighbourhood (imagine your bakery is close to a football stadium)

So you add columns with values for every day. You call the column you’re predicting the “target” and the columns you’re using for predictions the “features”.

The question

So you give your data to a guy who calls him/herself a data scientist and ask them to create something for predicting the demand — the data scientist calls this the “model”. You can imagine a model to be a decision tree which will be created by an algorithm using your data. After the model is delivered, you need to know if you can trust it (blindly) and use the predictions in your daily work.

To answer this question, we focus on the methodical part and consider the model itself to be a black box. This should give guidance to QA, managers, developers and infrastructure people who need to ensure the quality of software which uses AI. I’m convinced that an understanding of QA and common sense are sufficient prerequisites for this if you learn the playbook of how AI models are developed. You don’t need to take additional classes in math, statistics or deep learning.

I also want to provide ideas and best practices for data scientists, as they often have to ensure the quality of the data and models by themselves. Especially if your experience comes primarily from ML classes, research or Kaggle challenges, which cover only a part of the model lifecycle, you may be interested in additional QA techniques needed when working in the industry.

AI model lifecycle

From a lifecycle perspective, AI projects usually involve 3 stages:

collecting and preparing data
training the model
deploying the model in production for making predictions, monitoring the model

Thus, QA performs different checks or reviews depending on stage:

Is the data which will be used for model training of sufficient quality? This is the topic of this blog post.
Is the built model which goes live of sufficient quality? That will be the topic of the next post.
Does the model keep delivering sufficient quality when running in production? I will deal with that in the last post in this series.

Some QA steps come down to a check comparing a metric of your data or model to a predefined value or threshold. Others are rather analyses or reviews, requiring time, manual effort, domain knowledge and common sense.

On automation

The chance that you won’t need to retrain your model once it is live, even if the approach and the tools stay the same, is close to zero. Apart from technological changes and new requirements, there are some reasons specific to AI:

You are likely to get more or higher-quality data later. As data amount and quality is what gives you the main edge in AI, you won’t let this chance slip.
If your challenge is about predictions, you need data from today/this week/this month to predict something for tomorrow/next week/next month. So you need to retrain the model with recent data.
As time goes by, the world changes, and with it the reality your model used to reflect well a time ago. Think about estimating house prices and the changing real estate market.
Users of your system will most likely find conditions under which your model makes wrong decisions. Especially if your system is worth cheating (such as the automatic processing of insurance claims), expect bad guys who will try it. So you need to adapt your model to eliminate or at least reduce the impact.

The frequency of model updating depends on the use case and may range from hours to years. This frequency has a heavy impact on the level of automation you need for the flow from raw data to the new production-stage model. You can implement this automation in different ways which is a separate topic not considered here. But, in any case, you must provide the same level of automation for QA checks. This way, you make QA an unavoidable step properly integrated in the model development process.

Imagine you have a nightly running script flow which collects data from different sources, starts the model training, builds the service with the new model and replaces the old version of this service. In this case, checks for data and model quality must be scripts running as part of your flow and raising a red flag if defined conditions are not met.

Detecting data collection errors

In very rare cases like our small bakery, all the data used in a model may come from a single source. For companies which are at least middle-sized, the data which is useful for predictions may be spread over multiple systems. In an average e-commerce company, the CRM knows about customer’s demographic data, the payment system about their buying behaviour, the web tracking system knows what they search for, the inventory system knows about the product you offer, and in addition, you want to use weather data from an external service. So most of the data takes a non-trivial multistep path before data science touches it. Whether that path involves manual operations or is fully automated with ETL tools — something can go wrong. And source systems may contain errors as well. So we need a set of “entrance checks” before training the model.

Let us assume the data collection run is complete (we consider it to be a black box, leaving the details to data engineers) and you have a freshly updated table with customers or a folder with images.

The easiest types of checks are statistics-based. Once the data collection process is implemented, verify the correctness of final results manually, i.e. by tracking selected entries in the source and target system (if your data engineering colleagues already did this, you’re lucky). Once that is verified, count entities which are important for model training (accounts, images, customer reviews, audio files). Introduce thresholds as to how many of them must be available to assume that data collection worked without major errors.

Consider calculating further statistics which should not change significantly for your data in the long term: how your accounts are distributed by gender, what is the average review length, what is the most frequent image type, etc. For every statistic, define a threshold or range which will be validated before model training.

You may want to know how many records (at least as an order of magnitude) are sufficient to train a good enough model. It’s hardly possible to get this number just out of problem description and data. Let’s judge the accuracy of the model after training; we will address that in the next blog post.

Detecting data quality issues

As next step think if you really want to use all available records in your model. Entries you will most likely fix or get rid of before training the model are:

duplicates
entries created for testing
entries with a wrong date entered unintentionally (like a birth year 1897, instead of 1987)
entries covering use cases your model should not serve (like corporate accounts if you want a model for personal accounts only)

Perform some exploratory data analysis. For every column/feature used in your model, you should know what exactly it means and what values it can take. Maybe a piece of documentation will be around, maybe it will be up-to-date. But don’t assume everything to be correct and aligned with documentation.

For numeric columns like “age”, build a histogram. For categorical columns like “education”, list all possible values. To discover duplicates, check the uniqueness of technical keys and maybe also of some set of features (like the name and address for a customer).

Another, more advanced way to discover strange records is “anomaly detection.” Imagine a customer who is 21 years old, has a PhD and 10 years of work experience. All the 3 data points are fine in themselves, but their combination is very unlikely; we can assume somebody just entered the wrong age. Discovering anomalies like this is a pure machine learning topic with a lot of great papers about it.

Searching for explanation

If your data landscape is old enough, you likely discover a lot of issues. For some of them the explanation will be easy to find or obvious — but most likely not for all the issues. A lot of questions like “a record with this date looks strange, should I fix it or filter it out or leave it as it is?” will remain, and the only way I know is to get up from your chair and go ask some questions. Good candidates to ask are:

colleagues who work with edge cases: customer support, operations
colleagues with a good knowledge of how the system will be used: customer support (again), QA
colleagues who just have been in the company for a long time

Fixing data quality issues

We just briefly touch upon this topic as it is mostly about data science and less about QA. Once you discover an issue, consider addressing it at its origin — a duplicate user record should be fixed directly in the CRM, a wrong aggregation of payments must be repaired in the ETL part. But you still should assume upstream systems as having data issues (one will be fixed but another will arise by implementing a new feature), and have checks and/or fixes for them.

If the issue has an explanation but is not repairable (in the short term), which is often the case, you should fix it in order to prevent an impact of wrong data on your model. The fix is usually about

filtering out the record: you can e.g. remove duplicates by ID
enriching missing data points from another source: e.g., missing gender can be guessed by looking up the first name in a dictionary
imputing missing data point: i.e. if age is missing use median age of the data set

In any case, cleaning every data issue is not mandatory for proceeding; a subset of your data may do for training as well. So fix first the issues which are (1) about important features (2) occur frequently and (3) easy to fix — and then train your model candidate. If it won’t show sufficient quality, you can go after harder issues.

What’s next

Now we’re done with QA for the dataset and can proceed with building a model. QA for models will be the topic of our next post.