Continuous quality evaluation for ML projects using GitHub Actions.

How to setup an automatic metrics collection for GitHub repo and how I used it to win NLP competition.

Vladimir Chernykh

Published in

Towards Data Science

18 min readJan 15, 2020

Quality score dashboard — Photo by Stephen Dawson on Unsplash

In this post, the automatic quality evaluation of Machine Learning (ML) based algorithms is discussed.

Recently I and my team have taken 2nd place in Sberbank AIJ competition 2019 which is one of the largest ML contests in Russia (100 teams, 45k$ prizes). This year the task was to solve the Russian language high school graduation exam consisting of many different Natural Language Processing (NLP) subtasks each of which requires a separate ML model. It leads our team to the urgent need for an automatic quality evaluation system. Based on my production experience I’ve built such a system and I’d like to share it here.

All the code from this post is available in the Github repository: https://github.com/vladimir-chernykh/ml-quality-cicd

Introduction

In the last few years, the ML field has been growing extremely fast. Because of that growth and the consecutive need to do everything as fast as possible many projects lack testing. At the same time, classical software development is far more mature area and has lots to learn from (Software 1.0 vs Software 2.0). Proper testing of the product is one of such things. One can learn more about testing and its different types starting from these links [1, 2, 3, 4]

Different types of software tests — Figure 1: One of the software testing classifications

While classical algorithms and applications are deterministic, most Machine Learning based counterparts are inherently stochastic.

It introduces one more type of tests which might be not that relevant previously: algorithm quality tests. They evaluate how accurate the model can solve the underlying task, e.g., predict stock prices, distinguish between cats and dogs in images, etc. For sure, there were approximate and stochastic algorithms since a long time ago, but with the rise of ML, they have become ubiquitous (+ many classical algorithms have theoretical bounds on their quality).

The quality evaluation is crucially important when it comes to production deployment. One needs to know how well the model performs, its applicability scope and weak sides. Only then the model can be used efficiently and benefit the product.

Almost every Data Scientist knows how to evaluate the quality of the model in the development environment, e.g., Jupyter notebook, IDE of choice, etc. But for production purposes, it is important not only to know the current quality metrics but also to keep track of their changes over time. There are at least three ways to tackle this issue:

Surf through commit history in code storage (GitHub, BitBucket, etc.)
Store results in spreadsheets, wiki pages, readme files, etc.
Automatic metrics collection and storage

Each of these options has its usage scenarios. First two options work well when there are not many changes to the model and they happen rarely. Whilst the last approach is inevitable when the model changes frequently and one wants to have a stable and convenient way of tracking it. In what follows the last approach is called continuous.

CI/CD

One of the convenient ways to implement continuous evaluation is to use Continuous Integration/Continuous Delivery (CI/CD) system. There are many options available at the market:

The ultimate goal of all CI/CD systems is to automate build, testing, merging, deployment, etc. The whole CI/CD pipeline works in conjunction with code repository and triggered by the specific event (usually commit).

Figure 2: CI/CD principal scheme. Find more info in sources [5, 6, 7]

Recently GitHub launched its own CI/CD system called GitHub Actions. Previously I used Jenkins for several production-grade CI/CD systems. AIJ competition was a perfect chance to experiment with the new tool and I decided to give it a try. There are a couple of points about Actions that stand it out:

Inherent integration with GitHub. If one has already been using GitHub repositories for code management then Actions should work seamlessly.
GitHub provides on-demand virtual machines for running the whole pipeline. It means that there is no need to manage own machines and its connection to GitHub Actions. Though there is a possibility to do that if one needs a huge VM or has any other specific requests (self-hosted runners have been released in beta recently).
For private repos there is a usage limit for free accounts (2000 minutes/month), for public repos there are no limits though. More info here. I have used GitHub Actions for 2 small projects and haven’t hit the limit but it’s pretty easy in case of bigger projects. In this case, Jenkins/Travis/etc. might be a better alternative.

For a more comprehensive and detailed description please look in the official docs.

Task

To illustrate how the system is built and how it works I would use the toy example and build a model to solve the Boston House Prices task. In practice, this system was applied to the AIJ competition NLP task.

Real task: AIJ competition

The real-world practical task from which this continuous quality evaluation system has arisen is about solving the Russian language exam at AIJ competition.

Figure 4: AIJ contest evaluation pipeline

The peculiarity of this competition is that there are many completely different subtasks (27) which we end up solving with different models. Examples of subtasks are:

Choose the word (out of several) with incorrect stress
Where the comma should be inserted
In which sentences the dashes are put according to the same rule
Which statements are held for the given text
Build a mapping between the type of grammatical mistake and the sentence where it occurred

There were 3 different automated metrics (Figure 4) and each subtask was judged by one of them. The final ranking is based on the sum of the metrics across subtasks.

We have been solving these subtasks one-by-one and often returned to the one we had solved before to improve it. Thus we urgently needed an automatic system that keeps track of changes in metrics for particular tasks as well as the aggregative metric.

I do not describe the exact solution of the AIJ contest here because it is too bulky and complicated. It would distract from the main point of the article. But getting into the context under which the system was build might benefit the understanding of the system and decisions being made.

Toy task: Boston House Prices

This is the regression problem where one needs to predict the price of the housing facility based on different factors such as the number of rooms, neighborhood crime rate, etc.

This is one of the “default” classical datasets. It is even presented directly in the scikit-learn library

Boston House Prices data examples — Figure 3: data examples from Boston House Prices dataset

There are 13 features in total (see detailed description in the notebook) and one target called “MEDV” which stands for “Median value of owner-occupied homes in $1000's”.

Boston House Prices solution pipeline — Figure 5: Model development pipeline for Boston House Prices Dataset and LightGBM model

I would use three different models (+ baseline) to emulate step-by-step “work” on the task:

Mean model (baseline)

Random predictions

Linear Regression

Gradient Boosting over Decision Trees (LightGBM)

In real-world problems, it is equivalent to continuously improving model whose changes are pushed to the repository.

One should also define metrics that estimate how good the model is. In this case, I am to use:

Root Mean Squared Error (RMSE)
Mean Absolute Error (MAE)
Mean Absolute Percentage Error (MAPE)

All the data preprocessing and model training/evaluation code is available in the corresponding notebook in the GitHub repo. This is not the point of the article to:

Build the best model
Explain how to do feature engineering and to develop models

Thus I’m not describing it here in detail. The notebook is heavily commented and self-contained though.

Solution Packaging

The solution is shipped as a Dockerized RESTful web-service. The model is wrapped into a server application using Flask (see this file in the repo for details).

Few things to note here:

The sequence of features is important. Otherwise, the model will fail to do the prediction correctly.
There is ready endpoint which simply returns “OK”. One can send a request to it to understand whether all the models are already loaded to the RAM. If the endpoint does not answer then it means that the models are still loading or have already failed to load.
predict endpoint does an actual job of predicting the price. The input is a JSON with all the necessary features presented.
Input might contain a list of data instances, i.e., more than one.

One can start the solution server using local Python by going into ml-quality-cicd/src and running python server.py. Although it is an acceptable method, there is a much better platform-agnostic way to do it using Docker. I created a special Makefile which runs all the necessary command. To run the solution server just type (Docker should be installed)

It launches a Docker container with a solution web-server available at port 8000 of the host machine. For comprehensive technical instructions please go to the repo.

One can query launched web-service in any suitable way, e.g., using Python, JS, Go, etc. CURL command-line utility and Python examples are below:

In both cases the output should be the same:

{"predictions":[21.841831683168305]}

The solution’s format choice is dictated by the AIJ competition. Recently there is a strong trend in ML competitions to switch to container-based solutions (Kaggle DeepFake Detection, Sberbank AIJ Competition, etc.)

One can choose any other format of solution delivery which is convenient for particular purposes. E.g., launch a script from the command line with the path to the data and location to store results.

Evaluation

To do evaluation, I would use the hold-out technique and split the whole dataset into train and validation subsets. Cross-validation (CV) can also be used in this particular case for the Boston House Prices task because the models used here are simple and fast. In real applications, one might have a model that is trained for days. Thus it would be hardly possible to use CV.

Split is done in the training notebook. This step is also shown in Figure 5.

Note how the random state and other parameters are fixed for reproducibility. After that, the train and validation datasets are saved to CSV files to make later usage without launching the code again possible. These files are available in the data folder in the repo.

Next, there are a few available options for evaluation:

Notebook
Manual via web-service
Automatic via web-service

Let’s break down how to do each of them.

Evaluation: Notebook

In this method, one is to measure the quality of the model in-place: in the same environment where it is being developed. This method helps to build a better model by comparing many of them.

Below there is a table of comparison for all 3 models + baseline (full code in training notebook):

Figure 6: Models comparison on the validation set done in Jupyter Notebook

In-place evaluation of algorithms is great for instant development purposes but it does not allow to trace the history of changes easily (discussed in “Introduction”).

Evaluation: Manual via web-service

This is an intermediate step on the way to fully automated continuous quality testing.

Here I’m going to use the serving interface described in the “Solution Packaging” section above. On the one end, there is a web service that accepts a list of data instances in the JSON body of the POST request. On the other end, there are CSV files (train and validation) that contain data instances in rows. One needs to:

Get predictions from the web-service for all the data in CSV files
Compute metrics for predictions and ground truth values

To run the code in this section please make sure that the server with the solution is running or run it with

Note that because the port is fixed to 8000 there can only be one server running at a time. To kill all the servers one can use make destroy_all command or stop them manually using docker stop ....

Client

The client solves the first of these two tasks. The main goals of it are to:

Cast one format (CSV) to another (list of dictionaries)
Send a request to a server
Receive and process an answer

The full code of the client is available in client.py file in the repo. Here I address only the main points.

Below there is a core logic which is able to send the file to the endpoint in the appropriate format. All other 120 lines in client.py are in fact interface to make these lines work properly.

When the answers are returned one needs to handle and save them.

Note that one row corresponds to one input file. In this task, one file can contain more than one data instance. It’s easier to work with CSV when one row contains one instance and not a list of many. Thus the next step is to transform the answers into a more convenient format.

Figure 7: Parsed and transformed answers dataframe. Here the client was launched on two files simultaneously: train and validation sets. The transformed dataframe contains predictions for both files. All the predictions are the same because the MeanRegressor model is deployed.

Note that this stage fully depends on the data. This transformation should be rewritten correspondingly for each new task and data structure.

Metrics computation

After the answers have been received and are stored in the appropriate format one needs to join ground truth values to the answers properly. It would allow us to compute all the metrics both instance-wise and globally. This functionality is covered by metrics.py file in the repo.

Note that the assignment is done without any join because the parsed_answers dataframe is sorted by “path”, “row_number” fields.

Once the ground truth values are available one can compute all the instance-wisse metrics which are MSE, MAE, and MAPE in this case.

Figure 8: Transformed answers enriched by the ground truth labels and instance-wise computed metrics

This enriched table allows analyzing error distribution and instances which have the highest/lowest error.

Finally, let’s compute global metrics by averaging instance-wise metrics across files.

Figure 9: Global metrics averaged across files

Note that the model tested here is MeanRegressor. One can compare validation quality to the corresponding line from Figure 6 and assures that they are the same.

Evaluation: Automatic via web-service

Everything discussed in the previous section is implemented in the client folder in the repo. The goal now is to automate this process and wrap it into the CI/CD pipeline.

Automation is done by means of Makefile. I previously used it here to launch the web-service and now let’s take a closer look at its main details. There are a few targets in this Makefile which I’d like to discuss:

Runtime variables. Note how the host port is chosen randomly (in accordance with UNIX timestamp at the moment of launch). This is done to allow multiple servers to run at the same time.

build and push are the targets that prepare (build and push) Docker image. The image is quite standard and can be found in the dockers folder.

create starts the Docker container with a solution server. The resources limits are set and equal to 200Mb of RAM and 1 CPU. Note how the CPU limit is handled w.r.t. total CPU cores. This is done because Docker fails to launch the container in case if one asks for more CPUs than are available at the host machine.

evaluator launches the client and the metric computer discussed in the “Evaluation: Manual via web-service” section. The code can be found in the client folder. Note that the launch happens in the separate Docker container (started from the same Docker image though) and the network is shared with the host machine. It allows accessing the solution server from inside the launched evaluation Docker.

destroy stops and removes the solution server. Note how the logs are dumped and printed to the output in case of errors.

One can combine all the steps described above to launch the evaluation using one command:

which triggers create, evaluator and destroy sequentially. The output should be evaluation artifacts available under client/reports/<timestamp>_<data folder name> folder.

All three files were discussed previously in the “Evaluation: Manual via web-service” section and should be clear by now.

The last step to complete the fully automated system is to tie the described evaluation process with CI/CD system.

CI/CD: GitHub Actions

GitHub Actions platform is used as a CI/CD engine. For more details about GitHub Actions and CI/CD in general, please go to the section “CI/CD” of this post. There are also official docs available here. In what follows I would use three approaches. The difference between them is where and how to perform the CI/CD pipeline steps:

GitHub machine
Remote machine using SSH
Remote machine using self-hosted runners

GitHub machine

This is the easiest option to use. GitHub provides users with Virtual Machine (VM) which spins up automatically every time the specific event triggers the pipeline. GitHub manages the whole lifecycle of the VM and the user should not bother about it.

All the pipeline definitions should follow a few rules:

YAML format
Stored in /.github/workflows folder inside the repo
Follow the syntax rules

Let’s build a workflow that performs the evaluation process and prints the results. The specification is located in the evaluate_github.yml file in the repo.

The pipeline is triggered on each push that happens to the Master branch and has changes to specified files. VM runs ubuntu-18.04 and has the following list of packages installed. There are also other systems available (see the full list here).

The first step is to execute publicly available action called checkout (more details here). It checks out the code from the parent repo to the VM. The next step is to perform the evaluation. This is done using make evaluate as discussed previously. Finally, the metrics are printed to stdout.

As soon as the YAML config is pushed to the repo the pipeline would be triggered on each suitable event. Results of the pipeline run can be found under the “Actions” tab in the upper tab menu of the repo.

Figure 11: workflow run interface and results for GitHub provided VM

The successful pipeline run is shown in Figure 11. As one can notice, the metrics are the same as in Figures 6 and 9.

GitHub-managed VMs are a great method for workflows that do not require lots of resources. All VMs have only 2 CPUs, 7GB of RAM and 14 GB of disk space (link). It’s ok when one runs a toy task like Boston House Prices prediction. But for AIJ contest we quickly ran out of resources which made it impossible to use for the evaluation. And for many ML applications, these resources are not sufficient. This leads me to the solution described in the next subsection.

Remote machine using SSH

Assume now, that there is a workstation that one wants to use to execute the workflow. How do accomplish that?

One can use VM spun up by GitHub for any purpose, not only for launching the pipeline directly. In addition to that, GitHub allows users to store secrets and use them during workflow execution. The secret might be any piece of information which is stored securely at GitHub and can be passed to the VM.

A combination of these two ideas makes it possible to connect to the remote workstation using SSH and execute all the necessary steps there.

To do that let’s first add three secrets to the GitHub:

EVAL_SSH_HOST which is an IP-address of the remote machine
EVAL_SSH_USER is a user name at the remote machine
EVAL_SSH_KEY is a private SSH key which corresponds to the provided user and remote machine

Figure 12: secrets management UI provided by GitHub

After initializing all the necessary secrets one can use them in the GitHub Actions pipeline definition. The code for the described workflow is available in the evaluate_remote_ssh.yml file in the repo.

The trigger conditions and initial checkout step are the same as in the previous example. The next step is to init the SSH private key. Note how the secret is passed to the environmental variable using special context placeholder {{ secrets.<SECRET_NAME> }}. Once the ssh credentials are initialized, one can make an archive with code at the GitHub VM and load it to the specified path on the remote machine using scp command. Note that the archive is named after the commit hash GITHUB_SHA which is available by default as an environmental variable (see here for the full list of default env vars). After that, the evaluation is carried out by running make evaluate and the results are copied back from the remote machine to the GitHub VM.

Figure 13: workflow run results using an SSH connection to the remote machine

Remote machine using self-hosted runners

Remote connection via SSH solves the problem of insufficient GitHub VM resources. But it also introduces too many extra commands needed to accomplish the goal in comparison with GitHub VM execution (31 lines vs 5 lines). To resolve this issue and take the best of both worlds one might use recently released the beta version of self-hosted runners.

The core idea behind it is to expose the remote machine to the GitHub Actions for pipeline execution using special software (more info on installation here).

After establishing the connection one can use the new runner natively in the pipelines via runs-on instruction of the YAML configuration file.

Below the pipeline for the self-hosted runner is defined.

Note how runs-in: self-hosted instruction tells GitHub Actions to run on custom remote machine added previously (Figure 14).

The YAML configuration is almost the same as the one for GitHub-managed VM except for the few steps.

Cleaning the environment before and after the pipeline execution. It has to be done because nobody handles the env for us anymore (previously it was done by GitHub) and it might contain artifacts from previous builds. make clean command does the work.
Uploading artifacts to storage. Here I use a custom (possibly separate) machine defined by DASH_SSH secrets in GitHub. I store results in a separate folder where each file is named after the commit hash it corresponds to. One can use any other storage (AWS S3, GCP Storage, etc.) and format of files.

Results

All the evaluations done in this post are shown to be consistent and give the same metrics for the same MeanRegressor baseline model.

Now let’s change the model sequentially to RandomRegressor, LinearRegression, and LGBMRegressor to see how the metrics are changing. I have implemented the special dashboard for this purpose using the Dash Python library. One can get code details here in the repo. The dashboard is launched manually at the special remote machine specified by the DASH_SHH secrets in GitHub.

The folder which is tracked by the dashboard corresponds to the folder where CI/CD self-hosted pipeline stores its results. The dashboard itself is available at the port 7050.

Figure 16: commits with different models. They simulate the real progressive work on the model.

Each commit shown in Figure 16 triggers all 3 described CI/CD pipelines. All of the executions went well which is indicated by the green mark near the commit. The self-hosted runner also pushes results to the dashboard which is shown in Figures 17 and 18.

Figure 17: dashboard with a table of results with MAE metric. The green row shows that the quality of the last model is better than the baseline (otherwise the row is painted in red).

Figure 18: graph for one of the rows of the results table. Note how the model name repeats the commit hash from figure 16.

The dashboard allows users to conveniently track the progress in history and compare metrics across different subsets/subtasks.

In real-world AIJ contest this dashboard looked the following:

Figure 19: evaluation table for AIJ contest. Each row corresponds to one subtask of the exam.

Figure 20: model progress for one of the tasks.

Conclusion

In this post, I described my perspective on building continuous quality evaluation systems and provided all the necessary code and steps to reproduce it on one’s own.

The system built in this post is already good enough for small/medium size projects or ML contests. It also might be improved in many ways, e.g.:

More reliable storage for logs and metrics
Better visualization system
Logs analytics
etc.

The described continuous quality evaluation workflow can easily be ported to any other CI/CD engine by rewriting config files into the appropriate tool-specific format.

GitHub Actions has its pros and cons. The big advantage is the tight integration with GitHub and the availability of on-demand GitHub-managed VMs. This is convenient in case one uses GitHub and totally unacceptable if not. The main drawback is the restricted amount of builds (2000 minutes) for private repositories. For big projects, it leads to the need for a paid premium account or usage of another CI/CD instrument. If one has faced the minutes overflow issue, I would probably choose the latter option as the way to go.

In my opinion, GitHub Actions is a good option for small/medium scale projects which need an easy-to-setup CI/CD engine and do not want to care much about the infrastructure for it. For bigger projects and more fine-grained control, one might choose other CI/CD platforms.

P.S. I would like to thank my teammates (German Novikov, Oleg Alenkin, Alexander Anisimov) in the AIJ contest for their contribution to the solution.