The world’s leading publication for data science, AI, and ML professionals.

Docker for data scientists-Part 2

A complete guide to develop a machine learning model, write an API, package it in Docker, run it anywhere, and share it with ease.

Docker for data scientists - an end-to-end guide (photo by Rémi Boudousquié on Unsplash).
Docker for data scientists – an end-to-end guide (photo by Rémi Boudousquié on Unsplash).

Overview

In Docker for data scientists – Part 1 we covered some of the basics around using Docker and got a simple Hello World! example up and running. If you missed that post and want to run through the code, you can find it here:

Docker for data scientists – Part 1

Our aim in this post is to move on to an example of how a data scientist or machine learning engineer might want to use Docker to share, scale, and productionise their projects.

We’ll work our way through the following topics:

  • Model serving – develop and deploy an API for a simple model
  • Serialising models – storing and sharing models in Python
  • APIs with Flask – a very simple example of a prediction API
  • Images – packaging everything up and running in a container
  • Repositories – share your work
  • Machine learning at scale – Introducing Docker Swarm and Kubernetes
  • Next steps – DataOps, MLOps, and machine learning engineering

Model serving – develop a model, an API, and serve using docker

Example: a random forest classifier for the Iris dataset

To keep things simple, we’ll build a simple model for the well-known Iris dataset. We’ll write a straightforward API allowing users to call that model to predict on an array. Then, I’ll walk through the process of packaging and sharing the whole thing with Docker.

Get data

First, let’s get the data. As it’s a super-common demo dataset, we can import it directly from scikit-learn within our script using:

Or, if you want the .csv you can get it from GitHub using the terminal with:

It should look something like this:

Downloading the Iris dataset as a .csv from GitHub (image by author).
Downloading the Iris dataset as a .csv from GitHub (image by author).

Note: If your terminal is Powershell or the Anaconda PowerShell Prompt on Windows this may not work. By default, PowerShell has the Invoke-WebRequest command aliased to curl and it may behave differently. Either install cURL using Chocolatey using choco install curl or use whatever method you normally do for getting datasets.

Creating the model

We are going to quickly spin up a Random Forest model. Create a new folder called scriptsand in it, a new file called train_model.py in which we’ll write some simple code for building the model, as follows:

The output from our simple model creation script giving out-of-bag score and mean accuracy (image by author).
The output from our simple model creation script giving out-of-bag score and mean accuracy (image by author).

We could go further by looking at the confusion matrix, the ROC curve, and tweaking the features and hyperparameters. I won’t just here to keep the example simple – but please take the time to tweak this for your application.

Serialising models

Once we have a model we’re happy with, we want to store it. There are a few methods for achieving this. Here we’re just going to use Pickle to keep things simple. Pickle allows us to convert between Python in-memory objects and streams of bytes – for transfer over networks or storage on a local filesystem.

Pickle syntax is quite straightforward:

For a good introduction to Pickle check out Manu Kalia‘s post here:

Don’t Fear the Pickle: Using Pickle.dump and Pickle.load

Now, to pickle our model, either make the changes to the file we have already train_model.py or you can create a new file with a name like build_model.py and put your modified code to build and serialise the model in there.

If you’re modifying train_model.py add the import at the top and replace the print statement:

And you should now see the pickled model on your filesystem. To test this out, let’s make a new file called load_model.py in which we’ll load the model from the serialised object and run a prediction. The code will look something like this:

This should return the value [1] like this:

The output from our script testing the prediction from the pickled model (image by author).
The output from our script testing the prediction from the pickled model (image by author).

There are a lot of advantages to this approach and you should consider using it in your workflows if you aren’t already. You’re now able to score with our model without loading the supporting packages, and you can store and share versions of your model.

APIs with Flask

Now that we have our model let’s put a quick Flask API around it. We’re just going to add a simple score method around it that’ll allow users to send an array and get a prediction in return.

We won’t delve into error handling or anything here – you can do a lot with Flask and I’d highly recommend taking some time to learn it as adding APIs to your projects is both big and clever! A good place to start is this post by Bharat Choudhary which shows how to build a UI and everything for the Iris dataset:

Deployment of Machine learning project using Flask

Create a new file called app.py and let’s build a simple API that allows us to adjust the input parameters. In this case, we just need to adjust the values after each of the = signs. Our code will look like this:

Now, if we run app.py we can test that this returns the same prediction as earlier by entering the following URL into our browser:

http://localhost:1080/score?petal_length=5.0&petal_width=2.0&sepal_length=3.5&sepal_width=1.0

Your terminal and browser should now look something like this:

The Flask prediction wrapper for the Iris random forest model up and running, giving the same result for the test set as seen earlier (image by author).
The Flask prediction wrapper for the Iris random forest model up and running, giving the same result for the test set as seen earlier (image by author).

Images – package, and run

Now the model and app are built and working, we’re ready to package them up using Docker. Firstly, we’ll need to create a requirements.txt to capture all of the packages Python will need to run. This is easily achieved using:

conda list --export > requirements.txt

This should give you a file like the following (ensure the version numbers match for your versions if you aren’t using conda or pip to generate this):

Flask==1.1.2
numpy==1.19.4
scikit-learn==0.23.2

To keep things simple, if you’ve been working through all of the examples, create a new folder called production. In here, create the Dockerfile – remember this is a text file with the name Dockerfile and no extension. Then put a final copy of app.py, requirements.txt and rf-model.pkl. The folder should look like this:

production
├─── app.py
├─── Dockerfile
├─── requirements.txt
└─── rf-model.pkl

The contents of the Dockerfile are fairly straightforward as we can utilise existing Python images from Dockerhub. The Dockerfile will look like this:

With this all in place we can build our image using:

docker build -t iris-rf-app .

If you haven’t already downloaded the python:3.8 image, Docker will do so automatically. Overall this took just two minutes on my machine. Your terminal should look something like this:

Docker downloading the base image and sequentially building our Iris Flask app on top of it. All in taking about two minutes (image by author).
Docker downloading the base image and sequentially building our Iris Flask app on top of it. All in taking about two minutes (image by author).

We can now check the image exists with docker images and run it with:

docker run -d -p 5001:5001 iris-rf-app

Testing it with our test URL from before we get the same results. The output from the terminal and browser can be observed below:

Running the docker image and checking the output with our test URL as before (image by author).
Running the docker image and checking the output with our test URL as before (image by author).

We now have a working version of our model, with a simple API, that can be run anywhere we can run a Docker container. If we really wanted to we could ramp up our API and start using this approach to build end-to-end machine learning and Data Science pipelines.

Repositories – sharing your work

Sharing Docker images is also really straightforward. Once you’ve signed up for an account at hub.docker.com it’s really easy to upload your images to the public repo from the command line.

First, you need to log in so your instance of Docker knows your credentials:

docker login -u <your_docker_id> --password <your_docker_password>

Once logged in you can now upload images to your repo. The only change we need to make is to add tags. This is done in the build stage and follows the format:

docker build -t <your_docker_id>/<image_name>:<image_version>

If you leave <image_version> blank Docker appends the latest tag to the image. Rebuilding my image gives the following:

Rebuilt Iris Flask app with my Docker ID, name, and tag (image by author).
Rebuilt Iris Flask app with my Docker ID, name, and tag (image by author).

Then you just need to push the image to the repo using:

docker push <your_docker_id>/<image_name>

Example output of the Docker push command (image by author).
Example output of the Docker push command (image by author).

As shown below, our image is now publically viewable by navigating to:

hub.docker.com/repository/docker/<your_docker_id>/<image_name>
Our Iris Flask app publically available on Docker hub (image by author).
Our Iris Flask app publically available on Docker hub (image by author).

If you want to give it a try, you can navigate to my repo shown above and pull and run that image using:

docker pull asroka/iris-rf-app:1.0
docker run -d -p 5001:5001 asroka/iris-rf-app:1.0

There are also plenty of ways of hosting private Docker repositories – perfect for sharing within your organisation or team. I won’t touch on them in detail here but do check out the documentation as a starter here.

Machine learning at scale

So far we’ve looked at a very simple case but still, the approach here has many uses, allowing data scientists and Machine Learning engineers to package up, share, and run projects anywhere Docker will run.

There’s a lot more to Docker – we haven’t touched on multi-container applications. It’s possible to define and spin up your entire applications stack – including databases, servers, load balancers, etc. – with Docker and Docker Compose. Once you’re in the hang of these approaches, you can start templating and creating complex building blocks for your data pipelines.

If you’re interested in a deeper dive into Docker: Up & Running is great.

Managing many Docker containers at once can get complex and cumbersome. What if your app is in huge demand, how do you go from a handful of users to thousands? Luckily, there are tools available for just that.

Kubernetes

Developed by Google, Kubernetes is an open-source container orchestration platform that automates the management, deployment, scaling, and networking of containers.

I’ll be covering Kubernetes for data scientists and machine learning engineers in a future post – so watch this space.

Docker Swarm

Another option is Docker Swarm. I won’t be covering Docker Swarm as Kubernetes does the job for me but if you’re interested I’d highly recommend checking out this post by Gabriel Tanner to get started:

The Definitive Guide to Docker Swarm

Conclusion

This post has enough to get you up and running with Docker and deploying a machine learning model into production. With practice, you’ll begin to see the applications in your workflow. There’s a lot you can do. Some thoughts for the next steps follow.

Sharing entire project workspaces

Coming from a consultancy background we found Docker was a great way to share a finalised project with a customer without the hassle of environment and dependency management. To get a start on this, check out Cookiecutter Docker Science – a great tool to add to your toolbox.

JupyterLab

Setting up a JupyterLab can sometimes be cumbersome – ensuring credentials and configurations are all in the right place. It’s possible to wrap these all up in Docker containers, making it easy to spin up new JupyterLab environments on a range of machines.

MLOps, DataOps, and machine learning engineering

Machine learning engineering is a relatively new field, and the tools and practices are still forming. It looks like containers are here to stay, acting as a fundamental building block of many workflows. The fields of MLOps and DataOps are fascinating and well worth further exploration.

This discussion of MLOps by Cristiano Breuel really works as a first touchpoint:

ML Ops: Machine Learning as an Engineering Discipline

And for an introduction to DataOps, see this great post by DataKitchen on DataOps:

DataOps is NOT Just DevOps for Data

I hope this was useful, please let me know if you have any feedback reach out at:

If there are any other topics you’re interested in me writing about please let me know!

Please note, this post does contain an affiliate link.


Related Articles