Machine Learning Service for Real-Time Prediction

Build a production ready Machine Learning service for real-time prediction using FastAPI

Danylo Baibak

Published in

Towards Data Science

7 min readApr 19, 2021

Table of Content:

Introduction
Machine Learning model
REST API
Prediction endpoint
Local development vs production

1. Introduction

There are a number of patterns for using Machine Learning (ML) models in a production environment, such as offline, real-time, and streaming. In this article, we will take a look in detail at how to use ML models for online prediction.

You can find a dozen articles on “How to build REST API for ML”. The problem is that almost all of them describe this topic very superficially. In this article, we will take a closer look at what is the difference between local and production environments, how to update a model without redeploying, security questions.

For real-time forecasting, it is common to use a REST API and such a service makes predictions on demand, usually via an HTTP call [1]. To build the REST service, we will use the FastAPI framework. FastAPI is a modern, high-performance, batteries-included Python web framework that’s perfect for building RESTful APIs. It can handle both synchronous and asynchronous requests and has built-in support for data validation, JSON serialization [2].

The whole project is Dockerized and is available in the DanilBaibak/real-time-ml-prediction repository on GitHub. Feel free to play around with it and investigate the code directly.

2. Machine Learning model

Before we start talking about REST API, we need a ML model. In this article, we will use the Boston house prices dataset from the sklearn datasets [3]. It contains 506 instances and 13 numeric and categorical features. We need to solve a regression problem — predict the price of a house based on its properties.

The ML pipeline will be pretty standard:

Numerical features will be scaled using StandardScaler;
Categorical features will be encoded using OneHotEncoder;

We will use Ridge Regression [4] as our ML model. An important note — remember to fix the random state to be able to reproduce your results.

Now, if we start talking about a real application, it is not enough for us just to train the model and save it in a folder. We also need:

To track the version of the model. Ideally, we need to have a model training history in the database.
It is also good practice to save the model metadata after each training. In the long term, this information can give you the opportunity to see some critical points in the life cycle of the model. We have already discussed the idea of monitoring model metadata in another article. [5]

3. REST API

Once you’ve prepared a ML model, you can start building the service. As mentioned earlier, we will build a REST API using the FastAPI framework.

3.1 Project structure

The server.py file is entry point for our REST API service.
We will save the ML artifacts in the ml_pipelines folder.
The models folder is for storage Pydantic schemas. It defines the properties and types to validate some data [6].
The script scripts/train.py trains the ML model.

3.2 API endpoints

Typically, a ML REST service has one endpoint: “POST /predict”. It receives a list of the features for an item and returns a prediction. We’ll look at this method in more detail further.

One more useful endpoint can be: “GET /health”. The idea behind this comes from microservices best practices — quickly return the operational status of the service and an indication of its ability to connect to downstream dependencies [7]. The model version can be useful information as part of the health check output:

Checking if the latest version of the ML pipeline exists;
If you have QA engineers on your team, they can report the issue with a link to the model version;

4. Prediction endpoint

Now we come to the main part — the prediction endpoint. The common algorithm is quite simple and you can find it in many articles:

Receive a request;
Load the model;
Make a prediction;
Send a response;

This approach would work depending on the volume of traffic. Let’s look at how we can improve it.

4.1 Model caching

One of the important requirements of a REST service is latency. Loading the model for each request is time-consuming. We can introduce a local cache — this allows us to load the model once. The next request will use the cached model. This will speed up the response time.

4.2 Load a new version of the model

For a production application, we will need to retrain the ML pipeline at some point. Ideally, it should happen automatically. You set up a cron job to train your model. Then the REST service loads the latest ML pipeline as soon as it’s ready.

We have already prepared a separate table for storing the ML pipeline version and the local cache — this should be enough to implement the loading logic for the latest version of the model. We need to update the prediction endpoint in the following way:

Once the new ML pipeline is trained, stored and the latest version is saved in the database, the API will use the newest model. This approach will also work if you have multiple services with a load balancer [8].

4.3 Storing predictions

In terms of the production ML application, it is not enough just to make a prediction. Two important questions remain:

How will you keep track of the quality of the predictions?
How do you organize data collection for future model training?

To answer both questions, you need to start storing predictions. But if we just add logic to save the predictions per each request, it will affect the latency. FastAPI will help us to solve this problem. One of the most exciting features of FastAPI is that it supports asynchronous code out of the box using the Python async/await keywords [9].

We need to add the async keyword to the function which saves data to the DB and to the endpoint function. Then add the await keyword when you save the prediction to the DB. Now we can store historical data without affecting latency.

The final version of the prediction endpoint will look like this:

5. Local development vs production

To deploy our service to the production, we need to dockerize our code. In theory, we need to wrap our application in Docker according to the documentation [10] and we are done. But there are still a couple of important points to pay attention to.

5.1 Local development

Even after deployment, you still need to continue working on the project. As you remember, we use a database to store the version of the ML pipeline and predictions. We will use Docker Compose to set up the DB and apply migrations.

For local development, we will also use FastAPI in debug mode — the server will be restarted automatically each time you change your code. For this, we need to start the application slightly differently.

Please note that we use a separate Dockerfile. You also can have a separate requirements.txt file for local development. This allows you to have a hands-on working environment and not overload the production Docker image.

5.2 API documentation

FastAPI provides an API document engine out of the box. If you visit http://localhost/docs you will see the automatic interactive API documentation (provided by Swagger UI) [11]. This can be useful for local or staging environments. But, depending on your project, you might not want to have an open documentation endpoint in production. It’s easy to implement:

It can also be a good idea to change the default URL for the documentation.

5.3 Security requirements

I want to draw your attention one more time that useful information for a local or staging environment can be dangerous in production. If we go back to the health-check endpoint — displaying the model version can be very helpful for bug tracking, but unacceptable for production. Depending on your project, you might prefer to hide this information.

Summary

Latency is one of the important requirements for a REST service. Cache the ML pipeline and use asynchronous code whenever possible.
Properly organized storage for ML pipeline versions is key in organizing pipeline updates without redeployment.
Displaying the version of the ML pipeline as part of the health check endpoint can be useful for bug tracking during development.
Build a secure application! Block access to sensitive information.
Using docker-compose, a separate Dockerfile and even a separate requirements.txt file allow hands-on local development without overloading the production Docker image.

You can find the complete project in the repository DanilBaibak/real-time-ml-prediction with instructions on how to set up and use it.