Introduction
You’ve spent a lot of time on EDA, carefully crafted your features, tuned your model for days and finally have something that performs well on the test set. Now what? Now, my friend, we need to deploy the model. After all, any model that stays in the notebook has a value of zero, regardless of how good it is.
It might feel overwhelming to learn this part of the Data Science workflow, especially if you don’t have a lot of software engineering experience. Fear not, this post’s main purpose is to get you started by introducing one of the most popular frameworks for deployment in Python – Flask. In addition, you’ll learn how to containerise the deployment and measure its performance, two steps that are frequently overlooked.
What is "deployment" anyway?
First thing first, let’s clarify what I mean by deployment in this post. ML deployment is the process of taking a trained model and integrating it into a production system (server in the diagram below), making it available for use by end-users or other systems.

Keep in mind that in reality, deployment process is much more complicated than simply making the model available to end-users. It also involves service integration with other systems, selection of an appropriate infrastructure, load balancing and optimisation, and robust testing of all of these components. Most of these steps are out-of-scope for this post and should ideally be handled by experienced software/ML engineers. Nevertheless, it’s important to have some understanding around these areas which is why this post will cover containerisation, inference speed testing, and load handling.
Setup
All the code can be found in this GitHub repo. I’ll show fragments from it, but make sure to pull it and experiment with it, that’s the best way to learn. To run the code you’ll need – Docker
, flask
, fastapi
, and locust
installed. There might be some additional dependencies to install, depending on the environment you’re running this code in.
Project Overview
To make the learning more practical, this post will show you a simple demo deployment of a loan default prediction model. The model training process is out of scope for this post, so already trained and serialised CatBoost model is available in the GitHub repo. The model was trained on the pre-processed U.S. Small Business Administration dataset (CC BY-SA 4.0 license). Feel free to explore the data dictionary to understand what each of the columns mean.
This project focuses mostly on the serving part i.e. making the model available to other systems. Hence, the model will actually be deployed on your local machine which is good for testing but is suboptimal for the real world. Here are the main steps that deployments for Flask and FastAPI will follow:
- Create API endpoint (using Flask or FastAPI)
- Containerise the application (endpoint) using Docker
- Run the Docker image locally, creating a server
- Test the server performance

Sounds exciting, right? Well, let’s get started then!
What is Flask?
Flask is a popular and widely adopted web framework for Python due to its lightweight nature and minimal installation requirements. It offers a straightforward approach to developing REST APIs that are ideal for serving Machine Learning models.
The typical workflow for Flask involves defining a prediction HTTP endpoint and linking it to specific Python functions that receive data as input and generate predictions as output. This endpoint can then be accessed by users and other applications.
Create Flask App
If you’re interested in simply creating a prediction endpoint, it’s going to be quite simple. All you need to do is to deserialise the model, create the Flask
application object, and specify the prediction endpoint with POST
method. More information about POST
and other methods you can find here.
The most important part of the code above is the predict
function. It reads the json input which in this case is a bunch of attributes describing a loan application. It then takes this data, transforms it to the DataFrame, and passes it through the model. The resulting probability of a default is then formatted back into json and returned. When this app is deployed locally, we can get the prediction by sending a request with json-formatted data to http://0.0.0.0:8989/predict
url. Let’s try it out! To launch the server, we can simply run the Python file with the command below.
python app.py

When this command is run you should the message that you app is running at the [http://0.0.0.0:8989/](http://0.0.0.0:8989/)
address. For now, let’s ignore a big red warning and test the app. To check if the app is working as expected, we can send a test request (loan application data) to the app and see if we get a response (default probability prediction) in return.
If you managed to get a response with probability – congrats! You’ve deployed the model using your own computer as a server. Now, let’s kick it up a notch and package your deployment app using Docker.
Containerise Flask App
Containerisation is the process of encapsulating your application and all of its dependencies (including Python) into a self-contained, isolated package that can run consistently across different environments (e.g. locally, in the cloud, on your friend’s laptop, etc.). You can achieve this with Docker, and all you need to do is to correctly specify the Dockerfile, build the image and then run it. Dockerfile gives instructions to your container e.g. which version of Python to use, which packages to install, and which commands to run. There’s a great video tutorial about Docker if you’re interested to find out more.
Here’s how it can look like for the Flask application above.
Now, we can build the image using docker build
command.
docker build -t default-service:v01 .
-t
gives you a option to name your docker image and provide a tag for it, so this image’s name is deafult-service
with a tag of v01
. The dot at the end refers to the PATH argument that needs to be provided. It’s the location of your model, application code, etc. Since I assume that you’re building this image in the directory with all the code, PATH is set to .
which means current directory. It might take some time to build this image but once it’s done, you should be able to see it when you run docker images
.
Let’s run the Dockerised app using the following command:
docker run -it --rm -p 8989:8989 default-service:v01
-it
flag makes the Docker image run in an interactive mode, meaning that you’ll be able to see the code logs in the shell and to stop the image when needed using Ctrl+C. --rm
ensures that the container is automatically removed when you stop the image. Finally, -p
makes the ports from inside Docker image available outside of it. The command above maps port 8989 from within Docker to the localhost, making our endpoint available at the same address.
Test Flask App
Now that our model is successfully deployed using Flask and the deployment container is up and running (at least locally), it’s time to evaluate its performance. At this point, our focus is on serving metrics such as response time and the server’s capability to handle requests per second, rather than ML metrics like RMSE or F1 score.
Testing Using Script
To obtain a rough estimation of response latency, we can create a script that sends several requests to the server and measure the time taken (usually in milliseconds) for the server to return a prediction. However, it’s important to note that the response time is not constant, so we need to measure the median latency to estimate the time users usually wait to receive a response, and the 95th latency percentile to measure the worst-case scenarios.
This code resides in measure_response.py
, so we can simply run this python file to measure these latency metrics.
python measure_response.py

The median response time turned out to be 9 ms, but the worst case scenario is more than 10x this time. If this performance is satisfactory or not is up to you and the product manager but at least now you’re aware of these metrics and can work further to improve them.
Testing Using Locust
Locust is a Python package designed to test the performance and scalability of web applications. We’re going to use Locust to generate a more advanced testing scenario since it allows to configure parameters like the number of users (i.e. loan applicants) per second.
First things first, the package can be installed by running pip install locust
in your terminal. Then, we need to define a test scenario which will specify what our imaginary user will perform with our server. In our case it’s quite straightforward – the user will send us a request with the (json formatted) information about their loan application and will receive a response from our deployed model.
As you can see, the Locust task is very similar to a test ping that we did above. The only difference is that it needs to be wrapped in a class that inherits from locust.HttpUser
and the performed task (send data and get response) needs to be decorated with @task
.
To statrt load testing we simply need to run the command below.
locust -f app_test.py
When it launches, you’ll be able to access the testing UI at http://0.0.0.0:8089
where you’ll need to specify the application’s URL, number of users and spawn rate.

Spawn rate of 5 and with 100 users means that every second there will be 5 new users sends requests to your app, until their number reaches 100. This means that at its peak, our app will need to handle 100 requests per second. Now, let’s click the Start swarming button and move to the charts section of the UI. Below I’m going to present results for my machine but they’ll certainly be different to yours, so make sure to run this on your own as well.

You’ll see that as the traffic builds up, your response time will get slower. There are going to be some occasional peaks as well, so it’s important to understand when they happen and why. Most importantly, Locust helps us understand that our local server can handle 100 requests per second with median response time of ~250ms.
We can keep stress testing our app and identify the load that it cannot manage. For this, let’s increase the number of users to 1000 to see what happens.

Looks like the breaking point of my local server is ~ 180 concurrent users. This is an important piece of information that we were able to extract using Locust.
Summary
Good job for getting this far! I hope that this post has provided you with a practical and insightful introduction to model deployment. By following this project or adapting it to your specific model, you should now have a thorough comprehension of the essential steps involved in model deployment. Specifically, you have gained knowledge on creating REST API endpoints for your model using Flask, containerising them with Docker, and systematically testing these endpoints using Locust.
In the next post, I’ll be covering FastAPI, BentoML, cloud deployment and much more so make sure to subscribe, clap, and leave a comment if something is unclear.