Taking Machine Learning Into Production

I’m a big advocate for learning by doing, and it just so turns out that it’s probably the best way to learn machine learning. If you’re a machine learning engineer (and possibly a Data Scientists), you may never quite feel fulfilled when a project ends at the model evaluation phase of the Machine Learning Workflow, as your typical Kaggle competition would – and no, I have nothing against Kaggle, I think it’s a great platform to improve your modeling skills.
The next step is to put the model into production, which is generally a topic that is left out of most courses on Machine Learning.
Disclaimer: This article was written using notes from Deployment of Machine Learning models Course (Udemy)
Formats To Serve ML Models
Once a Machine Learning model has been trained and exported, the next step is coming up with a method to persist the model. For example, we can serialize the model object with pickle
– see the code below.
import pickle
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# loading the data set
dataset = load_breast_cancer(as_frame=True)
df = pd.DataFrame(data= dataset.data)
df["target"] = dataset.target
# Seperating to X and Y
X = df.iloc[:, :-1]
y = df.iloc[:, -1]
# splitting training and test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, shuffle=True, random_state=24)
# training logistic regression model
lr = LogisticRegression(C=0.01)
lr.fit(X_train, y_train)
# serializing the model
lr_dump = pickle.dumps(lr)
# making predictions with the model
clf = pickle.loads(lr_dump)
y_preds = clf.predict(X_test)
Note: when writing code to take a Machine Learning model into production, an engineer modularize the above code into training and inference scripts to abide by software engineering best practices.
The end of the training script is defined by the point at which the model is dumped into a pickle file, whereas the inference script begins once the model has been loaded to make predictions on new instances.
Other methods of serving a model include:
- MLFlow – MLFlow provides a common serialization format that integrates with various machine learning frameworks in python
- Language Agnostic exchange formats (i.e. ONNX, PFA, and PMML)
It’s always good to be aware of the other options we have since there are some downsides to the popular pickle
(or joblib
) formats.
1 Model Embedded in Application

- Pre-Trained: Yes
- On-the-fly Predictions: Yes
In this scenario, the trained model is embedded in the application as a dependency. For example, we can install the model into the application with a pip
installation or the trained model can be pulled into the application at build time from a file storage (i.e. AWS S3).
An example of this instance is if we had a flask application that we used to predict the value of a property. Our Flask application would serve an HTML page which we could use as an interface to collect information about a property a user would like to know an estimated value for. The Flask application would take those details as inputs, forward them to the model to make a prediction then return them to the client.
In the example above, the predictions will be returned to the user’s browser, however, we can vary this method to embed the model on a mobile device.
This approach is much simpler than other approaches, but there’s a simplicity-flexibility trade-off. For instance, to make a model update the entire application would have to be redeployed (on a mobile device, a new version would need to be released).
2 Dedicated Model API

- Pre-Trained: Yes
- On-the-Fly Predictions: Yes
In this architecture, the trained machine learning model becomes a dependency of a separate Machine Learning API service. Extending on from the Flask application to predict the value of a property example above, when the form is submitted to the Flask application server, that server makes another call – possibly using REST, gRPC, SOAP, or Messaging (i.e. RabbitMQ) – to a separate microservice that has been dedicated to Machine learning and is exclusively responsible for returning the prediction.
Differing from the embedded model approach, this method compromises simplicity for flexibility. Since we’d have to maintain a separate service there is increased complexity with this architecture, but there is more flexibility since the model deployments are now independent of the main application deployments. Additionally, the model microservice or main server can be scaled separately to deal with higher volumes of traffic or to potentially serve other applications.
3 Model Published as Data

- Pre-Trained: Yes
- On-the-Fly Predictions: Yes
In this architecture, our training process publishes a trained model to a streaming platform (i.e. Apache Kafka) which will be consumed at runtime by the application, instead of build time – eligible to subscribe for any model updates.
The recurring theme of simplicity-flexibility trade off occurs here once again. Maintaining the infrastructure required for this archeticutre demands much more engineering sophistication, however ML models can be updated without any applications needing to be redeployed – this is because the model can be ingested at runtime.
To extend on our predicting value of a property example, the application would be able to consume from a dedicated topic from the designated streaming service (i.e. Apache Kafka).
4 Offline Predictions

- Pre-trained: Yes
- On-the-fly Predictions: No
This approach is the only asynchronous approach we will be exploring. Predictions are triggered and run asynchronisously either by the application or as a scheduled job. The predictions will be collected and stored – this is what the application uses to serve the predictions via a user interface.
Many in industry have moved away from this architecture, but it’s much more forgiving in a sense that predictions can be inspected before being returned to a user. Therefore, we reduce the risk of our ML system making errors since predictions are not on the fly.
In regards to the simplicity-flexibility tradeoff, this system compromises simplicity for more flexibility.
Wrap Up
Sometimes, merely doing some analysis, building multiple models, and evaluating them can get quite boring. If that is the case for you then learning to put machine learning models into production could be the next step and it’s a formidable skill to have in your toolbox. To emphasize, there is no such thing as a "best" system architecture for your model deployment. There is only the best set of tradeoffs that meets your systems requirements.
Thank You for Reading!
Connect with me: LinkedIn Twitter
If you enjoy reading stories like this one and wish to support my writing, consider becoming a Medium member. With a $5 a month commitment, you unlock unlimited access to stories on Medium. If you use my sign-up link, I’ll receive a small commission.
Already a member? Subscribe to be notified when I publish.
Related Articles