Machine learning model deployment can be categorised into 3 broad categories:
- Real-time inference: typically it involves hosting a machine learning model as a endpoint on a web server. Applications can then provide data via https and receive back, in a short period of time, the predictions of the model.
- Batch inference: on a regular basis (triggered by time or events such as data landing into a data lake/data store) resources are spun up and a machine learning model is deployed to predict on the new data that is now available in the data lake/data store.
- Model deployment onto edge: Instead of requiring input data to be passed to a back end the model prediction is computed on edge devices. Think IoT and web applications.
When deciding which method to deploy machine learning model(s) into production several factors need to be taken into consideration.
- Latency: How quickly does an application/user require the results of the model prediction?
- Data Privacy: Is there any issues/concerns of sending data to a back end?
- Network connectivity: Some deployment options requires access to the internet/network. If the environment in which the model needs to be deployed has limited or no internet/network connectivity, the options are limited.
- Cost: certain deployment options will be more costly then others. Think having a server online 24/7 to serve predictions. What will be the cost to operate and maintain this server?
Latency
Imagine that you have created a web/mobile application that provides a machine learning service yielding predictions based on a user’s input. It would be better user experience if the user was able to obtain the results of the model shortly after providing some input vs waiting several hours or perhaps until they get a notification that the model has finally made a prediction on their input. In this particular instance deploying the machine learning model into a real-time inference pipeline/edge deployment will be more ideal.
On the other hand, if a machine learning model is being used to enrich some data as part of a data pipeline, and the data does not require to be available immediately, perhaps new data only arrives once a day, it may be more cost effective to predict the new input data using a batch inference pipeline.
Data Privacy
When deploying models to a real-time inference or batch inference pipeline, the input data will need to be transmitted to a back end and often it will be in a cloud environment managed by the creators of the machine learning model. Imagine providing data that has the potential to be personally identifiable information or there are concerns about which country the data will be stored in. If there are any data privacy concerns using such methods may not be feasible.
One way to work around this issue is to deploy models into edge devices. This requires the edge device to have sufficient computing power to actually host the model and do the calculations required to predict on the input data at the edge. The draw down of deploying onto edge devices is that typically the amount of compute resources is a constraint, as a result the choice of model architecture may need to be more simple in order to save on resources which may lead to a performance drop in model predictive power.
Network Connectivity
Deploying models to a real-time inference or batch inference pipeline generally will require some form of network connectivity capable of transmitting input data from source to the back end. If the source of input data has limited/no network connectivity it may mean that in order to solve a business use case, the model needs to be deployed closer to the source of data.
Cost
Certain methods of deployment will result in different costing. When deploying to a real time inference pipeline, you will require the endpoint to be available 24/7. In addition, you may need to scale up the compute resources during heavy load periods and scale down during low load periods. This management of scaling up and down will potentially also require a team to manage the real time inference pipeline.
Batch inference pipelines are generally cheaper as you only require to spin up resources for the time required to predict on the new batch of data. Depending on the volume of data and how quickly you need the model to predict on the data, you can determine the amount of resources (number of nodes and type of nodes) you would like to utilise (for an example batch inference pipeline using AWS please checkout this article I wrote previously: https://towardsdatascience.com/aws-sagemaker-batch-transform-design-pattern-fa1d60618fa8).
In terms of operational cost, edge deployment may be the cheapest option as you are utilising the resource of the edge device. The costs may be more indirect. This includes the cost of potentially deploying an inferior model into production that has lower predictive power then other models, cost of managing the deployment of the model to a fleet of edge devices, the cost of evaluating the performance of the model on the edge (how do you collect and track inference over time on edge devices?).
Machine Learning Deployment Options in AWS
- Real time inference: https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipeline-real-time.html
- Batch inference: https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipeline-batch.html
- Edge deployment: May need to do cross compilation using SageMaker Neo: https://docs.aws.amazon.com/sagemaker/latest/dg/neo-edge-devices.html or Using Greengrass: https://docs.aws.amazon.com/greengrass/v1/developerguide/what-is-gg.html and/or model management using SageMaker Edge Manager: https://docs.aws.amazon.com/sagemaker/latest/dg/edge.html
Machine Learning Deployment Options in Azure
- Real time inference: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-deploy-and-where?tabs=azcli
- Batch inference: https://docs.microsoft.com/en-us/azure/machine-learning/tutorial-pipeline-batch-scoring-classification
- Edge deployment (in preview as at April 12 2021): https://docs.microsoft.com/en-us/azure/iot-edge/tutorial-deploy-machine-learning?view=iotedge-2020-11