for advanced analytics systems

Modern advanced analytics systems extensively use Machine Learning (ML) models. These systems continuously empower complex business-level decision-making day in and day out. As reality around the business changes constantly, those models need to be invalidated and retrained if not at the same pace, but on a regular basis. A crucial part of this discussion is how to serve models efficiently with little or no expert intervention. In this article, we present concepts and a few solution approaches that need to be understood in order to achieve an efficient model serving system. The design and implementation of such a system are essential for the continuous realization of year-round advanced analytics capabilities.
Packaging Machine Learning Models
In ML-based analytics systems, models are created and then stored in specific file formats. These files are later moved between different platforms, loaded in memory, and used for prediction jobs. These require an effective serialization and deserialization format that allows efficient handling of model files in a portable way. Unfortunately, the landscape of model file formats is fragmented, where different frameworks use different file formats. The following list provides a few examples.
- Scikit-Learn: pickle (.pkl)
- TensorFlow: protocol buffer (.pb)
- Keras: HDF5 (.h5)
- Spark ML: MLleap (.zip)
- PyTorch: PyTorch file format (.pt)
There are efforts in standardizing. Many Python-based ML libraries support pickle formats. Some ML libraries also support currently fallen out of favor predictive model markup language (.pmml) format. Open Neural Network Exchange (.onnx) is supported by different major libraries, but it is yet to reach wide coverage.
To deal with such a fragmented space, modern model management platforms, such as MLflow, supports a wide range of serialization/deserialization methods. In fact, when a model is logged in MLflow, it actually creates a directory with the model file, a YAML file named MLmodel that provides all sorts of information about deserializing the model file.
Architectures for Model Serving

How models should be served depends a lot on the nature of the model application system. From the architecture design perspective, there are two types: batch and online serving. First, models can be applied to batch jobs where a large volume of data is used to predict a large number of target values. The jobs can tolerate a certain amount of delay, which can go up to days. Second, models can be applied to online jobs where only a small dataset is used to predict a small number of target values. The jobs can only tolerate very small delays, i.e., which can be no more than a few seconds.
Figure 2 presents three different high-level architectures that tackle the model serving problems for such applications. The first two tackles batch jobs, whereas the last one tackles online applications. In the case of batch jobs, a model needs to be loaded in memory in the execution environment where the application code runs. Figure 2(a) shows a design where the model is downloaded in the memory of the application’s execution environment when the application requests the model. Once it has loaded, the prediction tasks in the application can start. This is primarily suitable for the cases where the model size is rather small or refreshed frequently or both. If the model is large, loading the model each time a request has been made may violate the service level agreements, even if they are relaxed. Furthermore, in such cases, baking the model in the application’s execution environment prior to any execution would be desirable. That can be done by putting the model in the filesystem mounted in the execution environment as part of the model deployment process prior to the scheduled run of the application. Figure 2(b) shows such a setup. Finally, Figure 2(c) presents a simplified view of online serving architecture, where the model remains within the vicinity of the serving application, whereas interaction with the application is limited to dataset requests and responses.
Implementing Model Serving System

Figure 3 presents a high-level architecture of a model serving system. Until recently, Rest (or other types of) APIs played an essential role in these types of architectures. For the case of Python-based designs, Flask and FastAPI are very good choices. The API should be integrated with a model storage system where the (packaged) model files are managed. There are plenty of storage options that include Network Filesystems (Azure Files), Databases (Azure SQL Databases), and Blob Storages (Azure Blob Storage). Note that there are plenty of other solutions and platforms in this area by other cloud vendors. Between the API and the storage layer, the serving system may need a caching layer. Redis is a popular database system for managing caches. In front of an API, a reverse proxy may be placed that enables easy handling of requests/responses. Nginx is a well-known platform for implementing such a component. The whole solution may be deployed using a Kubernetes service, such as Azure Kubernetes Service. In such a design, the API component may be deployed using a replica set that would enable balancing loads efficiently.
Building a model serving system using the approaches as above is suitable for teams with solid engineering capabilities. Such capability may be available in central platform teams supporting ML product teams. The challenge is not developing such a system, which requires time-limited effort. Teams delivering business values need to prioritize value-generating tasks much higher than maintenance and improvement work for the system. For teams, it is better to adopt a platform that would take away such efforts if not completely, at least partially. For example, teams can reduce the overhead of the above design by integrating model serving frameworks, such as Seldon-core. This type of choice allows sufficient customization while serving a model, e.g., AB-testing, without putting a lot of effort. Teams that are happy with a vanilla model serving needs should consider implementing the serving system using more batteries included Platform as a Service, such as Azure Machine Learning, MLflow hosted in Azure Databricks, etc.
Continuous Deployment
Continuous Deployment of ML models basically means how to put the model in a serving environment to be used for operational purposes automatically. We avoid the discussion on how to select a model for operational use. Such a discussion goes towards the continuous integration of ML models. Assume that such a problem is solved for the sake of simplicity.
If you are using MLflow for other model management activities, such as tracking, encoding, lifecycle change, using MLflow serving may be the easiest way to get started. In case the model needs to be served with a better SLA, i.e, cheaper, faster, or in complex scenarios, i.e., AB-testing, consider replacing or integrating with a more robust framework, such as Seldon-core.
While it is easier to put together a few scripts to set up an expert-driven approach, a more badass idea is to adopt a closed-loop system that uses infrastructure as code and pipelines with little or no involvement of human experts. See Figure 4 to get an overview of such a system. The core of the system is a continuous deployment pipeline represented by the green boxes. The pipeline retrieves the chosen model in its serialized form from the model registry, builds the image of a (Docker) container and pushes the image in a container registry, and updates the serving microservice using the image. Such a pipeline can be developed as code using workflow management platforms, such as Airflow, Kubeflow, or DevOps platforms, such as Azure Pipelines. The pipeline can be triggered by listening to events that mark that a model is trained and selected for usage. The training and subsequent selection of the model can be triggered when the performance of the model degrades or the model has expired. In the former case, a monitoring system can generate such alerts and in the latter case, a human expert can initiate the process following a schedule.

Robust Continuous Deployment
For high availability, more robust deployment strategies can be considered. Taking inspiration from web service deployment, the following strategies can be adopted:
- Blue-Green deployment: The new deployment (blue) is deployed in parallel to the old deployment (green) where both of them share identical setups. A limited share of the traffic is routed to the blue deployment. Once the deployment shows achieve an acceptable SLA under the limited load, the green deployment is expired and the blue deployment becomes the new green one.
- Canary deployment: The blue and green deployments run in parallel, but the traffic share to the blue one progressively increases over time based on predefined constraints. Such a deployment can roll back to the old model for the full share of traffic once such constraints are violated.
A Kubernetes service together with Seldon-core microservice is a recommend solution that we can adopt. Putting together such a platform is easier than realizing the functionality in full force.
Remarks
All the building blocks to implement a closed-loop continuous deployment system are there for us to grab. Platform services that would bring all of these together for an easy realization of continuous deployment of ML models are in their early days. However, public cloud-based platforms, such as Google, Amazon, and Azure, as well as ML/Big Data platform services, such as Databricks, Neptune AI, etc., are making significant strides to realize such visions. Be sure to check the releases from these platforms on a regular basis to see how far these platforms came.
It is also interesting to see, out of the choices that are available to us, which one do you prefer and consider in your advanced analytics journey. Let us know your experience and view on this matter.