The Most Fundamental Layer of MLOps — Required Infrastructure

Having the infrastructure right for implementing MLOps solutions

Published in

Towards Data Science

8 min readOct 24, 2022

In my previous post, I have discussed the three key components to build an end-to-end MLOps solution, which are data and feature engineering pipelines, ML model training, and retraining pipeline ML model serving pipelines. You can find the article here : Learn the core of MLOPS — Building ML Pipelines. At the end of my last post, I briefly talked about the fact that the complexities of MLOps solutions can vary significantly from one to another, depending on the nature of the ML project, and more importantly, variations of the underlying infrastructure required.

Therefore in today’s post, I will explain how the different levels of Infrastructure required, determine the complexities of MLOps solutions, as well as categorize MLOPS solutions into different levels.

More importantly, in my view, categorizing MLOps into different levels makes it easier for organizations of any size to adopt MLOps. The reason is, not every level of MLOps requires large-scale online inference infrastructure like Kubernetes, parallel and distributed data processing frameworks, like Apache Spark, and low-latency and streaming data pipeline solutions, like Structured Streaming and Apache Flink. Hence, organizations with small-scale data sets and batch-inference ML projects, do not need to recruit people with these specialized skills and set up complex underlying storage and compute infrastructure, but can still do MLOps properly with existing skillsets and much simplified infrastructure.

For each level, I will share some reference architecture and implementation guidance in future blogs. Please feel free to follow me on Medium if you want to be notified when these new blogs get published.

First, let’s talk about the infrastructure required to run an end-to-end MLOps solution, for each of the 3 key components:

Data and Feature Engineering Pipeline;
ML Model Training Pipeline;
ML Model Inference Pipeline;

I will cover all the potential required infrastructure for each component. Then, I will categorize them into different levels based on the required infrastructure.

Infrastructure Required for Data and Feature Engineering Pipelines

Depending on the data volume and data latency, the infrastructure required to run data and feature engineering pipelines, are as follows:

Level 1 — When the data volume can be handled by a single machine and the data latency is at batch frequency, the required infrastructure can be as simple as a local laptop, or a virtual machine on the public cloud. Additionally you can leverage cloud platform-as-a-service (PaaS) offerings such as AWS Batch, AWS Lambda or Azure Functions, to even further simplify the infrastructure management;
Level 2 — When the data volumes cannot be handled by a single machine and requires parallel and distributed data processing, but the data latency can still remain at the batch frequency, the required infrastructure will need to to beyond a single machine to be a compute cluster, in order to install and manage distributed computing frameworks like Apache Spark. Apache Spark is an open-source solution. Organizations can run their own compute clusters and use open-source Spark to manage their data and feature engineering pipelines. However, most still choose a managed service, such as Databricks, as their underlying data infrastructure for large-scale data and feature engineering workloads. Public cloud providers also have service offerings for Spark, such as AWS EMR and GCP Data Proc.
Level 3 — In the first two scenarios, the data latency remains at batch-level. However when the data latency needs to be very low, quite different sets of infrastructure are required. At least an event-driven message queue and a streaming engine are required. In order to achieve much lower latency, a message queue service to capture the streaming data on the fly instead of persisting the data to a storage system, is generally required. For message queue services, there are open-source solutions, such as Apache Kafka; There are also commercial managed services, like Azure Event Hub, AWS Kinesis Data Stream and Confluent. Other than a message queue service, a robust streaming engine is also very much necessary, in order to achieve low frequency for the downstream data consumption. The open-source streaming engines include Apache Spark structured streaming and Apache Flink as well as Apache Beam. Of course, there are also commercial offerings for the streaming engine, such as Databricks, AWS Kinesis Data Analytics as well as GCP Dataflow.

As you can see, the infrastructure to run data and feature engineering pipelines can vary significantly depending on the data volumes and data latency requirements. Actually this is the same for both the ML model training pipeline and the ML model inference pipeline. This is why it is critical to clarify at the infrastructure level, to avoid the impression (or misconception) of MLOPS being always daunting and complex. MLOps can also be quite straightforward for certain levels, which I will explain later in this blog. Now let’s continue to explain the infrastructure required to run ML model training pipelines.

Infrastructure Required for ML Model Training Pipelines

Depending on the training data size and the required time (SLA) to have a trained mode ready for use in a production environment, infrastructure for model training can be divided as follows:

Level 1 — When the training data size is fit for the memory of a single machine and the total training time does not exceed the SLA required for a production environment, having a single machine for model training is sufficient. Depending on the format of training data, a GPU machine maybe required. For example, if your training data is structured and numeric, a CPU machine is generally enough. However if your training data is unstructured, like images, the preferred training infrastructure will be a GPU machine.
Level 2 — When the training data is too big to fit in the memory of a single machine or even if the training data size can fit in the memory of a single machine but it takes longer than the required SLA to finish a training job, this is the time that companies need to spin up training clusters to do parallel and distributed ML model training across multiple nodes. However, running distributed ML model training on multiple nodes introduces a host of new complexities, like scheduling tasks across multiple machines, transferring data efficiently, and recovering from machine failures. Fortunately there have been some open-source libraries to handle these extra complexities introduced by multi-node training, and keep the training jobs relatively simple for data scientists even when they need to distribute the jobs. These open-source libraries include Ray for scaling Python ML workloads, Horovod for a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet and Dask for scaling Python libraries, including Python ML libraries.

I am going to publish a separate a blog on distributed training. Please feel free to follow me if you want get notified when the blog for distribution is published.

As is well known, ML is an extremely dynamic field. To run a model training job, data scientists need to install quite a few open-source libraries, including Pandas, Numpy, Matplotlib, Seaborn, Plotly, Scikit Learn, Tensorflow, Keras, Pytorch, mlflow and so on. Therefore, most public cloud vendors or specific data+AI vendors, (like Databricks), provide pre-configured ML runtime including all these popular ML libraries to save the data scientists substantial time installing and maintaining these libraries. Therefore, most organizations build their ML training infrastructure by leveraging cloud services. The popular ML services on the cloud are AWS Sagemaker, Azure Machine Learning Workspace, GCP Vertex AI as well as Databricks Machine Learning Runtime.

Infrastructure Required for ML Model Inference Pipelines

Depending on the model inference frequency and volumes of inference requests, the infrastructure required to run ML model inference pipelines, are as follows:

Level 1 — When the model inference frequency is batch, and data volumes for model inferences is able to be handled by one single machine, the trained model can be loaded into a single machine for batch predictions by calling the predict function on data, which is generally stored as a Pandas data frame;
Level 2 — When the model inference frequency is batch, but the data volume is not able to be managed within a single machine, there is a need to set up a cluster to leverage distributed computing frameworks, like Apache Spark. For example, a trained ML model can be loaded as Spark User Defined Function (UDF) and can apply the UDF to a Spark data frame for parallel model predictions.
Level 3 — When the model inference frequency is low-latency, and the data volume is quite large, streaming inference becomes necessary. Similar to level 2, a compute cluster is required. Additionally, there is a need to also use a streaming engine for model predictions in order to meet the low-latency requirement. In this case, the popular streaming engines used are structured streaming of Apache Spark and Apache Flink.
Level 4 — When the model inference is online, which means the model is generally packaged as a REST API endpoints, but the API request volume is small scale and can be handled by a single machine, the required infrastructure will generally be a single node CPU virtual machine on the cloud. Public cloud providers and data/AI vendors all have managed services for this type of model serving. For example, Databricks has the serverless endpoint, where customers do not have to worry about setting up serving infrastructure and all they need to do is to instantiate a model endpoint. Others also have similar offerings.

One note to make before we get to level 5 — online inference is different from streaming inference where a trained ML model is still loaded as a Python function, instead of a REST endpoint. They both have have low-latency, but online inference is supposed to be real-time.

Level 5 — when the model inference is online, and the API request volume is large scale, (meaning the queries-per-second (QPS) is overwhelmingly large for one single endpoint), there is a need to setup a cluster infrastructure like Kubernetes, for distributed inference. The popular method used is generally that the trained model will be packaged as a container image and registered in a container registry — like AWS Elastic Container Registry, Azure Container Registry, or GCP Container Registry. Then these registered images of trained models will be pulled and deployed into Kubernetes for large scale and distributed model inference. Each public cloud has its offering for a managed Kubernetes service.

Conclusion

So far we have covered the different levels of infrastructure required for each of the 3 key pipelines of a complete end-to-end MLOps solution. It is very clear that the infrastructure complexities vary a lot for different levels.

In the next blog, I will categorize MLOps into different levels based on infrastructure complexities and implementation patterns, and for each level, I will also share some reference architectures and code samples, which will include other pieces of an MLOps solution, such as orchestration, model versioning, data versioning, drifting detection, data quality checks, and monitoring.

I hope you have enjoyed reading this blog. Please feel free to follow me on Medium if you want to be notified when there are new blogs published.

If you want to see more guides, deep dives, and insights around modern and efficient data+AI stack, please subscribe to my free newsletter — Efficient Data+AI Stack, thanks!

Note: Just in case you haven’t become a Medium member yet and want to get unlimited access to Medium, you can sign up using my referral link! I’ll get a small commission at no cost to you. Thanks so much for your support!