From Data Platform to ML Platform

How Data/ML platforms evolve and support complex MLOps practices

Published in

Towards Data Science

9 min readOct 22, 2023

Data/ML has been the most popular topic in our tech landscape. I want to share my understanding of Data/ML Platform and how would those platforms evolve from basic to complex. At last, I try my best to cover MLOps, a principle for managing ML projects.

About who-I-am, here is my LinkedIn.

Starting of the Journey: Online Service + OLTP + OLAP

At the starts, data infrastructures could be fairly simple. Analytical queries might be sent to the read replica of a online OLTP database or setting up a OLAP database serve as data warehouse.

Here is the infrastructure might look like:

There is nothing wrong with those systems as long as it fulfil business requirements. All systems that fulfil our business need are good systems. If there are simple, it is even better.

At this stage, there are multiple ways of doing data analysis:

Simply submit queries to OLTP database’s replica node. (Not recommended).
Enabling CDC(Change Data Capture) of OLTP databse and ingest those data to OLAP database. Come to the option of ingestion service for CDC logs, you can choose based on the OLAP database you have selected. For example, Flink data streaming with CDC connectors is a way to handle this. Many enterprise services come with their own suggested solution, e.g. Snowpipe for Snowflake. It is also recommended to load data from replica node to preserve the CPU/IO bandwidth of master node for online traffic.

In this stage, ML workloads might be running in your local environment. You can set up a Jupyter notebook locally, and load structured data from OLAP Database, then train your ML model locally.

The potential challenges of this architecture are but not limited to:

It is hard to manage unstructured or semi-structured data with OLAP database.
OLAP might have performance regression when come to massive data processing. (more than TB data required for a single ETL task)
Lack of supporting for various compute engines, e.g. Spark or Presto. Most of compute engine do support connecting to OLAP with JDBC endpoint, but the parallel processing will be badly limited by the IO bottleneck of JDBC endpoint itself.
The cost of storing massive data in OLAP database is high.

You might know the direction to solve this already. Build a Data lake! Bringing in Data lake do not necessary mean you need to completely sunset OLAP Database. It is still common to see company having two systems co-exist for different use-cases.

Data lake: Storage-Compute Separation + Schema on Write

A data lake allows you to persist unstructured and semi-structure data, and performs schema-on-read. It allows you reduce cost by storing large data volume with specialised storage solution and spun up compute cluster based on your demand. It further allows you to manage TB/PB dataset effortlessly by scaling up the compute clusters.

There is how your infrastructure might look like:

This is an oversimplified graph indeed. The actually implementation of a data lake can be much more complicated.

Many cloud provider now have quite established store solution for Data lake, e.g. AWS S3 and Azure ADLS. There is still a lot of tasks need to be done on top of those storage solutions. For example, there should be a Hive metastore to manage your table metadata and a Datahub to provide data visibility. There are also challenging topics like fine-grain permission control in data lake and data lineage analysis(e.g. spline).

To maximum the value and efficiency of your data lake, we should carefully choose file format and average file sizes for each layers of your data lake.

The general tips are:

Avoid small files: Small files are one of major causes for high storage cost and poor performance in Data lake.
Balance between latency, compress ratio and performance: A low latency Data lake table with file format like Hudi might not give you the best compress ratio, and large ORC files with high compress ratio might give your performance nightmare. You might want to choose file format wisely based on the usage pattern of the table, latency requirement and table sizes.

There are some quite established SaaS/PaaS provider like Databricks which provide a decent Data lake(or LakeHouse now) solution. You also can explore ByteHouse to have a unified experience of big data analysis.

On the ML side, team might start exploring well established ML framework like Tensenflow and Pytorch in remote environment. Furthermore, trained ML models could been deployed to production environment for online model inferrence. Both Tensorflow and Pytorch come with serving solution, e.g. TensorFlow Serving and Pytorch Serving.

However, our journey will not stop here. We might have following challenges now:

Lack of realtime metric and features management which are critical for online ML model serving.
Lack of model performance monitoring.

Let’s level up our game further.

Realtime Data/ML Infra: Data River + Data Streaming + Feature Store + Metric Server

It usually a joint effort from multiple departments of companies to build realtime data infra. The initial rationale of building Data River usually is not for data/ML system but allowing micro-services to further scale up by removing synchronised call. Instead, micro-services will gain efficiency by communicating with a message broker like Kafka (at the cost of lower consistency level).

The overall architecture might look like this.

With data available in Data River (e.g. Kafka), we can build data streaming pipeline to process realtime data. Those data can be used directly in online feature store or sync to an metric server like Pinot. Metric server can further process/aggregate those metric point to more useful model performance metrics and business metrics. You can also adopt streaming database like RisingWave which can joining/aggregating streaming data with SQL syntax.

For building data streaming itself, Flink is quite popular. You can also use Flink with CDC connector to extract data from OLTP database and sink data to message brokers and data lake.

There should be a online feature store backed by key-value database like ScyllaDB or AWS Dynamo DB. Online feature store can help you enrich the the request sent to Model Serving service with a feature vector associated with certain reference ID (user-id, product uuid). It can greatly de-couple the dependency between backend service team who build micro-services and ML engineer team who build ML models. It allow ML engineers rollout new ML feature with new ML model independently (Your model serving API signature expose to micro-services will remain the same when you update feature vector).

In the book, Designing Machine Learning System, it has shared about Model Stacking (Jen Wadkin’s medium post about model stacking). It is quite common for people to use model stacking in model serving as well. An orchestrator is required when you want to to stack hererogenous models together, e.g. stacking pytorch and tensorflow model together. You potential can make your orchestrator even more complicated by having a dynamic weightage based on model performance when routing request to different models.

Now we have a complicated system. It looks pretty cool but it carry new challenges:

Debt of the system will soaring high if leave it unmanage.
High cognitive load for ML engineers.

That’s probably when you need to thinking how MLOps can help you.

MLOps: Abstraction, Observability and Scalability

MLOps is never a specific solution. It is more like a set of principles for managing ML system. Different with a typical software project. ML systems are greatly affected by data shifting, and data dependency management is not a easy task. Paper Hidden Technical Debt in Machine Learning Systems has described those challenges in details. Therefore, a MLOps driven ML platform must able to:

Data change monitoring & data quality monitoring.
Manage ML features across offline and online environment.
Reproducible ML Pipeline which fulfil experimental-operational symmetry.
Concise ML pipeline configuration which can abstract away infra details.

This article, MLOps: Continuous delivery and automation pipelines in machine learning, highlighted the importance of experimental-operational symmetric. It also described MLOps automation level from level-0, level-1 to finally level-2. I really like the graph from this doc and will just borrow it to explain what level-1 MLOps looks like.

Image by Author. Describing MLOps Level-1 in MLOps: Continuous delivery and automation pipelines in machine learning

To scale such MLOps practise in your organisation, you need to provide concise ML pipeline configuration which can abstract infrastructure implementation details away for ML Engineers. By doing this, platform engineers also gain flexibility for upgrading ML platform without causing too much disruption to platform users. You can consider using configuration files like yaml to describe ML pipelines and rely on your ML Pipeline controllers to translate them to actual workload.

So let’s re-organised realtime data/ML infra with following graph to highlight how MLOps shapes our platforms.

To give you a better ideal of what the ML pipelines might look like. Here are the possible abstraction examples for each stage in ML pipeline. The following graph only help your to further understand what is the configuration might look like. It does not represent any actual implementation. It does not cover all aspects required neither.

A general idea of configurations in ML pipeline. Image by the author

Kubernetes is a popular solution to orchestrate ML workload(or maybe all workload nowadays). You can use CRDs to provide concise interfaces between users and platforms. In article My thinking of Kubebuilder, I have shared some of my thinking when I build CRD with kubebuilder.

Of course, I didn’t cover many important sub-topics which include but not limited to:

Hyperparameter Optimization
Distributed Training architecture

What Next

You can see MLOps only give a known mission a proper name. It is far from a job done. What I shared is a opinionated strategy for implement ML Ops platform. Even with that, the bar of creating high quality ML product is still high, and the effort of collecting, processing, mining data is still heavy.

Besides those challenges remains, I also want to share the trends in ML landscape I have observed. It is surely not a completed list given how fast this domain evolves.

Serviceless: We have put ML’s value too far behind because the foundation a ML platform is usually a Data platform. It is like forcing users to buy computers to engage on social media platforms when we are already in mobile time. Serviceless data services and data engine are addressing this challenge. Many service providers explore their own serveless solution to lower the bar of adoption, e.g. Databricks, Snowflake , Bytehouse. Companies can start building their ML products after bootstrapping data warehouses, or data lakes, or lakehouses.
AI driven feature engineering: Well, AI can do everything now, can’t it?
MaaS trends: More powerful Model-as-a-Service will pop up. Companies can directly leverage on ML power without even building their own ML service to enjoy a great lifting to their business.

As we all have noticed, ML space evolves so fast. At this very moment, when I am typing this, this article might already been expired. More ideas had already popped up and been translated to reality. Please do let me know what do you think about of ML Ops, or where should I further my learning. Let’s keep up the pace together!