The world’s leading publication for data science, AI, and ML professionals.

Why does your Machine Learning Ops feel inadequate?

Causes of fragmentation for Machine learning processes and why many practices are still bespoke.

Opinion

Every time I design a new microservice, there is a TODO checklist that I check against. An early item in this list is about setting up CI/CD for the new service. I think about what the artifacts are and how they get formed at various stages. I consider how it eventually gets deployed to production and how it gets monitored post-deployment.

This situation is not unique to me. Modern software engineering practices implement well-thought CI/CD principles as a norm. However, when I started the MLOps journey for our company, I realized that CI/CD of the kind practiced in software engineering is not a standard in Data Science practices.

MLOps is very bespoke (Photo by Jeswin Thomas on Unsplash)
MLOps is very bespoke (Photo by Jeswin Thomas on Unsplash)

It was almost hard to believe. Because I thought that human-error induced model quality problems are pervasive in the industry. Is it because there aren’t enough quality and automation considerations going into a typical ML process? Or does it need an entirely new style of ops toolkit that hasn’t been truly figured out yet? As I discovered later, it turned out to be a combination of both. And many others.

Here is a summary of those findings.

  • Many data science operations are plainly immature – We **** can’t ignore the obvious. Several data science operations aren’t at a scale to put the necessary thought into Ops. A company I encountered sends every new model from the data scientist’s local Jupyter notebook over _slac_k to the data engineering team, who would copy it into their production environment. While this is an incredibly naive practice overlooking several quality aspects, the rest of this company’s software operations was solid. This practice tells me how overlooked Mlops really is.
  • New models need breathing room before getting bogged down with MLOps – A new ML model is often built to handle an entirely new problem that requires new frameworks around inference to run in production. The building process goes through a lot of experimentation & evaluation – And only upon demonstrating the business success of this model is an inference framework is built around it. As a result, the deployment process is often an afterthought. And whenever this framework gets formulated, it is a very custom process.
  • Too many frameworks without a clear winner – Data science practices are too custom, stringing together several ML frameworks for various sections of the pipeline. And there isn’t a single winner among the top ML frameworks (For example, consider deep learning – Pytorch, MXNet, Tensorflow) yet. The existence of several frameworks makes process standardization incredibly challenging for an external vendor to do.
  • Advanced ML pipelines are sometimes just not implemented – This includes capabilities like monitoring your models in production for drift, automated retraining using live data, enabling centralized data cleaning and feature engineering, etc. Even though these are valuable tools for any model running in production to have, the engineering effort behind them is so large that many orgs do not simply put the effort.
  • Cloud AutoML Cloud Services that support AutoML(Sagemaker, Cloud AutoML, Datarobot) already come with strong CI/CD capabilities that are inherent to it. I’m not just talking about automated algorithm selection. Any software that can automate the data science pipeline(feature engineering, data processing, algorithm selection, inference, etc. ) must have a reasonable level of CI/CD and engineering behind it. The software to operationalize this model is often built around a standardized inference template usually provided by this same AutoML framework. In fact, you could make the case that if you are working with some form of AutoML, you already have a sort of CI/CD. But often, this is not enough. CI/CD is only one part of a reliable MLOps process. In this case, you are still relying on an individual to get the deployment right. Enough checks and balances aren’t there.


Also, Data Scientists aren’t Software Engineers.

The reason why software engineering has several guardrails in place is because of the underlying assumption that humans make mistakes. It is well understood, theorized and years worth of processes has been built around this. However, the machine learning industry is still playing catchup to this fundamental idea.

Data scientists have a different thought process than software engineers. Developers, every day, are concerned with the operational aspects of their code. They build quality gates and checks all the way into production.

On the other hand, data scientists delve into complicated math and analysis of data. Code quality, pipelines & automated testing is not exactly on top of their mind when building a new model. Most of the time, it is not even fair to ask them of that.

However, to get MLOps right, we need classic software engineering in the mix. Often this title gets masqueraded as a Data Engineer. But with MLOps gaining more popularity, there needs to be more nuance to this role. This role is often used in the context of scaling the data that goes into the pipeline(ETL, Spark, etc) and not the ML model-related problems. For example, this person has to think about,

  • How to create a reproducible data cleaning and feature engineering pipeline between model training and model inference
  • How to track the experiments and which metrics need to be aggregated to a dashboard
  • How to monitor models in production by tracking drift. And how to enable live retraining using new data

These are challenging problems that need both data science and software engineering thought processes. Various specialized roles that study each of these problems at scale need to be incorporated into the title ‘Data Engineer’ or its equivalent.

Afterthought

The solution to everything I talked about is seemingly apparent – A solid MLOps pipeline for your machine learning process. When this happened to us, we set forth in search of a provider who did MLOps out of the box. After a few months of investigations that proved futile, we had to build it ourselves to fit this use case. Just like everyone else I talked about.

It does what it is designed to do. But are we ever going to reach a point where we can open source it? Probably not.

CI/CD for ML practice (Image by Author)
CI/CD for ML practice (Image by Author)

References

7 MLOps ‘Smells’ that tells that your ML process stinks

AWS MLOps Framework | Implementations | AWS Solutions

MLOps: Continuous delivery and automation pipelines in machine learning

Machine Learning Operations – MLOps | Microsoft Azure


Related Articles