Design Considerations for Model Deployment Systems

An engineer’s guide to understanding cross-functional requirements for deploying Machine-Learning Models

Published in

Towards Data Science

9 min readMay 17, 2022

The impact of Machine learning is becoming more widespread each day, powering applications like product recommendations, fraud detection, and conversational AI. Data Science teams are no longer just sharing business insights with decision-makers via dashboards or presentations. More often, only by putting the models into end applications, the full potential of ML can be realized.

However, deploying machine learning models remains one of the most time-consuming engineering challenges. As the engineer tasked with building the model deployment pipeline, it’s critical to understand that this is a cross-functional effort and requires collaborating with multiple organizational stakeholders.

In this article, we will dive into the key requirements of model deployment systems, coming from three stakeholders: the Data Science team, the DevOps team and the Product team. Then, we’ll breakdown the main design considerations (marked with 💡) in order to meet those requirements, alongside with practical recommendations.

The Data Scientist

The Data Science team may be the team with the most (and likely most challenging) requirements to take into consideration — after all, it’s their models you’re bringing to fruition. Their primary goal is to build accurate and impactful models.

#1: Seamless transition from training environments

Data Scientists develop their models with specialized development tools, such as Jupyter notebook, different type of ML training frameworks and experimentation management platforms — environments that may be unfamiliar to many software engineers. This made the transition of model artifacts and feature extraction code to production one of the most error-prone processes in the deployment pipeline.

💡 A mature model deployment system must be able to easily integrate with multiple ML frameworks and model development environments, giving data science teams the flexibility to choose any tools they found suitable for their model training needs.

#2: The ability to continuously redeploy new models

Data Science teams understand that models don’t remain static after being deployed to production — concept drift and data drift may occur over time, requiring retraining and updating new models periodically. Your Data Science team may also want to incorporate new features, update model architectures, or experiment with new ML frameworks. Deployment systems should allow data science teams to easily update new models in production, and do so with confidence.

💡 You’ll need a way to manage the code, as well as the model and library dependencies all together. In addition, all of these pieces must be versioned and deployed together for reproducibility and a smooth rollback (if necessary).

This requirement also introduces a versioning challenge: the more models that are shipped, the harder it is to keep track of important details like how each model was trained, who has access, and who is responsible for monitoring to ensure quality.

💡 A model registry to record and recall specific details will simplify operational overhead related to managing multiple versions of a model.

#3: Visibility of model performance

The ability to understand how a model is performing once it’s live is critical. Consider this example: When a new model for recommending products results in fewer sales than before, it may be a sign that the new model’s predictions are less accurate. Your system should help to detect that and roll back to a previous version as soon as posssible.

Concept and data drift also occur over time as habits and trends change. When this happens, data science teams need to investigate to determine if a model needs to be retrained.

💡 Your ML serving solution should make it easy for data scientists to retrieve and analyze predictions in real-time. Simple solutions like CloudWatch, DataDog or Grafana Loki may suffice for certain cases, but more complex cases may require specialized ML monitoring solutions from companies like WhyLabs, Fiddler.ai, or Arize.

#4: Easy access to features

The viability of an ML model is not only dependent on its raw input data, but also on the transforms that create additional derived features. Many times in online prediction scenarios these transformations also need to be performed in real-time as the data is being ingested. The key challenges are defining a common interface for retrieving feature data as well as handling feature transformation in the same manner for both training and online serving.

💡 For most feature transformations, embedding the code as part of the serving pipeline directly is preferred as it ensures the model and feature processing code are always on the same version. However when working with features that requires online aggregation over time-series data, computing such features with low-latency is challenging. For this type of use cases, you may consider using a dedicated feature store, that exposes a common interface for both the online serving and the training pipelines to access features. A good feature store can also help coordinate rollouts of new features or a new version of a feature, so that it is synchronized with your models.

The DevOps Engineer

You’ll likely be working very closely with the DevOps team to deploy ML models. This team is primarily responsible for the underlying technical infrastructure that powers your company’s software products, overseeing critical areas like the stability and security of the whole operation.

#1: Deploying models on preferred platforms

Model serving workloads usually rely on other existing services managed by the DevOps team for provisioning resources, retrieving features, streaming logs, and monitoring. With specific cloud or on-premise environments, deployment patterns, and tooling already set up, it’ll be important to deploy new ML services in a similar way to ease the maintenance burden.

💡 Your prediction service should be able to work with different logging, monitoring, and provisioning tools. Additionally, it should be optimized to run in many different forms depending on your use case — whether it’s as a real-time service running on Kubernetes, a batch inference job with Spark, or a serverless deployment in the cloud, provisioned via Terraform or CloudFormation.

#2: Ensuring reliability and maintenance of the service

Once the model is deployed, the reliability of the service is extremely important to the DevOps team.

💡 DevOps teams should have the ability to monitor, reproduce identical deployments to create testing environments, and incorporate CI testing.

For monitoring, tracing, and alerting: Model serving workloads should make it easy to integrate with systems like Prometheus, OpenTelemetry, CloudWatch, or Datadog. This allows the DevOps team to monitor these new ML services in the same way they monitor other existing services.

For reproducibility: To minimize the chances of failed deployments in the future, the system should be able to reproduce identical deployments to create testing environments. Being able to easily revert all systems to a previous state in the case of a production outage gives the DevOps team peace of mind. The GitOps flow is a standard, declarative way of describing the desired state of a deployment and automatically reproducing the entire stack if needed.

E.g. the code sample below shows a model deployment specification on Kubernetes with Yatai and BentoML:

For CI and testing: Unexpected behavior due to changes in the serving logic or model prediction can be extremely disruptive, which is why CI testing is a must-have to catch issues before they hit production.

💡 Your model deployment solution should allow for integration with CI pipelines that can test the entire serving pipeline. Inline automated tests should not only evaluate business logic in the code but also verify that key model prediction metrics have not drifted out of acceptable bounds.

#3: Securing access to the model

Securing access to ML models is equally important as securing access to other critical parts of the business, like customer data or your company’s IP.

💡 ML prediction services, just like any other online service, should apply common security best practices with regards to authentication and access. We see many users deploying prediction services behind Istio. This type of service mesh can be configured for more complex authentication and routing scenarios. It’s also worth noting that ML services in particular can present new types of security challenges, like adversarial examples, intended to cause models to make mistakes.

The Product Manager

Your product team is an important stakeholder — after all, they’re the ones not only responsible for ML’s business impact, but also all of the economic, social and legal issues around ML. Their considerations for ML deployment systems mainly fall in the camps of proving the value of the model and delivering a final product with a great user experience.

Note: product manager here generally means product owner — whoever in your organization that is in charge of the success of the ML project and constantly advocate for end-users as well as business value.

#1: Quickly validate the value of the model

The product team will want to bring a functioning prototype to market as soon as possible in order to validate its value. Because of this time-to-market requirement, many ML experts believe ML projects should start on a smaller scale at first, then adjust and refine over time.

💡 When deploying a model to production, skip the more advanced features initially. When the model seems viable, everything can be fine-tuned for the best results.

#2: Meet the latency requirements

Prediction latency is a key consideration for the Product team because it has a direct impact on the end user’s experience (and therefore, on revenue). Consider this example: A video game company using an ML model to match competing players with similar skills must be near instantaneous — if it runs too slowly, it could result in frustrated users quitting.

💡 Depending on the use case, there may be different latency SLAs (Service Level Agreement) that the model serving system needs to take into consideration. For some time-sensitive and mission critical use cases, like ad targeting, the ML serving system needs to deliver a prediction in real-time with P99 < 50ms latency. For others, like a fraud detection model for issuing credit cards, a few minutes will suffice.

#3: Reduce operational costs

Large scale machine learning is extremely compute-intensive workload and involves potentially thousands of machines operating at the same time. This comes with cost in terms of energy, maintaining, and cloud spendings for your organizations. A ML model only make sense for a business, if the value it delivers outweighs the cost of running it. That’s why product teams want to assure the value a model delivers, while optimizing for cost.

💡 A solution that optimizes for horizontal and vertical scaling out-of-the-box ensures that costs are minimized at the start. Different use cases may require one or the other.

Vertical scale and resource utilization: If prediction traffic requires specialized hardware acceleration or can be more cost-effective by scaling vertically, you’ll want to make sure that the software is optimized for the hardware you’ve chosen. Specifically, you’ll want to make sure it supports common performance best practices such as adaptive batching of prediction requests, properly distributing load to all CPU cores available, and utilizing GPUs or custom accelerators where needed. This not only helps minimize latency for individual requests, but reduces the amount of compute resources required by utilizing resources efficiently.

Horizontal Auto-scaling: If prediction traffic varies throughout a specified time period (like a food delivery service with surges during lunch time or event-related spikes around Black Friday and Christmas, for example), the solution should be able to automatically scale horizontally in order to balance latency requirements and resource utilization.

Serverless and scale-to-zero: If prediction traffic is limited to just a handful of predictions each month, you may want to consider a serverless architecture like Knative or AWS Lambda. These solutions can scale down to 0 when no requests are in progress, saving on cost.

Conclusion

Data Science, DevOps, and Product are all teams that have a stake in the success of an ML project. The diversity of needs across these teams creates challenges that are very different than the typical software deployment process. While it may not be necessary to address all of these considerations in the very first iteration of an ML model deployment system, it is helpful to know what may lay ahead so that proper planning and research can be done to ensure the greatest likelihood of success.

A bit background: my name is Chaoyu Yang, I’m the creator of the open source model serving framework BentoML. Previously, I helped to create Databricks’ unified data science platform in its early days. If you are interested in learning more about this topic or be part of the conversations, join our community slack where hundreds of ML practitioners gathered and discuss all things MLOps.