Bridging the Gap: MLOps Linking ML Research and Production

Published in

Towards Data Science

15 min readFeb 9, 2022

Introduction

Machine learning models are highly advantageous; however, their potential for generating maximum value is constrained without the knowledge of how to deploy them effectively. The systematic planning of a machine learning project’s lifecycle serves as a crucial workflow for ensuring successful implementation. When constructing a machine learning system, careful consideration of the machine learning project lifecycle facilitates comprehensive planning of all necessary steps.

For instance, in our scenario, we employ computer vision techniques to examine phones as they emerge from a manufacturing assembly line, aiming to identify any defects. In the provided illustration, the phone on the left lacks any visible scratches. However, if there were cracks or other imperfections present, a computer vision algorithm would ideally be capable of identifying such anomalies. As part of the quality control process, the algorithm may enclose the identified defect within a bounding box. By utilizing a dataset comprising images of scratched phones, it becomes possible to train a computer vision algorithm integrated within a network to detect similar defects. Nevertheless, to successfully deploy this system into production, certain actions need to be taken.

Figure 1: Example computer vision project to inspect phones coming off a manufacturing line to see if there are defects on them.

An exemplary deployment scenario involves the utilization of an edge device, which resides within the smartphone manufacturing factory. This edge device is equipped with inspection software that carries out the task of capturing an image of each phone, examining it for any stretches, and subsequently making a determination regarding its acceptability. This process, commonly referred to as automated visual defect inspection, is frequently employed in manufacturing facilities.

The inspection software operates by controlling a camera to capture an image of the smartphone as it emerges from the production line. Subsequently, an API call is made to transmit this image to a prediction server. The role of the prediction server is to receive these API calls, analyze the image, and generate a decision regarding the presence of defects in the phone. The resulting prediction is then sent back to the inspection software. Finally, based on this prediction, the inspection software can implement the appropriate control decision, determining whether the phone should proceed along the manufacturing line or be flagged as defective.

ML Infrastructure Requirements

In recent years, machine learning models have garnered significant attention, with substantial advancements observed in the field. However, deploying these models into production requires more than just the machine learning code itself. A machine learning system typically encompasses a neural network or another algorithm designed to learn a function that maps inputs to outputs.

Upon examining a machine learning system in a production setting, it becomes apparent that the machine learning code, represented by the orange rectangle, constitutes only a fraction of the entire project’s codebase [1]. In fact, a substantial portion, perhaps as little as 5–10% or even less, is dedicated solely to the machine learning code.

Figure 2: Only a small fraction of real-world ML systems is composed of the ML code, as shown by the small orange box in the middle. The required surrounding infrastructure is vast and complex [1].

One of the underlying factors contributing to the challenge of transitioning from a proof of concept (POC) model implemented in a Jupyter notebook to a fully deployed production system is the substantial amount of additional work involved. This gap between the POC and production stages is commonly referred to as the “proof of concept to production gap.” A significant portion of this gap primarily stems from the considerable effort required to develop the extensive codebase beyond the initial machine learning model code.

In addition to the machine learning codes themselves, numerous other components must be constructed to facilitate a functional production deployment. These components often revolve around effective data management, encompassing tasks such as data collection, data verification, and feature extraction. Furthermore, once the system is operational, it becomes essential to establish mechanisms for monitoring the incoming data and performing data analysis. Consequently, multiple auxiliary components must be developed to enable a functional and robust production deployment.

The Machine Learning Project Lifecycle: From Planning to Deployment

The Machine Learning Project Lifecycle encompasses a systematic approach to planning and executing a machine learning project. This lifecycle comprises several key steps that guide the progression of the project. The initial step involves scoping, wherein the project’s objectives are defined, determining the specific application of machine learning, and identifying the variables of interest (X and Y). Once the project scope is established, the subsequent step entails data collection or acquisition. This process encompasses defining the required data, establishing a baseline, and organizing the data through labelling and organization.

Following data acquisition, the model training phase ensues. Within this phase, the model selection and training processes are conducted, accompanied by error analysis activities. Notably, machine learning projects often involve a high degree of iteration. During the error analysis stage, it is common to revisit and update the model or return to the earlier phases to acquire additional data if deemed necessary.

Figure 3: MLOps (Machine Learning Operations) is an emerging discipline, and comprises a set of tools and principles to support progress through the ML project lifecycle [2].

Before deploying the system, an important step is conducting a thorough error analysis, which may include final checks or audits to ensure satisfactory system performance and reliability for the intended application. It is crucial to acknowledge that deploying the system for the first time marks only the halfway point in achieving optimal performance. The subsequent half of valuable insights is typically gained through the active usage of the system and learning from real-time traffic.

Challenges in Deploying Machine Learning Models

The first category encompasses challenges related to the underlying machine-learning techniques and statistical aspects of the model. This may involve addressing issues such as model accuracy, overfitting, selecting appropriate algorithms, and fine-tuning hyperparameters. The successful deployment of a machine learning model requires addressing these challenges to ensure optimal performance and reliable predictions.

The second category pertains to the software engineering aspects of deployment. This involves considerations such as scalability, system integration, compatibility with existing software infrastructure, and the ability to handle large volumes of data efficiently. Effective deployment necessitates addressing these software engine issues to ensure a robust and scalable solution.

Navigating Concept Drift and Data Drift in Deployments

Concept drift and data drift present significant challenges in many deployments. Concept drift refers to the situation where the underlying concepts or relationships within the data change over time, while data drift refers to changes in the data distribution itself. This poses a critical question: What happens if the data used to train a deployed system undergoes changes?

For instance, in the previous manufacturing example, a learning algorithm was trained to detect scratches on smartphones under specific lighting conditions. However, if the lighting conditions in the factory change, it represents an example of data distribution change. Recognizing and understanding how the data has changed is essential in determining whether the learning algorithm needs to be updated accordingly.

Data changes can occur in different ways. Some changes transpire gradually, such as the evolution of the English language, where new vocabulary is introduced at a relatively slow rate. On the other hand, abrupt changes can also arise, leading to sudden shocks to the system.

Tackling Data and Concept Drift in Credit Card Fraud Systems during COVID-19

The outbreak of the COVID-19 pandemic brought forth unforeseen challenges for credit card fraud systems. The purchasing patterns of individuals underwent sudden and significant changes during this period. Many individuals who previously engaged in limited online shopping transitioned to making a substantial number of online transactions. Consequently, the manner in which credit cards were utilized experienced a sudden shift, thereby causing disruptions in numerous anti-fraud systems. The abrupt transformation in data distribution prompted machine learning teams to swiftly gather new data and retrain their systems to adapt to this novel data landscape.

In the context of data changes, it is worth noting that the terminology employed to describe these shifts may not always exhibit complete consistency. The term “data drift” is sometimes used to refer to alterations in the input distribution (x). For instance, if a previously unknown politician or celebrity suddenly garners widespread recognition and receives considerably more mentions, it signifies a change in the input distribution. On the other hand, “concept drift” denotes modifications in the desired mapping from the input (x) to the output (y). For instance, prior to the COVID-19 pandemic, certain unexpected online purchases made by a particular user may have triggered an alert for potential fraud. However, following the onset of the pandemic, these same purchases might no longer warrant concern or flagging. Consequently, the criteria for flagging suspicious activities might have required adjustment.

Software Engineering Challenges in System Deployment

In addition to addressing data-related changes, successful deployment of a system requires effectively managing software engineering issues [3]. When implementing a prediction service that takes queries (x) and generates corresponding predictions (y), various design choices come into play in developing the software component. To assist with decision-making in managing software engineering challenges, the following checklist of questions can be helpful.

One crucial decision in developing your application revolves around determining whether real-time predictions or batch predictions are necessary. For instance, in the case of a speech recognition system where prompt response is required within half a second, real-time predictions are imperative. On the other hand, consider systems designed for hospitals that analyze patient records, such as electronic health records, during an overnight batch process to identify relevant information. In this scenario, running the analysis as a batch process once per night is sufficient. The requirement of either real-time or batch predictions significantly impacts the implementation strategy, influencing whether the software needs to be designed for rapid responses within milliseconds or if it can perform extensive computations overnight.

Unleashing Deployment Potential: Strategies for Success

WhenWhen deploying a learning algorithm, a haphazard “turn it on and hope for the best” approach is generally not advisable due to the potential risks involved. Distinctions can be drawn between the initial deployment and subsequent maintenance and update deployments. Let us delve into these distinctions in more detail.

The first deployment scenario arises when introducing a new product or capability that was not previously offered. Consider the example of offering a novel speech recognition service. In such cases, a common design pattern involves initially directing a modest amount of traffic to the service and gradually increasing it over time. This gradual ramp-up strategy enables careful monitoring and assessment of the system’s performance, ensuring a smooth transition to full-scale deployment.

Another frequent deployment use case emerges when a task that was previously performed by humans is now intended to be automated or assisted by a learning algorithm. For instance, in a factory setting, if human inspectors were responsible for scrutinizing smartphone scratches, incorporating a learning algorithm for task automation or assistance becomes feasible. The existence of prior human involvement in the task offers additional flexibility in determining the deployment strategy.

By adopting appropriate deployment patterns suited to each scenario, the risks and challenges associated with system deployment can be mitigated effectively.

Shadow Mode Deployment for Learning Algorithms

Consider the scenario of visual inspection, where human inspectors traditionally examine smartphones for defects, particularly scratches. Now, the goal is to introduce a learning algorithm to automate and assist in this inspection process. When transitioning from human-driven inspection to algorithmic augmentation, a commonly employed deployment pattern is known as shadow mode deployment.

In shadow mode deployment, a machine learning algorithm is deployed alongside the human inspector, running in parallel. During this initial phase, the outputs of the learning algorithm are not utilized for making any decisions within the factory setting. Instead, human judgment remains the primary determinant. For instance, if the human inspector deems a smartphone defect-free, and the learning algorithm concurs, the decision aligns. Similarly, if the human inspector identifies a substantial stretch defect, and the learning algorithm agrees, their judgments are consistent. However, there may be instances where the human inspector recognizes a minor stretch defect, while the learning algorithm erroneously categorizes it as acceptable.

Figure 4: ML system shadows the human and runs in parallel.ML system’s output is not used for any decisions during this phase. Sample outputs and verifies predictions of the ML system

The essence of shadow mode deployment lies in its ability to gather data on the learning algorithm’s performance and compare it against human judgment. By operating in parallel with human inspectors without actively influencing decisions, this deployment approach enables comprehensive evaluation and verification of the learning algorithm’s efficacy. Embracing shadow mode deployment proves highly effective in assessing learning algorithm performance before granting it decision-making authority.

Canary Deployment: Gradual and Monitored Algorithmic Decision-Making

Canary Deployment, a commonly employed deployment pattern, serves as a method for introducing a learning algorithm to real-world decision-making scenarios. This approach involves initially rolling out the algorithm to a limited portion of the overall traffic, typically around 5% or even less. By confining the algorithm’s influence to a small fraction of the traffic, any potential mistakes made by the algorithm would only impact this subset. This setup offers an opportunity to closely monitor the system’s performance and gradually increase the percentage of traffic directed towards the algorithm as confidence in its capabilities grows.

Figure 5: Roll out to small fraction (say 5%) of traffic initially. Monitor system and ramp up traffic gradually.

The implementation of canary deployment facilitates the early detection of issues before they can lead to significant consequences, such as disruptions in a factory or other relevant operational contexts where the learning algorithm is deployed.

Blue-Green Deployment: Seamlessly Transitioning to New Versions with Rollback Capability

Blue-green deployment is an alternative deployment pattern commonly used, particularly in systems involving camera software for capturing phone images in a factory setting. In this context, the phone images are directed to a software module that routes them to a visual inspection system. The existing software version is referred to as the “blue” version, while the newly implemented learning algorithm represents the “green” version. During a blue-green deployment, the router initially sends the images to the blue version for decision-making.

To transition to the new version, the router is configured to cease sending images to the old version and abruptly switch to the new version. The implementation of a blue-green deployment involves running an old prediction service, the blue version, alongside a new prediction service, the green version. By swiftly redirecting traffic from the old service to the new one, a blue-green deployment offers the advantage of facilitating easy rollback if necessary. However, it is worth noting that the conventional approach involves switching the entire traffic over to the new version simultaneously.

Degrees of Automation in Deployment for Human-AI Collaboration

When considering the deployment of a system, it is beneficial to shift away from a binary perspective of deploy or not deploy and instead focus on determining the appropriate level of automation. MLOps tools can assist in implementing deployment patterns, or alternatively, they can be manually implemented. For instance, in the case of visual inspection of smartphones, different degrees of automation can be considered.

At one end of the spectrum is a purely human-driven system with no automation. A slightly automated approach involves running the system in a shadow mode, where learning algorithms generate predictions but do not directly impact factory operations. Increasing the level of automation, an AI-assisted approach could provide guidance to human inspectors by highlighting regions of interest in images. User interface design plays a crucial role in facilitating human assistance while achieving a higher degree of automation.

Partial automation represents an even greater level of automation, wherein the learning algorithm’s decision-making is relied upon unless it is uncertain. In cases of uncertainty, the decision is deferred to a human inspector. This approach, known as partial automation, proves beneficial when the performance of the learning algorithm is not yet sufficient for complete automation.

Figure 7: Design a system thinking about what is the appropriate degree of automation. You can choose to stop before getting to full automation.

Beyond partial automation lies full automation, where the learning algorithm assumes responsibility for every decision without human intervention. Deployment applications often span a spectrum, gradually transitioning from relying solely on human decisions to relying solely on AI system decisions. In domains such as consumer Internet applications (e.g., web search engines, online speech recognition systems), full automation is necessary due to scale and feasibility. However, in contexts like factory inspections, human-in-the-loop deployments are often preferred, finding the optimal design point between human and AI involvement.

Monitoring Machine Learning Systems: Metrics and Dashboards for Performance Insights

To ensure that a machine learning system meets performance expectations, monitoring becomes crucial. The standard approach involves utilizing a dashboard to track the system’s performance over time. Depending on the specific application, different metrics may be monitored through dedicated dashboards. For instance, server load or diffraction of non-null outputs could be tracked separately.

When determining what to monitor, it is recommended to gather input from the team and brainstorm potential issues that might arise. Once identified, suitable statistics or metrics should be devised to detect those problems. For instance, if concerns exist regarding user traffic spikes leading to overloaded services, monitoring server loads can serve as a relevant metric.

During the initial design of monitoring dashboards, it is acceptable to include a wide range of metrics and subsequently refine the set over time by eliminating those that prove to be less useful. Various examples of metrics commonly used in different projects include software metrics (e.g., memory, compute, latency, throughput, server load) that assess the health of the software implementation supporting the prediction service or other related components. Additionally, selecting metrics to monitor the statistical health and performance of the learning algorithm itself is advisable.

In certain cases, coarse metrics like click-through rate (CTR) can be employed, particularly in web search applications, to ensure the overall system’s robustness. Such output metrics aid in identifying changes in the learning algorithm’s output (y) or any significant shifts occurring downstream, such as alterations in user behavior like switching to typing.

Since input and output metrics are application-specific, MLOps tools generally require configuration to accurately track these metrics for a given application.

The Iterative Journey: Unveiling the Parallelism of ML Modeling and Deployment

Similar to the iterative nature of machine learning modelling, deployment follows a similar path. Just as in modelling, where the process involves creating a model, training it with data, conducting error analysis, and making improvements, deployment should also be approached iteratively. The initial deployment and implementation of monitoring dashboards mark the beginning of this iterative process.

Figure 8: Iterative process to choose the right set of metrics to monitor.

A running system provides an opportunity to gather real user data and traffic, enabling performance analysis. This analysis feeds into the update and refinement of the deployment, while continuous monitoring of the system persists. Once a set of metrics to monitor is selected, it is customary to establish thresholds for triggering alarms or notifications. These thresholds serve as indicators to identify potential issues or anomalies, such as exceeding server load or a significant fraction of missing values. Over time, metrics and thresholds can be adapted to ensure they effectively flag relevant cases of concern.

If performance problems arise, related to either the accuracy or efficiency of the learning algorithm, appropriate actions need to be taken. This may involve updating the model itself or addressing specific algorithmic issues. Consequently, many machine learning models require ongoing maintenance and occasional retraining to sustain optimal performance.

Monitoring the Flow: Surveillance of Complex AI Pipelines

Many Many AI systems consist of intricate pipelines, encompassing multiple steps beyond a single machine learning model serving as a prediction service. To effectively monitor such machine learning pipelines, understanding their construction is crucial. Let’s consider an example involving user profiles derived from clickstream data, capturing key attributes and characteristics of users, including car ownership prediction.

Figure 9: Utilizing clickstream data to construct user profiles and generate product recommendations.

Figure 9: Clickstream data showing what users are clicking on to build a user profile that tries to capture key attributes or key characteristics of the user profile.

The user profile is typically built using a learning algorithm to predict whether a user owns a car, among many other attributes. This comprehensive user profile serves as input for a recommendation system, where another learning algorithm generates product recommendations based on this understanding of the user. However, if there are changes in the clickstream data, such as alterations in the input distribution leading to a higher percentage of unknown labels in the user profile, it can impact the input to the recommendation system, potentially affecting the quality of the product recommendations.

When dealing with a machine learning pipeline, these cascading effects can pose complexity in tracking and managing the system. Nonetheless, an increase in the percentage of unknown labels should trigger alerts to prompt updates to the recommendation system, ensuring the generation of high-quality product recommendations persists.

Summary

As the machine learning (ML) community gains substantial experience in working with live systems, a disconcerting trend has emerged: the development and deployment of ML systems may be fast and cost-effective, but their long-term maintenance poses significant challenges and expenses. In addition to the ML code itself, there are numerous components involved, particularly those related to data management, such as data collection, verification, and feature extraction. Once the system is deployed, monitoring the incoming data and performing analysis becomes essential. However, the successful deployment of ML systems necessitates the construction of various other components. Thus, understanding the significance of these intricate software pieces is crucial for achieving successful production deployments.

To address these complexities, a valuable framework has been discovered — one that organizes the workflow of a machine learning project, systematically outlining its life cycle. We have delved deep into comprehending the complete life cycle of a machine learning project and anticipate that this framework will greatly benefit all your future machine learning deployments.

👋 Thank you for reading! If you’ve enjoyed my work, don’t forget to give it a thumbs up and follow me on Medium. Your support will inspire me to provide more valuable content to the Medium community! 😊

References:

[1] Sculley, D., Holt, G., Golovin, D., Davydov, E., & Phillips, T. (n.d.). Hidden technical debt in machine learning systems. Retrieved April 28, 2021, from Nips.https://papers.nips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf

[2] DeepLearning.AI

[3] AndKonstantinos, Katsiapis, Karmarkar, A., Altay, A., Zaks, A., Polyzotis, N., … Li, Z. (2020). Towards ML Engineering: A brief history of TensorFlow Extended (TFX). http://arxiv.org/abs/2010.02013

[4] Paleyes, A., Urma, R.-G., & Lawrence, N. D. (2020). Challenges in deploying machine learning: A survey of case studies. http://arxiv.org/abs/2011.09926