The world’s leading publication for data science, AI, and ML professionals.

How to design an MLOps architecture in AWS?

A guide for developers and architects especially those who are not specialized in machine learning to design an MLOps architecture for…

Introduction

According to Gartner’s findings, only 53% of machine learning (ML) projects progress from proof of concept (POC) to production. Often there is a misalignment between the strategic objectives of the company and machine learning models built by data scientists. There might be a lack of communication between DevOps, security, legal, IT and the data scientist that causes challenges to pushing the model into production. Finally, the team might find it difficult to maintain the models in production while pushing out new models. It has led to the rise of MLOps which brings the principles of DevOps, such as continuous integration and continuous delivery (CI/CD), automation, and collaboration to the machine learning lifecycle – development, deployment and monitoring.

In this article, I will dive into the following :

  • Various steps in the machine learning process
  • Different MLOps components and explain why they are necessary without diving too much into the details that only data scientists need to know
  • MLOps Architecture diagrams based on the size and maturity of the organization
  • General suggestions on starting the MLOps journey

Typical Machine-learning process

Let’s start first by understanding the steps involved in the machine learning process.

Machine learning process - Image by Author
Machine learning process – Image by Author

A machine learning process has the following components:

  1. Business Problem and Machine learning problem statement: We start the process by identifying the business problem and agreeing that machine learning is the right solution for the problem. The proposed machine-learning solution should produce a measurable business outcome.
  2. Data Collection, Integration and Cleaning: In this step, data scientists/data engineers collect data, integrate it with different sources, and clean & transform it to make it consumption ready. Data engineers/scientists might also divide the data into three datasets – Training, validation, and testing using a ratio of 80–10–10 or 60–20–20. Training datasets are used directly for training the model, validation sets are the unseen examples to evaluate the model’s performance, and testing sets are another set of unseen records to check the real-life model’s performance.
  3. Data analysis and Visualization: Data scientists then perform exploratory analysis(EDA) to understand the structure and quality of the data. This step is necessary to understand the data discrepancies, patterns and formation of new hypotheses.
  4. Feature engineering: In this step, data is selected, combined, and manipulated to create new variables using statistical or machine learning approaches. For example – doing log transform or scaling, or normalization. Together all-new features contribute towards improving model performance.
  5. Model training and parameter tuning: Once the new features are available, data scientists train various ML models, and hyperparameters are tuned to meet the desired SLA metrics.
  6. Model Evaluation and Deployment: In this step, the model with the best accuracy among all the other models is selected and deployed to production. Deployment of the model means that the model is ready for consumption(prediction).
  7. Monitoring and debugging: The machine learning model is out of date as soon as you deploy it to the real world. The model must be regularly re-trained with updated data.

Every data scientist will more or less follow this process and perform most of the above steps manually if they are new to their machine learning journey.

To illustrate what I mean, let us look at the architecture diagram without MLOps.

Machine learning architecture without MLOps - Image by Author
Machine learning architecture without MLOps – Image by Author

Here we have a pretty standard data science setup. The data scientists have been given access to the AWS account within the sage maker studio in which they can use Jupytor notebook to develop their models. They start by pulling data from various data sources, such as S3 or athena, and then use different machine-learning techniques to create a model. Then the model is stored in S3 as a model artifact and deployed as a sage maker endpoint. The endpoint is available to the world through an API gateway.

Challenges of machine learning project without MLOps

While this looks good for doing proof of concept, this setup has the following challenges:

  1. Changes rely on manual labour: Any changes in the Machine Learning model require manual changes by the data scientists. It might involve re-running the Jupyter notebook cells or updating the sage maker endpoint with the latest version of the model. It becomes cumbersome and difficult to scale if the code extends over multiple Jupyter notebooks.
  2. No versioning of the code: The code that the data scientists produce sits in the Jupyter notebooks. These notebooks are difficult to version and automate.
  3. No feedback loop: There is no automatic feedback loop in the process. If the quality of the model deteriorates, you will only find out through complaints from disgruntled customers.

These are some of the challenges which can be avoided by using Mlops.

MLOps architecture for small teams(1–3 data scientists)

MLOps can be complicated, and you don’t have to adopt all the features immediately. You can start with a minimal MLOps setup and gradually adopt more as your team grows or matures.

Let’s take the same machine learning setup explained above and introduce elements of MLOps in it. The architecture diagram described below is suitable for a small company or a team of 1–3 data scientists

MLOps architecture for small teams - Image by Author
MLOps architecture for small teams – Image by Author

In this architecture

  1. Data scientists start with the sage maker notebook instance and perform the version control of the code, using code commit or any git-based code repository, and the environment for training machine learning model using docker containers, stored in elastic container registry(ECR). By versioning the code, environment and model artifacts, you improve the ability to reproduce the model and encourage collaboration among the team.
  2. You can use step functions or any other workflow tool, such as Airflow or sage maker pipelines, to automate the re-training of the model. The re-train pipeline built by the data scientists will use version code and environment to perform data pre-processing, model training, and model verification and save model artifacts in S3. The pipeline can utilize various services, such as glue jobs, EMR jobs, and lambda, to complete its flow and can be automated using scheduled-based event rules inside the event bridge.
  3. The model and its versioning are further managed through the model registry service like the sage maker model registry. The sage maker model registry stores the model and its metadata, such as hyperparameters, evaluation metrics, and bias & explainability reports, and allows one to view, compare, and approve or reject the model. The actual model artifacts are stored in the S3, and the model registry service sits at the top as an additional layer.
  4. Finally, the deployment of the model is automated using the lambda function, which is triggered as soon as the model is approved in the sage maker model registry. The lambda will fetch the approved model from S3 and update the version in the sage maker endpoints without downtime. These sage maker endpoints are connected to lambda and API gateway to serve the consumer application and have an auto-scaling group attached to handle an unexpected spike in requests. You can further improve the deployment process by using canary deployment. It means a small number of user requests will be diverted to the new model first, and any error will trigger a cloud watch alarm to inform the data scientists. Over time, the number of requests sent to the new model will increase until the new model gets 100% of the traffic.

This architecture versions the code & model and automates the re-training and deployment. It provides scalability of the auto-scaling of the sage maker endpoints and flexibility of the canary deployments. However, when the data scientists team grows, we can introduce a few more elements of MLOps.

MLOps architecture for medium and large teams

This MLOps extends the small MLOps architecture, and expands into multi-account setup with emphasis on the quality check of the model running in the production.

MLOps architecture for medium and large teams - Image by Author
MLOps architecture for medium and large teams – Image by Author

In this architecture:

  1. Data scientists adopt a multi-account approach in which they develop the model in the development account.
  2. As in a small MLOps setup, data scientists start with the sage maker notebook and perform versioning of the code with code commit and versioning of the environment with ECR. Then they create a re-training pipeline using a step function, which has steps for model training, verification and artefact saving in S3. The versioning of the model is done by the sage maker model registry, which allows the user to accept or reject the model.
  3. The deployment steps of the model include sage maker endpoints and autoscaling groups, which are connected to lambda and API gateway to allow users to submit inference requests. However, these component services sit in different AWS accounts. A multi-account strategy is recommended because it provides a way to separate business units, easily define restrictions for production loads and provide a fine-grained view of the cost incurred by each component of the architecture.
  4. The multiple-account strategy involves setting up a staging account alongside the production account. A new model is first deployed to the staging account, tested and then deployed to the production account. This deployment must happen automatically via the code pipeline in the development account. The code pipeline is automatically triggered by the event generated when a model version is approved in the model registry.
  5. It is imperative to monitor changes in the behaviour or accuracy of the models running in production. We have enabled data capture on the endpoints of the staging and production accounts. It captures the incoming requests and outgoing inference results in S3 buckets. This captured data usually needs to be combined with labels or other data on the development account, so we have used S3 replication to move the data onto an S3 bucket in the development account. Now to tell whether the behaviour of the model or data has changed, we need something to compare against. This is where the model baseline comes in. During the training process we can generate a baseline dataset which records the expected behaviour of the data and the model. We can use the sage maker model monitor to compare the two datasets and generate the report.
  6. The final step in this architecture is to take action based on the model report. When a significant change is detected, we can send an event to trigger the re-training of the pipeline.

Departing Thoughts

MLOps is a journey. You don’t have to immediately use all of the features from the complicated architectural design. You can start with basic steps to integrate with versioning and automation. Explore all the features that I have presented above and categorise them according to the needs of your business and then start adopting them when they are needed. The architecture designs described above are not the only way to implement MLOps, but I hope they will provide some inspiration to you as an architect.

If you found my article helpful, please leave me a comment.


Related Articles