The world’s leading publication for data science, AI, and ML professionals.

Scaling Enterprise MLOps with Modern Cloud Operations

Step-by-step guide to scaling an enterprise ML platform on AWS.

Step-by-step guide to scaled ML environment provisioning on AWS

Joint post with Nivas Durairaj

The Gutenberg printing press was revolutionary in its time. Suddenly, publishers could print thousands of book pages per day when compared to a few handwritten pages. It enabled a rapid dissemination of knowledge in Europe and opened the era of Renaissance.

Photo by Henk Mul on Unsplash
Photo by Henk Mul on Unsplash

Today, large enterprises need to deliver hundreds of ML projects to their business, while doing so in a secure and governed manner. To accelerate ML delivery, they need to provision ML environments with guardrails in minutes. A printing press allowing ML teams to quickly access working environments, and scaffolding to operationalize their solutions.

I have recently published a guide on how to do that on AWS. Here we will put this in practice. I will share in 3 steps how you can modernize your cloud operations to scale ML delivery.

Image by author
Image by author

Walkthrough overview

We will tackle this in 3 steps:

  • We will first setup our ML platform foundations with AWS Control Tower and AWS Organizations. We will adopt a multi-account strategy and each ML project will operate in a separate account.
  • Then, we will enable self-serving of templated ML environments with AWS Service Catalog and Amazon SageMaker. It will allow ML teams to self-provision approved environments in their accounts in minutes.
  • Finally, we will see how ML teams can launch and access their governed ML environments.

Prerequisites

To follow this post, make sure you:

  1. Visit Using AWS Control Tower to Govern Multi-Account AWS Environments if this sounds new to you.
  2. We will apply concepts presented in Setting up secure, well-governed machine learning environments on AWS. Make sure you read the post before continuing.
  3. For self-serving, we will reuse the same approach and Service Catalog portfolio as in Enabling self-service provisioning of Amazon SageMaker Studio resources. Make sure you are familiar with it.

Step 1: Enabling ML projects with modern cloud operations

First, we want ML teams to access a secure and compliant AWS accounts every time they have a new project. Here, we keep it simple and create 1 AWS account per project.

Image by author: We will use Control Tower to automate account provisioning under a Workloads OU. They will automatically inherit governance we apply on the OU.
Image by author: We will use Control Tower to automate account provisioning under a Workloads OU. They will automatically inherit governance we apply on the OU.

Setting up your Landing Zone and creating a Workloads OU

Navigate to the AWS Control Tower console to setup your landing zone. See Getting started with AWS Control Tower for details on how to launch one.

Image by author: I create the Workloads OU in the OU configuration page
Image by author: I create the Workloads OU in the OU configuration page

Once launched, it should take about half an hour for the process to finish.

Image by author: You should now see foundational OU (Security) and additional OU (Workloads) in Control Tower
Image by author: You should now see foundational OU (Security) and additional OU (Workloads) in Control Tower

Note – The log archive account under the Security OU can act as a consolidation point for log data gathered from all the accounts under Workloads OU. It can be used by your security, operations, audit, and compliance teams to support regulatory requests.

Applying Control Tower guardrails and SCPs for ongoing governance

You can setup Control Tower Guardrails to provide ongoing Governance for your overall AWS environment, and Service Control Policies to control maximum available permissions for all accounts under the Workloads OU.

For illustrative purposes, we will use the same example SCP as in this blog. It prevents ML teams from launching SageMaker resources in their accounts, unless a VPC subnet is specified:

Images by author: Navigate to AWS Organizations and create a new SCP. I called mine "sagemaker-enforce-vpc".
Images by author: Navigate to AWS Organizations and create a new SCP. I called mine "sagemaker-enforce-vpc".

Creating project accounts with the account factory

You can now create new accounts on demand, using the Control Tower Account Factory.

Image by author: In my case I create MLProjectA and MLProjectB accounts under the Workloads OU.
Image by author: In my case I create MLProjectA and MLProjectB accounts under the Workloads OU.

The account factory is a Service Catalog product so you can create accounts through the UI. When scaling, you may create them programmatically.

Managing user authentication and permissions

Next you need to manage user authentication and permissions in the accounts. Here I use AWS SSO to manage those and you can follow the process in this video. Feel free to use the identity provider of your choice:

Images by author: I created 2 users - Jon Doe (in DataScientists group), and Mike Smith (in MLEngineers group). I also created Custom permissions sets for users to have when logged in into their accounts.
Images by author: I created 2 users – Jon Doe (in DataScientists group), and Mike Smith (in MLEngineers group). I also created Custom permissions sets for users to have when logged in into their accounts.

Sagemaker provides service-specific resources, actions, and condition context keys you can add to the permission sets. Also see this page for managing permissions to other AWS services.

You can create permission sets for different ML project personas. Here are a few examples:

  • Data scientist They can experiment with ML approaches.
  • ML engineer They can handle CI/CD, model monitoring, and artifacts.
  • ML platform engineer (admin) Administrator.
  • Audit and compliance team They have read access to Log Archive account, where they can verify the compliance of workloads.
Image by author: As admin, I gave Jon Doe access to the MLProjectA account. As Jon Doe, I can go into the SSO log in page, and see which project accounts I have access to.
Image by author: As admin, I gave Jon Doe access to the MLProjectA account. As Jon Doe, I can go into the SSO log in page, and see which project accounts I have access to.

You should now have multi-account foundations in your ML platform. When a new ML project starts, you can create a new AWS account with guardrails, and provide users with access to it. The process takes only a few minutes.


Step 2: Self-serving templated ML environments

Now that your ML teams have access to accounts within minutes, they need to access working environments, and scaffolding to operationalize their solutions.

We will create a Service Catalog portfolio in the Control Tower management account, and share it with the ML projects accounts.

Images by author: We create a portfolio of governed products in Management Account (left), and share it with ML project account (right)
Images by author: We create a portfolio of governed products in Management Account (left), and share it with ML project account (right)

Creating a Service Catalog portfolio in the management account

For this you can reuse the approach from Enabling self-service provisioning of Amazon SageMaker Studio resources. It will allow you to automate the deployment of SageMaker products using the AWS Service Catalog Factory.

Image by author: Service Catalog portfolio with templated SageMaker resources
Image by author: Service Catalog portfolio with templated SageMaker resources

Note For illustrative purposes, I put the Service Catalog Factory in the Control Tower management account. In real life, your ML platform team may have dedicated accounts to build, test, and deploy products and portfolios.

Sharing the portfolio with the ML project accounts

Now we will share the Service Catalog portfolio with all account under the Workloads OU.

Image by author
Image by author

The process is very easy and you can follow the steps in this video:

Image by author: I share the portfolio with the Workloads OU.
Image by author: I share the portfolio with the Workloads OU.

As an ML platform admin, you can log into the project accounts and accept the portfolio.

Images by author: You will need to input the portfolio ID present in the Management Account.
Images by author: You will need to input the portfolio ID present in the Management Account.

Then, you can provide ML teams with access to the imported portfolio in their account.

Image by author: Here I give access to the portfolio to users having DataScientist or MLEngineer roles.
Image by author: Here I give access to the portfolio to users having DataScientist or MLEngineer roles.

From now on, creating new ML project accounts should take no more than a few minutes. Updates you do on the Service Catalog portfolio will be reflected automatically in project accounts, allowing you to continuously deploy new products.


Step 3: Launching an ML environment in a project account

Now is the easy part! We will use one of our SSO user, log in into the MLProjectA account, and launch SageMaker Studio.

Image by author: As Jon Doe I log in into the MLProjectA account and go to the Service Catalog console page.
Image by author: As Jon Doe I log in into the MLProjectA account and go to the Service Catalog console page.
Images by author: Using Service Catalog, I launch a new SageMaker Studio domain (one off), and create a user profile for myself.
Images by author: Using Service Catalog, I launch a new SageMaker Studio domain (one off), and create a user profile for myself.

Note – For illustrative purposes, the example Studio domain will look for the default VPC with public subnet. I get one with the aws ec2 create-default-vpc CLI command. In the real world you will want the Studio domain to run in a private subnet.

Image by author: I can now access a SageMaker Studio environment in a few clicks! 🚀
Image by author: I can now access a SageMaker Studio environment in a few clicks! 🚀

Conclusion

A multi-account strategy and self-service provisioning of governed ML environments allows enterprises to scale their ML delivery. It allows ML teams to start working in approved environments, in a few minutes.

In this post, I have shared how an ML platform team can quickly provision secure, well-governed ML environments with AWS Control Tower, AWS Organization, AWS Service Catalog, and Amazon SageMaker.

To go further, you can visit the AWS Management and Governance Lens and Industrializing an ML platform with Amazon SageMaker Studio.


Related Articles