Comparing Cloud MLOps platforms, From a former AWS SageMaker PM

How do the two big clouds compare today for ML platform tooling? What are the features that matter?

Alex Chung
Towards Data Science

--

Companies that are applying machine learning (ML) in the organization have a string of tools put together as their ML platform. When organizations scale, every ML engineer, ML architect, and CIO team should re-evaluating their architecture, especially as the big-name cloud vendors release their annual announcements. As a former Amazon SageMaker Senior Product Manager, I’m going to give an overview of the current GCP Vertex versus SageMaker landscape and how I think about the different tools.

MLOps is rarely a single monolith system. Instead, it’s composed of a number of smaller tooling stages, what I call the “MLOps Big Eight”: data collection, data processing, feature engineering, data labeling, model design, model training, model optimization, and model deployment and monitoring.

Image by Author. The “big eight steps” by open-source foundation, Social Good Technologies.

Introduction to ML platform approaches

Large cloud vendors have built “end to end” ML platforms. A data scientist using Amazon SageMaker can pull data from their data warehouse, create algorithm model code, and deploying onto production without leaving the tool suite. While cloud vendors are the most talked about to get started with, there are also startups that aim to serve the same challenges from Dataiku, Datarobot, C3.ai, H20.ai.

The alternative to an end-to-end is to be a “best in breed” tool, which requires vendors to focus to become the thought leader product in a domain. While both GCP Vertex and Seldon have model serving capabilities, a proficient ML engineer will uncover that Seldon’s product has features such as inference graphs and native Kubernetes deployment that many customer use cases require. End-to-end ML platforms typically take 15–36 months to catch up on feature parity, but the products will launch much sooner than that to create thought leadership and get product feedback.

At this point, there’s a significant amount of overlap with the major cloud vendor platforms. In fact, many of the net new announcements from Google I/O 2021 were the features SageMaker launched at re:Invent 2020. What differentiates Google’s strategy from other cloud vendors is that they have a number of open-source MLOps projects that originated from Google Brain that GCP now offers as a managed service. Vertex AI pipelines is a managed Kubeflow Pipelines service, Vertex metadata API is nearly identical to MLMD, and Vertex also has APIs for hosting tensorboard training artifacts. This approach to building products means customers have portability, the same tools are available in open source for them to run on AWS or on-premise.

Comparison of Amazon SageMaker versus GCP Vertex

Every couple of months I spend time reevaluating my 18 month MLOps industry roadmap. This roadmap is what I consider the key features that should cover 90% of what enterprises need to train and serve models. It’s assembled by nuggets of product feature requests I have in conversations with ML engineers, CIOs, and other PMs, and what I’ve observed in my work experience. I put these features into a list and describe the current ML offerings.

A comparison of GCP Vertex with SageMaker for commonly required and best in breed features.

There are two classes of features that ML platforms have. “Table stakes” required and “best-in-breed” features.

The best-in-breed is highlighted in bold. Teams that are newer to ML will be sufficient with the table stakes features so they can move quickly to deploy models. Both GCP Vertex and Amazon SageMaker have invested enough to clear this minimum hurdle.

However, those features alone are not sufficient to have models at scale in production. The best in breed features, such as inference graphs, are important to ML teams that already have many models in their organization. Every stage of the Big Eight has its own set of best-in-breed features to consider.

Cloud platforms are still maturing their best-in-breed features, even if they have a public product around them today. While GCP is more honest and calls Vertex Metadata, and other best in breed tools pre-GA, they really aren’t usable at scale (yet). I highlighted on Twitter that Vertex Metadata (Experiments Tracking) doesn’t have a Python SDK. SageMaker has similar deficiencies. An example is in their feature store, which lacks an entity model to organize the saved features with. It takes deeper testing to uncover the technical debt still there.

Enterprise teams that need advanced features in bold should be ready to build their own solutions to solve their specific use case or write “glue” code to alternatives (open source tools or other vendors). This glue is a concept I’ve heard investors call ML orchestration, and it’s a new class of MLOps tools that will emerge.

Takeaways

Doing evaluations like this shows the gaps between what a CIO/VP of engineering needs to build to solve current challenges versus what is available to buy if they didn’t have an MLOps platform team at all. The list of needs is always growing and the list of sufficient features doesn’t come from the end-to-end ML platforms alone. For more complex enterprises with power users, the exercise gives clarity for the future of their ML platform in the classic build vs buy. In the case for startups and smaller companies, having an MLOps platform engineering isn’t cost-effective, and using SageMaker or Vertex will pass muster.

After AWS, I started a non-profit MLOps software foundation called Social Good Tech to address the interoperability of tools. I plan on writing more on how to think through architecting ML platforms. Follow along for my analysis of the MLOps industry and debates evolving in the market.

--

--

MLOps in Enterprises. Sharing experiences from building internal products at AWS SageMaker, Facebook, and Lyft. www.awchung.com