Interpreting the Business Considerations of MLOps

An assessment of the real-world constraints of cloud migrations

Published in

Towards Data Science

7 min readOct 3, 2022

Photo by Kateryna Babaieva from Pexels.com

Let’s imagine Company A. Company A is a typical mid-tier industry leader. They already have a data science team, a very successful ML deployment, and a strong data infrastructure.

Additionally, compared to their competitors, they already have some experts in both Cloud and data engineering. They even have their leadership team committing to a “Cloud-first” strategy but have nicely provided the teams with flexibility on how they go about achieving that objective. Their IT security and data privacy teams have already been involved in ensuring that no client PII data will leak, so the ML team has carte blanche on their future ML deployments.

Company A is presently looking to improve the quality of its client relationships by increasing the automation of its on-web interactions (both image and NLP-based), so there will be more models interacting with clients in the foreseeable future. Thanks to the success of their first WidgetBot, both the ML team and management want to continue down this path.

Their main Cloud provider, Awooz, does provide virtual machines with GPU compute capability. They’ve recently had a bad experience with Gupson, and so the team is a bit on edge regarding runaway costs, limited capabilities, and vendor lock-in.

Rather than committing to a turn-key MLOps/AutoML platform, they’d like to experiment with building their stack so that they can control each of the modules being deployed, an advantage for maintenance and internal knowledge transfer across team members.

They need to consider:

In what sequence should they automate their pipelines;
How much automation should their existing deployment be improved by; and
Identifying the opportunities for controlling costs to avoid surprise bills.

How should Company A go about designing its machine learning systems to create the most client value (i.e., new models over time) while managing operational costs?

Backgrounder: CapEx and OpEx

Capital Expenditures (CapEx) and Operational Expenditures (OpEx) are the two main forward budgeting mechanisms for a corporation. CapEx is usually reserved for large-scale asset acquisitions (buildings and vehicles), whereas OpEx relates to day-to-day expenses (such as payroll and software subscriptions).

A comparison of CapEx and Opex. (By the author.)

The first allows for a one-time purchase price, and the second one absorbs costs over time. On the flip side, the first one needs budgetary approval to greenlight such an endeavor, while the second one gnaws away at departmental profitability.

For Company A’s situation, the “CapEx vs Opex” is a good proxy for evaluating on-premise vs. Cloud IT infrastructure acquisition. Should they rent nodes from a Cloud provider, or build and manage their own on-premise equipment?

Due to the Cloud GPU instance pricing levels, however, Company A’s planned OpEx would be significantly higher than their competitors. In fact, according to Cloud-GPUs.com, the monthly cost of an NVIDIA V100-based virtual machine can easily reach $2000 per month. With IT purchases based on a 3-year lifetime value cycle, the payback period of purchasing this is not even a few months. Under these conditions, why consider Cloud at all?

Sidebar: What drives a CapEx vs OpEx decision?
Although there are many factors, typically a CapEx/OpEx decision for acquiring a similar product boils down to budget availability and cost of capital. Budget availability is typically signed off on a yearly basis (quarterly if very few cases).
Cost of capital, also analogous to the Internal Rate of Return of a company, is the interest rate at which a company borrows against itself as a proxy for opportunity cost. Typically, large and/or public organizations have a budgetary IRR of 10% to 15% depending on the industry and financial situation. Due to this rate, expenses can only be correctly compared through their Net Present Value (NPV) or Net Future Value (NFV).
The consideration of renting equipment over time rather than buying it upfront is that the Net Present Value (the perceived cost at this time) in an OpEx situation is lower than the Total Value (sum total).

The Value of MLOps

First, should Company A even consider increasing their migration to the Cloud? They have a functional team, a solid pipeline of activities, and by any measure some reliable successes that can simply be repeated. They’re growing at above-market trends, so they’re clearly firing on all cylinders.

The main issue related to their growth stage is that compute requirements will scale linearly; more models mean more compute, and more compute means more IT management with its associated capital burden. Improving their product could well mean the end of their profitability if they adhere to their roadmap by purchasing more on-premise servers.

Bad, meet Worse

If IT acquisitions are bad, full Cloud engagements are equally unhealthy. Research and development servers are computationally insatiable due to the exploratory nature of data investigation and model testing. They are complex systems to set up and end up costing a fortune mostly from being idle.

Some unhealthy Cloud practices are:

Always-on GPU instances. When the training environment is set up, it is rare to see teams de-provision all of their hard work.
Multiple GPU instances. Similar to always-on instances, resource sharing is poor across members of larger teams, so multiple instances are required during working hours.
Manual installation of libraries. Every new instance requires a kabuki ritual of library installations, and only after everything is set up can the project actually start.

Note: I’m the first person to agree that instance pausing, job scheduling and instance cloning are surefire ways to avoid these issues. But I’m also the first person to warn my clients about these risks and then gingerly offer these recommendations again once the first few projects went awry. Internal communication usually shows unclear team member expectations, hence the need for automation.

Not “Either/Or”

Instead of forcing a choice across both budgeting strategies, there actually exists a spectrum of opportunities in the transition phase.

Investing in automation reduces operational costs over time. (By the author.)

Usually, as the number of deployed models grows, the cost of retraining, packaging, and deploying them grows linearly. But if the associated automation increases equally to tame the complexity of these additional models, then the budget is under control while value increases.

The Best of Both: Incremental Migration

In Company A’s case, they have the advantage of hindsight: they already have a successful deployment. Using this first success as a springboard for automation use-cases means that if done correctly, every future model will be deployed faster and cheaper.

A sequential MLOps implementation strategy. (By the author.)

In this eventual migration to Cloud, a hybrid Cloud/on-premise deployment provides the ultimate balance of cost management and flexibility. An end-to-end functional (not automated) pipeline is already present with the first delivery.

The next stage should not actually focus on accelerating data science or improving machine learning performance; instead, all matters of deployment are where the team’s automation energy should go. This includes everything from automatically provisioning the virtual machines to pulling the latest models from the registry. The reason why deployment should be the focus is that any automation performed before this point will need to be rebuilt to accommodate the deployment verification and QA requirements.

A client-facing product requires a lot of effort to ensure quality and reliability. Therefore, the packaging of the inference models is what is the likeliest to break at this step.

From there, using the same logic, automating the previous steps all the way to data science is how your team will create the most value in a set amount of time.

Chicken or the Egg?

Although it’s easy to preach that an MLOps Cloud transition should focus on cost-effective, scalable efforts, every client we work with is unique with their own history and set of challenges. Helping them unstick their internal transition projects is how we end up creating long-term value.

Here are a few general recommendations that we feel are applicable for almost every team:

Maximize your existing infrastructure. Keep training on-premise and inference in the Cloud. If you have GPUs on-site, then get every ounce of training out of them. They’re a sunk cost, yes, but already installed and operational. No need to move the most computationally expensive step to the Cloud just yet.
Deploy automation activities by modules and stages, not by projects. The more you can reuse code across steps, the more you’ll be able to scale on future projects.
Build your provisioning automation scripts as early as possible. Although it seems like it should happen later, this gives your team the confidence to de-provision training and inference instances as soon as possible with no productivity loss.