7 DevOps skills for Machine Learning Operations

Lessons learned from successful MLOps implementation

Ricardo Mendes
Towards Data Science

--

Photo by Pietro Jeng on Unsplash

DevOps meets Machine Learning

MLOps has been a hot topic in 2021, with many people talking about it and companies aiming at implementing it. The reason is clear: MLOps brings agile software development principles to machine learning projects, which means shorter release cycles and higher quality standards.

From the technology standpoint, the main pieces for successful MLOps implementation are available: the ability to train and serve ML models in containers, plenty of data pipeline orchestration tools, automated testing frameworks, and mature DevOps practices.

Having the technology pieces in hand does not mean success, though. Building MLOps teams is challenging due to the roles typically involved: Data Scientists, Machine Learning Engineers, Data Engineers, DevOps Engineers, and management staff. Experience shows that people in these roles do not necessarily speak the same language, and someone should be responsible for connecting the dots.

In a recent MLOps engagement, I was primarily focused on DevOps-related tasks. We achieved our milestones as a team, and I noticed DevOps Engineers can help in many ways, not only in technical matters but also in promoting a better teamwork experience. I don’t mean they are more important than others, but, being DevOps a cross-role in the team, engineers can play, learn, and share knowledge from end to end, advocating for agile and engineering best practices while delivering the “operations” part of MLOps.

DevOps Engineers who want to work with Machine Learning Operations will undoubtedly leverage their expertise from “regular” projects. As you’ll see next, they should also be open to learning new stuff, especially cloud services for Big Data, Data Analytics, and AI/ML.

GitLab, Terraform, and Google Cloud

The experience I share in the present blog post comes from implementing MLOps with GitLab, Terraform, and Google Cloud managed services such as BigQuery, Cloud Storage, Vertex Feature Store, Vertex Pipelines, Vertex Models, and Vertex Endpoints.

We used Google Cloud’s blog post on Continuous delivery and automation pipelines in machine learning as a reference guide. Anyway, given that MLOps is a language, framework, platform, and infrastructure agnostic practice, I hope people using other alternatives can also leverage such experience.

The seven skills

1. Source Code Management

This is an elementary skill for DevOps Engineers. They must know Git concepts such as branching, merging, rebasing, and related commands, in addition to widely used workflows such as Gitflow and GitHub flow, or the ability to design custom workflows that better fit the team’s needs.

2. Continuous Integration and Delivery

CI/CD knowledge is also mandatory in a DevOps Engineer’s toolbelt. It will be helpful when discussing a variety of topics with other team members, including but not limited to:

  • Decide on the tools to support the project’s CI/CD needs, such as code review support, container and other artifact registries, and the ability to integrate with third-party services (cloud providers, for instance);
  • Organize the repositories in a way that data preparation code, AI/ML code, CI/CD code, and infrastructure-related code are synchronized but managed independently;
  • Ensure the code is suitable for automated testing;
  • Set up containers to run the code in isolated environments;
  • Split the backlog into small tasks to be delivered continuously in short cycles.

3. Infrastructure as Code

Automation is critical for MLOps. Having the infrastructure provisioned through code allows the team to easily replicate resources in the development, QA, and production environments. It also keeps an entire history of the changes applied to all environments, making troubleshooting and rollback tasks easier.

Infrastructure as code is recommended to be used since the early days of an MLOps project. IaC tools such as Terraform may become complex and error-prone when handling resources that are also managed manually.

It is noteworthy that MLOps teams usually rely upon managed cloud services for data processing, storage, and model training, which means infrastructure provisioning in this context is not strictly related to networks or virtual machines but to buckets, database-like services, and cloud schedulers.

4. Programming languages

DevOps Engineers are usually responsible for instrumenting the containers running automated tests or end-to-end pipelines created by Data or Machine Learning Engineers. Although they are not required to know the programming languages used by their peers in-depth, it happens to be helpful to know at least the CLI tools that are most used by the team (e.g., pytest for Python) and how external dependencies are managed (e.g., the relationship between pip install and requirements.txt for Python). Such knowledge will help DevOps Engineers find elementary errors in the CI/CD pipeline and fix them themselves.

Build-automation tools and scripting languages such as Make and Bash help the team streamline repetitive tasks. Bash is a must-have, and Make is somewhere between nice-to-have and must-have for DevOps Engineers who aim to work MLOps.

Pro tip: when working with Vertex Feature Store in its early days, we realized the related Terraform module was a WIP, missing the ability to create some resources. One of our DevOps Engineers then needed to code a Python script to bypass this limitation, creating the resources through API calls. Your team may face similar scenarios when working with recently launched tools. That said, getting to know a programming language that allows you to interact with third-party services might be a plus.

5. Custom containers

Lucky are the folks who always find ready-to-use container images. In real-world projects, we usually need to build custom images that fit specific use cases (I wrote about this in a previous blog post). It means you are expected to understand Docker concepts and best practices.

6. Pipeline orchestration

There are two main types of pipelines when working on MLOps: one for CI/CD and another for machine learning. CI/CD pipelines are orchestrated by GitLab CI, GitHub Actions, Circle CI, or similar tools, which are well documented, and I’m not going to repeat their docs here.

The machine learning pipelines, however, may vary considerably from project to project. They usually comprise data preparation, model training, model validation, model deployment, and model monitoring. Unlike CI/CD pipelines, triggered on Git events, ML pipelines are usually triggered on a calendar basis (a daily scheduler, for instance). Depending on the technologies used, each step may turn into a sub-pipeline. A DevOps Engineer must understand how the team decides to orchestrate their pipelines to provide all necessary resources for successful execution.

For example, if the team uses Apache Airflow (including managed versions such as Cloud Composer or Astronomer), the DevOps job might be as simple as triggering the machine learning DAG. The orchestration code, in this case, is usually built by Data Engineers. But if orchestration tools like Airflow are not an option (trust me, it happens!), the DevOps Engineer might be in charge of provisioning cron-similar schedulers, serverless functions, and other pieces intended to run the machine learning steps in an appropriate order.

7. Teamwork

Photo by Antonio Janeski on Unsplash

Last but not least, a few words on teamwork. As mentioned before, DevOps is a cross-role in MLOps teams, so DevOps Engineers are in touch with all other roles. To do their job, they need to talk with other engineers to learn more about the programming language, tools, and services they are using. They also need to understand how each step of the ML pipeline interacts with others — e.g., what data they share and the order they are executed.

Additionally, remember that some teammates may not be familiar with agile and DevOps practices.

That said, teamwork is a crucial skill for DevOps Engineers. To thrive, they are expected to put themselves in their teammates’ shoes from time to time, empathize, keep in mind not everyone is on the same page when it comes to automation readiness (especially in new teams), and pursue effective communication.

Wrapping up

I know I’ve covered a lot of stuff. Nevertheless, it does not mean everyone must master all the skills. It’s common to see teammates sharing their expertise to increase DevOps adoption.

Longer-term, the DevOps mindset should be shared among all team members to get better results. In addition, the culture should be organically spread across the team so everyone sees the benefits of higher quality standards and shorter delivery cycles.

At the end of the day, all team members are responsible for a successful MLOps implementation.

Thanks for reading so far!

References

--

--