Machine Learning doesn’t occur in a vacuum, so why develop it in one?

A review of the ML/DS development ecosystem: From a Software Engineers perspective

Robbie Anderson
Towards Data Science
9 min readApr 20, 2021

--

Photo by Charles Deluvio on Unsplash

FYI: Everything I talk about concerning Machine Learning is also true of any Data Science work; while a Data Scientist’s work might be focussed internally, if it’s used to make decisions for the business then it should be reliable and dependable.

Software development has come a long way in the last 30 years, with thousands of tools being developed to aid in the complex task of writing code. While many of these tools have faded into the dust, there are some, such as git, which has become a staple of software development worldwide. Many of these tools focused on abstracting away complexities to allow developers to focus on writing application-specific code. Let’s think back to the early 2000s, a place where it often took days to deploy and scale a database appropriately, compared to three clicks to create an infinitely scalable DynamoDB table today. Why? Well, the rise of Cloud Service Providers has allowed most companies to shed infrastructure management entirely, with developers working on top of these services to build business logic, free from the worry of whether or not they’ll have enough server racks. These service providers have abstracted away from both the hardware and infrastructure elements to accelerate cloud application development. This acceleration, from years to months in many cases, has powered thousands of startups leading to the veritable boom of new apps and services we are currently experiencing. Cloud is no longer for just the big players but is accessible to any developer.

How does this relate to Machine Learning/Data Science? They are currently at the same point that Cloud development was pre-cloud providers. With no service that reduces complexity across the board, engineers are required to not only understand the fundamentals of their discipline but also how to develop, host, and maintain the code that they build. This slows down development drastically, and limits machine learning at scale to just the large tech companies with the budgets to afford an infrastructure team.

This isn’t to say there aren’t libraries out there to help with this, some fantastic tools are cropping up such as MLFlow, Google Colab and Feast. But they aren’t all-encompassing, they fulfil specific criteria and no more.

To illustrate this, let’s say you have a model developed in Python and running in Jupyter Notebooks. It’s the standard go-to tool for developing, but you want to ship it to end consumers. How do you do it? Export the model as a zip? What if you want to retrain the model in the future? Who writes the code to utilize the model? How is that deployed? Where are the results stored? I think it’s best summarised by a diagram produced by the Metaflow team [1].

A typical workflow from a Data Scientist [1]

As a Cloud Engineer by trade, I look at workflows like this and cringe. Any CD pipeline should complete as quickly and with as few steps as possible to ensure the development time is maximized. If your deployment workflow takes 5 minutes, then you can do 5x fewer tests than someone whose workflow takes 1 minute, genius! Machine Learning is hard, let’s give engineers as much time as possible to work through problems.

So far, I’ve insulted a common workflow and provided no answers, so let’s define some criteria, and look at how we can improve!

Criteria for a successful workflow

  • A development environment that matches production* as much as feasible
  • A fast and repeatable deployment cycle for testing in development environments
  • Easy access to both the logs and results produced
  • Easily version controlled

How about an example task to illustrate the requirements defined above. Let’s say our ML engineer Geralt works for a company that sells burgers online and has been tasked with building a recommender system to recommend different combinations of burgers to people based on daily data updates we receive. This job needs to be developed then pushed to production, where it will run daily for each customer and dump the recommendations for that day into a database.

A development environment that mimics production

Now, this might be a little strange for anyone getting into developing Machine Learning/Data Science work but it’s crucial when working with different environments and platforms. Whenever you are integrating code into a production system for the first time there are always issues, missing dependencies, different OS versions and unexpected bottlenecks which will cause bugs. If you can develop in an environment that is as near to production as possible this allows you to eliminate these issues in the development phase. For Geralt, if he develops locally then pushes to a remote instance to run his code on a schedule, he may encounter issues that could break the production recommender system. If Geralt had developed his code on a clone of the remote instance (don’t develop on production systems directly!) then he may have been able to avoid these issues, reducing his workload and changes required to get the system into production.

Sticking to this requirement will also mean that development code doesn’t have to be rewritten to deploy it into production. Imagine Geralt developing a Jupyter Notebook, creating tests, and ensuring everything works well — just to tear it up and rewrite it all so it can be easily containerised. It just doesn’t make sense. If Geralt had worked in a mimicked production environment, he could have reused his tried-and-tested development code (If you’re interested as to why this is terrible [2]).

While this requirement is important, it also shouldn’t get in the way of developers having a quick iteration cycle between tests — this is where our next requirement comes in.

Recommendations: Try to recreate production environments locally. Containers are a great tool for this as they require no external dependencies. If you cannot use containers, allowing for limited local testing before easily pushing to remote locations is vital. Jupyter Notebooks aren’t a good tool here as it’s likely you’ll have to rewrite portions of the code to run it in production!

A fast and repeatable deployment process for testing

While I touched on this earlier, let’s revisit what this means for machine learning. This process should be the way that the code is “deployed” to its development environment (be that locally or elsewhere) and run to test the code. In Geralt’s example, this would be pushing his code to the cloned remote instance (mimicking production) and running it to produce some test recommendations on an example dataset. If this process takes lots of manual steps then this may pull Geralt out of a state of concentration [3] or lead to errors being made thus slowing down progress. This is what we want to avoid.

As part of this, Software pipelines often include unit testing to ensure that the code functions as expected. While you might not be able to test the models being developed directly, there is still great value in testing all internal logic and outputs to ensure they meet the criteria outlined. In Geralds’s case, he could have the best recommender system in the world, but if his code doesn’t write the values to the database correctly then it’s useless to his company.

Taking all this into account, when designing a workflow, it might be good to think about it in three stages.

Stage 1: Exploring locally -> Locally

Exploring sample datasets to get a feel for the work which might be required. This would be best done locally as it will require lots of fast iterations, for example, formatting and marshalling the data. Geralt wouldn’t want to spend 5 minutes waiting for his code to deploy just to print out the contents of a data frame.

Stage 2: Bulk development -> Development environment

Most of the meaty development of any model will require lots of slower iterations, as the code may be significant at this time and changes will be more complex to make. For Geralt, this would be when he’s developing and training his model. As part of this, there will be a need for some robust deployment so he can consistently test his work, possibly containing unit tests to ensure that the inputs/outputs stay within specifications. Ideally, this would also be conducted on the production-like environment previously discussed, to minimize issues later on.

Stage 3: Production scheduling -> Production environment

To deploy to production, the code should have to pass a rigorous set of tests — combining the idea of scalability tests to ensure that the model maintains accuracy as its input grows. This process should also be done automatically, along with the actual deployment of the code into production scheduling.

Recommendations: Follow the staged process outline above, utilizing tools such as Metaflow or Kubeflow to minimize disruption to developers when deploying code.

Easy access to both logs and results produced by the codebase

I don’t think this is too controversial, but it certainly needs to be highlighted. When developing you need to make sure you have fast access to any outputs created by your codebase so you can easily debug if necessary. If your workflow involves pushing code to a remote machine, you need to make sure that it’s easy to access the logs generated — we’re looking at improving processes here not making them less efficient!

This also applies to any output — not just debugging information — Geralt can’t be sure that his code has written the correct values to the database without being able to check them.

Recommendation: Again, tools such as Metaflow are great for this as they provide instant feedback for debugging and store the results of all variables for further analysis. Avoid mechanisms such as SSH to get log feedback as they require manual steps, which are prone to errors and are more taxing on a developer’s concentration.

Easily version controlled

I’d challenge anyone to find a traditional Software team that doesn’t use some sort of version control for their codebase. And there is good reason for it, it’s an invaluable tool for storing, reverting, combining, and analyzing any code.

Not only is it convenient as a backup for code, but it’s also really useful to keep track of versions of the code — something which is critical when tracking bugs. If you are deploying your code to run remotely on a schedule, it’s often hard to know what version of your code is running. Have you ever run something and got an output, and thought — that’s not right? I wonder if it deployed correctly — let me retry and see if it fixes it. Utilising version control within a Continuous Deployment pipeline can help with this [4], as it can deploy your code to the remote location only when you commit — so you can look back and see what the last commit was and be sure what version of your code is running. From what I’ve learned, uncertainty in a deployment pipeline is a recipe for disaster and easily leads to misdiagnosed bugs.

The last point I’d like to make here is team integration. If Geralt had a teammate who was also working on his codebase, it would be easier for them to collaborate and not get in each other’s way if they were using a version control system such as git. It would allow them to submit pull requests and review each other’s code, deploying it independently from each other.

Recommendation: Use git! If you also have the time, use Github actions to run basic analysis, testing, and deployment of your code — no more manually dropping zips into S3!

So as a TL:DR, what are the takeaways here? It’s that ML doesn’t occur in a vacuum. It will always have to consume data from other services and write data elsewhere. This means it needs to be as stable and reliable as the services that surround it, and the tried and tested way of achieving that is using good software development principles to build a dependable codebase.

For direct recommendations, I’d say look for tools such as Metaflow, Kubernetes/Kubeflow which implement several of the criteria points, such as using containers to reduce the footprint being changed when switching to production. If these are out of your reach, use bash to automate the deployment of your code — even if this is just locally — so that others can pick up your code, run one command, and get started. While installing these processes might take up some initial development time, they will pay dividends later in a project and prove invaluable for other developers utilising your work in the future.

I’ve only briefly touched on some of the tooling that I believe could be used, in future blog posts I will compare some of the tools named here to see which is most effective for developers across the board.

*Production in this context is defined as the place any ML/DS work is run to produce value to the business — generally this is scheduled to provide insight to the business or provide a service to a client on a regular basis.

References

[1] — Netflix (2019), https://docs.metaflow.org/introduction/why-metaflow

[2] — Joel Spolsky, Things You Should Never Do, Part I (2000), https://www.joelonsoftware.com/2000/04/06/things-you-should-never-do-part-i/

[3] — Kendra Cherry, The Psychology of Flow (2021), https://www.verywellmind.com/what-is-flow-2794768

[4] — GitLab, 4 Benefits of CI/CD (2019), https://medium.com/@gitlab/4-benefits-of-ci-cd-efc3d6b9d09d

--

--

Building new Data Platforms with Aero & Senior Software Engineer at Tumelo.