The world’s leading publication for data science, AI, and ML professionals.

Refactoring Machine Learning Projects

Going from good (or bad) to great

Photo by Todd Quackenbush on Unsplash [1].
Photo by Todd Quackenbush on Unsplash

As data scientists, we spend a lot of our mental energy on building new tools, pipelines, and workflows from scratch. However, as our work matures, we all eventually find ourselves facing situations where we need to adapt something that has already been built, instead of starting fresh. Having skills in refactoring then becomes vitally important.

Picture this: your company has a machine learning pipeline in production, and its predictions are gradually losing accuracy. Business stakeholders are complaining. They look to you to "fix it". Now what? It’s time for a refactor!

What are we starting with?

Some code, functional(ish), that is somehow not meeting all the needs of users. For our purposes, we are talking about a machine learning model or pipeline.

Who wrote it?

You or somebody else. This may or may not matter – you a while ago may not be all that different to your current mind than someone totally different. It’s really hard to remember what we did in the past and why, unless we always write perfect docs, which nobody does.

Why refactor?

Either it has always needed some work (which is common and normal when we are developing minimum viable products), or it was great but needs and circumstances have changed, so you need to make it functional again.

You may think that it would be easier to start over from scratch – this is possible, but likely would end up being more work. A good refactor ought to save time, because you will be able to build upon the good bones of the original work. If the original work is simply unsalvageable, that’s a different problem.

So, let’s talk about some of the specific ML examples where this situation might show up.


Scenarios

New Maintainer

This may mean you are taking on a project someone else built, or someone else is taking on one of yours. And it doesn’t necessarily mean a refactor is necessary, but it’s a great opportunity to evaluate and do one if it would be valuable. Information is going to be transferred to a new person, and that new person may have ideas for how to improve the existing tool!

Scaling

Models frequently start out with minimal expectations of performance. It may be that your pipeline was serving one or two users an hour, or ingesting new data weekly, but now it needs to serve twenty users a minute, or ingest data hourly. This sort of shift in the scale of expectations indicates a refactor is in order.

Moving from R&D to Production

If you built a model as a demonstration of what’s possible, moving it to a production implementation is certainly going to require some changes. You’re going to have to ensure that the model integrates with the other elements of the pipeline that it will depend on or feed into, and you need to make sure it’s performant at scale as well.

Model Drift

Model drift occurs when some factor in the environment of the model’s data changes, and the model’s expectations about the relationships in the data are no longer correct. This might mean that your model thinks Variable A is linearly related to Outcome B, but now something changed about Outcome B so that relationship is changed. Your model won’t know why, but suddenly its predictions of Outcome B will be wrong. This means it’s time for you to review the model, examine those relationships, and perhaps retrain or refactor the model to better reflect the world around it.

Change of Source Data

This happens a lot in business – the data collection system or data engineering pipelines upstream from your model change, and now perhaps data is measured differently, or comes in different volume, or features are added or taken away. This can all impact your model’s performance, and may call for a refactor. Even in the case of just having new features available, if your model isn’t using them but could gain improved performance by adding them, you should consider refactoring to incorporate this new information.

???

Finally, there’s also the situation where "it doesn’t work" and no one’s sure why. This happens more than you might expect, particularly when the original author of the project is no longer present. All these situations illustrate the key importance of documentation in machine learning, because a situation where the code and model are understood is vastly preferable to this.

So, you’re in front of a machine learning project and one of these situations applies to you. Now what?


Time

You need to be aware of the business and practical limitations around you. Ideally, you’ll have time and resources to thoughtfully conduct this refactor and produce a result that is resilient and will continue to work for a significant amount of time.

Of course, we all know this is not always the case. You might be on a huge time crunch, and need this project to be working again as fast as possible. If that’s your situation, then you might be better off making a small band-aid fix to the project to get things functioning.

For example, if the data stream feeding the model has changed, maybe figure out the minimum adjustment you need to do so the model can produce reasonable results. A full refactor would involve examining the new data stream, comparing it to the original, perhaps retraining the model using new data options, and designing the new pipeline. But a band-aid fix might mean finding the new data channels closest to the original features, formatting/renaming them to suit the model expectations, and gluing that together. It’s not ideal, but it might work for the short term.

From here, though, let’s assume you have time to refactor conscientiously. How to proceed?


Planning

One thing you should NOT do is just dive right in to code writing. Instead, think about what this refactor is about and what you want to achieve.

Set goals to frame the task and avoid mission creep.

Why do you want to refactor this? What is the measure of success of this refactor? How will you know when it’s done?

A model can have many kinds of improvement: shorter code, cleaner code, more pythonic code, faster predictions, easier to use, more readable, more interpretable, dimensionality reduction, faster training… you get the idea. Pick one or two, but don’t expect that you’ll write the perfect pipeline or model. Think seriously about what you can do, not just what would be cool or what you’d like to do in the ideal world. Things will come up, priorities will shift, and you need to consider what’s realistic.

Look at the big picture.

It’s sometimes easier to see small issues, little nagging problems, but if you’re going in to refactor a piece of code or a project, it may be that the small problem is indicative of a systemic issue that you could fix. In public health, we might call this root cause analysis – don’t just treat the symptom, but look at the determinants that make this the case.

For example, if one channel in your data stream has changed and broke things, don’t assume that all your other data channels are still fine! They might be, or they might have silently changed, and need to be handled differently.

Understand the functionality.

Even though the current project is not adequate for whatever reason, it still has history. Unlike starting a project from scratch, in a refactor, there are preexisting expectations for exactly how a project ought to behave. Output may need to be in a specific format, and decisions may have been made in order to accommodate the behavior of the original code. This doesn’t mean those original behaviors are good, of course – but you’ve got to contend with user expectations and the reality of the organization as well as any technical challenges. Development doesn’t live in a vacuum, it’s got to work for the human beings around it.

All this leads to the question of breaking changes. If you’re really sure the original behavior is bad, or it’s unsustainable if the improvements you need are going to happen, then you’re talking about making a breaking change. The behavior your user expects will no longer be the case, and their old code is very likely to need changes. You’re basically creating a cascading need for more refactoring. This may be the right thing to do! But you need to be very, very sure and realistic about what this means for people besides you.

Document it!

People who know me know I am all about documentation. I just think it’s exceptionally important for our work to be really useful. There are two ways this comes into play for refactoring.

First, is the original project documented? Is it clear what all the elements do/did, and why? If not, your first step in a refactor is going to be understanding the original code deeply. Read it as though you were planning to document it (or do it!), and you’ll actually understand the code. This is essential for your refactor being successful.

Second, your refactor needs to be well documented so that people can use it, and so that it will have lasting power. This is just true of code projects in general, but in this case your code is replacing something at least somebody previously understood, so the new replacement is going to need to be easy to slot in and adapt to. Additionally, your refactor may need to prove it’s good enough or adds value, and documentation is a great method for that.


Doing the Job

At this point, you know what you want to achieve, you know what the original project did, and you know what went wrong. You’re ready to write code! Because of the planning you’ve engaged in, this ought to be a lot easier. As you work, keep a few more considerations in mind.

  • Retain as much as you can of the original project. Don’t reinvent the wheel if someone already wrote something you can use. At the same time, cut what’s not useful. Don’t be a code pack-rat.
  • Try to create stylistic consistency. If the original code uses a particular code style convention, either keep that, or adapt the whole project to your new preference. The result of a refactor is ideally not a Frankenstein’s Monster of different code conventions, but a cohesive whole that others can understand easily.
  • Don’t take away functionality without having a replacement, or a good reason/explanation for users. Instead of chopping away at the main branch, develop on a parallel branch until your work is ready for users and has been reviewed. If you decide a breaking change is needed, as we discussed earlier, then make that clear and set people’s expectations appropriately.
  • If you can, add tests or quality control elements to your refactor, so that future breaks or problems are more easily identified. You are now the expert on what this code ought to do, so you are best qualified to write tests of that functionality. This sets future-you up for success and easier work down the road.

Getting Done

Data scientists can often get stuck in projects because they are pursuing perfection. One of the biggest lessons I’ve learned in Data Science in business is that getting a minimum viable product out the door is no shame, especially when you are able to generate business value rapidly. You don’t have to quit working on improvements because you got the essentials completed, but you also can’t drag out the project forever.

This is all very true of refactors as well. As we discussed in the planning section, you will not be able to make this project perfect in a single refactor – perfect code doesn’t exist. Code is always a work in progress, and models are the same. No matter how well you refactor, there are always possibilities that the environment around your model will shift, and a new refactor will be required. Don’t let this paralyze you from getting good work done.

Instead, make sure you add as much versatility and documentation to your refactor as you can, so that future refactor (which will almost certainly have to happen) is easier and faster – give your future self that gift.


Conclusion

I hope this helps you envision and carry out your next Machine Learning refactor with confidence! Planning ahead thoughtfully will make the whole endeavor easier for you and your users, now and in the future.

Disclaimer: I’m a Senior Data Scientist at Saturn Cloud – a platform enabling easy to use parallelization and scaling for Python with Dask. If you’d like to know more about Saturn Cloud, visit us at www.saturncloud.io.

Quick plug: If you have needs around productionalizing and/or scaling your ML workflow, Saturn Cloud is here to help. We offer a platform that takes away the drudgery of scaling your models, gives you easy access to multi GPU clusters and Dask, and lets you do what you do best – machine learning. We also offer support with refactoring and scaling models for our enterprise customers. Contact us today to learn more!


Related Articles