Building Better ML Systems —
Chapter 1: Every Project Must Start with a Plan

About ML project lifecycle, designs doc, business value, and requirements. About starting small and failing fast.

Published in

Towards Data Science

9 min readApr 20, 2023

A lot of data scientists and ML engineers, after graduating from universities, have a false image of how their day-to-day work will look like — they expect it to be similar to their studies:

Trying cool state-of-the-art algorithms on fixed relatively clean datasets and selecting the best one in terms of accuracy (Expectations).

You don’t need:

To think about the business value and a never-ending list of requirements.
(Most likely) to collect, label, and clean the dataset. In some cases, even the train/validation/test split is already done for you.
To thoroughly evaluate your model, check for biases, and conduct A/B tests.
To deploy the model to thousands (or millions) of users and ensure it is up and running 99.9% of the time.
To monitor the model, catch any drops in accuracy, and retrain it as needed.
To collect new data immediately after deploying a previous version and start working on a new, hopefully, better model.

Yes, you don’t need to think about all this during the research/study project. But in real-life projects it becomes crucial.

The main difference between research and a real-life project is that:

In real life, there are many users who use your model in all imaginable and unimaginable ways and expect it to always work quickly, accurately, and fairly without bias. Users’ behaviors continuously change, and epidemics and wars may happen, while your company tries to earn a profit by delivering what users want and building a competitive advantage by applying Machine Learning in ways that no one else has ever tried or succeeded before (Reality).

Throughout this series, you will learn that building better Machine Learning systems requires thinking of it as a system — paying enough attention to each component and their relationships.

This tutorial will be beneficial for Data Scientists, Machine Learning Engineers, Team, and Tech Leads (or those who aspire to be one). Do not expect this series to be comprehensive, although it will help you lay a strong foundation in ML system design, fill any gaps, and allow you to explore topics that are less familiar to you. Along the way, I will provide links to numerous excellent posts, papers, and books.

Without further ado, let’s begin!

Every project must start with a plan.

Below is the Machine Learning project lifecycle. Make yourself comfortable. First, you understand the task and determine what needs to be done. Then, you collect, label, and clean the data. Next, you move on to modeling. After that, you evaluate the models and select the best one. Finally, you deploy the model and monitor its performance.

*Image 1. Life Cycle of a Machine Learning Project. Image by Author.*

Is this the end? No, it is only the beginning.

While monitoring, you may discover that the model is not working well for some subset of users, or its accuracy is deteriorating over time, so you start again: understand the problem -> get data -> model and evaluate -> deploy.

Or during model evaluation, you may find that the model is not good enough to deploy, and so you start again: understand what is not working and how to improve it -> collect more data -> do more modeling -> evaluate to (hopefully) get better results this time.

(If this is your first time learning about the Machine Learning project lifecycle, I recommend checking out Anton Morgunov’s post: The Life Cycle of a Machine Learning Project: What Are the Stages?)

So there are two important things to understand:

Building a Machine Learning system is an iterative process that continues forever until the model is removed from production. (No rest for the wicked)
Image 1 provides a simplified version of how a Machine Learning system is developed, but in reality, you do not move smoothly and sequentially from stage to stage. Something can go wrong (and usually does) at each stage, which can set you back one or more steps, or even throw you to the beginning. (Welcome to the real world)

*Image 2.* ***Realistic*** *Life Cycle of a Machine Learning Project. Image by Author.*

Those with engineering backgrounds may wonder: What is the difference between Machine Learning projects and traditional software development? Where are the tests, builds, and releases? Thank you for asking.

The truth is that the Machine Learning project is a subclass of a software engineering project. So all the best practices you may think of in software engineering are highly welcomed in ML projects. With that said, let me introduce you to a truly realistic life cycle of a software project with a Machine Learning component:

*Image 3. Truly* ***Realistic*** *Life Cycle of a Software project with a Machine Learning component. Image by Author.*

And to take control over this chaos, each project must start with a plan.

(Read MLOps: Machine Learning Life Cycle by Satish Chandra Gupta to learn more about the ML software development lifecycle.)

Before spending thousands of dollars on data annotation and weeks and weeks on Machine Learning model development, there are four things you need to do. Let’s call it the “pre-coding” stage. So, close your PyCharm for now, as all you need is a Google document, your brain, and Zoom.

1. Estimate the business value of the ML project.

Any commercial company’s goal is to earn more money or provide a better customer experience… in order to earn more money. With this simple axiom in mind, convince your boss, C-level management, and stakeholders that the current ML project is a good investment.

Ideally, you need to provide some rough numbers on how the ML model increases the company’s revenue, user engagement, or decreases request processing time, etc. Be creative here, turn off your perfectionism, and do not hesitate to ask your colleagues from the financial and marketing departments for help.

(Keep in mind, later this metric will be used to access the project, so be realistic with what you promise to deliver.)

2. Collect the requirements.

Once no one has doubts that the ML model is necessary, start collecting requirements.

Each domain is specific and each project is unique, so there is no exhaustive list of requirements to refer to. So trust your experience and collaborate with your colleagues.

Here’s a helpful tip: Come up with a list of generic questions (I’ll share mine below) and just ask. Start the conversation, and as you discuss, more project-specific questions will naturally arise.

How much data do we have? How are we going to label it?
What should model latency be?
Where will the model be deployed — cloud or on-premises? What are the instance specifications?
Are there any requirements for data privacy and model explainability?

If a task can be solved with machine learning, it does not mean it should be. At this point, I suggest that you reconsider whether a pure software engineering approach or a basic rule-based approach may be a suitable solution. Here are the posts that can help you with that:
- When to Use Machine Learning by Amazon
- Four Situations Where you Should Not use Machine Learning by Svenja Szillat

There is no Machine Learning without data. It may seem obvious, but unfortunately, in my career, I’ve seen too many companies making the same mistake: they want AI, but their datasets are small, lacking important features, or dirty. A great post “The AI Hierarchy of Needs” by Monica Rogati encourages to think of AI as the top of a pyramid of needs, while data collection, storage, and cleaning are at the foundation.

*Image 4. The AI Hierarchy of Needs. Adapted from an image by* *Monica Rogati in “The AI Hierarchy of Needs”*.

3. Start small and fail fast.

Even if your goal is to create an ML system that serves millions of users per day, it’s wise to start with something much much smaller:

PoC (Proof of Concept). Manually retrieve data from data storages, quickly iterate through a couple of algorithms in a Jupyter Notebook, and finally, proof (or reject) the hypothesis that you can train a machine learning model with satisfactory accuracy on the data you have. During the PoC stage, you’ll also understand what is needed to deploy and scale the model.
MVP (Minimal Viable Product). Assuming the PoC stage was successful and now you are creating a product with only the main functionality and releasing it to users. In a Machine Learning project, this means rolling out the model to a segment of the users and accessing whether it brings the expected business value.

Once you realize that an idea is not working out — abandon it with a clear conscience and move to the next one. This is much easier to do when you haven’t already spent years of work or hundreds of thousands of dollars. Keeping the cost of failure low is a key factor in the success of a project.

(To explore this topic further, read POC vs MVP: What to Choose to Build a Great Product by Dmitry Chekalin.)

4. Write a design document.

A design document in software engineering is a description of the software system’s architecture — its overall structure, its individual components, and the interactions between them. It can take an arbitrary form and structure, be formal or informal, high-level or detailed (it is up to a team to decide). During the implementation phase of software development, the design document serves as a blueprint for developers to follow.

This is a best practice in software engineering, and as I mentioned earlier, all software engineering best practices are highly welcomed in ML projects.

My personal reasons to love design docs are:

Writing triggers the thought process. Writing a design doc is like implementing the project on a high level — you do not actually code but still make decisions on data, algorithms, and infrastructure. You consider all scenarios and evaluate trade-offs, which means that you’ll save time and money in the future by avoiding dead ends.
Design docs simplify the synchronization and collaboration within a team. The document is shared among team members so that they can review it, familiarize themselves with the system design, and launch discussions if needed. No one is left out, and everyone is encouraged to contribute.

If you are ready to start writing a design doc, here is a template for machine learning systems proposed by Eugene Yan. Feel free to modify it and adapt it to your project needs.

If you would like to learn more about design documents as a concept, check out these posts:
- How to Write Design Docs for Machine Learning Systems by Eugene Yan
- Design Docs at Google by Malte Ubl

Conclusion

In this chapter, we learned that every project must start with a plan because ML systems are too complex to implement in an ad-hoc manner. We reviewed the ML project lifecycle, discussed why and how to estimate project business value, how to collect the requirements, and then reevaluate with a cold mind whether ML is truly needed. We learned how to start small and fail fast using concepts like “PoC” and “MVP”. And finally, we talked about the importance of design documents during the planning stage.

In the next posts, you will learn about data collection and labeling, model development, experiment tracking, online and offline evaluation, deployment, monitoring, retraining, and much much more — all this will help you build better Machine Learning systems.

The next chapter is already available:

Building Better ML Systems. Chapter 2: Taming Data Chaos.

About data-centric AI, training data, data labeling and cleaning, synthetic data, and a bit of Data Engineering and…

towardsdatascience.com

Building Better ML Systems — Chapter 1: Every Project Must Start with a Plan