Research guidelines for Machine Learning projects

On the importance of having a simple goal but multiple ways to reach it.

Fran Pérez
Towards Data Science

--

Machine Learning projects can be delivered in two stages. The first stage is named Research and is about answering the question: can we make a machine learning model out of this pile of data that serves the needs of the client? The deliverable is a proof of concept or a feasibility study. The second stage is named Development and it is about committing to deliver a machine learning product. The deliverable of this stage is a machine learning product.

During the Research stage, there are two main aspects that I consider relevant when the project resources are limited and the time is short: firstly is the importance of defining a scope for your problem, and secondly, the importance of testing your model using short iterations.

Scope

One of the most difficult tasks in this stage is to scope your efforts. For example, think about working with files; you can choose between working with live data (connecting with a live database) or simply read data from files. Working with databases means you need to have access to infrastructure, deal with authentications, and such things that depending on the company can delay the start of the project; on the other hand, you can request a dump of the data, and start working the next day.

The good (and bad) thing about the scoping exercise is that it narrows down all your possible actions. In some way, this is good because you strip down all the things that are not necessary for delivering a proof of concept:

  • Focus on a single problem. If the problem is too big, split the big-problem in many small-problems (divide-and-conquer). For example, if your problem is world-wide, re-scope your problem to be continent-wide or just focus on a single country. You can also reduce complexity by reframing the timespan of your problem (for example, focus on a specific date range: last year, last decade, etc).
  • The scope is not set in stone, and you should be agile about it. While you delve into the business domain and gain more experience wrangling with the data, there will be times that you will shift your scope. This is one of the reasons I use Kanban instead of Scrum during research, as it is more flexible.
  • Until the moment you have the zero model (explained below) you will not know if you’ve enough data. But I will tell you one secret: in machine learning, you never have too much data, but you might have biased or not representative data which can bring some issues into your inferences. If you have too much data, you can always downsample your dataset; but if you have too few samples, upsampling (augmentation) techniques might add some unwanted variance to your dataset. In the bottom line, if you have many options to choose a single scope from, go for the biggest one (size-wise).
  • The problem should be possible to be handled using a single computer. Except, if you are using deep learning models, then you will need to add one or more GPU’s into the computer. You can use either your own machine or a virtual machine inside the VNET of the client (recommended when the data is sensible). If you need a spark cluster for transforming your data, think again about the scope of your problem.
  • Convey with the client the acceptance criteria. This point is about negotiating with the client what will be the project’s deliverable. Focus: this is not an ML product (yet), but a proof of concept, so for example, do you really need a blue-green deployment? perhaps would you rather deploy the model as a REST API using flask? or does it works just saving the predictions of the model in an excel file with the purpose of being reviewed by the business expert later?
  • The acceptance criteria are one of the drivers of your backlog (Jupyter Notebook or python module, docker or local deployment, project log, model artifact, model predictions, etc). Eventually, the metric and/or condition criteria (how to evaluate the goodness of your solution) should also be addressed by the client (more on this in the following section).
  • The focus should boost simplicity because it reduces time and effort. The aim of the research stage should be to have an answer to the feasibility of the ML product, as soon as possible. Keep in mind this issue when deciding the scope and your acceptance criteria.
  • The proof of concept is different from a throw-away prototype. All the efforts invested in this stage should be part of a future product. Most of the time, the data scientist experience is what will make the difference for re-using not only the knowledge learned but the source code and infrastructure as well. The machine learning engineer should help to address these issues (for example, enforcing best practices).
  • If it’s not clear enough, I repeated five times the word “focus” in this section (now this is the sixth). Does it make sense now?

Model testing

A little disclaimer: Most people think building a model is the most complex part of an ML project. But it is not. It is not because there is a good chance that the ML algorithm you need is already implemented in a library like sklearn, h2o or pycaret. And I’m not even going to mention autoML techniques or the ML libraries available in languages like R or Julia.

Stick to the basics; these will solve 90% of your projects. Once you understand the basics, you can dare to jump ahead and use more complex ML/DL algorithms. It is easier to grasp the theory and the fundamentals of the algorithms once you’ve used them, so do not think you need to fully understand them to use them. This is the learning technique used by Jeremy Howard in their courses, which I consider a fantastic way to learn for non-PhDs people like myself.

So, if building the model is not the hardest part, what it is? The rest of the pieces around the ML model, like feature engineering, serving the model, etc.

Hidden Technical Debt in Machine Learning Systems

Most of all these other parts fall beyond the scope of this post (and some of them pertain to the development stage following the research stage).

Iterative approach

During the iteration zero of your model, my recommendation is that you build your model and collect its predictions as soon as possible. Go for the quickest win, for example: reduce your feature engineering, only converting non-numerical features to numerical, build up a simple model. The ideal result you want to get is that this zero model prediction is better than a random choice or a summary metric (mean, median, etc) figure which is the most basic prediction (no machine learning involved, just pure arithmetic).

PS: Do not feel down if you do not get it the first time you try. This is just the first step on a long trip.

Model testing

The model you obtained during the first half of the iteration zero I called it zero model. During the other half of the iteration zero, you will build a better model by updating the training dataset (for example, pre-processing the data more aggressively or optimizing your model hyperparameters, etc). I called this second model the null model.

During the following iteration(s), you will build the alternative model that plays the counterpart of the null model. This procedure is similar as when you do hypothesis testing, and you figure out if you are on the right track comparing models against each other. The way I work is that I try to complete at least two iterations, testing two different models. But as the said goes “the more, the merrier”, and you are only limited by the time at your disposal = scope.

Metrics

Depending on the ML model you are using, you will end up using a metric. For example, in case of a regression problem, you can use RMSE, and for classification problems, you can use accuracy. Among other things, metrics are used to track the progress of the ML models, so you can measure how a change in the data/model hyperparameters affects the predictions of the model. Your experiments should always be metrics-driven. Choosing the right metrics is as important as choosing the model itself, and you should know that “there is no such thing as a free lunch”. You will need to experiment and check what works better for your problem.

Most of the time, the stakeholder will not know anything about RMSE (or any other ML metric); but you will be responsible to show to him that the elected metric aligns with the business goal. In this case, you have two options: a) as mentioned before, explain the meaning of the ML metric and prove its importance, or b) develop a parallel business metric. The business metric is an indicator expressed in business domain units. For example, in the case of a recommender for an online clothing store, a business metric can evaluate how well the model recommends products with the same gender as the person they are recommended for. The business metric is easier to get understood by the stakeholders, and eventually, it will turn out an important driver of your model.

Each iteration/experiment needs to be recorded. The bare minimum that you need to record is the results of your training: the result metrics. Jupyter Notebooks can serve as a simple log for this purpose. From there, you can use more elegant solutions, that will allow you to track not only the metrics but the data used for the training or the model generated (along with the hyperparameters used to obtain it): mlFlow, weights & biases, …

There are three outcomes of the Research stage:

Research stage outcomes

In case of promising results, but not enough to meet the condition criteria, the best option is to extend the proof of concept; in order to improve your model results, you can try to change the data pre-processing and/or the machine learning model. The best result is that the model meets the condition criteria to move to development; in this case, the best course of action is to develop the actual product parting from the proof of concept, and parallelly, refine the current proof of concept. When the current model doesn’t meet the condition criteria (for example lack of data, or missing a better Machine Learning algorithm to model the problem), you can put the case on hold until there is a change in the case’s context.

If you’re interested in this kind of issues, I recommend this reading to you: Managing Machine Learning Projects, from Amazon’s Machine Learning University.

The next post will be more technical, and I will delve into some tools and techniques I use during the Research stage. Stay tuned.

--

--