
As some of you are coming to understand, I am all about efficiency at work and finding smart solutions to everyday problems. Being a data scientist helps me dissect the fundamental steps in a work pipeline and understand the possibilities of writing articles that actually help the reader.
I already touched upon the topic of properly structuring your machine learning project, as it increases productivity and efficiency, especially if working in team. Today I want to dedicate this article to covering a checklist of 6 items that I found to be quite effective in directing my Data Science work.
I want to thank Andriy Burkov and his work with the Machine Learning Engineering book. I’ve taken inspiration from it to write this article and it’s overall a great read – go through it if you haven’t already.
Let’s begin.
1. Use and Maintain a Feature Schema File
The schema file is used to keep track of what your features are, how they behave and their general properties. It is useful because it ensures that all of the team stays updated on what is fed to the model, at all stages of the development. It also gives the team a formal pattern to go through when debugging the model.
Create a file that satisfies at least these criteria:
- holds the name of features
- the type (categorical, numerical, …)
- minimum and maximum values allowed
- sample mean and variance
- if it allows zeros to occur
- if it allows undefined values to occur
Feel free to create any kind of file to hold this information. Here’s an example of a schema.json file that holds this info for a synthetic dataset.
2. Go Through Your README.md file
I just love writing my thoughts down before and during my ML project. This is all done in my README.md file, which is the first file I create in my repo.
It is essential that all your reasonings go into this file so that the team (and even yourself down the road) are always treading on the same line. It is easy to commit to an idea only to find a couple of hours later that it didn’t quite make sense according to the project’s brief. Going through your notes and project assets will help you crystalize your intention and be more efficient.
The way I structure my README file is not standard, but it generally follows this pattern:
- I write about the goal and the general idea behind it
- I list out the possible methodologies that can be used to tackle the problem
- I list out pros and cons of each one, commenting as much as possible on each approach
- I list out the challenges of the project and what I would need in terms of resources and knowledge to solve the problem right now
- the impact the project has on the business
Point 5 helps us in the post-project storytelling. If I can document each step and talk about my process, then I will have a great story to tell to the stakeholders. Communication is as important as your analytical skills.
3. Set an Achievable Performance Level
Always talk to the stakeholders to understand their expectation about your work. Talk to them about what performance level they expect to see and the threshold of satisfaction. This is a minimum value of performance you should aim to achieve.
To get an idea of how your model could perform, take this into consideration
- if a human can do the same job without much effort, then it is safe to assume that the model will perform at the same level
- if you feed into the model high quality data, i.e. data that contains relevant information about your task, then it is safe to assume that the model will perform well
- if a software can achieve good results without ML, then it is safe to assume that the model will perform at the same level
- if another ML algorithm can achieve good results on a similar dataset, then it is safe to assume that the model will perform at the same level
4. Choose One (and only one) Performance Metric
This is closely related to the performance level. A model has a performance metric you can assign to it prior to training. For instance, a regression task could require you to set the RMSE (root mean squared error) or MAE performance metrics.
You must choose the performance metric that makes the most sense for your problem. Choose one and only one performance metric and stick with it. Compare and track different models to understand how the metric changes among models. I talk about Model Selection and metric evaluation here.
5. Define a Baseline
You should always compare your models with a baseline. This can be highly opinionated, for instance:
- your baseline can be a human-based. Your model is compared to human performance on the same task
- your baseline can be a random prediction: the algorithm chooses a random value from the y train set
- your baseline can follow a specific rule: for a classification task it can return the most frequent class while for a regression task it can return the average y value
- your baseline can be a simple and rudimentary model
If your model can perform better than your baseline, then you know you are providing value to your company / team.
6. Split the Data in Three Parts
Kaggle and the various great content creators on the web have covered this aspect thoroughly, but it is still something to give much attention before training.
Make sure your data is split into three parts: train, validation, test set. Here are the differences:
- the train set is used to train your model
- the validation set is not seen by the model, and it is used to test algorithms and their parameters
- the test set is not seen by the model, and it is used to evaluate the complete pipeline
I like to use this metaphor:
Your model is like a child at school. The kid studying at school is your model learning from the train set, the kid doing homework exercises is your model being tested on the validation set, and the kid at the final exam is the model on your test set.
Remember that your validation and test sets must come from the same statistical distribution of your train set, otherwise you’d be training your model wrong data. It’s like the kid studying a chapter he will never be tested on.
Conclusion
Here’s a TL;DR
- Use a schema file to keep track of your features and their properties
- Store all the relevant information in your README.md file throughout all of the project’s phases
- Set achievable performance metrics – talk to your stakeholders to undertand their expectations
- Choose one and only one performance metric
- Compare your models to a well-known baseline to understand the value you are adding to the mix
- Be sure you are splitting the data in the correct manner.
I use this mental map all the time. I hope this will help you as well. Share your steps too – I am always looking to improve and refine my own processes 🙂