
It began as a mistake…
Starting to tackle a new Machine Learning problem is so exciting! That’s when the inner nerd really awakes in us and we might think to ourselves: "Finally a new dataset, where I can try out this new Neural Network Architecture I saw in a Medium post, and maybe I can implement the workflow with this new ML library everybody keeps talking about lately, and…."
Sometimes, when we start working on a new ML task it feels like we can be children again, doesn’t it? We can be curious, explore, try out new things, potentially fail, learn how to walk again and in the end impress our colleagues, customers or boss with this cool new approach!
I had this motivational boost a couple a lot of times during my early years as a data scientist, but very often and very soon it resulted in a lot of frustration, stress and self-doubt and I started asking myself: "Wouldn’t it have just been easier applying a simpler model using a framework that I am already familiar with?" My model wasn’t working properly, this new library turned out to not have all the features I needed to tackle my problem and each iteration to improve accuracy would take forever because my model was so complex and the amount of data was simply too big! – Sounds familiar? If not, maybe this article is not for you, but if you have ever experienced similar problems, I have a couple of tips for you!
How to avoid stress?
If something is not working as it should and time is short – frustration, stress and self-doubt is a natural reaction. But how do we avoid this?
Option 1: Never try out new things, stick to what you already know and don’t take a risk!
Sounds easy enough right? But it also sounds like you are a robot, with zero innovation and no passion for your job! You will probably get frustrated very quickly since progress is what drives us, right? So this option doesn’t really seem to be feasible. Anything else?
Option 2: Start simple, establish a baseline, gradually add complexity and prioritize!
Sounds nice and all, but what does this mean? Well, let us step through these instructions one by one, and if you decide to go for this option when you start working on your next Machine Learning task, you will hopefully finish with a great result and most importantly be happy while doing it!

Start simple
Maybe you have been there: A couple of days or weeks since you started working on the new exciting ML task have passed and you are asked to present your preliminary findings. However, all you can present up until now, are some exploratory data plots without incorrect axis labels (because who has time for that?) and some non-working code, due to a new error message that popped up today in the morning and you could swear that it wasn’t there yesterday evening! Even if it was running, you wouldn’t feel comfortable presenting the results, because the accuracies are so low and you don’t really know why. How to avoid this?
Establish a baseline
The first thing you want to do when you start working on a new problem is, establishing some kind of baseline! A baseline can be multiple things:
- A simple model
Are you working on a binary classification problem? Perfect! Then simply start with a logistic regression! The key message here is: Don’t be ashamed to start off with the simplest method possible! I actually highly advise you to do so! Why? Because it gives you a good starting point and reference for your next step. Start with as little complexity as possible and see where it’s working and where it’s not working. You will get a better understanding of your data that will help you in future steps. Furthermore, you will have a first experience of success and a benchmark to compare your future models.
2. Rule-based approach (Quick and dirty implementation)
Finding a suitable simple model to start with, can sometimes be hard. Imagine you want to write a spam filter for emails: Which method and model do you choose as a baseline? Well, why not simply start with a simple rule-based approach? Do a quick exploratory analysis and find features and patterns that seem highly predictive and then manually write down a series of if-else statements that classify your data. This can for example be based on the average text length (word count), frequency of certain words (e.g. "credit card"), etc. Getting to a fast and simple baseline is important! You will get a feeling of accomplishment and set a first bar that you can try to beat in the next iteration – the key philosophy here is continuous improvement!
In my experience, a good baseline usually already performs at about 90% of the best possible model (depending on the use case of course). In my opinion, getting to the 90% fast with an easy solution that is interpretable to me is more impressive than spending days or weeks fine-tuning a neural network that is hard to explain, slower to run and only performs 2% better.
Once you have a baseline established, you have something to present to your colleagues and clients. You will have a better understanding of the data, the underlying problem and it will take some pressure off you because you already have a first working solution! Great! Try to get some feedback from your colleagues and clients, to see how this compares to their experiences and expectations. These insights will prove themselves to be very valuable in the next iteration when you will try to improve your model. When your colleagues or clients do not have any valuable input, try to find literature and compare your first results to existing solutions. Before you start working on an improved version of the model, make sure you are working with the right amount of data (read below)!
Subset your data
Another pit I dug myself a couple of times in the past was, to perform my tests and modelling on the entire data set. No, I am not talking about the importance of training and test set here, but about the importance of downsizing! Having a lot of data (is usually) a great position to be in as a data scientist, but when you start working on your problem, don’t make the mistake to use all of it in the beginning.
Every record in your data set, makes each iteration of improvement and test run slower. Start off, by setting up your machine learning workflow and model architecture on as few data records as possible, just to have a running proof of concept, even if the model performance (in terms of accuracy) is poor!
Once your pipeline is working, increase your data set size continuously and see if the model performance is increasing. If this is not the case, then very likely your pipeline has some unforeseen errors. Not working with the whole data set, will let you iterate and debug much faster!
Subset your training set into a manageable small dataset. If you work with structured tabular data, 30 samples per class are enough to get you started. If you work with structured data such as images, maybe even 1 or 2 images can be enough to be able to set up a proof of concept.
Gradually add complexity

Once you have your baseline, it is time to gradually improve your model! The most important thing is, do this gradually! Do not change too many parameters at once! This one is really hard to follow, I know! To be honest, I hardly ever adhere to this principle myself, but here is why we should:
While we are moving to more and more complex models, it is important to understand why the increased complexity works!
More complexity means more moving parts. The more changes we make at once, the harder it is to keep track of what change actually caused the improvement! "Was it the new feature I added or the hyperparameter I adjusted?"
By changing too much too fast, we will likely end up with a sort of black box that seems to work just fine but simultaneous lose explainability which hinders future improvement. A strategy that many data scientist seem to follow, is to increase the model’s complexity and parameters until the model starts overfitting. Then, they pull the handbrake by implementing some kind of regularization to prevent this. We should however always strive to understand how the complexity we added improved the model and if it is really needed.
I found a quote that sums this up on the MLCMU website:
"When designing a new model, we should be constantly aware that we are not building a complex model for the sake of its complexity but for its expressiveness. Our inability to address this issue on both the intuitive and theoretical level is an intellectual debt that we will eventually need to pay off."
Prioritize what to work on

In literature, this is often referred to as "finding the bottleneck". Machine learning models usually have a lot of screws that we can use to tune and improve performance. Due to a lack of understanding or wrong prioritization, however, many Data Scientists focus on the wrong parameters to tune and waste a lot of time improving the wrong thing.
The funniest thing however is, that often we know that the screw we are tuning is not the most important one, but we keep fiddling around with (unimportant) details, just to keep ourselves busy so we don’t have to engage with the bigger problems.
So how do we find this one magic screw? One way is, to avoid overly complex workflows (see the chapter above). When our model is simple we usually have a good understanding of why it is not working and what to do to improve its performance. However, when things get more complex, we lose this overview and start guessing, turning, pushing and pulling with the expectation that something will eventually happen that will solve our problems! Try to avoid this "technique" since it very often results in frustration and an even more complex model. Instead, try to spend some time on error analysis!
Error analysis
When you don’t know why your model is performing poorly, try to look into the data points that were misclassified by your model (for classification) or show the largest residuals (for regression). Sometimes these observations are outliers but very often (if you look at enough data points) you will observe a pattern that will give you a better understanding of what’s happening: Maybe the misclassified data points all belong to the same class? If you have multiple data sources, maybe they all come from a certain sensor that returned flawed data points? Maybe they show more noise than the others?
Furthermore, it is important to know, how well a human would perform on these tasks or more generally what the maximum expected accuracy (e.g. based on literature) for this task is. If you have a reference on what’s theoretically possible in terms of accuracy, it will help you prioritize correctly.

Take-home messages for a happier life as a Data Scientist
- Start with a simple model as a baseline. This helps to understand the data better and gives you a first feeling of accomplishment.
- Don’t be afraid to use simple methods to get you started. They are a great benchmark for future improvements.
- Gradually progress towards more complexity. Only change one parameter at a time.
- Do not add unnecessary complexity. Only add complexity if you can explain why it is needed.
- Subset your data, this enables you to iterate over your model more quickly and fix bugs.
- Prioritize what to work on!
- Analyze your model’s performance and assess priorities by thorough error analysis.
Further material:
-
- MLCMU, Baselines. https://blog.ml.cmu.edu/2020/08/31/3-baselines/
-
- Emmanuel Ameisen, Building Machine Learning Powered Applications https://christophergs.com/machine%20learning/2019/03/30/deploying-machine-learning-applications-in-shadow-mode/
-
- Andrew Ng. Coursera – Introduction to Machine Learning in Production by DeepLearning.AI https://www.coursera.org/learn/introduction-to-machine-learning-in-production