Any Data Science project is composed of multiple complexities and difficulties, just like any big enough project. Maybe you first need to connect to a database, read the data, clean the data, process it for your models, implement the model, run the training, save the model, deploy, etc. Each step can take a lot of time and the time between starting to work on the project and having visual proof that it works can be hours. You’re blind to most bugs you’re currently writing because you don’t have quick feedback on your mistakes. Imagine that you spent quite some time processing the data and modeling but your models don’t seem to work well. How can you debug your model? If your data and model are complex, it’s going to be very difficult. You may already have hundreds of lines of code and it’s a mess.
It’s especially harder in Data Science compared to software engineering because your code can fail silently, making unprecise predictions.
There is a simpler, more efficient way to attack any data science project :
1. Create extremely simple data
You may think it’s an extra effort/work but, in reality, you’ll implement it in less than 5 minutes and it will make you win hours of debugging your model on complex data. You want to train your model on data that is compatible with your problem but in the simplest way. If you’re training a classifier, you can manipulate features so that the class is an obvious function of those features.
For example, I recently worked on anomaly detection. The data was big and complex, I knew it would take me hours of processing the data before modeling and that it would be hard to manipulate and debug. So, I created extremely simple data. I generated data on S⁵ (points of R⁵ of norm 1) as "normal data" and some anomaly data as points of R⁵ with norm 1.3. It was extremely fast to code (roughly 1 minute) and I could verify that my model managed to solve that very easy problem.
2. Create the simplest version of your model
When writing code, realize that, no matter how cautious and skilled you are, you’ll write bugs because it’s just too hard not to. There is a well-known idea in software engineering that junior programmers expect to, at some point, be enough skilled that they don’t write bugs anymore, while more senior programmers know that it’s part of the job and it’s more important to learn how to be resilient to writing bugs instead of expecting to write no bugs at all.
Now, if you know that there will be bugs, you want your model to be super simple so that you can debug more quickly and easily. If your final model is expected to have many big and diverse layers, find a way to create a simpler model. Your model needs to have enough capacity to solve your problem on your super simple data of step 1 but in a minimal way.
3. Train your model and iterate
I use Pytorch Lightning in all my Deep Learning projects because it’s extremely easy and fast to prototype. In roughly 15 minutes, I can implement super simple data, implement a first iteration of the model, and implement a Pytorch lightning model so that I can have feedback in a few minutes on the whole training pipeline. Now, there are two options: very rarely, everything will work as expected and you can increase the complexity of the data or the model to get closer to your real problem, but most of the time you’ll face a few errors and problems. It will be easy to solve because you prepared it to be that way. You will solve those problems and then you can start on increasing the complexity of your model or the data.
Conclusion
Now that you have the whole training pipeline working, you can iterate until you have a working model on the real data. There are two sources of productivity gain in this approach:
- The bugs you made are much more localized because you had much more feedbacks. You can think of your project timeline as sliced in multiple small portions and anytime you face an error, you know that the bug appeared in the current (small) portion and it’s much easier to spot it.
- The bugs coming from complex interactions between the real data and the real model are split into multiple smaller problems that are easier to solve.