The world’s leading publication for data science, AI, and ML professionals.

Starting a New Machine Learning Model

As many of my readers will know, I recently joined a new organization! project44 is building incredibly valuable logistics visibility and…

Photo by Brian Matangelo on Unsplash
Photo by Brian Matangelo on Unsplash

As many of my readers will know, I recently joined a new organization! project44 is building incredibly valuable logistics visibility and transparency tools for companies that move stuff from place to place – meaning, most companies! If you produce, transport, or sell any tangible product you want to know where it is, where it’s going, and when it’s supposed to arrive, and project44 is here to help with all those things and more. (And we’re hiring for Data Science and many other roles! Let me know if you want to learn more.)

Thanks to the great data science team already in place, I was able to immediately get started on a high value modeling project when I arrived – a position every data scientist loves to be in. On reflection, I realized that my ability to hit the ground running and feel comfortable diving in were due to some unwritten lessons I have absorbed from past experience, and that’s what I’m sharing with you. I’m not at liberty to get into deep details about what the model is, but I can still give you some good tips.

The features are the thing

You might feel like pulling a sample and running a model on day 1 – this is a mistake. There is almost zero chance that your data is clean, formatted, and organized in a way that is conducive to a model worth a damn. Garbage in, garbage out! Your model is only as good as your feature engineering, and in most cases, I’ll spend at least 3 times as long feature engineering as building a model. Your features are the most important part of building a model, and a cleverly designed feature can improve a model way more than any amount of hyperparameter tuning.

This can mean expanding your view of what the feature is – a form of what we in social sciences call Measurement Validity. In short, what is the underlying mechanism or concept you want to put in your model? This is what then needs to be translated into a numeric feature, and it may be more complicated than you think. Multiple features are frequently required to fully cover the concept you intend to use, and/or there might still be some aspects you don’t have a sensible way to measure or quantify. Don’t just accept the columns in your table or dataset, but really think about what they represent and how they relate.

What are we actually doing?

I’ve gotten ahead of myself! Let’s return to the beginning, where you’ve been assigned a modeling task and have free rein. We’re going to spend the first weeks of the project on feature engineering, taking the expansive view I described above. How do we actually implement that in practice? Well, in my experience, you have to start by researching the problem. Until you know a good bit about what the outcome is, the available data, the limitations on that data, etc, you don’t know what features to even try to build (or the underlying concepts that matter). Understanding the context and situation is not optional – it’s necessary to get things right.

As you study the problem, construct a plan of what areas you think might be promising. This lends itself rather well to constructing tickets or stories if your team uses a kanban or agile style approach. If you discover an area that context or theory suggests might be a big influence on your outcome, set that aside for a deep dive of research, but don’t do it just yet. Keep going until you have a robust set of deep dives to do, document the general idea of each one, then systematically work through them. And set yourself time bounds, so you don’t get sucked into one that takes you down a rabbit hole – these areas need to be researched well, but you still have other stuff to do!

Draw, draw, draw

This might just be a "me" thing, but when I am learning a new subject matter area in preparation for modeling, I am going to make endless visualizations to help me grok the thing. Graphs, plots, maps, diagrams, etc – I find these incredibly useful for developing my understanding so I can later comprehend how my features relate to the outcome and to each other. Check your intuition about the things you’re building, and test your assumptions! Visualizations are a great way to do that. For the model I am building in my new role, so far I have used pen and paper, bokeh, Altair, and geopandas to make visuals, and I’m still only working on the first few features.

80/20 Rule

As I’ve heard from many wise bosses, including my current one, there’s a tradeoff between pursuing perfection on this problem, and getting on to the next high priority thing that needs doing. The idea behind 80/20 is that you’ll get 80% of the results with 20% of your effort, then you’ll have to grind out 80% of the work to get that last marginal 20% of improvement. (The metaphor falls apart a little when we realize that perfect model performance is not going to happen, but you get my drift.)

If you’re lucky like us, there are tons of problems to solve and data to solve them, and the only thing standing in the way is your time and staff capacity. This means, for us, that we need a good model that meets the needs of customers, but we do not need a "perfect" model. Incremental improvement after the development of an MVP (minimum viable product) is natural for software development, and that’s a framing data science also frequently uses.

Your models will always have some misses and some blind spots or edge cases they fail on – that’s okay! In fact, it’s inevitable. You as a data scientist need to understand and intuit when you have reached a model that does the job, and learn to let it go at that point. You’ll come back to it later to improve it, and no one is expecting perfection. (If they are, that is a problem with them and perhaps with the mentality around data science at your org, but it is not a problem with your model.) Going back to work on that MVP is almost certainly less urgent for your business than getting another MVP out the door for the next major problem.

Once all your high priority tasks or problems have models, then you are freed up to go back and take a swing at retraining or making incremental improvements on your existing models- this is what it looks like when you’re a mature data science team maintaining a stable of models, and that’s a different problem than what we’re discussing here.

Conclusion

This is of course just a short set of tips for getting started on your model process – once you get going, you’ll discover lots of interesting things, such as idiosyncrasies in the data, new innovative ways to generate features, and perhaps totally unexpected challenges to preconceived assumptions. That’s the fun of the job! Embrace the problem solving and exploratory nature of the project of Machine Learning, and you’ll end up with models you can be proud of.


Related Articles