How To Structure Machine Learning Projects

And Make Machine Learning Algorithms Work

Published in

Towards Data Science

5 min readDec 1, 2018

This article is not to show you what machine learning algorithms to learn and explain the nitty-gritty of the models to you.

If you’re looking for these materials, I strongly recommend you to check out my previous article to know how to choose online courses, what online courses to choose and what books to read for deeper understanding.

In fact, this article is to show you how you can really make machine learning algorithms work for your projects and how to structure them that you’d otherwise spend unnecessarily long time to optimize your models in the wrong direction.

Machine Learning Yearning by Andrew Ng

Regardless of whether you’re a beginner or an expert in data science, chances are (and I mean 99%) you have heard of his name.

His most famous course on Coursera — Machine Learning is a treasure to many students around the world. I have always been fascinated by his ability to break down complicated concepts into simpler pieces of information for learning, especially for beginners in machine learning.

He also wrote a book — Machine Learning Yearning which serves as a practical guide for those who are interested in machine learning.

And it’s FREE!

Grab it here and you’d receive a draft of each chapter as it is finished once you’ve signed up for the mailing list.

UPDATE:

If you’re unable to see the link or sign up for the mailing list to get the draft, please get the FREE copy here from my Google Drive: https://drive.google.com/file/d/1q81NaLyN8WY8-BYyxSXpZioTkZa6974X/view?usp=sharing

So many machine learning books out there. Why this book?

Take an example, say you want to build a neural network for image classification in different categories.

However, the accuracy of your neural network is not good enough and your team is required to meet the desired accuracy within a deadline.

Stressed. So you and your team start brainstorming for ideas to improve the model. For instance:

Get more training data
Collect more diverse training data: Images with different settings and background for different categories
Increase the complexity of the model: More units, hidden layers
Keep tuning the model’s parameters for optimum settings
Reduce the learning rate of the algorithm (longer time needed)
Try adding regularization to the model
…

The good news is: If you choose the correct directions, your model would be able to meet the required accuracy (or beyond) within the timeframe.

The bad news is: If you choose the wrong directions, you might end up wasting months (or even years) of development time, only to realize that you’ve made a wrong decision. Not good.

Some technical AI classes will give you a hammer; this book teaches you how to use the hammer.
—Andrew Ng

How do you proceed to make the most out of the model and achieve the optimum result?

You see. Learning how to set direction for your team to make strategic decisions at the first place is so important and this often requires years of experience.

Therefore, this book is meant to make machine learning algorithms work for your projects and company by prioritizing the most promising directions, diagnosing errors in a complex machine learning system, improving your team’s productivity and so much more.

A sneak preview of this book

1. Setting up development and test sets

Training set — Which you run your learning algorithm on.
Dev (development) set — Which you use to tune parameters, select features, and make other decisions regarding the learning algorithm. Sometimes also called the hold-out cross validation set .
Test set — which you use to evaluate the performance of the algorithm, but not to make any decisions regarding what learning algorithm or parameters to use.
This chapter shows readers what distribution of data should be used for dev and test sets, size of dev/test sets to use, and metrics to be optimized etc.

2. Basic error analysis

Evaluating multiple ideas in parallel during error analysis
Cleaning up mislabeled dev and test set examples
How big should the Eyeball and Blackbox dev sets be?
This chapter shows readers how to build the first simple machine learning system quickly and then iterate through error analysis to find the best clues that give the most promising directions for time investment.

3. Bias and Variance

Bias vs Variance tradeoff
Techniques for reducing bias and variance
This chapter explains bias and variance in a very clear and concise manner. It addresses the importance of identifying underfitting and overfitting of models. Also, it teaches readers some of the useful techniques to reduce bias and variance.

4. Learning curves

Plotting training error and learning curves
Interpreting learning curves: High bias
Interpreting learning curves: Other cases
This chapter explains why learning curves are so important in understanding the performance of models and how to use learning curves to make decisions based on the desired level of performance.

5. Comparing to human-level performance

6. Training and testing on different distributions

7. Debugging inference algorithms

8. End-to-end deep learning

9. Error analysis by parts

Final Thoughts

So you may be wondering now: Why are the rest of chapters above empty?

The answer is I’m still in the process of reading the book. Definitely will finish reading it soon! 😃

And to be honest with you, after reading the first four chapters of the book I have already learned so much and discovered some of the useful techniques that I’d otherwise not have realized!

Most importantly, the book is not technical and each section only contains 1–2 page(s).

Thank you for reading. I hope that by showing my takeaways from this book will give you a brief overview of the book and how you can benefit from it.

Ultimately, the practicality of the book will teach you how to structure your machine learning projects and make your models work for you, your team and the company.

As always, if you have any questions or comments feel free to leave your feedback below or you can always reach me on LinkedIn. Till then, see you in the next post! 😄

About the Author

Admond Lee is currently the Co-Founder/CTO of Staq — the #1 business banking API platform for Southeast Asia.

Want to get free weekly data science and startup insights?

Join Admond’s email newsletter — Hustle Hub, where every week he shares actionable data science career tips, mistakes & learnings from building his startup — Staq.

You can connect with him on LinkedIn, Medium, Twitter, and Facebook.

Admond Lee

In the mission of making data science accessible to everyone. Admond is helping companies and digital marketing agencies achieve their marketing ROI with actionable insights through advanced social analytics and machine learning.