Step by step guide to explaining your ML project during a data science interview.

With a bonus sample script at the end that lets you show off your tech skills discreetly!

Published in

Towards Data Science

9 min readAug 7, 2020

This is Part 2 of the Interview Question series that I recently started. In Part 1, we talked about another important data science interview question pertaining to scaling your ML model. Be sure to check that out!

Interviews can be intimidating, but explaining a project you put your blood and sweat in, shouldn’t be!

A typical open-ended question that often comes up during interviews (both first and second round) is related to your personal (or side) projects. This question can take on many forms, for instance:

Can you walk us through a recent project you completed?
Can you tell us of a time you were part of a challenging project?
What are some interesting projects you have worked on?

And trust me when I say this, this question is the best thing that can happen to you during an interview. It lets you steer the conversation in your favor and focus on topics/ML frameworks/algorithms you are confident about!

In this article, we will be decoding how to pick an interesting project, how to structure our answer so that we don’t miss any important detail and also learn some buzz words that should definitely be part of your answer. All this is done in 10 easy steps!

Step 1: Selecting a project.

Goes without saying, while picking a project to demonstrate your technical prowess, make sure it resonates well with the company you are applying for.

For instance, for an e-commerce company, I would go with a retail dataset, for a fintech company I would choose a loan application dataset, and for a healthcare company I would prefer to pick Covid-19 or a breast cancer dataset. The trick is to pick a project based on your target audience. I swear by Kaggle to provide good quality datasets along with some analysis notebooks to get you started.

Also, it is actually a good idea to have some end-to-end projects from different sectors under your kitty.

Step 2: Explaining the data source.

Begin your explanation by specifying where you got the data to work with.

It could be that the data was provided/collected by you at your last company. Maybe you did the project for fun and extracted the required data via Kaggle. You can even mention that it was some open-source data available on the net freely. Perhaps you mined the data (of course ethically) using third party APIs (happens a lot for Twitter data). Whatever be the case, make sure you are revealing the source of your data. Additionally, give a brief overview of some of the columns/features in your dataset.

Step 3: Explain your objective behind this project.

Specify what is it that you were trying to achieve with this project.

It could be a classification problem to separate approved vs. rejected loan applications, a regression problem to predict house prices, a cold-start problem for recommender system, clustering problem to find similar users for targeted advertising. Absolute clarity in terms of explaining what the project was about is paramount from an interviewer’s perspective.

Step 3: Preparing your dataset.

This is where you talk about data cleaning, data wrangling, handling outliers, multicollinearity, duplicate removal, feature engineering, feature normalization, etc AND also the techniques to handle each of them.

The idea is to mention the actual techniques you used to target each of these data preparation steps mentioned above. For instance, explicitly state that observations with a Cook’s distance of more than 3 times the mean were considered outliers. Values for VIF (Variance Inflation Factor) exceeding 10 were regarded as indicating multicollinearity. One hot encoding (or label encoding) was used to handle categorical data. Numerical data was scaled/normalized to ensure all features are on the same scale. An 80:20 train test split was done to ensure there is no data leakage. Usage of the t-SNE plot to see a visible separation between classes (in case of classification problem).

Step 4: State the KPIs or Performance Metrics

What is it that was most important to your research problem — accuracy, precision, recall, false positives, false negatives, etc?

Every ML model has some metrics that you are trying to optimize. It can vary from person-to-person, problem-to-problem, stakeholder-to-stakeholder, and even sector-to-sector. A healthcare data scientist will have to ensure his (or her) model has fewer false negatives as it could cost a patient his life if an incorrect cancer diagnosis is performed. On the other hand, a system installed in a mall to detect shoplifters has to worry about too many false positives as it would mean causing a huge deal of embarrassment to otherwise innocent shoppers.

Also, in case of classification problem, make sure you specify whether your dataset was imbalanced or balanced, mainly because your choice of the performance metric is largely dependent on the distribution of classes in your dataset.

Step 5: Baseline model

How would you know your created model is any better (than the ones already exisiting out there)?

This is one step that is often overlooked but is immensely important. I have seen so many people fumble when the interviewer's follow up question is — So tell me why do you think your model is any good?

It is important to have a baseline that you can compare your final model against. More often, you can pick a baseline through a quick google search — whats the highest accuracy achieved on the MNIST dataset? OR whats the accuracy for Netflix’s recommendation system Cinematch. If you cannot find a baseline for your field/problem, you can always create one yourself.

A baseline model is one that is simple to set up and has reasonable chances of providing decent results. For instance, in the case of time series forecasting of ice-cream sales, my baseline (read: stupid) model may make predictions for tomorrow solely using data from today.

Step 6: Explain the training process

Explain why a particular algorithm was selected.

Always begin with the basic spot-checking several algorithms using cross-validations, followed by selecting the one with the highest value of performance metric (specified in Step 4). In my experience, it always comes down to two or three algorithms which vary only slightly in their performance. Of those three, I personally tend to go with the ensemble model like Random Forest, XG Boost, CART, etc. (especially since they tend to increase prediction accuracy by combining the predictions from multiple models together).

But hey, that's my reasoning! You might be comfortable selecting linear regression as your go-to model, especially since it is so easy to explain to non-technical stakeholders. All in all, do remember to share your model-selection decisions with your interviewer.

Step 7: Explain the model tuning process

How did you increase the accuracy of your model? What challenges did you face during this process?

The trick is to talk about how you improved accuracy whilst preventing high variance (or overfitting) problem. Talk about hyperparameter tuning using the grid search or randomized search. Also, specify if you used some sort of oversampling/undersampling technique to balance your dataset? Emphasize the use of pipelines during your ML workflow to avoid data leakage and subsequent overfitting. Explain how you used the learning curves to keep track of the loss function over time.

As for the challenges, these are unique to one’s personal experience. I will explain my biggest challenge in the sample script at the end.

Step 8: Model deployment process

Is your model sitting in your Jupyter Notebook or out in the world for everyone else to enjoy as well?

Quite often, interviewers are looking for potential data science recruits who know how to wrap up their model in a nice little container and present it to the world. This could be in the form of a web app or an API. I would highly recommend going that extra mile and learning how to do one of them. Complete beginners can check out this article on how to deploy models as APIs using the Flask framework.

Step 9: Prepare some backup questions!

If you were to do something differently, what would that be? If I were to change condition X in this project, how would your approach change?

Questions like the one above are bound to be asked once you finish explaining your project. And most of the time, it is a good sign! It means the interviewer was listening and is genuinely interested to know more about your work. Try to think of a few things you would want to differently for your project — for instance, I would try to get access to unbiased data (for instance, one that has equal representation of males & females), I would experiment with stacked models, I would re-assess my confusion matrix with different classification thresholds, etc.

Step 10: Remember to breathe!

Don’t rush to finish! Enjoy the process and weave a beautiful story for your audience.

I know how overwhelming it can get to condense a project that took you 3–4 weeks into a 90-second elevator pitch. But it is not about how much work you did, it’s about how much of it can you convey effectively to the interviewer. So breathe….

Headspace

And now to tie it all up with a sample script!!!

BONUS: Sample answer script to this question

Q — So, can you tell us about any projects you did recently?

A- For sure. I did quite a few, especially given the lockdown, but I think will pick my favorite, simply because it cleared a lot of data science basics for me. During my Ph.D. internship, I was providing consultations to a company that specialized in giving loans for second-hand cars. They were trying to automate the process of assessing incoming loan applications in the shortest time possible. So basically, I was dealing with a binary classification but with an imbalanced class problem — the number of approved applications was far more compared to rejected applications.

Since the data was imbalanced, I decided to use F1 as the performance metric as it reduces both false positives and negatives. The dataset was made available by the company itself and contained features such as age, car type, loan amount, deposit, credit score, etc. As part of the exploratory data analysis phase, I took care of outliers using Cook’s distance, multicollinearity using VIFs, duplicate removal, imputed missing values using KNNImputer, and performed 80:20 split on data. Categorical and numerical features were one hot encoded and normalized, respectively. All the analyses were performing using SkLearn.

For the baseline model, I chose a simple model that predicts the application outcome only using the applicant’s credit score. With that, I was able to achieve an F1 score of 56%. For the model selection, I began with spot-checking a few algorithms like SVM, LR, KNN, NN, and RF gave the best cross-validation scores among them. I proceeded with the hyperparameter tuning using GridSearch CV and was able to achieve an F1 score equal to 74% on the test set. I also tried to oversample the rare classes using SMOTEENN library but that didn’t help much in improving performance. What did help was doing some feature reduction using the variance importance plot from RF and finally my F1 score was around 82%.

I was satisfied with this and decided to test my model on previously unseen held-out test sets given by the company. I was surprised to find it did not perform as well as I expected. I then referred to a few blog articles to realize what I was doing wrong. Apparently, one-hot encoding categorical features is bad for tree-based models, especially because we create many binary sparse features and from the splitting algorithm’s point of view, they’re all independent. As a result, continuous variables are automatically given higher importance and chosen at the top of the tree to make a split. To alleviate this problem, I switched from SkLearn to H2o framework that does not require categorical features to be one-hot encoded. While my F1 score remained roughly the same, my H2o model generalized better on the unseen dataset.

Ending Note

So there you have it, a detailed guide to showcasing your ML project. Make sure to leave no tiny detail out of your answer. As to the buzz words that I talked about in the beginning, here are a few words that are sure to impress your interviewer:

Hyperparameter tuning
Cross-validation
ML workflow pipelines
Bias-variance tradeoff
Over/Under Sampling
Ensemble models
Scaling, normalization, one hot encoding
Feature reduction, feature engineering
Performance metric, confusion matrix

I enjoy writing step-by-step beginner’s guides, how-to tutorials, interview questions, decoding terminology used in ML/AI, etc. If you want full access to all my articles (and others on Medium), then you can sign up using my link here.

Interviewer’s favorite question- How would you “scale your ML model?”

Are you building a production-ready ML model?

towardsdatascience.com

Understanding Python imports, init.py and pythonpath — once and for all

Learn how to import packages and modules (and the difference between the two)