The world’s leading publication for data science, AI, and ML professionals.

4 types of Data Science Interview Questions – Part1

Real questions with a strong ML focus

Conquer the Data Science Interview – Part1

In the previous post we spoke about Conquering the coding round is DS interviews. In this post, I share actual questions asked in DS interviews, and what is being assessed by those questions.

Being interviewed can be overwhelming, but not when one is prepared [Photo by Sam McGhee on Unsplash]
Being interviewed can be overwhelming, but not when one is prepared [Photo by Sam McGhee on Unsplash]

DS interviews can be a saga of multiple rounds. Its common for most companies to conduct 2–3 technical rounds. This is often followed by a Senior Director/hiring manager round. In some companies each round is an elimination round; but in others, every round is meant to get an unbiased assessment of a candidate from multiple interviewers. In such cases any one interview is not a make or break of your journey towards that coveted job. Its good to check the interview process and rationale behind by asking the Recruiter (HR).

Whatever be the case, multiple rounds mean that you can be assured to be tested on all aspects of the DS project pipeline. Broadly there are the following types of Questions:

1. Resume / Project based

2. ML Proficiency check – algorithm details

3. Case-study based questions

4. Metrics

We’ll cover type 1 and 2 in this post, and the remaining in the next post.


Know your resume like the back of your hand [Source]
Know your resume like the back of your hand [Source]

Resume / Project based

The basic requirement of any interview is that you are well versed with everything written in the resume- because Hey! it’s what you say you’ve worked on. This area can be strongest part of the interview.

You’ll be asked to explain any project from the resume. And as you are explaining the project the interviewer will delve deeper into some aspects of the project or they may add a twist and check you adapt to complexity.

Example 1: Questions that delve deep

Project – Customer Churn prediction using GBM

Candidate(C): Spoke about data sources, creation of the dependent variable

Interviewer(I): What was the distribution of the 2 classes in the dependent variable?

C: 4% – 1s, remaining 0s

I: How did you deal with class imbalance?

C: Data interventions (Sampling strategies), algorithm interventions (Using class weights), and choosing the right reporting metric (precision/recall over accuracy)

I: What is the advantage of Majority under sampling over minority over sampling? Did you try other techniques like Smote/Adasyn?

C: Explains the reason and moving onto the classifier – a GBM classifier

I: Which cost function does GBM use? How does it work? Other than GBM what else did you try?

I: How do you choose the best Precision/Recall cut-off?

and so on..

Example 2: Explore all offshoots of the project

Project – Topic modelling using SVD

C: Introduces the use case and text pre-processing.

I: What is the difference between Stemming and lemmatization – what is used when?

C : Explains

I : How big was the data. Dimensions of the matrix used for decomposition. Any problems encountered using SVD?

C: SVD takes long time to decompose, thus used another variant of it – FastSVD (developed by Facebook)

I: Can you explain how FastSVD works? What aids in speeding up the process?

C : Explains by going into linear algebra.

I : Ok, now that you have the topics, how did you validate the topics?

C: Manual + Topic coherence, Topic Perplexity

I: How did you decide the number of topics?

C: Using a Scree Plot of importance of singular Values with respect to number of topics

I: Explain the system architecture used in this project. Was the processing real-time or in batches? How did the backend system handle this?

and so on..

Example 3: Add complexity to an existing problem

Project – Anomaly detection using log data

C : Explains about the need of the use case, and the data available.

I: What is the difference between an anomaly and an outlier?

C: The outlier is an extreme value that will still be within the mechanism that generated the data. An anomaly however, is so extreme that it makes you wonder that it may be generated by another mechanism altogether.

I: What other anomaly detection techniques did you explore? Can you reformulate your solution using Isolation Forest?

You get the trend…


[Source]
[Source]

Machine Learning Proficiency check – Algorithm details

In this type of questions, you can expect to go deep into the workings of an algorithm. One can use Mathematics, Statistics, Geometry and anything else that works for you to explain the inner workings of an algorithm.

Examples:

1. What is the difference between RF and GBM (This is a pretty common question, and for good reason too. Many of the business use cases are still being solved as classification problems. Especially for tabular data, tree-based modes still perform quite well. However its an entry question into the workings of tree based algorithms, as you’ll see below)

2. Why is a RF better than a Decision Tree?

3. How does RF/GBM assign a probability score to new data?

4. What is the Loss function used in GBM?

5. Explain the Cross entropy loss function. Why does it use log?

6. How does RF calculate variable importance?

7. The variable importance of Logistic regression offers an advantage over that of RF. Could you tell what it is? (Hint – its to do with p values)

8. What methods do you use to detecting and overfit?

9. Post deployment the model trained on 1 distribution encounters another distribution. What will you do?

10. When would you use an LSTM over a simple RNN?

11. Can you explain how back propagation works?


Conclusion

In this set of questions you saw a strong technical focus, in the next article, you’ll notice a business focus the form of case study and metric based questions. Continue reading at –

4 types of Data Science Interview Questions – Part2


Related Articles