The Data Science Interview Blueprint

After my Data Science Manager offer with Deliveroo was rescinded a few months after I was preparing to leave my cosy consultancy job, and I didn’t have much of a safety net to fall back on and be unemployed for too long. I’ll share everything that helped me land two Data Scientist offers with FaceBook, with the hope that it might help one of you who also finds themselves in the unfortunate place I was in a few months ago.

Leon Chlon
Towards Data Science

--

Source: Unsplash.com

1. Organisation is Key

I’ve interviewed at Google (and DeepMind), Uber, Facebook, Amazon for roles that lie under the “Data Scientist” umbrella and this is the typical interview construction theme I’ve observed:

  1. Software Engineering
  2. Applied Statistics
  3. Machine Learning
  4. Data Wrangling, Manipulation and Visualisation

Now nobody is expecting some super graduate level competency in all of these topics, but you need to know enough to convince your interviewer that you’re capable of delivering if they offered you the job. How much you need to know depends on the job spec, but in this increasingly competitive market, no knowledge is lost.

I recommend using Notion to organise your job prep. It’s extremely versatile, and enables you to utilise the Spaced Repetition and Active Recall principles to nail down learning and deploying key topics that come up time and time again in a Data Scientist interview. Ali Abdaal has a great tutorial on note taking with Notion to maximise your learning potential during the interview process.

I used to run through my Notion notes over and over, but in particular, right before my interview. This ensured that key topics and definitions were loaded into my working memory and I didn’t waste precious time “ummmmmm”ing when hit with some question.

2. Software Engineering

Not all Data Scientist roles will grill you on the time complexity of an algorithm, but all of these roles will expect you to write code. Data Science isn’t one job, but a collection of jobs that attracts talent from a variety of industries, including the software engineering world. As such you’re competing with guys that know the ins and outs of writing efficient code and I would recommend spending at least 1–2 hours a day in the lead-up to your interview practicing the following concepts:

  1. Arrays
  2. Hash Tables
  3. Linked Lists
  4. Two-Pointer based algorithms
  5. String algorithms (interviewers LOVE these)
  6. Binary Search
  7. Divide and Conquer Algorithms
  8. Sorting Algorithms
  9. Dynamic Programming
  10. Recursion

DO NOT LEARN THE ALGORITHMS OFF BY HEART. This approach is useless, because the interviewer can question you on any variation of the algorithm and you will be lost. Instead learn the strategy behind how each algorithm works. Learn what computational and spatial complexity are, and learn why they are so fundamental to building efficient code.

LeetCode was my best friend during interview preparation and is well worth the $35 per month in my opinion. Your interviewers only have so many algorithm questions to sample from, and this website covers a host of algorithm concepts including companies that are likely or are known to have asked these questions in the past. There’s also a great community who discuss each problem in detail, and helped me during the myriad of “stuck” moments I encountered. LeetCode has a “lite” version with a smaller question bank if the $35 price tag is too steep, as do HackerRank and geeksforgeeks which are other great resources.

What you should do is attempt each question, even if it’s a brute force approach that takes ages to run. Then look at the model solution, and try to figure out what the optimal strategy is. Then read up what the optimal strategy is and try to understand why this is the optimal strategy. Ask yourself questions like “why is Quicksort O(n²) average time complexity?”, why do two pointers and one for loop make more sense than three for loops?

3. Applied Statistics

Data science has an implicit dependence on applied statistics, and how implicit that will be depends on the role you’ve applied for. Where do we use applied statistics? It pops up just about anywhere where we need to organise, interpret and derive insights from data.

I studied the following topics intensely during my interviews, and you bet your bottom dollar that I was grilled about each topic:

  1. Descriptive statistics (What distribution does my data follow, what are the modes of the distribution, the expectation, the variance)
  2. Probability theory (Given my data follows a Binomial distribution, what is the probability of observing 5 paying customers in 10 click-through events)
  3. Hypothesis testing (forming the basis of any question on A/B testing, T-tests, anova, chi-squared tests, etc).
  4. Regression (Is the relationship between my variables linear, what are potential sources of bias, what are the assumptions behind the ordinary least squares solution)
  5. Bayesian Inference (What are some advantages/disadvantages vs frequentist methods)

If you think this is a lot of material you are not alone, I was massively overwhelmed with the volume of knowledge expected in these kinds of interviews and the plethora of information on the internet that could help me. Two invaluable resources come to mind when I was revising for interviews.

  1. Introduction to Probability and Statistics, an open course on everything listed above including questions and an exam to help you test your knowledge.
  2. Machine Learning: A Bayesian and Optimization Perspective by Sergios Theodoridis. This is more a machine learning text than a specific primer on applied statistics, but the linear algebra approaches outlined here really help drive home the key statistical concepts on regression.

The way you’re going to remember this stuff isn’t through memorisation, you need to solve as many problems as you can get your hands on. Glassdoor is a great repo for the sorts of applied stats questions typically asked in interviews. The most challenging interview I had by far was with G-Research, but I really enjoyed studying for the exam, and their sample exam papers were fantastic resources when it came to testing how far I was getting in my applied statistics revision.

4. Machine Learning

Now we come to the beast, the buzzword of our millennial era, and a topic so broad that it can be easy to get so lost in revision that you want to give up.

The applied statistics part of this study guide will give you a very very strong foundation to get started with machine learning (which is basically just applied applied statistics written in fancy linear algebra), but there are certain key concepts that came up over and over again during my interviews. Here is a (by no means exhaustive) set of concepts organised by topic:

Metrics — Classification

  1. Confusion Matrices, Accuracy, Precision, Recall, Sensitivity
  2. F1 Score
  3. TPR, TNR, FPR, FNR
  4. Type I and Type II errors
  5. AUC-ROC Curves

Metrics — Regression

  1. Total sum of squares, explained sum of squares, residual sum of squares
  2. Coefficient of determination and its adjusted form
  3. AIC and BIC
  4. Advantages and disadvantages of RMSE, MSE, MAE, MAPE

Bias-Variance Tradeoff, Over/Under-Fitting

  1. K Nearest Neighbours algorithm and the choice of k in bias-variance trade-off
  2. Random Forests
  3. The asymptotic property
  4. Curse of dimensionality

Model Selection

  1. K-Fold Cross Validation
  2. L1 and L2 Regularisation
  3. Bayesian Optimization

Sampling

  1. Dealing with class imbalance when training classification models
  2. SMOTE for generating pseudo observations for an underrepresented class
  3. Class imbalance in the independent variables
  4. Sampling methods
  5. Sources of sampling bias
  6. Measuring Sampling Error

Hypothesis Testing

This really comes under under applied statistics, but I cannot stress enough the importance of learning about statistical power. It’s enormously important in A/B testing.

Regression Models

Ordinary Linear Regression, its assumptions, estimator derivation and limitations are covered in significant detail in the sources cited in the applied statistics section. Other regression models you should be familiar with are:

  1. Deep Neural Networks for Regression
  2. Random Forest Regression
  3. XGBoost Regression
  4. Time Series Regression (ARIMA/SARIMA)
  5. Bayesian Linear Regression
  6. Gaussian Process Regression

Clustering Algorithms

  1. K-Means
  2. Hierarchical Clustering
  3. Dirichlet Process Mixture Models

Classification Models

  1. Logistic Regression (Most important one, revise well)
  2. Multiple Regression
  3. XGBoost Classification
  4. Support Vector Machines

It’s a lot, but much of the content will be trivial if your applied statistics foundation is strong enough. I would recommend knowing the ins and outs of at least three different classification/regression/clustering methods, because the interviewer could always (and has previously) asked “what other methods could we have used, what are some advantages/disadvantages”? This is a small subset of the machine learning knowledge in the world, but if you know these important examples, the interviews will flow a lot more smoothly.

5. Data Manipulation and Visualisation

“What are some of the steps for data wrangling and data cleaning before applying machine learning algorithms”?

We are given a new dataset, the first thing you’ll need to prove is that you can perform an exploratory data analysis (EDA). Before you learn anything realise that there is one path to success in data wrangling: Pandas. The Pandas IDE, when used correctly, is the most powerful tool in a data scientists toolbox. The best way to learn how to use Pandas for data manipulation is to download many, many datasets and learn how to do the following set of tasks as confidently as you making your morning cup of coffee.

One of my interviews involved downloading a dataset, cleaning it, visualising it, performing feature selection, building and evaluating a model all in one hour. It was a crazy hard task, and I felt overwhelmed at times, but I made sure I had practiced building model pipelines for weeks before actually attempting the interview, so I knew I could find my way if I got lost.

Advice: The only way to get good at all this is to practice, and the Kaggle community has an incredible wealth of knowledge on mastering EDAs and model pipeline building. I would check out some of the top ranking notebooks on some of the projects out there. Download some example datasets and build your own notebooks, get familiar with the Pandas syntax.

Data Organisation

There are three sure things in life: death, taxes and getting asked to merge datasets, and perform groupby and apply tasks on said merged datasets. Pandas is INCREDIBLY versatile at this, so please practice practice practice.

Data Profiling

This involves getting a feel for the “meta” characteristics of the dataset, such as the shape and description of numerical, categorical and date-time features in the data. You should always be seeking to address a set of questions like “how many observations do I have”, “what does the distribution of each feature look like”, “what do the features mean”. This kind of profiling early on can help you reject non-relevant features from the outset, such as categorical features with thousands of levels (names, unique identifiers) and mean less work for you and your machine later on (work smart, not hard, or something woke like that).

Data Visualisation

Here you are asking yourself “what does the distribution of my features even look like?”. A word of advice, if you didn’t learn about boxplots in the applied statistics part of the study guide, then here is where I stress you learn about them, because you need to learn how to identify outliers visually and we can discuss how to deal with them later on. Histograms and kernel density estimation plots are extremely useful tools when looking at properties of the distributions of each feature.

We can then ask “what does the relationship between my features look like”, in which case Python has a package called seaborn containing very nifty tools like pairplot and a visually satisfying heatmap for correlation plots.

Handling Null Values, Syntax Errors and Duplicate Rows/Columns

Missing values are a sure thing in any dataset, and arise due to a multitude of different factors, each contributing to bias in their own unique way. There is a whole field of study on how best to deal with missing values (and I once had an interview where I was expected to know individual methods for missing value imputation in much detail). Check out this primer on ways of handling null values.

Syntax errors typically arise when our dataset contains information that has been manually input, such as through a form. This could lead us to erroneously conclude that a categorical feature has many more levels than are actually present, because “Hot”, ‘hOt”, “hot/n” are all considered unique levels. Check out this primer on handling dirty text data.

Finally, duplicate columns are of no use to anyone, and having duplicate rows could lead to overrepresentation bias, so it’s worth dealing with them early on.

Standardisation or Normalisation

Depending on the dataset you’re working with and the machine learning method you decide to use, it may be useful to standardize or normalize your data so that different scales of different variables don’t negatively impact the performance of your model.

There’s a lot here to go through, but honestly it wasn’t as much the “memorise everything” mentality that helped me insofar as it was the confidence building that learning as much as I could instilled in me. I must have failed so many interviews before the formula “clicked” and I realised that all of these things aren’t esoteric concepts that only the elite can master, they’re just tools that you use to build incredible models and derive insights from data.

Best of luck on your job quest guys, if you need any help at all please let me know and I will answer emails/questions when I can.

--

--

Research Data Scientist — Facebook; Past: McKinsey Analytics Consultant | Harvard Medical School Postdoc | University of Cambridge PhD, MPhil