How to ace a Data Science Interview

A Comprehensive Toolkit that boosts your Competitive Edge

Sultan Al Awar

Published in

Towards Data Science

10 min readJun 8, 2022

Introduction

As the volume, velocity, and variety of data increases everyday, the demand for qualified data scientists and analysts rises for organizations to become capable of utilizing and unlocking the value of this data. However, based on my experience sitting in interview panels or networking with people, I believe that many potential candidates struggle to pass the interview stage because they are not fully aware of the sample questions they should prepare beforehand. For this reason, in this tutorial, I aim to present and discuss the most relevant theoretical and practical questions that a candidate might encounter.

I should note that this article is a toolkit which gives you a basic understanding of various data science concepts and machine learning tools; however, you need to dig deeper on the topics discussed and further augment your knowledge to become completely confident while tackling the interview questions.

Section One: Theoretical Part

The interviewer will be interested to test your intellectual knowledge in areas related to statistics, machine learning, tech stack, among many others. Thus, the first section will demonstrate the most relevant theoretical questions across seven categories that are evolving in the data science field.

I) Statistics

State the difference between population and sample: population is the set of all the individuals of interest whereas sample is a selected set to represent the population in a research study.
What is hypothesis testing? statistical inference tool that uses sample data to determine whether a statement about a population parameter should or should not be rejected. Null hypothesis, denoted by H0, is a tentative assumption about a population parameter, and the alternative hypothesis, denoted by H1, is the opposite of what is stated in the null hypothesis.
What does p-value indicate? It is used in hypothesis testing as the smallest significance level that null hypothesis can be rejected. The lower the p-value, the more likely you reject the null hypothesis, where we cannot reject H0 at the significance level smaller than the p-value.
What is the difference between Type I vs Type II error? Type I error is called False Positive, is the error of rejecting H0 when it is true, whereas Type II error known as False Negative is the error of retaining H0, when it is false.
What does it mean if two random variables X and Y are independent? It means that knowing X tells you nothing about Y and vise-versa.
What is the Central Limit Theorem? It depicts that when we have a large sample size, the sample mean should be normally distributed, so the sample mean will be tending to the mean of the population and the sample variance will be equal to the variance of the population divided by the sample size regardless of the distribution of the original population.
Explain the difference between correlation and covariance? Correlation provides both strength and direction of the linear relationship of two variables while Covariance measures the direction of the joint linear relationship of two variables.
What is the law of large numbers? When we conduct an experiment for a large number of times the average result will be very close to the expected result or in other words the sample average will become closer to the population mean.
How can outlier values be treated? We can change it with a mean or median, standardize and scale the data, apply log transformation or even drop the values.

II) Machine Learning

10. List the differences between supervised and unsupervised learning.

**Differences between Supervised and Unsupervised ML**

11. What is a confusion matrix? an N x N matrix used for evaluating the performance of a classification model, where N is the number of target classes. It compares the actual target values with those predicted by the model.

12. Explain the differences between accuracy, precision, recall and F-1 score. Accuracy is the number of classifications a model correctly predicts divided by the number of predictions made. Precision tells us how many of the correctly predicted cases turned out to be positive (TP/TP+FP). Recall informs us how many of the actual positive cases we were able to predict correctly with our model (TP/TP+FN). F1-score is a harmonic mean of Precision and Recall.

13. Discuss the difference between Decision Trees and Random Forest Algorithms. A Decision Tree is a supervised algorithm mainly used for Regression and Classification where it breaks down a data set into smaller and smaller subsets by splitting data into sub-regions and predicting the average value at the leaf node, whereas a Random Forest trains an ensemble of trees by repeatedly resampling training data with replacement, and voting the trees for a final prediction.

14. Define Logistic Regression and give two business use cases. It is a binary classifier that estimates the probability of an instance belonging to a certain class and makes predictions accordingly. It can be used to predict the potential churned customers in a business or prospecting new customers in a marketing campaign.

15. What is the bias and variance trade-off and how to solve it? Bias is an error introduced in the model due to the oversimplification of the algorithm used and it leads to under-fitting. Variance is an error introduced in the model due to a too complex algorithm, whereby it performs very well in the training set but poorly in the test set and it leads to overfitting. The bias-variance trade-off attempts to minimize these two sources of error, through methods such as: cross-validation, dimensionality reduction, and feature selection.

16. What is overfitting and how to combat it? It occurs when the model performs well on the training set but fails to generalize well on the unseen test data. To overcome this, we can increase the training set, perform feature selection, apply regularization techniques, early stopping, or use cross-validation.

III) Deep Learning

17. What is Deep Learning and why did it gain popularity in recent times?It tries to mimic the functioning of the human brain where it uses a lot of layers of neurons to progressively extract higher level features from the data that feed to the neural network. It has gained popularity because of the increase in the amount of data generated and the growth in computational resources required to run these models.

IV) Natural Language Processing

18. What are the use cases of NLP? It helps computers to understand languages with different tasks such as speech recognition, sentiment analysis, text summarization, text classification, translation, question answering, chat bots, and named entity recognition.

19. Provide the difference between bag-of-words and TF-IDV. Bag-of-words represents text using word frequencies without context or order, whereas TF-IDV measures word importance by multiplying the term frequency or occurrences with the inverse document frequency which eliminates common unnecessary terms.

V) Python Libraries

20. Define pandas library and data-frames. The Pandas library is built on NumPy and provides easy-to-use data structures and data analysis tools for Python. A data frame is the primary two-dimensional data structure in pandas, with columns of different types

21. What is scikit-learn? It is an open source Python library that employs a unified interface for various efficient tools for predictive data analysis including: machine learning, pre-processing, cross-validation, and visualization algorithms.

VI) Tech Stack

22. What is Git? a version control system that allows users to keep track of changes to source code. There are websites such as GitHub or GitLab where the data team can push, pull, and review codebases using Git.

23. What is Docker? It ensures portability and reproducibility while developing data science solutions where team members can create and deploy isolated environments for running applications with their dependencies.

24. What is Apache Airflow? It is an open-source tool that can be employed to write and schedule workflows in Python and monitor processes such as model training or data scraping.

25. What is an API? It is an acronym for Application Programming Interface that can be used as a middleman between any two machines or applications that want to connect with each other within a set of rules for a specified task.

VII) Other Applications

26. Identify The goal of A/B Testing and give a use case. It examines user experience through randomized tests with two variants. It can be implemented to figure out the best online promotional and marketing strategies for the business.

27. Define Recommender System (RS) and mention its types. It is a suggestion tool which evaluates alternatives that a platform may offer for users. There are two main types: Content Filtering RS, which recommends items based on learning what the user liked or identified useful in the past, and the Collaborative Filtering RS, which recommends items based on learning what other users with similar tastes liked or identified as useful.

Section Two: Practical Part

The interviewer will also be keen to test your technical expertise and proficiency in programming languages such as Python, SQL, and R. Thus, in this section, I will present and discuss some coding examples that you might see in the interview.

I- SQL Coding Exercise

The first hands-on example that I will present is about SQL, a programming language used to communicate with relational databases. The task is to extract the amount of customers, average age, number of calls, and talk time in 2022 segmented by account type with at least average age of 40. Let us assume we have to use two tables called accounts and calls.

select k.account_type, count(distinct k.account_id) as number_customers, round(avg(k.customer_age),0) as avg_age, count(distinct c.call_id) as number_calls, sum(c.talk_time) as total_talk_time
from accounts as k
join calls as c
on k.account_id = c.account_id
where k.start_date as date >= cast('2022-01-01' as date)
group by 1 --account_type
having avg(customer_age) >= 40
order by 2 --number_customers

Herein are the building blocks that I would like you to clearly focus on and understand:
1. Select which simply states that we want to extract the variables names: account_type, account_id, etc.
2. Count and distinct which allow us to count the unique (distinct) account ids to generate the number of total customers; the same approach applies for calls where we counted the number of unique call ids to have the total calls.
3. Count, sum, and avg are statistical measures which are employed to drive insights from the data.
4. From identifies the name of the table.
5. Join enables us to merge the data between two tables based on a common unique key. There are different types of joins: left, right, inner, and outer which you should consider. For example, outer join returns all records when there is a match in left (table1) or right (table2) table records, where as inner join selects records that have matching values in both tables. There is also a union option that combines the data of two tables.
6. Where identifies the condition that we are using to extract data based on it . For instance, in the above example, we are specifying that we need only the data after a certain date (beginning of 2022).
7. Group by is used to aggregate by category which is the account_type in our case.
8. Having which filters out rows from the aggregate results. In our example, we are filtering the aggregated results for those who have an average age above 40 as requested.
9. Order by which orders data by asc (default) or desc.

II- Python Coding Exercise

The task will be to develop a function that can act as a login gateway for your account. It should ask you to input your password, if it matches the correct password within 2 attempts only, you will be able to log in. Otherwise, your access should be blocked.

def user_login(password):
  i = 0
  while i < 3:
    user_input = input('Input the password:')
    if user_input == password:
      print('You have successfully logged in.')
      break
    if i == 2:
      print('You are not the authorized user.')
    i+=1
  return

Herein are the building blocks that I would like you to focus on understand:
1- The use and structure of the function: in Python the term function packages together a collection of statements that you can execute whenever you want in your code. The major parts of the function are: function name, arguments, body, and the return part.
2- While statement enables iteration whereby a block of code will be repeatedly executed whilst some condition is True. Once the condition evaluates to False the next section of the code is executed. To end a while loop prematurely, the break statement should be used. In our example, the while loop is utilized to ask the user to input the password within certain number of attempts, when it receives the correct password the while loop will stop (break statement), otherwise it will continue until the final allowed attempt which will block the access automatically.

In terms of Python knowledge, you should also be proficient in pandas operations and methods and data wrangling techniques. You can check my other article about Pandas in Practice and I am adding a link to the Data Camp Pandas Cheat Sheet: Data Wrangling in Python.

Conclusion and Final Thoughts:

Finally, I demonstrated in this tutorial the most relevant theoretical questions that a candidate might be exposed to in an interview whether in statistics, machine learning, natural language processing, and tech stack. I also presented and explained two hands-on coding exercises which depict some of the important notions and techniques that the candidate should be confident and capable to utilize in the interview. However, I highly recommend to extend your preparation and conduct additional readings to ensure you are covering and grasping all the crucial knowledge to ace an interview. For this reason, I am including below some suggestions for books and online sheets that should be complementary resources for this toolkit.

I wish you the best in your preparation and career endeavours. Get in touch if you need any mentoring or support to secure a technical data scientist or an analyst position.

Additional Resources:

You can find PDF versions of these books on Library Genesis:

The Art of Statistics: Learning from Data by David Spiegelhalter
Python for Data Analysis by Wes McKinney
SQL for Data Analysis: Advanced Techniques for Transforming Data into Insights by Cathy Tanimura
Machine Learning with Scikit-learn and Pytorch by Sebastian Raschka, Yuxi (Hayden) Liu, Vahid Mirjalili

I also suggest reviewing the Data Science Cheat Sheets on the Data Camp website which are also a beneficial source to leverage your expertise in Python/SQL Programming, Machine Learning and AI.