The world’s leading publication for data science, AI, and ML professionals.

If I Were to Hire a Data Scientist, I Would Ask These 2 Questions

They lead to a conversation that reveals many important things

Photo by Vincent van Zalinge on Unsplash
Photo by Vincent van Zalinge on Unsplash

I’m currently not in a position to hire a data scientist but I hope to be someday. With that being said, I have been on the other side of the interview table several times.

At data scientist interviews, I have answered questions from a variety of topics from SQL to Python, machine learning to Bayes’ theorem. Some of them were really challenging but some of them seemed not-so-useful to me.

Now that I work as a data scientist, I sometimes think about what I would ask if I were to hire a data scientist.

My primary goal in doing this is not to find the best questions to ask or find the best data scientist among the candidates. Instead, I aim to become a better data scientist. Such a thought process helps me improve my skills and motivates me to learn new stuff. Since Data Science is still evolving, it is of crucial importance to learn continuously.

Let’s start with the first question.

Question 1: You are assigned to create a model to solve a supervised learning problem, either a regression or classification. Which algorithm would you choose and why?

The first thing that comes to your mind is probably that it depends on many things. You are absolutely right. Thus, we are not trying to find a single algorithm as an answer.

The purpose of this question is rather to start a discussion about Machine Learning algorithms in general. However, at some point during the discussion, I would like to come to the point where the candidate explains the following:

  • There is a trade-off between prediction accuracy and model interpretability.

Let’s elaborate on this. In general, as the flexibility of an algorithm increases, it tends to give more accurate results. For instance, gradient boosted decision trees are much more flexible than linear regression and it outperforms linear regression with regards to accuracy and performance.

However, we gain the extremely good performance of GBDT with the price of interpretability. What this means is that we have a limited idea of how a decision is made. The algorithm can calculate feature importance values but we have a vague understanding of which features play a key role.

On the other hand, linear regression has very limited flexibility but offers high interpretability. With a linear regression model, we obtain a comprehensive understanding of the effects of each feature on the predictions.

Therefore, the choice of algorithm depends on what we want to achieve. If we do not care about interpretability and only want to achieve good results, we can go with flexible algorithms. An example of this could be stock price prediction. We are usually only interested in getting high accuracy.

When we work on a task where the focus is "why" a prediction is made, then we should choose interpretable models.

Machine learning is not only used in a low-risk environment such as recommender systems but also for critical tasks such as cancer prediction, drug tests, and so on. In these cases, we definitely want to know why a decision is false.

This question is essentially related to interpretable machine learning or explainable AI. Here is a great book by Cristoph Molnar if you’d like to learn more about interpretable machine learning.

Interpretable Machine Learning


Question 2: What is bias and variance in machine learning?

This is also an open-ended question and the goal is to see if the candidate knows bias, variance, what they mean for a machine learning model, and the trade-off between them.

Variance is a measure of how sensitive a model is to the training data. Predictions of a model with high variance might change significantly as a result of as small variations in the training set. However, this is not desired. We want our predictions not to vary too much between different training sets. Thus, we try to avoid models with high variance.

Bias is an indication of using a very simple model to approximate a complicated problem. For instance, using a linear regression model a non-linear relation is likely to result in high bias. We cannot get a good estimate of the target variable by using a model with a high bias. Thus, we also try to avoid models with high bias.

But, how can we achieve a model with low bias and low variance?

The error of a predictive model is basically the difference between the predictions and actual values. These errors are composed of two main parts which are the reducible and irreducible errors.

We can improve a model by focusing on the reducible error part which can be expressed as the variance and squared bias of the predictions. Variance and squared bias are both non-negative values so, in the optimal case, we aim for a model with low bias and low variance. However, this is usually a highly challenging task and there is a trade-off between them.

In general, as the flexibility of a method increases, the variance tends to increase and the bias tends to decrease. For instance, GBDT based algorithms such as XGBoost and LightGBM are likely to have very low bias but high variance.

As the flexibility increases, bias tends to decrease faster than the variance increases to a certain point. After that, if we keep increasing the flexibility, we will not achieve much in terms of bias but the variance will increase dramatically.

The key to creating a robust and accurate model is to find that optimal point.

There will also be cases where the relationship between the features and the target variable is linear. In those cases, linear models will have no bias so they outperform more advanced and complicated models.


These are the questions that I would definitely ask a data scientist candidate. They lead to a conversation that reveals the candidate’s knowledge and understanding of many important concepts in machine learning and statistics.

I strongly suggest thinking about your own questions because it is both a good mind exercise and an opportunity to learn.

You can become a Medium member to unlock full access to my writing, plus the rest of Medium. If you already are, don’t forget to subscribe if you’d like to get an email whenever I publish a new article.

Join Medium with my referral link – Soner Yıldırım

Thank you for reading. Please let me know if you have any feedback.


Related Articles