The world’s leading publication for data science, AI, and ML professionals.

New Amazon Data Scientist Interview Practice Problems for 2021

New Year, New Interview Questions

Photo by Piotr Cichosz on Unsplash
Photo by Piotr Cichosz on Unsplash

I know the number of interview question articles is getting a little bit saturated. But for the sake of those who are trying to find up to date questions and answers, I’ve decided to write this article. I took a number of interview questions that were asked in recent months and attempted to give concise answers for you to understand.

With that said, let’s dive into it!


Q: A couple has 2 kids where they know that one is a boy. What’s the probability of the other kid being a boy?

This isn’t a trick question. The probability of one kid being a boy is independent of the other, so it is one-half. You may be confused with Leonard Mlodinow’s question where the answer is one-third, but that is a completely different question.


Q: Explain what the p-value is.

If you Google what the p-value is, they’ll say "it’s the probability of obtaining results at least as extreme as the observed results of a statistical hypothesis test, assuming that the null hypothesis is correct."

It’s quite wordy, but it wordy for a reason and it’s because the p-value is quite particular in meaning and heavily misunderstood.

While the meaning is not as complete, a simpler definition of the p-value is that it’s the likelihood of an observed statistic occurring due to chance, given the sampling distribution.

The alpha sets the standard for how extreme the data must be before the null hypothesis can be rejected. The p-value indicates how extreme the data is.


Q: There are 4 red balls and 2 blue balls, what’s the probability of them being the same in the 2 picks?

The answer is equal to the probability that both are red plus the probability that both are blue. Assume that this question is without replacement.

  • Probability of 2 reds = (4/6)*(3/6) = 1/3 or 33%
  • Probability of 2 blues = (2/6)*(1/6) = 1/18 or 5.6%

Therefore the probability that the balls will be the same is approximately 38.6%.


Q: Explain the C value in SVM.

To give context, a Support Vector Machine (SVM) is a supervised classification technique that finds a **** hyperplane or a boundary between the two classes of data that maximizes the margin between the two classes. There are many planes that can separate the two classes, but only one plane can maximize the margin or distance between the classes.

The C parameter indicates the extent to which you want your SVM to have a soft margin vs a hard margin. Large values of C result in a harder margin that does a better job of correctly classifying the training points but results in a smaller-margin. Conversely, smaller values of C result in a softer margin that creates a larger-margin hyperplane even if it misclassifies more points. Generally, a softer margin is used to prevent overfitting.


Q: We have two models, one with 85% accuracy, one 82%. Which one do you pick?

If we only care about the accuracy of the model then we would choose the one with 85%. But if an interviewer were to ask this, it would probably be a good idea to get more context, i.e. what the model is trying to predict. This will give us a better idea whether the evaluation metric should indeed be accuracy or another metric like recall or f1 score.


Q: What is the difference between bagging and boosting?

Bagging, also known as bootstrap aggregating, is the process in which multiple models of the same learning algorithm are trained with bootstrapped samples of the original dataset. Then, like the random forest example above, a vote is taken on all of the models’ outputs.

Bagging Process
Bagging Process

Boosting is a variation of bagging where each individual model is built sequentially, iterating over the previous one. Specifically, any data points that are falsely classified by the previous model is emphasized in the following model. This is done to improve the overall accuracy of the model. Here’s a diagram to make more sense of the process:

Image Created by Author
Image Created by Author

Once the first model is built, the falsely classified/predicted points are taken in addition to the second bootstrapped sample to train the second model. Then, the ensemble models (models 1 and 2) are used against the test dataset and the process continues.


Q: What is the Naive Bayes Algorithm?

Naive Bayes is a popular classifier used in Data Science. The idea behind it is driven by Bayes Theorem:

In plain English, this equation is used to answer the following question. "What is the probability of y (my output variable) given X (my input variables)? And because of the naive assumption that variables are independent given the class, you can say that:

As well, by removing the denominator, we can then say that P(y|X) is proportional to the right-hand side.

Therefore, the goal is to find the class with the maximum proportional probability.

_Check out my article "A Mathematical Explanation of Naive Bayes" if you want a more in-depth explanation!_


Q: How would you improve a classification model that suffers from low precision?

The first thing that I would do is ask a couple of questions:

  • Is it possible that the quality of the negative data that was collected worse than the quality of the positive data? If so, the next step would be to understand why that is the case and how it can be resolved.
  • Is the data imbalanced? A model with low precision is a sign that the data is imbalanced and needs to be fixed either by over or undersampling.

The other thing that I would do is take advantage of algorithms that are most suited in dealing with imbalanced data, which are tree-boosted algorithms (XGBoost, CatBoost, LightGBM).


In case you missed it, here are some of last year’s interview practice problems that I think are really good question!


Q: If there are 8 marbles of equal weight and 1 marble that weighs a little bit more (for a total of 9 marbles), how many weighings are required to determine which marble is the heaviest?

Image created by author
Image created by author

Two ‘weighings’ would be required (see part A and B above):

  1. You would split the nine marbles into three groups of three and weigh two of the groups. If the scale balances (alternative 1), you know that the heavy marble is in the third group of marbles. Otherwise, you’ll take the group that is weighed more heavily (alternative 2).
  2. Then you would exercise the same step, but you’d have three groups of one marble instead of three groups of three.

Q: What is overfitting?

Overfitting is an error where the model ‘fits’ the data too well, resulting in a model with high variance and low bias. As a consequence, an overfit model will inaccurately predict new data points even though it has a high accuracy on the training data.


Q: How would the change of prime membership fee affect the market?

I’m not 100% sure about the answer to this question but will give my best shot!

Let’s take the instance where there’s an increase in the prime membership fee – there are two parties involved, the buyers and the sellers.

For the buyers, the impact of an increase in a prime membership fee ultimately depends on the price elasticity of demand for the buyers. If the price elasticity is high, then a given increase in price will result in a large drop in demand and vice versa. Buyers that continue to purchase a membership fee are likely Amazon‘s most loyal and active customers – they are also likely to place a higher emphasis on products with prime.

Sellers will take a hit, as there is now a higher cost of purchasing Amazon’s basket of products. That being said, some products will take a harder hit while others may not be impacted. It is likely that premium products that Amazon’s most loyal customers purchase would not be affected as much, like electronics.


Thanks for Reading!

What I love about these interview practice problems is that it serves two purposes 1) It helps you learn about concepts that you aren’t familiar with and 2) it provides a nice refresher for concepts that you’ve already learned about but haven’t touched in a while.

Hopefully, this will help with your preparation and your journey to being a data scientist! And if this wasn’t enough you can check out the article below 😉

As always, I wish you the best in your endeavors!

Not sure what to read next? I’ve picked another article for you:

120+ Data Scientist Interview Questions and Answers You Should Know in 2021

Terence Shin


Related Articles