Netflix is one of the most elite tech companies in the world, so it’s no surprise that their Data Science interview questions are much more challenging. Below are several interview questions that have been previously asked in Netflix’ data science interviews and my attempts at answering them.
Q: Why is Rectified Linear Unit a good activation function?

The Rectified Linear Unit, also known as the ReLU function, is known to be a better activation function than the sigmoid function and the tanh function because it performs gradient descent faster. Notice in the image to the left that when x (or z) is very large, the slope is very small, which slows gradient descent significantly. This, however, is not the case for the ReLU function.
Q: What is the use of regularization? What are the differences between L1 and L2 regularization?
Both L1 and L2 regularization are methods used to reduce the overfitting of training data. Least Squares minimizes the sum of the squared residuals, which can result in low bias but high variance.

L2 Regularization, also called ridge regression, minimizes the sum of the squared residuals plus lambda times the slope squared. This additional term is called the Ridge Regression Penalty. This increases the bias of the model, making the fit worse on the training data, but also decreases the variance. If you take the ridge regression penalty and replace it with the absolute value of the slope, then you get Lasso regression or L1 regularization. L2 is less robust but has a stable solution and always one solution. L1 is more robust but has an unstable solution and can possibly have multiple solutions.
Q: What is the difference between online and batch learning?
Batch learning, also known as offline learning, is when you learn over groups of patterns. This is the type of learning that most people are familiar with, where you source a dataset and build a model on the whole dataset at once.
Online learning, on the other hand, is an approach that ingests data one observation at a time. Online learning is data-efficient because the data is no longer required once it is consumed, which technically means that you don’t have to store your data.
Q: How would you handle NULLs when querying a data set? Are there any other ways?
There are a number of ways to handle null values including the following:
- You can omit rows with null values altogether
- You can replace null values with measures of central tendency (mean, median, mode) or replace it with a new category (eg. ‘None’)
- You can predict the null values based on other variables. For example, if a row has a null value for weight, but it has a value for height, you can replace the null value with the average weight for that given height.
- Lastly, you can leave the null values if you are using a Machine Learning model that automatically deals with null values.
Q: How do you prevent overfitting and complexity of a model?
For those who don’t know, overfitting is a modeling error when a function fits the data too closely, resulting in high levels of error when new data is introduced to the model.
There are a number of ways that you can prevent overfitting of a model:
- Cross-validation: Cross-validation is a technique used to assess how well a model performs on a new independent dataset. The simplest example of cross-validation is when you split your data into two groups: training data and testing data, where you use the training data to build the model and the testing data to test the model.
- Regularization: Overfitting occurs when models have higher degree polynomials. Thus, regularization reduces overfitting by penalizing higher degree polynomials.
- Reduce the number of features: You can also reduce overfitting by simply reducing the number of input features. You can do this by manually removing features, or you can use a technique, called Principal Component Analysis, which projects higher dimensional data (eg. 3 dimensions) to a smaller space (eg. 2 dimensions).
- Ensemble Learning Techniques: Ensemble techniques take many weak learners and converts them into a strong learner through bagging and boosting. Through bagging and boosting, these techniques tend to overfit less than their alternative counterparts.
Q: How would you design an experiment for a new feature we’re thinking about. What metrics would matter?

I would conduct an A/B test to determine if the introduction of a new feature results in a statistically significant improvement in a given metric that we care about. The metric(s) chosen depends on the goal of the feature. For example, a feature may be introduced to increase conversion rates, or web traffic, or retention rates.
First I would formulate my null hypothesis (feature X will not improve metric A) and my alternative hypothesis (feature X will improve metric A).
Next, I would create my control and test group through random sampling. Because the t-test inherently considers the sample size, I’m not going to specify a necessary sample size, although the larger the better.
Once I collect my data, depending on the characteristics of my data, I’d then conduct a t-test, Welch’s t-test, chi-squared test, or a Bayesian A/B test to determine whether the differences between my control and test group are statistically significant.
Thanks for Reading!
If you like my Work and want to support me, I’d greatly appreciate if you followed me on my social media channels: