Nail data science interviews with confidence, part 1
Machine learning related questions always take a large portion during interviews. Positions like data scientists, machine learning engineers require potential candidates to have comprehensive understandings of machine learning models and be familiar with conducting analysis using these models. While discussing some of your projects with the interviewer demonstrate your understanding of certain models, it is expected that interviewers will ask some fundamental machine learning questions about model selection, feature selection, feature engineering, model evaluation, etc. In this article, I will go over 20 machine learning related questions and explain how would I answer these questions during interviews.
Model Specifics
1, What is supervised machine learning problems, and what is unsupervised machine learning problems?
You can easily distinguish them by checking whether there are target values, or labels, to predict in the problem. Supervised machine learning maps data with target values so that the model will use features extracted from data to predict target values. For example, using linear regression to predict the housing prices; using logistic regression to predict whether one person will default on his/her debts.
Unsupervised machine learning problems have no target value to predict but are learning to uncover the general patterns from the data. For example, clustering the data based on the pattern; dimension reduction based on the feature variances.
2, What is a classification problem, and what is a regression problem?
Both classification problems and regression problems are supervised machine learning problems so that there are target values in the problems. The classification problems have discrete target values that stand for classes. For the binary classification problem, there are only positive class and negative class. The regression problems have continuous target values to predict, like housing prices, waiting time, etc.
3, What are the parameters and hyper-parameters for a machine learning model?
Parameters are generated during the fitting process of the model, while hyper-parameters are defined by default or specified by searching through GridSearchCV. Take ridge regression as an example, parameters are the coefficients for all features while the hyper-parameter is the α that specify the level of regularization in the model.
4, What is the cost function of logistic regression?
Logistic regression uses cross-entropy as the cost function:

Cost function with cross-entropy simultaneously penalizing uncertainty and incorrect predictions. Incorrect predictions that are made with high confidence contribute the largest penalties to the sum. For example, when y_j = 0 and your model predicts f (X_j) = 0.9. The cost will be -log(0.1), which is close to infinite.
5, What is SVM, and what is the support vector?
Support Vector Machine(SVM) is a supervised machine learning algorithm that is usually used in solving binary classification problems. It can also be applied in multi-class classification problems and regression problems.
The support vectors are the data points that lie closest to the separating hyperplane. They are the most difficult data points to classify. Moreover, support vectors are the elements of the training set that would change the position of the dividing hyperplane if removed. I have an article that explains more concepts about SVM:
6, What are Gradient Descent and Stochastic Gradient Descent?
Each machine learning model has a cost function J (θ_0, θ_1,…θ_n), where θs are the parameters. To find the optimal parameters during fitting, we are solving an optimization problem:
min J (θ_0, θ_1,…θ_n)
w.r.t θ_0, θ_1,…θ_n
Gradient Descent solves this problem by taking first-order iterations:

It starts with random values of θs and keeps updating θs based on the first-order partial derivatives. When the partial derivative is positive, we decrease θ and Vise Versa:

When the partial derivative reaches zero or close enough to zero, the iteration stops and reaches the local/global minimum. ɳ is the learning rate, when it is small, it takes longer to converge, but if it big, the cost function may not decrease at every iteration and may diverge in some cases.
Stochastic Gradient Descent is an optimization method that considers each training observation individually, instead of all at once (as normal gradient descent would). Instead of calculating the exact gradient of the cost function, it uses each observation to estimate the gradient and then takes a step in that direction. While each individual observation will provide a poor estimate of the true gradient, given enough randomness the parameters will converge to a good global estimate. Because it need only consider a single observation at a time, stochastic gradient descent can handle data sets too large to fit in memory.
7, How to choose K for K-means?
We choose the number of clusters to define in the K-means algorithm beforehand, and the K value is determined both technically and practically.
First, we need to plot the Elbow curve that measures distortion (average of the squared distances from the cluster center) or inertia (sum of squared distances of samples to their closest cluster center) with respect to K. Note that we will always decrease distortion and inertia as K increase, and if K equals to the number of data points, then their value will be zero. We can use the Elbow curve to check the decreasing speed and choose the K at the "Elbow point" when the value decreases substantially slower.
Practically speaking, we need to choose the K that is either easier to interpret, or practically doable. For example, if your company only has the resources (labor and capital) to cluster customers into three categories, then you have to set K at three.
8, What is online learning?
Online learning is updating a fit with new data, rather than re-fitting the whole model. It is usually applied in two scenarios. One is that when your data is coming sequentially, and you want to adjust your model incrementally to accommodate the new data. Another case is that when your data is too large to train on all at once, you can either use Stochastic Gradient Descent or specifying batch sizes depending on the model you are using.
Model evaluation
9, What is the difference between under-fitting and over-fitting?
Underfitting is when your model is not complex enough to learn the data patterns, and over-fitting is when your model is too complicated and is picking up the noises rather than the patterns. When underfitting, your model will have poor performances in both the training set and testing set, and you need to include more features, or using a more complicated model. When over-fitting. the model will perform very well at the training set, but it will not be generalizable to new data, which means it will perform badly in the test set. You need to use a simpler model or delete some features through regularization, bagging, or dropout.
10, What is the trade-off between bias and variance?
Bias is measuring how poorly your model performs thus it is a measure of underfitting. Variance is a measure of over-fitting, which is measuring how much your model has fit the noise in the data.
11, What is regularization, and what is the difference between L1 and L2 regularization?
We usually use regularization in linear models to control over-fitting. Regularization is adding the size of the parameters to the cost function when fitting the model. Thus, it forces the model to choose fewer features or reduce the size of features’ parameters and reduces the chance of over-fitting, especially when there are a lot of features.
L1 regularization adds the absolute value of the parameter to the cost function while L2 regularization adds the square of the parameter. In linear regression, L1 regularization is Lasso regression, and L2 regularization is Ridge regression. L1 regularization can bring down the parameters for unuseful features to zero, thus can be used in feature selection. L2 regularization cannot make any parameter reduce to zero, but can bring down the value substantially low. Plus, since L2 is using the squared value, then it is heavily punishing "outliers", which are the very large parameters. L1 regularization is good for models with fewer features, each of them has a large or median effect, while L2 regularization is good for models with many features, each of them has a small effect.
12, How to interpret L2 regularization from a Bayesian point of view?
In a Bayesian point of view, the parameters are determined by:

Where p (β|y, X) is the posterior distribution, p (β) is the prior distribution and p (y|X, β) is the likelihood function. When ignoring the prior distribution and only maximizing the likelihood function to estimate the β, we do not have any regularization. When having assumptions about the prior distribution, we are adding regularization, which means we put some limits on what β can be chosen for this model. For L2 regularization, we add the assumption that β follows a normal distribution with mean equals to zero.
For more information about Bayesian statistics, you can read the following article of mine:
13, How to evaluate regression models and how to evaluate classification models? (also include the effectiveness)
To evaluate a model, we need to evaluate its performance technically and practically. Technically speaking, depending on the scenarios, we use MSE, MAE, RMSE, etc, to evaluate regression models and use accuracy, recall, precision, F score, AUC to evaluate classification models. I have an article describe the choices of metrics for evaluating classification models:
The Ultimate Guide of Classification Metrics for Model Evaluation
On the practical side, we need to evaluate whether the model is ready to deploy and use business metrics in this case. If we are improving an old model, we can just compare the techniques metrics between the old model and the new model to see whether the new model has better performance. If this is the best model you are building, you need to define "good performance" with business metrics. For example, how much is the loss if we follow the wrong predictions of the model, and that relies deeply on business scenarios. If sending ads is cheap, then the model is still good to have low precision. However, if sending ads is expensive, then we need to have higher precision.
14, How to evaluate linear regression models?
There are several ways to evaluate linear regression models. We can use metrics like Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE) to evaluate the model. Note that if you do not want outliers to affect your model performance, then you should use MAE rather than MSE. In addition to these metrics, we can also use R-Squared or adjusted R-Squared(R2). R-Squared is a measure that compares the model you built with a baseline model, where the baseline model predicts the average of y all the time.

If your model is worse than the baseline model, R-Squared can be less than zero. Adjusted R-Squared adjusts R-Squared by how many features your model used to make the prediction. If increasing one feature does not improve the model performance than expected, adjusted R-Squared will decrease.
Note that MAE and MSE are both difficult to interpret without context because they depend on the scale of the data. However, because R-Squared has a fixed range, a value close to 1 always means that the model is fitting the data fairly well.
15, What are bagging and boosting? Why do we use them?
Bagging is training ensemble models parallelly. We have a collection of identical models training with randomly selected sub-samples (with replacement) and features. The ultimate prediction combines all models’ predictions. For classification problems, it takes the majority vote. While for regression problems, it takes the average of all model predictions. Bagging is usually used to combat over-fitting, and Random Forest is a great example.
Boosting is training models vertically. It takes a series of models and each one takes the output of the previous one. It is trained on data re-weighted to focus on the data the previous models got wrong. The ultimate predictions are then combined in a weighted average at the end. Boosting is a technique that fights for underfitting, and Gradient Boosted Decision Trees is a great example.
16, What cross-validation technique should you use on a time series data set?
The default cross-validation techniques shuffle the data before splitting them into different folds, which is undesired for time series analysis. Time series data’s order matters and we don’t want to train on future data and test on past data. Instead, we need to preserve the order and only train on the past.
There are two methods: "sliding window" and "forward chaining". First, we preserve the order of our data and slicing them into different folds. In the sliding window, we train on fold 1 and test on fold 2. Then we train on fold 2 and test on fold 3. We will finish until we test the last fold. In the forward chaining, we train on fold 1, test on fold 2. Then we train on fold 1+2, test on fold 3. Then train on fold 1+2+3, test on fold 4. We will stop until we test the last fold.
Data Preparation
17, What is data normalization, and why?
Data normalization (or scaling) makes all continuous features to have a more consistent range of values. For each feature, we subtract out the feature mean and divide by its standard error or range. The goal is to have all continuous features to be on the same scale. Data normalization is useful in at least three circumstances:
1, for algorithms that use Euclidian distance: Kmeans, KNN: different scales distort the calculation of distance.
2, for algorithms that optimize with gradient descent: features in different scales make gradient descent harder to converge.
3, for dimensionality reduction (PCA): finds combinations of features that have the most variance
18, How to deal with missing data?
The answer depends heavily on the specific scenarios, but here are some choices:
1, delete missing rows/columns when the size of the data set will not decrease substantially unless filter them out would bias the sample.
2, use mean/median/mode to replace the missing value: can be problematic because it reduces the variance of the feature and ignores the correlation between this feature and other features.
3, predict the value by building an interpolator or predicting them based on other features
4, use missing value as a separate feature: maybe some values are missing because of specific reasons that can be helpful for data analysis.
19, How to deal with imbalanced data set?
Imbalanced data set makes the normal classification metrics, like accuracy, not work well. There are several ways to deal with imbalanced data set:
1, choose different metrics to evaluate the model depending on the problems: F score, recall, precisions, etc.
2, dropout some observations from the larger set: down-sampling the larger set by randomly throwing away some data from that set.
3, increase observations for the smaller set: up-sampling the smaller set either by making multiple copies of the data points in the smaller set (can cause the model to be over-fitting) or using synthetic data creation such as SMOTE, where we use the existing data in the smaller set to create new data points that look like the existing ones.
20, What is dimensionality reduction? Why and how?
Dimensionality reduction is reducing the dimensionality (the number of features) of the data before fitting the model.
Dimensionality curse is **** the primary reason for conducting dimensionality reduction. The dimensionality curse is saying that when dimensionality increase, first data-intense decreases in each dimension, second higher dimensionalities make the model easier to overfit.
Depending on the problems, there are different techniques to conduct dimensionality reduction. The most strict forward way to do principal component analysis (PCA), which is an unsupervised machine learning algorithm that only keeps the uncorrelated features with the most variance.
During text analysis, you usually need to reduce the dimensionalities because the vectorized matrix is a sparse matrix. HashingVectorizer will reduce the dimensionality before fitting the data into a model.
These are all twenty machine learning related questions to prepare for the interview. You can use them as a Checklist preparing the interviews, or as a study guide to help you understand better about machine learning fundamentals. I have written other interview guides for questions in statistics:
22 Statistics Questions to Prepare for Data Science Interviews
Some practice questions in probabilities:
12 Probability Practice Questions for Data Science Interviews
Case study questions:
Structure Your Answers for Case Study Questions during Data Science Interviews
Behavioral questions:
Thank you for reading! Lastly, don’t forget to:
- Check these other articles of mine if interested;
- Subscribe to my email list;
- Sign up for medium membership;
- Or follow me on YouTube and watch my most recent YouTube video: