Recently, I was asked by some new graduate students about how to prepare for Data Scientist interviews. Generally, the Data Scientist or Machine Learning Engineer interviews consist of 3 parts: Maths, Stats, and Machine Learning algorithms. Some useful preparation materials will be suggested at the end of this article. Here, we will go through 5 frequently asked questions with answers provided to give you a sense on how to answer Data Scientist interview questions.
- What are weight and bias?
- Have you used ReLU function? Can you explain how it works?
- How is the floating point number stored in 32-bit computer memory?
- What are Gradient Descent and Stochastic Gradient Descent? And what is the difference between them?
- What is Backpropagation?
What are weight and bias?

Weight and bias are the learnable parameters of your model. The values of weights and biases are randomly initialized before training, and automatically calculated by TensorFlow during training. The following code is an example of updating weights and biases:
# How we train
train_step = tf.train.GradientDescentOptimizer(0.02).minimize(cross_entropy)
Have you used ReLU function? Can you explain how it works?
ReLU, which stands for Rectified Linear Unit, is a simple non-linear activation function. Any negative input gets set to zero while positive inputs stay the same.


ReLU is widely applied in Neural Networks and Deep Learning. It works as a switch for linearity. If you don’t need it, you "switch" it off; If you need it, you "switch" it on. TensorFlow provides ReLU and its variants (Noisy ReLUs, Leaky ReLUs, ELUs) through the tf.nn module. For example, the following creates a convolution layer (for CNN) with tf.nn.relu:
import tensorflow as tf
conv_layer = tf.layers.conv2d(inputs=input_layer, filters=64, kernel_size=[3, 3], padding='same', activation=tf.nn.relu)
How is the floating point number stored in 32-bit computer memory?
Yes, it is a "low-level" question that require candidates have some background knowledge of computer hardware. In the IEEE 754 standard, the bits of floating point number are laid out as follows:

- sign (1 bit): determines the sign of the number
- exponent (8 bits): the exponent bias in IEEE 754 is 127 that we need to subtract 127 from the value
- fraction (23 bits): there is an implicit leading bit with value 1
Therefore, we can deduce the real value is

Here I can show how to convert a base 10 real number 12.875 into an IEEE 754 binary32 format.
- 12 = 8 + 4 = 1100
- .875 = 0.5 + 0.25 + 0.125 = 2^-1 + 2 ^ -2 + 2^-3
So the binary representation of 12.875 should be as follows:

In this case, the number is positive, so sign bit is 0
; the exponent is 3 , add 127 (exponent bias) we get 130 = 1000 0010
; the fraction is = 10011100000000000000000.
From these we can form the resulting 32 bit IEEE 754 binary32 format representation of real number 12.875 as: 0-10000010-10011100000000000000000.
What are Gradient Descent and Stochastic Gradient Descent? And what is the difference between them?
Imagine you are walking down a hill from peak, what is your next step? Gradient Descent (GD) and Stochastic Gradient Descent (SGD) are 2 common ways of calculating next step from the current position.
GD and SGD are optimizers in Deep Learning by slowly nudging weights (the parameters of a model) toward better results. In practice, you can train and compare a fully-connected network using GD and SGD. The SGD optimized model would be more accurate than GD along with less training time required.
At each step, SGD takes a very small step instead of a large step, and uses a tiny random sample instead of the complete data. SGD is efficient because of smaller data set at each time and compensates by do it a lot of times. You can progressively train deeper and more accurate models using TensorFlow function tf.contrib.keras.optimizers.SGD.
What is Backpropagation?
Backpropagation is commonly used by gradient-based optimizers to adjust the weight of neurons in multi-layered neural networks.

Backpropagation involves the following steps:
- When an input vector is presented to the network, it is propagated forward through the network, layer by layer, until it reaches the output layer.
- The loss function calculates the difference (called "error") between the network output and its expected output.
- The resulting error value is calculated for each of the neurons in the output layer.
- The error values are then propagated from the output back through the network, until each neuron has an associated error value that reflects its contribution to the original output.
- Backpropagation uses these error values to calculate the gradient of the loss function.
- This gradient is fed to the optimization method, which in turn uses it to update the weights, in an attempt to minimize the loss function.
Often, normalization of input vectors could improve the performance of models.
This article goes through 5 frequently asked questions with answers for your reference. For preparation, I would recommend book "Deep Learning" by Goodfellow, Bengio, and Courville.
- Math & Stats: Chapter 2, 3 and 4 are enough to prepare/revise for theoretical questions during such interviews
- Machine Learning algorithms: Chapter 5, around 60 pages, concise and digestible
If you finish the Deep Learning bible, I would also recommend "Mathematics for Machine Learning" and "Data Science Question Answer".
Interview is always a good way to check the missing pieces in your Data Scientist puzzle. As data scientists, we can call it Evaluation or Measure. The expectation is that you perform well, not perfectly. Study and prepare as best you as you can. All you can do is put your best effort forward.
You’re all set. Good luck!