Deep learning and neural networks can get really complicated. When it comes to Data Science interviews, however, there are only so many concepts that interviewers test. After going through hundreds and hundreds of data science interview questions, I compiled 10 deep learning concepts that came up the most often.
In this article, I’m going to go over these 10 concepts, what they’re all about, and why they’re so important.
With that said, here we go!
1. Activation Functions
If you don’t already have a basic understanding of neural networks and their structures, I would first check out my article, "A Beginner-Friendly Explanation of How Neural Networks Work."
Once you have a basic understanding of neurons/nodes, an activation function is like a light switch – it determines whether a neuron should be activated or not.

There are several types of activation functions, but the most popular activation function is the Rectified Linear Unit function, also known as the ReLU function. It’s known to be a better activation function than the sigmoid function and the tanh function because it performs gradient descent faster. Notice in the image that when x (or z) is very large, the slope is very small, which slows gradient descent significantly. This, however, is not the case for the ReLU function.
2. Cost Function
A cost function for a neural network is similar to a cost function that you would use for any other machine learning model. It’s a measure of how ‘good" a neural network is in regards to the values that it predicts compared to the actual values. The cost function is inversely proportional to the quality of a model – the better the model, the lower the cost function and vice versa.
The purpose of a cost function is so that you have value to optimize. By minimizing the cost function of a neural network, you’ll achieve the optimal weights and parameters of the model, thereby maximizing the performance of it.
There are several commonly used cost functions, including the quadratic cost, cross-entropy cost, exponential cost, Hellinger distance, Kullback-Leibler divergence, etc…
3. Backpropagation
Backpropagation is an algorithm that closely ties with the cost function. Specifically, it is an algorithm that is used to compute the gradient of the cost function. It has adopted a lot of popularity and use due to its speed & efficiency compared to other approaches.
Its name stems from the fact that the calculation of the gradient starts with the gradient of the final layer of weights and moves backwards to the gradient of the first layer of weights. Consequently, the error at layer k is dependent on the next layer k+1.
Generally, backpropagation works as follows:
- Calculates the forward phase for each input-output pair
- Calculates the backward phase for each pair
- Combine the individual gradients
- Update the weights based on the learning rate and the total gradient
Brilliant has a ‘brilliant’ (hah) article on backpropagation which I strongly recommend.
4. Convolutional Neural Networks

A Convolutional Neural Network (CNN) is a type of neural network that takes an input (usually an image), assigns importance to different features of the image, and outputs a prediction. What makes CNNs better than feedforward neural networks is that it better captures the spatial (pixel) dependencies throughout the image, meaning it can understand the composition of an image better.
For those who are interested, CNNs use a mathematical operation called convolution. Wikipedia defines convolution as a mathematical operation on two functions that produces a third function expressing how the shape of one is modified by the other. Thus, CNN’s use convolution instead of general matrix multiplication in at least one of their layers.
TLDR: CNNs are a type of neural network that is mainly used for image classification.
5. Recurrent Neural Networks

A Recurrent Neural Network (RNNs) is another type of neural network that works exceptionally well with sequential data due to its ability to ingest inputs of varying sizes. RNNs consider both the current input as well as previous inputs it was given, which means that the same input can technically produce a different output based on the previous inputs given.
Technically speaking, RNNs are a type of neural network where connections between the nodes form a digraph along a temporal sequence, allowing them to use their internal memory to process variable-length sequences of inputs.
TLDR: RNNs are a type of neural network that is mainly used for sequential or time-series data.
6. Long Short-Term Memory Networks
Long Short-Term Memory (LSTM) networks are a type of Recurrent Neural Networks that addresses one of the shortfalls of regular RNNs: RNNs have short-term memory.
Specifically, if a sequence is too long, i.e. if there is a lag greater than 5–10 steps, RNNs tend to dismiss information that was provided in the earlier steps. For example, if we fed a paragraph into an RNN, it may overlook information provided at the beginning of the paragraph.
Thus LSTMs were created to resolve this issue.
You can read about LSTMs in more detail here but for the sake of this article, I’m only providing a high-level summary.
7. Weight Initialization
The point of weight initialization is to make sure that a neural network doesn’t converge to a trivial solution.
If the weights are all initialized to the same value(eg. equal to zero) then each unit will get exactly the same signal and every layer would behave as if it were a single cell.
Therefore, you want to randomly initialize the weights near zero, but not equal to zero. This is an expectation of the stochastic optimization algorithm that’s used to train the model.
8. Batch vs. Stochastic Gradient Descent
Batch gradient descent and stochastic gradient descent are two different methods used to compute the gradient.
Batch gradient descent simply computes the gradient using the whole dataset. It is much slower especially with larger datasets but is better for convex or smooth error manifolds.
With stochastic gradient descent, the gradient is computed using a single training sample at a time. Because of this, it is computationally faster and less expensive. Consequently, however, when a global optimum is reached, it tends to bounce around – this results in a good solution but not an optimal solution.
9. Hyper-parameters
Hyper-parameters are the variables that regulate the network structure and the variables which govern how the network is trained. Common hyper-parameters include the following:
- Model architecture parameters such as the number of layers, number of hidden units, etc…
- The learning rate (alpha)
- Network weight initialization
- Number of epochs (defined as one cycle through the whole training dataset)
- Batch size
- and more…
10. Learning Rate
The learning rate is a hyper-parameter used in neural networks that control how much to adjust the model in response to the estimated error each time the model weights are updated.
If the learning rate is too low, your model will train very slowly as minimal updates are made to the weights through each iteration. Thus, it would take many updates before reaching the minimum point.
If the learning rate is set too high, this causes undesirable divergent behavior to the loss function due to drastic updates in weights, and it may fail to converge.
Thanks for Reading!
And that’s all! I hope that this helps you in your interview prep and I wish you the best of luck in your future endeavors. Having a strong understanding of these ten concepts will serve as a strong base for further learning in the realm of Deep Learning.
As always, I wish you the best in your endeavors!
Not sure what to read next? I’ve picked another article for you:
Ten SQL Concepts You Should Know for Data Science Interviews
Terence Shin
- If you enjoyed this, follow me on Medium for more
- Sign up for my email list here!
- Let’s connect on LinkedIn
- Interested in collaborating? Check out my website.