
Introduction
In this article, I’m going to talk a little bit about the theory behind Deep Learning models. Actually, I am studying the Deep Learning textbook by Ian Goodfellow et. al. and I found a really cool idea in there that I’m going to share. I’ll start with a brief explanation about the idea of Maximum Likelihood Estimation and then will show you that when you are using the MSE (Mean Squared Error) loss function, you are actually using the Cross Entropy! Don’t worry if this idea seems weird now, I’ll explain it to you. And, please do not be afraid of the following math and mathematical notations! I am actually a medical student and I do not have a rigorous math background, but I started studying books and taking courses to self study the needed math topics. So, be sure that if I can understand them, you will definitely understand them as well! I will explain these from the view of a non-math person and try my best to give you the intuitions as well as the actual math stuff! So, let’s get started!
MSE is Cross Entropy at heart!
I know this may sound weird at first because if you are like me – starting deep learning without rigorous math background and trying to use it just in practice – the MSE is bounded (!) for you with regression tasks and cross entropy with classification tasks (binary or multi-class classification). We are also kind of right to think of them (MSE and cross entropy) as two completely distinct animals because many academic authors and also deep learning frameworks like PyTorch and TensorFlow use the word "cross-entropy" only for the negative log-likelihood (I’ll explain this a little further) ** when you are doing a binary or multi class classification (e.x. after a sigmoid or softmax activation function); however, according to the deep learning textbook, this is a _"misnome**_r".
So, what is happening behind the scene that makes these not that different? This is what this article is about. Before this, I explain the idea of maximum likelihood estimation to make sure that we are on the same page!
Maximum Likelihood Estimation
When we are training a neural network, we are actually learning a complicated probability distribution, P_model, with a lot of parameters that can best describe the actual distribution of training data, P_data. So, we want to find the best model parameters, θ (theta), in a way that they maximize the obtained probability when we give the model the whole training set X. You can see this in math:

where the xᶦ indicates different training examples which you have m of them. If you don’t know the big math notation which is like pi number, don’t worry. It is saying that we should multiply all the probabilities after that.
Because of numerical issues (namely, underflow), we actually try to maximize the logarithm of the formula above. By using the log, taking the product changes into summing the log probabilities which is a really nice feature of logarithm:

You might ask that here I am talking about Maximizing something but in deep learning framework we are actually Minimizing a Loss Function; so, what is going on here? Actually, when you want to maximize something, you can easily minimize the negative of that expression and you will be good to go! But, there is another way to think about it.
I said that when training a neural network, we are trying to find the parameters of a probability distribution which is as close as possible to the distribution of the training set. Seemingly, in math world there is a notion known as KL Divergence which tells you how far apart two distributions are, the bigger this metric, the further away the two distributions are. This is the formula for the KL Divergence:

where P_data is your training set (actually in form of probability!) and P_model is the model we are trying to train. In the best case where the two distributions are completely similar, the KL Divergence will be zero; our goal when training the neural net is to minimize this. As we can not change the logarithm of P_data, the only thing we can modify is the P_model so we try to minimize the negative log probability (likelihood) of our model which is actually the well-known Cross Entropy:

Okay! I explained all of these to go to the main point which is how on earth MSE can be the same as this formula? There is no log in MSE! What’s going on here? Go ahead to the next section to see how.
Showing how MSE relates to cross entropy
Imagine we want to do a simple linear regression where we predict y according to input variable x and our model parameters θ. As the previous sentence suggests, this is actually a conditional probability, the probability of y given x:

Here is the interesting part. As I mentioned before, when building a neural net, we are trying to find a distribution (by finding its parameters) that best describes the distribution of the training set. When solving this very problem of linear regression, we can make an assumption about the distribution we want to find. Actually, we assume that this distribution is a Gaussian (normal or the bell-curve) and after limiting ourselves with this, we then try finding its parameters which as you know are its mean and variance. In fact, we only look for the best mean and choose a constant variance:

where the big and beautiful! N shows that this is a Gaussian distribution and y^ (pronounced y hat) gives our prediction of the mean by taking in the input variable x and the weights w (which we will learn during training the model); as you see, the variance is constant and equal to σ². If you do not already know (which is completely okay!) this is the Gaussian distribution formula:

where μ is the mean and σ² is the variance.
Now that we have our P_model , we can easily optimize it using Maximum Likelihood Estimation that I explained earlier:

compare this to Figure 2 or 4 to see that this is the exact same thing only for the condition that we are considering here as it is a supervised problem. So, here we are actually using Cross Entropy!
We know that the conditional probability in Figure 8 is equal to the Gaussian distribution that we want to learn its mean. So, we can replace the conditional probability with the formula in Figure 7, take its natural logarithm, and then sum over the obtained expression. The cool thing is happening in here; all because of neat properties of logarithms. Remember that the products convert to sums and divisions (which is a product actually) convert to taking the difference. The logarithm (of course natural logarithm) of the exponential (exp) is the exact expression after it. I highly recommend that before looking at the next figure, try this and take the logarithm of the expression in Figure 7; then, compare it with Figure 9 (you need to replace μ and xᶦ in Figure 7 **** with appropriate variables):

This is what you’ll get if you take the logarithm and replace those variables. I hope you are noticing the interesting thing that just happened in this step: the emergence of the MSE (Mean Squared Error) in the right of the Figure 9! Isn’t that cool? Of course there are some other terms in the figure but they do not matter much! Because they are all constant and won’t be learnt. So, when you minimize MSE (which is what we actually do in regression), you are actually maximizing this whole expression and maximizig the log likelihood!
I personally found it amazing that when we are using the MSE we are actually using Cross Entropy with an important assumption that our target distribution is Gaussian.
Why is this important?
You may ask why this is important to know. We just aim to solve the linear regression problem so why butter learning these things?
I believe its important because it makes clear all the assumptions we are making when using a neural net to solve a problem. It helps to understand the behavior of our model and the mistakes it makes. For example, there is an application of MSE loss in a task named Super Resolution in which (as the name suggests) we try to increase the resolution of a small image as best as possible to get a visually appealing image. If we use MSE loss alone, the final image will be really blurry and not appealing. Knowing that MSE is a kind of Cross Entropy where we assume that our target distribution is Gaussian, we can easily find out the reason why our model produces these blurry images: because it assumes that pixel data is normally distributed and always picks the values which have the most probability; i.e. the values from the middle of the bell-curve. So, it rarely uses the values which make the image sharp and appealing because they are far from the middle of the bell-curve and have really low probabilities. Our model becomes conservative in a sense that when it doubts what value it should pick, it picks the most probable ones which make the image blurry!
Final Words
I hope this article has given you a good understanding of some of the theories behind deep learning and neural nets. As I mentioned, I’m actually a medical student and I am self studying this math stuff (which I love!). Therefore, if there are any mistakes that I’m making, I will be really glad to know and edit them; so, please feel free to leave a comment below to let me know. Also, I will be really happy to hear from you and know if this article has helped you. So, again, please let me know your comments, suggestions, etc in the comments. Good Luck!