
Nowadays we can learn about domains that were usually reserved for academic communities. From Artificial Intelligence to Quantum Physics, we can browse an enormous amount of information available on the Internet and benefit from it.
However, the availability of information has some drawbacks. We need to be aware of a huge amount of unverified sources, full of factual errors (it’s a topic for the whole different discussion). What’s more, we can get used to getting answers with ease by googling it. As a result, we often take them for granted and use them without a better understanding.
The process of discovering things on our own is an important part of learning. Let’s take part in such an experiment and calculate derivatives behind Gradient Descent algorithm for a Linear Regression.
A little bit of introduction
Linear Regression is a statistical method that can be used to model the relationship between variables [1, 2]. It’s described by a line equation:

We have two parameters Θ₀ and Θ₁ and a __ variable x. Having data points we can find optimal parameters to fit the line to our data set.

Ok, now the Gradient Descent [2, 3]. It is an iterative algorithm that is widely used in Machine Learning (in many different flavors). We can use it to automatically find optimal parameters of our line.
To do this, we need to optimize an objective function defined by this formula:

In this function, we iterate over each point (xʲ, yʲ) from our data set. Then we calculate the value of a function f for xʲ, and current theta parameters (Θ₀, Θ₁). We take a result and subtract yʲ. Finally, we square it and add it to the sum.
Then in the Gradient Descent formula (which updates Θ₀ and Θ₁ in each iteration), we can find these mysterious derivatives on the right side of equations:

These are derivatives of the objective function Q(Θ). There are two parameters, so we need to calculate two derivatives, one for each Θ. Let’s move on and calculate them in 3 simple steps.
Step 1. Chain Rule
Our objective function is a composite function. We can think of it as it has an "outer" function and an "inner" function [1]. To calculate a derivative of a composite function we’ll follow a chain rule:

In our case, the "outer" part is about raising everything inside the brackets ("inner function") to the second power. According to the rule we need to multiply the "outer function" derivative by the derivative of an "inner function". It looks like this:

Step 2. Power Rule
The next step is calculating a derivative of a power function [1]. Let’s recall a derivative power rule formula:

Our "outer function" is simply an expression raised to the second power. So we put 2 before the whole formula and leave the rest as it (2 -1 = 1, and expression raised to the first power is simply that expression).
After the second step we have:

We still need to calculate a derivative of an "inner function" (right side of the formula). Let’s move to the third step.
Step 3. The derivative of a constant
The last rule is the simplest one. It is used to determine a derivative of a constant:

As a constant means, no changes, derivative of a constant is equal to zero [1]. For example f'(4) = 0.
Having all three rules in mind let’s break the "inner function" down:

The tricky part of our Gradient Descent objective function is that x is not a variable. x and y are constants that come from data set points. As we look for optimal parameters of our line, Θ₀ and Θ₁ are variables. That’s why we calculate two derivatives, one with respect to Θ₀ and one with respect to Θ₁.
Let’s start by calculating the derivative with respect to Θ₀. It means that Θ₁ will be treated as a constant.

You can see that constant parts were set to zero. What happened to Θ₀? As it’s a variable raised to the first power (a¹=a), we applied the power rule. It resulted in Θ₀ raised to the power of zero. When we raise a number to the power of zero, it’s equal to 1 (a⁰=1). And that’s it! Our derivative with respect to Θ₀ is equal to 1.
Finally, we have the whole derivative with respect to Θ₀:

Now it’s time to calculate a derivative with respect to Θ₁. It means that we treat Θ₀ as a constant.

By analogy to the previous example, Θ₁ was treated as a variable raised to the first power. Then we applied a power rule which reduced Θ₁ to 1. However Θ₁ is multiplied by x, so we end up with derivative equal to x.
The final form of the derivative with respect to Θ₁ looks like this:

Complete Gradient Descent recipe
We calculated the derivatives needed by the Gradient Descent algorithm! Let’s put them where they belong:

By doing this exercise we get a deeper understanding of formula origins. We don’t take it as a magic incantation we found in the old book, but instead, we actively go through the process of analyzing it. We break down the method to smaller pieces and we realize that we can finish calculations by ourselves and put it all together.
From time to time grab a pen and paper and solve a problem. You can find an equation or method you already successfully use and try to gain this deeper insight by decomposing it. It will give you a lot of satisfaction and spark your creativity.
Bibliography:
- K.A Stroud, Dexter J. Booth, Engineering Mathematics, ISBN: 978–0831133276.
- Joel Grus, Data Science from Scratch, 2nd Edition, ISBN: 978–1492041139
- Josh Patterson, Adam Gibson, Deep Learning, ISBN: 978–1491914250