Beginner: Cost Function and Gradient Descent

Published in

Towards Data Science

7 min readJan 20, 2019

Credited to “Machine Learning and Human Bias”. *Google.* https://www.youtube.com/watch?v=59bMh59JQDo

There are two ways to tell a story, one is the hard way where you are expected to meet the standards of the speaker or the writer, and another is the one where writer or speaker expects that the audience must understand what the story is telling no matter how naive you he may sound. So, here I have tried to stay in latter category and at the same time trying to give the reader an intuitive understanding of the cost function, and 2 different ways to reduce the cost function.

Note: I am assuming that the reader is families with 2-D and 3-D plane.

Introduction

Let’s say you are given three number (1,1), (2,2), (3,3) that can be represented on a 2-D plane, meaning they can be represented on x-y plane as shown in the picture below.

Now, somebody asks you to fit a line as close as possible to all the points already available to you. A fit-line means a line that passes as closely as it possibly can to all the points in 2-D plane. The equation that these lines would follows looks something like this:

Here θ0 is the intercept of line, and θ1 is the slope of the line. An intercept is the value where line crosses y-axis and a slope indicates how much one unit change in x would change the value in y.

Following that equation, let’s say we came up with three different kind of line with three different slopes and intercept values as given below. The table has θ0 for upper line, middle line and the lower line and θ1 for upper line, middle line and lower line.

The values of these graphically shown lines is also shown in tabular form.

It is pretty obvious that the middle line matches all three points that were shown in graph (a), but the upper line and lower line does not exactly matches those three points. What does that mean then? Let’ see…

Cost Function

Cost function basically means how much far away your predicted line is from the actual points that we were already given. In other words, you had some points already given to you, after that you predicted some value of θ0 and θ1, using that, you draws a line on the graph; after doing that you realize that new line don’t exactly touches upon all three data points you already had, so now you calculate how far away the original points and your predicted line is. And that you can calculate using cost function. The formula for that is as follows:

Let’s break it down and see what that means.

The first term 1/2m is a constant term, where m means the number of data points we already have, in our case it’s 3. Then we have a summation sign, this sign means for each changing value in subscript ‘i’ we keep adding the result. The term h(x^i) means the output of our hypothesis for particular value of i, in other words the line you are prediction using equation h(x)=θ0+xθ1 and term y^i means the value of data point we already had. The value ‘i’ means the number of data points we have already calculated the difference of. To get the picture more clear, let’s look at the examples:

Middle line

Original points have values (1,1), (2,2), (3,3).

θ0 and θ1 are predicted to be 0 and 1 respective. Using hypothesis equation we drew a line and now want to calculate the cost. The line we drew passes through same exact points as we were already given. So our hypothesis value h(x) is 1, 2, 3 and the value of y^i is also 1, 2, 3.

We can see that the cost function gives us zero for the middle line. The red line below is our hypothesized line and black dots are the points we had.

Upper Line

θ0 and θ1 are predicted to be 1.5 and 1.25 respectively. Meaning that the intercept is 1.5 on y-axis and for each unit chance in x, the hypothesis h(x) change by 1.25 on y axis.

With that, we calculated our h(x) values as follows — (1.5, 2.75 and 4). And y^i (the original data points) remains the same — (1, 2, 3). Using the cost function, we get the following value:

Lower Line

The value of θ0 and θ1 for lower line is 1.25 and .75 respectively. That means it intercepts the y-axis at 1.25 and for each unit change in the value of x, hypothesis h(x) would change by rate of 0.75.

We already know that the value of original points y is (1, 2 and 3) and the values of our predicted points h(x) is— 1.25, 1.5, 2. Now, using the cost function we can calculate the cost as shown in the figure below.

Reducing the Cost Function

You remember the values of theta-0 and theta-1 that you predicted above, since they were just the predictions, only middle one was the perfect one. But in a real life scenario, finding a perfect value of theta-0 and theta-1 is next to impossible. However, you do have the power to manipulate the values of theta-0 and theta-1 in a way that for any given set of values (x1, x2, x3…. xn) you are finding the line that has lowest value of cost function.

Let’s come up with different set of values of theta-0 and theta-1. theta-0 and theta-1 are 0 and 1.42 respectively. The line it creates will look something like the one shown below.

But, since you need to reduce your cost, you need to create a line that fits those 3 points. Something like this:

So you use a cost reduction method called the gradient descend. Always keep in mind that you just reduce the value of theta-0 and theta-1, and by doing that, you come from that red line over there to the black line down. See below for better understanding.

Since we know changing the value of theta-0 and theta-1, the orientation of line can change, and to reach a line that fit as closely as possible to those three points we reduce the value of all the thetas (in our case just 2 thetas) bit by bit in such as way that we reach a minimum value of cost function.

The term alpha means — with how much magnitude you are reducing your value. Theta-j here represents each individual theta you have in your solution, so you run this equation for all the thetas, which in our case are two, but can also be three, four or ten depending upon problem at hand. So, the top line in the picture above had certain value of theta-0 and theta-1, then, using that formula over here, you reduce the value of all the thetas you have in your equation by some magnitude alpha and moved a bit lower with your predicted line. Then you again run that formula, reduce the values of thetas, see what that line looks like, calculate the cost and get ready for next iteration. Then you again reduce the values of theta, again look at the line and calculate cost. You keep doing that until your line reaches to the point where the cost is minimum, which in our case can be seen in the picture above having value 1, 2, 3. A perfect line with cost zero against the original data points 1, 2, 3. The break down of that formula makes more sense, see in the picture below.

Summary:

1- You have some data points.

2- Using them you calculate values of thetas and draw the figure using hypothesis equation.

3- You calculate the cost using cost function, which is the distance between what you drew and original data points.

4- You see that the cost function giving you some value that you would like to reduce.

5- Using gradient descend you reduce the values of thetas by magnitude alpha.

6- With new set of values of thetas, you calculate cost again.

7- You keep repeating step-5 and step-6 one after the other until you reach minimum value of cost function.