Today, we will delve into three crucial concepts in Machine Learning: Linear Regression, Cost Function, and Gradient Descent. These concepts form the foundation of many machine learning algorithms. Initially, I decided against writing an article on these topics because they are so widely covered. However, I have changed my mind because understanding these concepts is essential for understanding more advanced topics like Neural Networks (that I plan on tackling in the near future). In addition, this series will be divided into two parts to make it more manageable and organized for better understanding.
So make yourself comfortable, grab a cup of coffee, and get ready to embark on a magical journey of machine learning.
As with any machine learning problem, we begin with a specific question we want to answer. In this case, our friend Mark is considering selling his 2400 feet² house and has come to us for assistance in determining the most appropriate price to list it at.

Intuitively, we start by looking for comparable houses in our friend’s neighborhood. After a little digging, we find a list of three nearby houses and see how much they sold for. Of course, a typical dataset would have thousands or even tens of thousands of data points, but we’ll keep it simple with just these three houses.

Let’s plot this data:

By examining the data, the price of a house appears to be related to its size in a linear fashion. To model this relationship, we can use an ML technique called Linear Regression. This involves drawing a line on a scatter plot that best represents the pattern of the data points. Our model might look like this:

Now using this line, we can say that a house that’s 2400 feet² should sell for..

…~$260,000. And boom. That’s the answer.
Now the big question: how do we determine the best-fitting line for our data?
I could’ve drawn a line that’s a little off like this:

Or, even worse, like this:

And we can clearly see that they don’t fit our data nearly as well as our first line does.
To figure out the best line, the first thing we need to do is mathematically represent what a bad line looks like. So let’s take this "bad" line and according to this a 2000 feet² house should sell for ~$140,000, whereas we know it actually sold for $300,000:

It is also significantly different from all the other values:

On average, this line is ~$94,000 off ($50,000 + $160,000 + $72,000 / 3).
Here’s a better line:

This line is an average of ~$44,000 dollars off, which is much better. This $44,000 is called the cost of using this line. The cost is how far off the line is from the real data. The best line is the one that is the least off from the real data or with the lowest cost. To find out what line is the best line, we need to use a cost function.
Cost Function
Above, we utilized the Mean Absolute Error (MAE) cost function to determine the deviation of the actual house prices from the predicted prices. This basically calculates the average of how off the actual house prices (denoted as y, as it represents the value on the y-axis) were from the predicted house prices (denoted as ŷ). We represent MAE mathematically like this:

NOTE: Absolute values are used in the calculation of MAE because they ensure that the difference between predicted and actual values is always positive, regardless of whether the prediction is high or low. This allows for a fair comparison of error across different predictions, as positive and negative differences would cancel out if not taken absolute.
Depending on the ML algorithm and problem at hand, there are various types of cost functions that can be employed. For our problem, instead of using the MAE, we will employ a commonly used method, the Mean Squared Error (MSE), which calculates the average of the squares of the difference between the predicted house price and the actual house price.

Ultimately, the purpose of any cost function is to minimize its value and reduce the cost to the greatest extent possible.
Equation of the Line
Before diving deeper into linear regression, let’s take a step back and review the basics. Here’s an example of a line: y = 1 + 2x
The first number, called the intercept, tells us how high the line should be at the start.

And the second one tells us the angle (or, in technical terms, the slope) of the line:

Now that we understand how the equation works, we just need to determine the optimal values for these two values – the slope and the intercept to get our best-fitting line for our linear regression problem. To make things even simpler, let’s assume that we somehow magically already have the value of the slope, 0.069.
So the equation of our linear regression line is:

To get the predicted price of any house of a certain size, all we need to do is plug in the values of the intercept and desired house size. For instance, for a house of size 1000 feet² with intercept 0…

…we get a predicted house price of $69,000. So all we need to do now to get our linear regression model is to find the optimal value for the intercept.
One option (which we will soon find to be quite tedious and not very fun) is to use brute force, where we repeatedly guess the value of the intercept, draw a LR line, and calculate the MSE. Just for the sake of experimentation, let’s try this approach for a moment.
Start by guessing a random value of the intercept (let’s start with 0) and plotting the LR line:

Then we calculate the MSE of this line:

To gain a visual understanding, let’s plot the intercept value and the corresponding __ MSE on a graph:

Next, we’ll test another value for the intercept (let’s say 25), plot the corresponding line, and calculate the MSE.

We can continue this process with different values of the intercept (= 0, 25, 50, 75, 100, 125, 150, and 175) until we end up with a graph that looks like this:

From the points plotted on the graph, we can see that the MSE is the lowest when the intercept is set to 100. However, it is possible that there may be another intercept value between 75 and 100 that would result in an even lower MSE. A slow and painful method for finding the minimal MSE is to plug and chug a bunch more values for the intercept as shown below:

Despite our efforts, we cannot be certain that we have found the lowest possible MSE value. The process of testing multiple intercept values is both tedious and inefficient. Fortunately, gradient descent can help solve this problem by finding the optimal solution in a more efficient and effective way. And this is exactly what we will explore in the second part of this series!
You can connect with me on LinkedIn or email me at shreya.statistics@gmail.com to send me questions and suggestions for any other algorithms that you want illustrated!