
Just like Mr. Miyagi taught young Daniel LaRusso karate through repetitive simple chores, which ultimately transformed him into the Karate Kid, mastering foundational algorithms like linear regression lays the groundwork for understanding the most complex of AI architectures such as Deep Neural Networks and LLMs.
Through this deep dive into the simple yet powerful linear regression, you will learn many of the fundamental parts that make up the most advanced models built today by billion-dollar companies.
What is Linear Regression?
Linear regression is a simple mathematical method used to understand the linear relationship between a dependant variable (what you want to predict) and a single or group of independent variables (the factors you believe affect the value of the dependant variable).
For example, if you wanted to predict the price of a diamond, you might use the number of carats as the main factor. The dependent variable (price) depends on one or more independent variables (carats).
The linear relationship part means that as your factor(s) change, your dependent variable changes at a consistent rate (the slope, m), determined by the straight-line formula. *y = mx + c**
Consider we see in the market 5 prices for diamonds:
- 2 carat diamond for $2k
- 4 carat diamond for $4k
- 6 carat diamond for $4k
- 8 carat diamond for $4k
- 8 carat diamond for $5k
(I know, the market is a bit messed up! This inconsistency happens often in real-world data, and regression helps us find an overall trend.)
If we plot this data on a graph, we can use a linear regression model to draw a "line of best fit" (we will later see how its calculated). This line minimises the overall difference between the actual data points and the values it predicts, it helps us estimate new values – for instance, the price of a 9-carat diamond.

In this example, the line of best fit is represented by the equation: Price = 0.5 × (Carats) + 1
Using this equation, the estimated price of a 9-carat diamond is: Price = 0.5 × 9 + 1 = 5.5 (thousands of dollars).
Keep in mind that in real-life situations, additional factors such as clarity, cut, and colour could also affect the price. These factors can be added as independent variables in the regression model for more accurate predictions.
The Linear Regression Formula

This formula is the same as what you saw before in the diamonds example, just a bit fancier. In our example, since the number of independent variables is one (n=1) the formula we used is: *Y = β0 + β1X1** Where Y is the price, β0 = 1, β1 = 0.5, and X1 is the carats.
In another case, if we had three independent variables (n=3), we would use: *Y = β0 + β1X1 + β2X2 + β3X3**
Note that despite the added independent variables, the formula still resembles a line. The only difference is that we add dimensions to the space in which this line lives. When we plotted the line of best fit earlier, it was a 2D graph because we only had one independent variable (carats). If we had two independent variables (e.g., carats and clarity), the graph would be in 3D. With three or more independent variables, the ‘line’ lives in an even higher-dimensional space (which we can’t easily visualise but can still calculate).
It’s important to notice that Y refers to the actual, observed value (like the real diamond prices in the market). However, when you’re using the model to make predictions, the output is Ŷ (read as "y hat"). This distinction reminds us that Ŷ is only an estimate – it’s what the model thinks the value should be, not necessarily the true value.
The parameters (β) are the values that the model adjusts (or learns) during training to capture the relationship between the independent variables X and the dependent variable Y. We will now go through the process to understand how these values are calculated, which is the same process neural networks and large language models use to "learn".
Parameter Learning
We require a few things to be able to adjust the parameters and achieve accurate predictions.
- Training Data – this data consists of input and output pairs. The inputs will be fed into the model and during training, the parameters will be adjusted in an attempt to output the target (true) value. We already saw this earlier, those 5 diamond prices were our training data.
- Cost function – also known as the loss function, is a mathematical function that measures how well a model’s prediction matches the target value.
- Training Algorithm – is a method used to adjust the parameters of the model to minimise the error as measured by the cost function.
Let’s go over a cost function and training algorithm that can be used in linear regression.
Cost Function: Mean Squared Error (MSE)
Mean Squared Error is a widely used cost function in regression tasks, where the goal is to predict a continuous number. It measures how far the predictions are from the actual values by calculating the average of the squared differences between them. The goal is to make this difference as small as possible. When we work to reduce the value of the cost function, we call this value the "loss." This is how its calculated:

- Calculate the difference between the predicted value, Ŷ, and the target value, Y.
- Square this difference – ensuring all errors are positive and also penalising large errors more heavily.
- Sum the squared differences for all data samples
- Divide the sum by the number of samples, n, to get the average squared error
Later we will see a more practical example, but consider this one for now. We have three samples in our training data and we want to calculate the MSE.

Good models will have a low MSE, because the difference between the target and the prediction is small. On the other hand, worse models will have a large MSE, because the difference will be higher.
Training Algorithm: Gradient Descent
Gradient descent is like hiking down a hill in the dark with the goal of reaching the lowest point – the global minimum – where the cost (error) is smallest.
The cost function is what tells us how good or bad the current model’s predictions are. It gives us a "score" (called loss) that measures how far off our predictions are from the actual target values. Randomly changing the model’s parameters (the things it adjusts to learn) won’t help us consistently improve.
Instead, we use the gradient of the cost function, which tells us the slope or direction of steepest descent. Think of the gradient as a guide pointing us downhill. By following this direction, we can update the parameters to reduce the cost, meaning our predictions get closer to the actual targets.

The size of each step (called the learning rate) is very important:
- If the steps are too big, we might overshoot the lowest point and never find it.
- If the steps are too small, we’ll move very slowly and waste time, or even get stuck on a small dip (a local minimum) instead of finding the lowest point overall.
The key is to take steps that are just the right size so we can efficiently and accurately reach the global minimum.
Gradient Descent Formula

In the diamonds example, θ could be _β_0 or β1. The gradient is the partial derivative of the cost function with respect to θ, or in simpler terms, it is a measure of how much the cost function changes when the parameter θ is slightly adjusted.
A large gradient indicates that the parameter has a significant effect on the cost function, while a small gradient suggests a minor effect. The sign of the gradient indicates the direction of change for the cost function. A negative gradient means the cost function will decrease as the parameter increases, while a positive gradient means it will increase.
So, in the case of a large negative gradient, what happens to the parameter? Well, the negative sign in front of the learning rate will cancel with the negative sign of the gradient, resulting in an addition to the parameter. And since the gradient is large we will be adding a large number to it. So, the parameter is adjusted substantially reflecting its greater influence on reducing the cost function.
Practical Example
Let’s take a look at the prices of the sponges Karate Kid used to wash Mr. Miyagi’s car. If we wanted to predict their price based on their height and width, we could model it using linear regression.
We can start with these three training data samples.

Now, let’s use the Mean Square Error (MSE) as our cost function J, and linear regression as our model.

Notice we have substituted Ŷ, in the MSE formula for the linear regression formula, as we will use linear regression to make predictions. The linear regression formula uses X1 and X2 for width and height respectively, as there are no more independent variables since our training data doesn’t include more. That is the assumption we take in this example, that the width and height of the sponge are enough to predict its price.
Now, the first step is to initialise the parameters, in this case to 0. This is so that we have some starting values as our parameters that allow us to make predictions. These will not be optimal (often they will be pretty bad), but it gives us a starting point.
We can then feed the independent variables into the model to get our predictions, Ŷ, and check how far these are from our target Y.

Right now, as you can imagine, the parameters are not very helpful. But we are now prepared to use the Gradient Descent algorithm to update the parameters into more useful ones. First, we need to calculate the partial derivatives of each parameter, which will require some calculus, but luckily we only need to this once in the whole process. Recall that the partial derivatives of each parameter, will tell us how the loss changes when we change that parameter slightly.

With the partial derivatives, we can substitute in the values from our errors to calculate the gradient of each parameter.

Notice there wasn’t any need to calculate the MSE, as it’s not directly used in the process of updating parameters, only its derivative is. It’s also immediately apparent that all gradients are negative, meaning that all can be increased to reduce the loss. The next step is to update the parameters with a learning rate, which is a hyper-parameter, i.e. a configuration setting in a machine learning model that is specified before the training process begins. Unlike model parameters, which are learned during training, hyper-parameters are set manually and control aspects of the learning process. Here we arbitrarily use 0.01.

This has been the final step of our first iteration in the process of gradient descent. We can use these new parameter values to make new predictions and recalculate the MSE of our model.

The new parameters are getting closer to the true sponge prices, and have yielded a much lower MSE, but there is a lot more training left to do. If we iterate through the gradient descent algorithm 50 times, this time using Python instead of doing it by hand – since Mr. Miyagi never said anything about coding – we will reach the following values.

Eventually we arrived to a pretty good model. The true values I used to generate those numbers were [1, 2, 3] and after only 50 iterations, the model’s parameters came impressively close. Extending the training to 200 steps, which is another hyper-parameter, with the same learning rate allowed the linear regression model to converge almost perfectly to the true parameters, demonstrating the power of gradient descent.
Conclusions
Many of the fundamental concepts that make up the complicated martial art of Artificial Intelligence, like cost functions and gradient descent, can be thoroughly understood just by studying the simple "wax on, wax off" tool that linear regression is.
Artificial intelligence is a vast and complex field, built upon many ideas and methods. While there’s much more to explore, mastering these fundamentals is a significant first step. Hopefully, this article has brought you closer to that goal, one "wax on, wax off" at a time.