The world’s leading publication for data science, AI, and ML professionals.

Linear Regression - The Basic Building blocks ! (Part-1)

An introduction into the need and essential basic concepts of one of the most widely used data science techniques.

Linear Regression (Part-1) – The Basic Building blocks!

The What?

As human beings, we love to find relations between things – whether consciously or unconsciously we all do it. Data Scientists do it on data collected by big organizations to figure out business problems; whereas a common man would do it while walking down a street and mentally calculating how fast he needs to walk to reach t he destination by a certain time. Other common examples can be: budgeting for a shopping trip based on what you need, suggesting a dress to your friends based on their taste etc.

The key aspect is you are predicting (or estimating) something based on a few other bits of explicitly/implicitly known and unknown factors. For example – in the above, the street length of walk-in street and timings are factors when you are estimating your desired speed, similarly what you need to buy on a shopping trip is a factor which will help you predict your budget and your friends taste is a factor which will help you suggest them a dress. We all, knowingly or unknowingly, look for relationships and predict (or estimate) things. That’s what we do. Day in and day out!

We all estimate or predict things – consciously or unconsciously. We might as well use a combination of art and science to do it well.

Regression is a mathematical/statistical procedure to find out the relationship between a ‘dependent‘ variable and a few ‘independent‘ variables (or factors affecting the dependent variable).

  • The ‘dependent’ variable is also called outcome variables (often denoted by y)
  • Independent variables are also called predictors, features etc. (often denoted by X)

The word ‘linear‘ in Linear Regression refers to there being a ‘linear’ relationship between a dependent and a set of independent variables.

The Why?

Linear Regression is one of the very first algorithm’s data scientists/analysts learn as it is a building block of most of the advanced algorithms. It has numerous use-cases, some of the common ones are below:

  1. The most common one is to predict the values of a variable using a bunch of independent variables (or factors/predictors).
  2. Understand which factors (i.e. our predictors / independent variables) are significant or insignificant in predicting the dependent variable.
  3. Understand whether factors (i.e. our predictors / independent variables) have a direct or inverse relation with the dependent variable
  4. Understand the extent of influence of factors on the dependent variable.

The Origin

To understand a Linear regression model, let’s first understand the need for linear regression and then move onto what we are trying to do.

The need for estimation or prediction

As we now know that we need predictions (or estimates) on a daily basis. Let’s pick up an example of predicting the property value in your area. To do this, you collect data for 10 flats in your area and note down their property value. Table-1 below captures this.

Now, if you are asked to provide an estimate for property value in your area, just based on the data you have collected **** (i.e. just the ‘Property Value’ for now); how would you do it?

→ Yes, you are thinking on the right track. You would think about the average or mean of the ‘Property Value’ based on your actual data points. It comes out to be: £1,800,000. So, you would say that the average Property Value is £1.8m in my area and this would be your prediction for any new property for which you do not have an actual value.

This is correct; although this does not take into account any other factor; for example, the square footage (or area) of the flats. The ‘Property Value’ must most definitely be varying based on how big the flat is. __ But in this case, based on just one variable (i.e. ‘Property Value’) your best estimate is £1.8m as seen below.

Actual vs. Predicted – Deviations: Residuals or Error terms

Please do note that if we were to give this estimate of £1.8m for the flats for which we have actual data, we are off from the actual property values as depicted in chart-2 below. That is, there is a deviation between our predicted (or estimated) property value and the actual property value (predicted in this case being the average Property Value)

Best estimates are the ones that are closest to reality. Implying the purpose of a good regression model is to minimize this deviation or variability between the predicted and the actual values.

Now we know that for the dependent variable, values predicted by us will always be different from actual values in our data and understanding this deviation between predicted vs. actual can help us understand how good our prediction is.

Going back to our example, our prediction for ‘Property Value’ is the average value of £1.8m (as we are basing it just one variable ‘Property Value’ for now). If you see in chart-2, this deviation is represented by the yellow lines.

This deviation seen between actual vs. predicted values is known as the error terms or residuals,

i.e. Actual – Predicted = Error terms OR Residuals

Summarizing the Deviation / Residuals – Actual vs. Predicted:

As Linear Regression is aimed at being better at predictions, we do need to understand the deviation between actual and predictions to know how good the predictions are. We do need a summarized metric for the residuals (or error terms) to understand how close to reality (i.e. actual values) are our predictions.

  1. One quick way to summarize our predictions can be to take the sum of all the error-terms across all data points. In the below table-3, you can see the Error terms are calculated as the difference between y-actual and y-predicted (y-predicted, in this case, being the average value of y i.e. £1.8m).

    • In some data points, the error will be positive while in some it will be negative; taking sum will cancel out the effect of the error terms. – Hence, the sum of Error terms (or Residuals) is not a good metric to depict deviation between actual vs. predicted.
  2. One better way to summarize this can be to take the absolute difference between actual vs. predicted (to negate the effect of error terms being negative or positive) and then take the sum of this. This is known as Sum of Absolute Error.

    • In the below table-3, you can see the Absolute value of Error terms, and then we take the sum to get to £9.10m. This takes care of the issue of error terms cancelling each other’s effect due to being positive and negative.
  3. Sum Absolute Error is a good way to depict the deviation between actual vs. predicted but it gives the same weightage to all residual data points. That is, a high or low residual get the same weightage. An improvement on this would be to penalize the data points where the actual and predicted values are further apart (i.e. penalize high error terms more). This can be achieved by taking the squared differences between all data points and then taking their sum. This is known as ‘Sum of Squared Errors/Residuals‘ (Sum of Squares is also known as ‘SS’).
    • In the below table-3, you can see the Squared Error, and then we take the sum to get to £10.5m. Please note that this takes care of the issue of some error terms cancelling each other AND also penalizes higher error terms.SS is a good way to depict the deviation between actual vs. predicted.

It is important to get our head around SS as this would be used to explain further concepts in Linear Regression.

Can we improve this estimate/prediction?

Improving the prediction means minimizing deviation between actual vs. predicted values. That is, getting our predictions as close to the actual values as possible, which means minimizing the Sum of Squared Error/Residuals.

One way would be to collect data for the square footage of the flats (Area of flats) and vary our prediction of Property Value based on that. That would surely be more accurate than just a flat prediction of £1.8m for all types of flats (small flats to big multi-story Villas)

This is us scratching the surface of Linear regression as it seems logical to think that there is a linear relationship between ‘Property Value’ and ‘Square Footage (Area) of flats’ and we can use one to predict (or estimate) the other. (‘Area of flats’ is used to estimate/predict the ‘Property Value’)

Now, let’s suppose that when you were collecting ‘Property Value’ data for the 10 flats, you also managed to collect square footage (or area) of flats. Now, there are two variables at play as seen by the table-2:

  • Property Value – Value of the property (which needs to be predicted)
  • Area of flats – Square Footage of the flat (which can be used to predict ‘Property Value’)

Please note that the relationship we are trying to establish is that ‘Property Value’ is dependent on ‘Square Footage Area’, not the other way around as that would not make sense (area of a property would be generally fixed, the property value would depend on it) i.e. Property Value → f (Area of flat)

Now, to understand this relationship let’s plot these data points on two axes; let’s use the Y-axis for ‘Property Value’ and the X-axis for ‘Area of flats’.

Now, before doing any analysis let’s eyeball chart-3. It shows that Property Value is increasing with an increase in Area, it seems like the two are linearly related.

Lets consider Model-1 to be the scenario when we report the average Property Value as our prediction for all flats. Model-2 as a scenario, wherein we use ‘Area of a Flat’ to make our predictions.

  • Model-1: Predicted Property Value = Average(Property Value)
  • Model-2: Predicted Property Value → f (Area of flat)

It makes intuitive sense that our average estimate/prediction of £1.8m can improve if we know the area of the flat whose property value we are trying to estimate/predict i.e. by knowing the area of flats we would be able to make predictions for property value closer to the actual values.

→ The variability (or deviation) in actual vs. predicted values in model-1 (with predicted defined as average) will reduce when we introduce area of flats in the mix (i.e. Model-2) as we will get predictions that are closer to actual.

This is what a Linear regression does.

Linear Regression helps us estimate/predict the value of one variable (known as dependent variable) when we know the values of one or more predictor variables (also known as independent variables).

The aim of Linear Regression model is to minimize the variability in the predicted vs. actual values by using independent variables, i.e. minimize the Sum of squares Residuals.

How good we are at predicting the y variable can be determined by comparing the Sum of Square Residuals (representing actual vs. predicted) between our Model#1 (with predicted y = average of y ) vs. our new model where y → function(X). → if our Model#2 is significant it will give us a prediction which is closer to actual; which means it will "eat up" some of the variability/deviation from our based Model#1. The model will aim to minimize the Sum of Square Residuals (representing the variability/deviation) and the model will literally "fit" the data better.

Linear Regression – The model

The Linear regression model is a mathematical equation that models the value of a dependent variable with respect to one or more independent variables. The general form of Linear regression :

Dependent variable is a function of independent variables : y → f (X)

i.e. *y = B0 + B1 X1 + B2 X2 + B3 X3 +…** – Equation(1)

where, y – dependent variable and X1, X2, X3, .. – independent variables helping us estimate/predict y

Thinking about our predicting the Property value problem,

*Property Value (predicted)= B0 + B1 (Area of flat) – Equation(2)**(please note that we have just one independent variable here, i.e. Area of flat)

Equation (2) clearly, represents a straight line in a 2-D plane, with predicted Property Value being on the Y-axis and the area of flat being on the X-axis. This means, linear regression results in a straight line implying that any of the lines described in chart-4 can be our regression lines (please see chart-4 below).

Please note that all these lines predict y (i.e. Property Value) for a given value of X (Area of flat) and all of these have some deviation from the actual y-values (i.e. actual Property Value from your data). Please note that as stated earlier, the intention of modelling is to minimize this deviation of predicted values from actual values.

Among all these lines, there would be a line that will give predictions or estimates of Y which are closest to the actual values of Y (i.e. actual property value). This is called the line of best fit.

Line of Best Fit is a linear line through our data points, which best expresses a linear relationship between our dependent variable and independent variables.

Simple Linear Regression is really a comparison of two models:

  • One is where the independent variable doesn’t even exist, i.e. average of y is our prediction (or average of y is the regression line)
  • One is where y →function(X), and regression line minimizes the variability in actual vs. predicted (and is called the line of best fit)

Summary!

  • We all do regression daily in some shape or form to predict something or the other.
  • We aim to predict as close to actuality, i.e. minimize the deviation between actual vs. predictions.
  • Linear Regression : y = B0 + B1X1+ B2X2 + …
  • Errors / Residuals = y(actual) – y(predicted).
  • SS (Sum of Squares) is a good way to capture deviation both within a variable and between two variables.
  • Line of Best Fit → Best fitting regression line, the aim is to minimize the deviation between actual vs. predicted, i.e. minimize Sum of squared Residuals.

Connect, Learn & Grow ..

If you like this article and are interested in similar ones follow me on Medium, LinkedIn, connect with me 1:1, join my email list and (..if you already are not..) hop on to become a member of the Medium family to get access to thousands of helpful articles.****(I will get ~50% of your membership fees if you use the above link)

.. Keep learning and keep growing!


Related Articles