
This article is a continuation of series that last covered Bayes statistics.
Volatility is inherent within the stock market, but perhaps the rise of "meme stocks" is something that is beyond traditional comprehension. This article is not about such "stonks" – which is the ironic misspelling of stocks that encapsulates the new investing zeitgeist amongst Millennials (broadly speaking). Rather, I was pondering the fabulous rollercoaster of a ride of $GME as a timely example of something to apply linear regression to; one can apply a linear regression model to a given stock to predict its price in the future. This article and subsequent articles will expand upon the concepts of linear regression and several methods associated with it, its associated errors for model analysis as well as touching on residuals.
But what is regression? The goal of regression is to produce a model that represents the best fit for some observed data. A better, more specific way to state this is that regression is used to predict or explain numeric data ** – smarter people than me have been applying such models to financial data (like $GME) for decades. But, if you want to predict whether or not somebody is going to come down with a disease or whether or not somebody will be wearing black, or whether an asset will appreciate or depreciate in value those would not be regressions, those would be classifications. A regression is either an explanatory model or a predictive model for numeric dat**a. And for the most part, we will be dealing with ones where we can describe a function – that means an output – we can describe the relationship between one or more input variables and a single output variable (for linear regression).
History
Regression is based on the method of least squares or the method of minimum mean square error. The ideas around least squares or averaging errors have evolved over nearly three centuries. The first known publication of a ‘Method of Averages’ was by the German astronomer Tobias Mayer in 1750. Pierre-Simon Laplace published in 1788 a similar method. The first publication of the method of least squares was by the French mathematician Adrien-Marie Legendre in 1805.
As a historical aside, it is very likely that the German physicist and mathematician Carl Friedrich Gauss developed the method of least-squares as early as 1795, but he did not publish this method until 1809. Legendre firmly opposed any notion that Gauss had used the method earlier than the 1805 publishing date.
Francis Galton, a cousin of Charles Darwin, coined the term regression in 1886. Galton was interested in determining which traits of plants and animals, including humans, could be said to be inherited. While Galton invented a modern form of regression, it was Karl Pearson who put regression and multiple regression on a firm mathematical footing. He created the world’s first university statistics department at UCL. Pearson’s 1898 publication proposed a method of regression as we understand it today.

Numerous others have expanded the theory of regression over the last century since Pearson’s pivotal paper. Notably, Joseph Berkson published the logistic regression method in 1944, one of the first classification algorithms. In recent times the explosion of interest in Machine Learning (ML) has lead to a rapid increase in the numbers and types of regression models.
Types of regression
There are a variety of types of regressions including many that are nonlinear and have absolutely nothing to do with what I will cover; they are used to predict numeric outcomes but use different underlying technologies. Some common regression types are:
- Linear – most common
- Logistic – binary outcomes (success/failure)
- Polynomial – when a straight line is not the best fit
- Stepwise Ridge, and Lasso (common in ML)
Least-squares linear regression
Least-squares regression is the type I will be expounding on today. Caution, most people will conflate this with linear regression. Now the two need not be the same, but just warning you that most people will conflate these two. Linear regression just means that you are going to do something using a linear collection of parameters. There are a variety of other ways to do regressions and those would not use those linear collections of parameters; where each parameter is associated with one variable.

We have our linear regression, which is most commonly used to describe a straight line like the picture above – but that would also be true for logistic and polynomial. And probably also for most cases of a stepwise ridge and lasso. So there are two definitions of linear – and it will be confusing.
The one definition of linear is a line that goes through your data. Said another way, a line is fitted to your data. The other definition of linear is when you have a linear collection of parameters, with each parameter being associated with one variable. Said differently: the linearly predictable behavior of the dependent (target) value based on the explanatory variables.
This is a fancy way of saying:
In the example above the parameters a, b, c have a linear association with the input variables x1, x2, and x3 – x0 is the offset. When I say variables, that means that they are inputs to a regression.
The least-squares linear regression was derived with the help of linear algebra and some calculus. I believe that understanding linear models is the basis for understanding the behavior of a whole bunch of statistical and ML models. In fact, I believe that about half of all machine learning models will use some of this form of linear regression, at least in their starting points.
Model selection: 11 dots

A very reasonable and acceptable way to do a regression for some data (like the above example) is just to eyeball it and draw your line through it. And you want to call that a regression, that is fine. I doubt it can be called a linear regression anymore, but it is certainly a regression. And it is a very valid one. Now we are going to be talking about regressions where we can show mathematical rules for going from data points to a line of best fit.

Earlier I defined linear regression, now I will show an example of a quadratic equation that is still linear the way we have defined it:


equivalently:

What we are going to do that or what we could do is we can say, "okay", y is not the output from a function that takes in one variable, namely x1, but rather y is the output from a function that takes in two variables; the first one being x1, the second one being x1². In fact, we can just call x1 times x1 the variable x2.

We can have non-straight lines fit a linear equation! Extending from this we can basically go to any polynomial. The following graph has the same data as the quadratic example, but a different line of best fit.

The only thing really changing is the line that we are drawing through them. In both cases, we are using what we call a linear relationship – well people who are in this regression business (mathematicians/physicists) call it such. They know full well that this is not a straight line. But they claim that in higher dimensions, this is actually straight; calling it not a line but a hyperplane.
The hyperplane is not curved in those dimensions. The dimensions would likely be:

This is adjacent to the mathematical field of topology, very relevant to particle physicists and Einstein’s General Relativity, in general (haha)! Physicists will say that the path of a hit baseball is a straight line. In high school, we learned that this path is a parabola but in curved space-time, the path is a straight line geodesic (a geodesic being a "straight" line in a curved topological landscape). Geodesics you are likely familiar with include the lines of latitude on a globe.
So here the relevant question being asked is which of these three proceeding models is correct? The first order linear one? The second-order linear one? The fourth-order linear one? Well, the better question is which of these three is more useful, and that would naturally require that we have knowledge of what the model is being used for.
The general rule is, and this is a pretty important rule, that the simpler the model, the more useful it is. Remember this heuristic! Later I will write about regularization, which is just a way of making models simpler. When you simplify your linear model (well any model, but the context here is linear models) it is not just easier to understand, more importantly, they work better!
No data are perfect, errors will always exist. Another useful heuristic to remember is "all models are wrong, but some are useful!" The fact is that some of the data above could just be errors. The 11 dots above are the training data. When we look at new data, it might turn out, or probably will turn out, that they will not follow the funny curve, our higher-order model. Chances are that the "hyperplane" model is probably an example of what we call overfitting. And the reverse process of overfitting or the process that undoes overfitting is called regularization. As such, the balance of probability says that our first model is the most useful model to conduct further analysis on.
Terminology
- Response (Dependent) variable: the variable of primary interest, the variable we are trying to predict or explain. Like in our Jupyter notebook examples above, the response is the lefthand side of the equal sign -‘y’ is DEPENDENT on the values on the righthand side of the equals sign
- Explanatory (Independent) variable: the variable that attempts to explain the observed outcomes of the response variable. In an equation, if you change the ‘y’ (dependent) variable, you are not changing the righthand side ‘x’ variables – hence they are INDEPENDENT of changes in ‘y’
There are two types of parameters in linear models:
- y-intercept, referred to above as the offset
- slope, rise over run, change in y divided by change in x. If there are multiple inputs, then we have multiple slopes (we call these partial slopes)
As an aside, but I want to mention that data is often not linear, and as such the data need to be transformed to apply regression analysis. An important such transformation is doing a log transformation when dealing with nonlinear data.

This raw data is not linear, however, we can make it appear linear by applying a logarithmic change, here it will be log base 10:

The reason for doing the log transformation is because now your errors are going to probably be consistent throughout and that is very probably going to be better. It is possible to fit an exponential line to the data, doing non-parametric line fitting, but then the errors would not be consistent, they would not be of the same scale – which makes doing downstream analysis much harder.
Error terminology
Before explaining least-squares regression, we need to understand what the errors are:

The E(error) is the distance between each point and the regression line. We select the line with the smallest sum of squared differences between each point and the line.

The line above is a pretty good fit for the five dots above, but it is not perfect. What we want to know is how much error is there in the dependent variable. The epsilon term is added to bridge the gap between our predicted value(y_hat), which is where the red dot follows the green line and meets the regression line, and the actual observed values. The error is the difference between the predicted and observed variables. This is further explained in the code analysis below.
With a little bit of calculus and linear algebra, we try to find the minimum value of the sum of the errors for all possible regression lines. This summation is the best fit line. This is incredibly laborious to calculate. If instead, you say you want to minimize the square of this value, it turns out that there are beautiful, analytical solutions. And this is why we do least-squares fitting. So we do not try to minimize those dotted lines, or the sum of those dotted lines, we try instead to minimize the sum of the squares. So what is the sum of squares?

The squares are now summed. The smallest total area of these squares is what we believe to be our best fit line. The advantage of this method, besides the beautiful analytical solutions, is that it penalizes larger errors – it punishes outliers. In the example above say that the largest error is 6 times larger than the smallest error. Well in least-squares regression that error is 36 (6 x 6)times more weighted than the smallest square.

We let the computers solve the equation above to find the sum of least-squares by playing around with the slope and intercept. The predicted (theoretical) values are subtracted from the actual (observed) values and summed.
Python example: regression model
The following code can be run in a Jupyter Notebook to help get a feel and a visualization for least-squares linear regression.
N.B. we use synthetic data because we have a lot of control over what happens. Synthetic data is not always the best.
# relevant imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# synthetic data points between 0 and 10 with standard deviation of #the errors of 1 (std deviation of y should be around 3-the square #root of 10)
n_points = 80
x_start, x_end = 0, 10
y_start, y_end = 0, 10
y_sd = 1 # std deviation of the errors
# data columns, the y values are the errors added to the x values
x_data = np.linspace(x_start, x_end, n_points)
y_error = np.random.normal(loc=0, scale=y_sd, size=n_points)
y_data = np.linspace(y_start, y_end, n_points) + y_error
# create dataframe
syn_data = pd.DataFrame({'x':x_data, 'y':y_data})
syn_data.head()

Let us now visualize the data – note the strong linearity of the data. Remember that the point of this exercise is to answer the question "if we know the value of X can we predict the value of Y?" We believe that by drawing the line of best fit that if someone gives us an X, say an X that we do not already have, then we can figure out what the Y value would be.
plt.plot(syn_data['x'], syn_data['y'], 'ko') # ko = black points
plt.grid(True)
plt.xlabel('x')
plt.ylabel('y')
plt.title('x vs y')
plt.show()

Building linear regression model
Following a simple template, we can build our model. You are likely familiar with one common package that can do this: Sci-Kit Learn. A quirk of sklearn
is that works best with NumPy arrays, it is just what it is used to. The following template reshapes and converts the data for model training and evaluation:
# standard lin reg template
# imports
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# create numpy arrays for the x (independent) and y (dependent) #variables
x = syn_data.x.values.reshape(-1,1)
y = syn_data.y.values.reshape(-1,1)
# model initialization
regression_model = LinearRegression()
# train the model (fit data)
regression_model.fit(x, y)
y_predicted = regression_model.predict(x)
# model evaluation
rmse = mean_squared_error(y, y_predicted)
r2 = r2_score(y, y_predicted)
# printing values
print('Slope:' ,regression_model.coef_)
print('Intercept:', regression_model.intercept_)
print('Root mean squared error: ', rmse)
print('R2 score: ', r2)
# plotting values
plt.scatter(x, y, s=10)
plt.xlabel('x')
plt.ylabel('y')
plt.plot(x, y_predicted, color='r')
plt.show()

The statistical values are explained in greater detail below when we evaluate the model’s performance.
Plot residuals
The following is an important tool in understanding how good your regression model is – a residual plot. What is happening is that for every point, the difference between the observed and predicted values is plotted.
import seaborn as sns
sns.residplot(x, y)
plt.show()

The residuals are the errors, which we plotted earlier. The x-axis is the x values. The y axis is the discrepancy between the actual y values and the predicted y values, the errors.
from sklearn import linear_model
# from sklearn.linear_model import LinearRegression
# initialize the model.
linear_model = linear_model.LinearRegression()
# define x and y
x_input = syn_data['x'].values.reshape(n_points, 1)
y_output = syn_data['y'].values.reshape(n_points, 1)
# fit the model to the existing data with lm.fit(x,y)
linear_model.fit(x_input, y_output)
# fit model to make predictions
y_pred = linear_model.predict(x_input)
# plot
plt.scatter(x_input, y_output)
plt.plot(x_input, y_pred, linewidth=2)
plt.grid(True)
plt.xlabel('x')
plt.ylabel('y')
plt.title('x vs y')
plt.show()
# model parameters
print('Intercept: {0:.5f}'.format(linear_model.intercept_[0]))
print('Slope : {0:.5f}'.format(linear_model.coef_[0][0]))

Stats Model package
Another, older and lesser-used Python package is Stats Model sm
. The following code duplicates the Sci-Kit Learn work. Note the subtle differences between the two w/r to how the outputs are displayed.
N.B. The statistics are the same, but the presentations and the robustness thereof differ between the packages. sklearn
is the preferred package to learn these days.
import statsmodels.formula.api as sm
# ols = ordinary least squares
ols_model = sm.ols(formula = 'y ~ x', data=syn_data)
# fit the model
results = ols_model.fit()
print('Intercept: {0:.5f}'.format(results.params.Intercept))
print('Slope : {0:.5f}'.format(results.params.x))

# add predicted to dataframe
syn_data['predicted'] = y_pred
# add residuals to dataframe
syn_data['resids'] = y_output - y_pred
# verify data is as expected
syn_data.head()

# coefficients of line of best fit
m = linear_model.coef_[0]
b = linear_model.intercept_
print('m = {}'.format(m[0]))
print('b = {}'.format(b[0]))

Interpreting the parameters
It is very important to know how to interpret the parameters (your results will be different because of the random seed in creating the data).
- y-intercept (b): when x is zero, y is -0.13…
- slope (m): When we increase x by 1 then we expect that y will go up by 1.03….
Simple enough, I bet most readers will have a strong grasp on these basics, but it is important to refresh everyone and to build up from a solid and universal foundation.
Model summary statistics
This is where SM shines – the beautiful results summary. A lot more data is shown natively than we need to understand least – squares regression at the moment. The residuals are also plotted after the OLS report.
import statsmodels.formula.api as sm
import seaborn as sns
ols_model = sm.ols(formula = 'y ~ x', data=syn_data)
results = ols_model.fit()
# print slope (m) and y-intercept (b)
print('Intercept, Slope : {}'.format(results.params))
print('nSSE, SST, SSR, and RMSE:')
mean_y = np.mean(y_output)
sst = np.sum((y_output - mean_y)**2)
ssr = np.sum((y_pred - y_output)**2)
sse = np.sum((y_pred - mean_y)**2)
print('Sum of Squares Explained (SSE): {}'.format(sse))
print('Total Sum of Squares (SST): {}'.format(sst))
print('Sum of Squared Residuals (SSR): {}'.format(ssr))
print('RMSE: {}'.format(np.sqrt(results.mse_resid)))
# print linear regression statistics
print('n', results.summary())

# plot histogram of the residuals
sns.distplot(results.resid, hist=True)
plt.xlabel('Residual')
plt.ylabel('Frequency')
plt.title('Residual Histogram')
plt.show()

The first thing to note about our residuals plotted above is that they are roughly normal. This is to be expected because the synthetic data was created to be such, so this a bit of a self-fulfilling prophecy. The y data is created from errors added to the randomly generated x data, with the y data having a standard deviation of +/- one.
Evaluation of regression models
Now that we have built a regression model, we can quantitatively evaluate the performance of our regression model. The evaluation of regression models is based on measurements of the errors.
If we want to know how well we did, we must consider what a good benchmark would be. A common easy benchmark is to see if we can predict better than just the mean of the target variable.
We define the following terms:

Now we need to define the following error metrics:
- Sum of Squared Residuals (SSR): This is the value that the method of least squares was able to minimize to with the best fit line. The goal of regression is to minimize this quantity.

- Sum of Squared Total (SST): This is a measure of how bad a prediction the target-mean would be

- Sum of Squared Explained (SSE): This is a sum of a squared pointwise difference of the (SST – SSR). You can also think of this as a measurement of how much better we are explaining the variation than the mean

Notice that SST = SSR + SSE
- This formula is only true if you do your regression on your training data! If you do it on test data, this formula is no longer true. This formula is true for the cases where your coefficient of determination (COD)is the same as your Pearson’s R², the correlation coefficient. They are not the same! They can sometimes be the same, but only in the situation for at least-squares fitting on your training data, not for when you use either your test data or not when you do a non least-squares fitting. SSR can become larger than SST, leading to a negative R²!

We can now talk about the Root-Mean-Squared-Error (RMSE). The reason we do this is that the "sum-squared-error" (also referred to as the mean-squared error) is in units of x². To convert the error to units of x, we need the square root of the mean-squared-error or the square root of the sum-squared residuals. This is the RMSE:

Remember that the goal of regression is to minimize the residual error, SSR. Specifically, we wish to explain the maximum amount of the variance in the original data as possible with our model. We can quantify this idea with the coefficient of determination (COD)also known as R²:

so as


R² is the fraction of the variance of the original data explained by the model. A model that perfectly explains the data has R² = 1. A model which does not explain the data at all has R² = 0. This means that the training data is spread out so evenly that you cannot draw any best fit line through it. You can draw a line through the data, it just would not be the best fit line.
However, R² is not perfect:
- R² is not bias adjusted for degrees of freedom
- More importantly, there is no adjustment for the number of model parameters. As the number of model parameters increases, SSR will generally decrease. Without an adjustment, you will get a false sense of model performance
We address this by defining Adjusted R² (R² adj):

where
This gives R²adj as:

where

Which can be rewritten:

Now we have a metric that decreases with more added parameters, rewarding simplicity – remembering from earlier that simpler models are preferred to complex models.
Conclusion
We have covered many regression concepts so far:
- The single regression equation of a line is

- Use the Python model object
linear_model.LinearRegression()
to initialize the model - Use the
fit
method to fit the model to the data - Use
predict
method to compute scores(predict values) for the dependent value (y) - Pull the model parameters – m(slope) with
coef
and b(y-intercept) – withintercept
- The residuals are the difference between the y_output and the y_predicted
- Use the
statsmodels
andseaborn
libraries to get summary statistics and make diagnostic plots sklearn
is preferable to thesm
package, it is simpler to use and more robust- What residuals are and why we plot them
We evaluated the results from the OLS Regression Results with summary
method
- SSR (Sum of the Squared Residuals) is the difference in error from the regression line (try to minimize)
- SST (Sum of the Squared Total) is the difference to the target mean
- SSE (Sum of the Squared Explained) is the variation of the regression line to the mean
- RMSE (Root Mean Squared Error) is the square root of the SSR
- R² is the fraction of the variance of the original data explained by the model, want a number close to 1
- SST = SSR + SSE is ONLY true for the cases where your coefficient of determination (COD) is the same as your Pearson’s R², the correlation coefficient. For least-squares regression on training data, yes it is true. Outside of this case, on test data or when SSR > SST the formula does not hold
In my next article, I cover multivariant regression and then bootstrapping wiht regression models!
Find me on Linkedin
Physicist cum Data Scientist- Available for new opportunity | SaaS | Sports | Start-ups | Scale-ups