
This article is a continuation of my previous one on Linear Regression.
It is important to reiterate from my last article about the error formulae in least-squares regression.

The sum of this sum of squares error here is the discrepancy between the theoretical value on the line and the actual observed value, which is the point which is the blue dots. Those discrepancies squared and then sum up those squares that hold total area is the SSE, sum of squared error.

Here is the sum of squares regression (SSR). This basically says how much variance there is; what is basically the amount of effect that we see between the predictive values, how much the model has to do. If this is a very large number, it does not by itself mean that the model is good or bad. But if this SSR is substantially larger than it, or the sum of squares of the regression is substantially larger than the sum of squares of these discrepancies (SSE), then it is a good model. The measurement is between the predicted values and the mean.
The sum of squares total (SST)is sometimes equal to SSR + SSE. The primary sometimes is if you use the least-squares regression, to begin with, which sort of makes sense – we are summing up squares here so you are doing a least-squares regression. And the other more important thing is, and this is the one that people forget all the time, is that this equation, SST = SSR +SSE is only true for the training data. It is not true for the test data. The reason why is the SSE can be arbitrarily large with test data. And if it gets too large then things can go wrong. SSE can get arbitrarily large, so even larger than SST, it can become a lot larger than SST. SST is based on only the training data while SSE is based on your test data.

Therefore it is very possible to have a negative value for R². And as we all know, R² cannot be negative if it were truly a square of something. But as you see, this formula here is not a squaring of any Pearson’s R or anything like that.
Another accuracy measure of the line is using the Root Mean Squared Error (RMSE). Using this as an estimate of the error means we are losing one more degree of freedom than the standard deviation, so we write the RSE as:

# mean squared error
mse = np.sum((y_pred - y_actual)**2)
# root mean squared error, m is number of training samples
rmse = np.sqrt(mse/m)
# results.mse_resid
Assumptions
Here are some important assumptions of linear regression.
- Linear relationship between the dependent variable and independent variables
- Measurement error is random
- Residuals are homoscedastic
The primary assumption is residuals are homoscedastic. Homoscedasticity means that they are roughly the same throughout, which means your residuals do not suddenly get larger. And this is often not the case, often things are not homoscedastic. What do you do then? Well, if you can do one of those tricks, like changing the y values, transforming them before you do any linear regression, then great! Otherwise, what can be done when the assumption is not valid? This is the dirty secret: they do nothing. They just pretend that their residuals are homoscedastic in spite of the fact that they are heteroscedastic, which means that the error is not consistent throughout.

N.B: The errors do not be normally distributed to be homoscedastic, the errors just need to be roughly the same throughout. They do need to be centered about zero, however.
Outliers

Linear regressions are susceptible to outliers. Here, you have a nice linear relationship, this line going through these data points perfectly. The second image is not because that one data point up in the top right is pulling the regression line way up there. Outliers like this can really be a problem, you can get rid of outliers and like that in a variety of ways, like leverage and Cook’s Distance.
Linear regression fits a line based on the means of the x and y values. It fits a line that goes through the means of both values. This line pivots around the point relative to the pull of each point. Points that are farther away from the mean pull harder on the slope – this is leverage. Another way to quantify the ‘pull’ of each point, is to fit the line to the data without each point and see how the parameters move: this is Cook’s Distance.
Coefficients and p-values
When the regression line is linear the regression coefficient is the constant that represents the rate of change of one variable as a function of the other – it is the slope of the regression line.
- The p-values for the coefficients indicate whether these relationships are statistically significant.
- The t statistic is the coefficient divided by its standard error.
Checking the residuals
This is important! You should always check for trends in the residuals.

There are more data points on the right than the left, and that is ok, that is just the shape of the data. The predicted values on the x-axis and the residuals, the error from the predicted value, on the y-axis.
- Residuals = Obsevered (actual data) – Predicted (where the data moves to the regression line)
Positive values for the residual (on the y-axis) indicate that the prediction was too low, and negative values indicate that the prediction was too high. A zero means that the guess was exactly correct!
THERE SHOULD BE NO PATTERN IN THE RESIDUALS!
The question is, are the residuals closer to this dotted line on the right-hand side than they are on the left-hand side overall? Or even worse – do you see more residuals beneath the dotted line than above the dotted line? What is the dotted line, the dotted line is zero where you have zero residual. The residual again is the discrepancy between your predicted y value and your actual y value. They should all hover around zero because well, the point of the fact of having a predicted y value was to explain your data. And the predicted y value should be generally close to your y value.
When this is not the case you have residuals. Sometimes, seeing an obvious trend in your residuals is an opportunity. That means that you can then use that trend in your residuals to add another model only to your residuals.
Always plot the residuals to check for trends. Check the residuals versus y, and make sure that they are, say, always positively correlated, the higher the correlation, the worse the fit. The reason is that if there is a high correlation to the residuals with y, that means that as y gets larger, your residuals get larger. And this actually happens a lot. Testing the normality of the residuals is also important.
A common sign that your residuals are heteroscedastic is the "fan-shaped" errors, whereby the errors are larger on the right-hand side than the left-hand side.

Again, there should be no trends apparent in the residuals. But oftentimes the residuals can be assumed to be homoscedastic when they are not, like the fan-shape below. The third outcome, the nonlinear one, is a really bad one that cannot be assumed to be homoscedastic, the error trend is parabolic.

This nonlinear error trend is likely an indication that the data has a relationship that is not best captured with the parameters used; the residuals have information in them. It looks as if we could actually fit a polynomial to these residuals. Or we could simply add the polynomial to our linear model in the first place and overall we would have fewer errors after doing this.
Multiple linear regression

As mentioned in my previous article, why not make the variables in the last formula linear by saying the exponent is a different variable, like z? These are the new variables that we then have a linear setup for and the equation becomes a multivariant linear regression.

One hot encoding vs dummy variables
This is commonly misunderstood. There is one-hot encoding (OHE) as one way of doing things, then there is the dummy variable way of encoding for linear regression. The dummy variable is basically the same as OHE, you just do n-1 of the variables. Gender is not OHE when all you have is a column labeled "gender" with a 1 being "male" and a 0 being "female", this here is actually a dummy variable. This is not one hot encoding because there is no column for females, there is only a column for males. So the column should not be called gender, because there exists no such gender as one and zero, this column should be called is male.
With the dummy variable, you drop columns, with OHE you never drop columns. That is the definition of what one-hot encoding is. One hot encoding is not a Data Science term, rather it comes from a very different area of engineering. Imparting dummy variables is statistically more useful because you will not have colinearity. But you probably will not have colinearity anyway, probably not for certain. Using OHE allows for better model interpretation. Remember that with OHE, you never drop columns, doing so is imparting a dummy variable.
Multiple linear regression in python
With the following example, we want to predict the stock price index, just like everyone else.
N.B: For simplicity, and this is a big no-no, this model just uses the test data for fitting – there is no train/tune/test split.
# imports
from pandas import DataFrame
from sklearn import linear_model
import statsmodels.api as sm
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
# create dataframe from stock data
Stock_Market = {'Year': [2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016],
'Month': [12, 11,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1],
'Interest_Rate': [2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75],
'Unemployment_Rate': [5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6.2,6.1],
'Stock_Index_Price': [1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,958,971,949,884,866,876,822,704,719]
}
# inputs
df = DataFrame(Stock_Market,
columns=['Year','Month','Interest_Rate','Unemployment_Rate','Stock_Index_Price'])
# check that the data is well imported
df.head()

Identify variables that are interesting and scale them. Scale your data to be safe! You will not have a different outcome if you scale your data in spite of what anyone will tell you – this you can prove mathematically. My statement is true, as long as you do not consider floating-point stuff. If you have very large or very small numbers, then you have floating-point inaccuracy. And this can then lead to a lot of problems. And you get very large numbers and very small numbers by not scaling, to begin with. If you scale everything, to begin with, then everything looks the same. Everything is between minus two and plus two, everything is centered around zero. The reason why that is that most people will scale according will scale on use what is called a standard scaler, which is your normal or your Gaussian distribution. What you do is subtract away the mean, and then you divide by the standard deviation.
Another benefit to scaling is that if you use an iterative methodology, your data will generally converge faster if you have scaled it. But, as long as you have a linear system – there is a mathematical proof (this is a so-called convex problem) and if it is such, there is a specific solution in all cases, regardless of data scaling. In principle, you will not get more or less accuracy if you scale – however floating point inaccuracies are a thing to watch out for!
We keep "Interest_Rate" and "Unemployment_Rate" for further analysis:
x = df[['Interest_Rate','Unemployment_Rate']]
y = df['Stock_Index_Price']
# scale the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(x)
x = scaler.transform(x)
from sklearn.linear_model import LinearRegression
regression_model = LinearRegression()
# Fit the data(train the model)
regression_model.fit(x, y)
# predict
y_predicted = regression_model.predict(x)
print('Intercept: n', regression_model.intercept_) # pull out intercept
print('Coefficients: n', regression_model.coef_) # pull out coeffeicients

This output includes the intercept and coefficients to build the multiple linear regression equation.
N.B: We scaled the data, so the coefficients above reflect that. Nonetheless, there is a correlation between high-interest rates and stock prices rising and a smaller correlated effect with prices rising as unemployment falls. We can see that the scaled coefficients are roughly the same and are therefore directly comparable.
Stock_Index_Price = (Intercept) + (Interest_Rate coef)*X1 + (Unemployment_Rate coef)*X2
Plugging in new numbers:
Stock_Index_Price = (1070.083) + (118.233)*X1 + (-80.815)*X2
# prediction with sklearn
New_Interest_Rate = 2.75
New_Unemployment_Rate = 5.3
# create a scaled feature array and make the prediction
feature_array = scaler.transform([[New_Interest_Rate ,New_Unemployment_Rate]])
print ('Predicted Stock Index Price: n', regression_model.predict(feature_array))

Stats models
Same analysis but this time with sm
. Again I stress the more important package to know is sklearn
# with statsmodels
X = sm.add_constant(x) # adding a constant
model = sm.OLS(y, X).fit()
predictions = model.predict(X)
print_model = model.summary()
print(print_model)

# Summary graphs:
import scipy.stats as stats
import statsmodels.api as statsmodels
from statsmodels.graphics.regressionplots import *
import seaborn as sns
import numpy as np
sns.distplot(model.resid, hist=True)
plt.grid(True)
plt.xlabel('Residual')
plt.ylabel('Frequency')
plt.title('Residual Histogram')
plt.show()
# Residuals vs Fitted Values
residuals = model.resid # outlier_linear = name of linear model on #our dataset
fitted_vals = model.predict(X) # making predictions from our fit #model
plt.plot(fitted_vals, residuals, 'o') # plotting predictions from #fit model vs residuals
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs. Fitted Values (w/ 1 outlier)')
plt.show()
# create Q-Q plot to ensure normality
statsmodels.qqplot(residuals, stats.norm, fit=True, line='45')
# leverage Plot (Cook's Distance)
influence_plot(model) # from statsmodels.graphics.regressionplots #import *




Conclusion
We have explored linear models in depth and run some Jupyter Notebook code cells to explore least-squares linear regression. Remember linear refers to the parameters – in the above stock example the data is multivariant: we used interest rates AND unemployment rates to predict a numeric outcome: the stock price GIVEN those inputs. The importance of homoscedasticity of the residuals was emphasized as was, albeit briefly, leverage and outliers.
In my next article, I will expand these concepts and explore using bootstrap with regression models!
Find me on Linkedin
Physicist cum Data Scientist- Available for new opportunity | SaaS | Sports | Start-ups | Scale-ups