Hello and welcome to this FULL IN-DEPTH, and very long, overview of Regressional Analysis in Python! In this deep dive, we will cover Least Squares, Weighted Least Squares; Lasso, Ridge, and Elastic Net Regularization; and wrap up with Kernel and Support Vector Machine Regression! Although I’d like to cover some advanced Machine Learning models for regression, such as random forests and neural networks, their complexity demand their own future post! In this post I will approach Regressional Analysis from two sides: Theoretical and Application. From the Theoretical side I will introduce the algorithms at a basic level and derive their base solution while in the Application side I will use sklearn in Python to actually apply these models to a real-life dataset!
Table of Contents
- What is Regression?
- Our Data Set – Medical Cost
- How to Measure Error?
- Least Squares Solution (MLR)
- Interpretation of the Model
- Weighted Least Squares (WLR)
- How to deal with Overfitting – Regularization
- How to deal with Underfitting – Kernel Regression
- Support Vector Machine
- Conclusion
What is Regression?
In the realm of Machine Learning, tasks are often split into four major categories: Supervised Learning, Unsupervised Learning, Semi-Supervised Learning, and Reinforcement Learning. Regression falls into the domain of Supervised Learning, where the goal is to learn or model a function that maps a set of inputs to a set of outputs. In Supervised Learning, our set of outputs are commonly called the dependent variable in statistics or the target variable in the Machine Learning Community. This target variable can either be discrete, commonly called Classification, or continuous, commonly called Regression. In this way, Regression is simply trying to predict a continuous target variable given a set of inputs.
Our Data Set – Medical Cost
To give some application to the theoretical side of Regressional Analysis, we will be applying our models to a real dataset: Medical Cost Personal. This dataset is derived from Brett Lantz’ textbook: Machine Learning with R, where all of his datasets associated with the textbook are royalty free under the following license: Database Contents License (DbCL) v1.0.
This dataset contains 1338 medical records of different individuals recording a few metrics: age, sex, bmi, number of children, whether or not they smoke, and the region they live. The goal is to use these features to predict the persons ‘charges’, medical cost.
Because this will already be a long post, I will not go in detail over the exploratory analysis and pre-processing steps in detail; however, they are listed below:
Here is the code on how to load in the dataset, split it into the feature and target variables, as well as partition it into a testing and training set:
The reason why we split our dataset into a training and testing dataset is that we train our model on the training set and evaluate it on how well it generalizes to new data that it has not seen before, the testing set. In addition, I’ve also had to perform a Logarithm transformation on our target variable as it follows a heavily skewed distribution. Under some principle assumptions of Least Squares, Y needs to follow a Normal distribution, which will be explained later. As of right now, heavily skewed positive distributions can be made to follow a normal distribution through eithe ra Logarithm or BoxCox Transformation.
How to Measure Error?
In the Machine Learning community there has been a lot of research and debate on the best way to measure error. For Regression, most error measurements are derived from a concept found in Linear Algebra known as norms. Norms are measurements that allow one to measure how big is a tensor/matrix/vector. As one can see, if these norms measure how big is a tensor, then the goal of Machine Learning models is to minimize the norm difference between our expected output and the predicted output! In mathematical format, a norm of x, is commonly defined to be:

Where p is a parameter that changes the measurement. Here below are some of the most common p-norms:


How do we use these norms to help us measure error? Well, we can use them to measure the difference between our models prediction, f(x), and the actual target variable, y. Thus, we can measure error to be the difference between the prediction and actual, which can be quantified by a single numeric value using a norm:

The two most common error measurements in Machine Learning are Mean Squared Error (MSE) and Mean Absolute Error (MAE):


In the Machine Learning models we’ll look into today, MSE is chosen as the measurement to quantify error due to the convex nature of squaring the error – in layman’s terms, the numerical methods have an easier time of minimizing squared numbers rather than absolute due to the derivative of the absolute operation being non-defined.
There exists only one problem with the error measurements described above, they do not explain how well the model performs relative to the target value, only the size of the error. Does a large error mean a poor model? Does a small error mean a good model? A good model can have an extremely large MSE while a poor model can have a small MSE if the variation of the target variable is small. For example, suppose we have two different lines for two different datasets. The prediction for the dataset on the left has a lower MSE than the one on the right, does that mean the model on the left is better? I’m guessing that you’d say the prediction line on the right is better than the one of the left despite having a higher MSE as the dataset on the right has higher variation within the Y variable.


The problem with only using MSE or MAE is that it does not take into account the variation of the target variable. If the target variable has a lot of variance, as in the dataset on the right, then the MSE will be naturally higher. A popular metric used to take the variation of the target variable into account is known as the Coefficient of Determination, commonly called R Squared:

As we can see from above, R Squared is proportional to the ratio of the Residual Sum Squared (RSS) versus the Total Sum of Squares (TSS). R Squared ranges from (-infinity, 1]. Where the interpretation is the percentage of the variation of the target variable explained. For example, suppose a model has an R Squared value of 0.88, then that model explains approximately 88% of the variability of the target variable. Hence, larger R Squared values are more desirable as the model will explain a greater percentage of the target variable. However, if the RSS of the model is larger than the TSS, then the R Squared metric will be negative, which means that the variance of the model outweighs the variance of the target, aka the model sucks.
Least Squares Solution (MLR)
Now that we’ve defined our error measurement, it’s time to introduce our first classic Machine Learning model, Least Squares! As with most of the models going to be discussed, Least Squares works off the assumption that the dependent/target variable is a linear combination of the feature variable (assuming k number of features):

The goal of the coefficients are to act as the slope for the respective input variable, and the intercept is to act as the point where the target variable starts when the input variables are zero. With this strong assumption above, our goal is to find accurate predictions to the coefficients:

Because our beta estimates are not going to be exact, we will have an error term, epsilon. This can be written in matrix format as the following:

To derive the estimate coefficients for beta, there are two main derivations. I will give both. First, using simple matrix manipulation:

The second derivation is most the most common, by trying to minimize the Expectation of the difference using gradients. In statistics, Expectation is commonly defined to be weighted mean of a random variable:

Which can be converted into matrix format:

Then, we can find the gradient of J, set it equal to zero, and find the analytical solution for beta!

As we can see, both methodologies led to the same solution! In fact, they are the equal! However, if you’ve been paying acute attention, we’ve made three big assumptions: Y is distributed Normally; X^T*X is invertible; and the expected value of epsilon is zero with constant variance. Although these assumptions are broken sometimes in practice, the Least Squares model still performs well! I hope now you understand as to why we had to perform a logarithmic transformation on our target variable to achieve Normality!
The time complexity for standard Least Squares is O(k³) as the time complexity is O(n³) to find the inverse of a matrix, but our matrix result, X^T*X is actually k by k, where k is the number of features/columns.
Now that we’ve discussed the theoretical background for Least Squares, lets apply it our problem! We can use the LinearRegression object from the sklearn library to implement our Least Squares solution!
To assess our model, we can look at the MSE and R² value both on the testing and training dataset:


As we can see from the above barplot on the left hand side, the MSE for both the testing and training sets are extremely low, only around 0.19; however, if you remember, the target variable underwent a Logarithmic Transformation, meaning that an MSE of 0.016 is not relatively small given the scale of the target variable. Therefore, a better measurement is to assess the R² value, which we can see from the barplot on the right hand side is decent. Our model only explains approximately 79% of the variability of the target variable for the training set and roughly 76% of the variability for the testing set. Although is result is reasonable for some scenarios, for this simple dataset this is known as underfitting, when a model performs poorly at predicting a target variable. We will cover some methodologies on how to fix this problem later.
Interpretation of the Model
One big advantage of Linear Regression over some other Regression models is its simplicity and explanatory power. Each beta coefficient can be assessed to explain how the model is achieving its predictions. Machine Learning models like this are known as White Box Methods, meaning that it is apparent how the model is achieving its output. On the other hand, Machine Learning models where the formulation process on how the prediction was achieved is unknown are known as Black Box Methods.
For example, here are our beta coefficient values:

Unfortunately, because we scaled the target variable using a logarithm, the coefficient values are in terms of explaining the log of the target. To combat this, we can rescale the coefficients by the inverse of logarithm, the exponentiation. Here are our exponentiated beta coefficients:

The interpretation of exponentiated beta coefficients is the percentage change in the target variable. For example, when the person is a smoker, their medical cost increases by 116.8% ((2.168–1)100). On the other hand, if the person is not a smoker then their medical cost will decrease by 63.9% ((0.461–1)100). Essentially, any beta coefficient larger than one imposes an increase percentage of medical costs while any beta coefficient smaller imposes a decrease percentage of medical costs. In addition, we can definitely see that the largest exponentiated beta coefficient belongs to smokers, meaning that variable, whether the person is a smoker, has the largest influence to medical cost out of the other variables.
Let’s suppose that we did not perform a logarithmic transformation, how would we interpret the beta coefficients then?

The interpretation is in terms of the unit scale of the target variable. Because our target variable was measure in dollars, we can see that if a person was a smoker, then their medical cost would increase by $11,907 ; or decrease by $11,907 if they were not a smoker. For age, we can see that as the older the person becomes, their medical cost will increase by $264 per year of age. If we were to look at sex, the coefficients are the same, meaning that medical cost does not go up whether or not the person is male or female. Hopefully you can see the power of interpreting the coefficients of Least Squares Regression.
Weighted Least Squares Solution
One of the main assumptions made under Least Squares is that the errors, epsilon, is Normally distributed with constant Variance:

One can check this assumption by plotting the residuals, f(x)-y, verses the actual, commonly called a residual plot. Here is our residual plot from our previous model on the training sample:


As we can see from above, the variance of our residual errors does not have mean of zero nor constant variance, as it is highly non-linear. In addition, we can see that the squared residuals show a slight upward trend as the target variable approaches its max value. Residual errors with non-constant variance are called heteroscedastic. One way to combat heteroscedasticity is through Weighted Least Squares. Weighted Least Squares is like standard Least Squares; however, each observation is weighted by its own unique value. In this way, observations with larger weights are more favored by the model to fit than smaller weights. To give an example of the power of adding weights, here below we have two prediction lines, one unweighted and one not. As we can see from the weighted prediction, the instances that have the higher weight are going to have a better fit as the model will gravitate to fixing the prediction line about those points more than instances with lesser weights.


Now it is time for the derivation of the Weighted Least Squares Solution. To start off, we want to minimize the Expectation of the weighted residual error:

This can be converted to matrix format:

Now the partial derivate can be found and set to zero to find the analytical solution:

As we can see from above, the solution is extremely similar to that of Linear Regression, except with a diagonal matrix W, containing the weights for each instance.
One of the main powers of WLS is the ability to weight different instances to give preference within the model, either to create homoscedasticity, constant variance, or to model better certain records. However, one of the main drawbacks of WLS is how to determine the weights. In our problem, we want to fix our residuals to have constant variance. Below I have depicted the four major types of residual variances found in practice. The top left showcases the ideal, where the variance is constant with a mean of zero. The top right showcases how the residual variance grows with y, revealing a ‘MegaPhone‘ type distribution. The bottom left depicts non linear residuals, revealing that our model is lacking the complexity to create an association. Lastly, the bottom right showcases a binomial residual variance. WLS is commonly used only when a binomial or MegaPhone type residual plot is found, as nonlinear residuals can only be fixed the addition of nonlinear features.




A common solution for Binomial and MegaPhone residuals is to make the weights equal their squared residual error:

As we can see, this intuitively makes sense, we weight instances based off how large is their error.
In sklearn, this is simple, just create another model and add the extra weight_sample argument:
If you were to test these weights above, the residual plot would look similar and with similar R² and MSE scores, this is because our residual variance is highly non linear:

For this situation, we only have one solution, try the addition of nonlinear terms in hopes to explain the variance. Which we will cover with Kernel Regression, but first we need to discuss Regularization.
How to deal with Overfitting – Regularization
Suppose we trained a Linear Regression model on a given dataset, and during its application and deployment we found out that it was performing extremely poorly, despite having good MSE and R² scores on the training data; this is known as overfitting – when the metrics on the Testing Dataset are much worse than the Training dataset. To give an example:

As we can see from above, we have a linear trend of points; however, if we were to fit a 10th Degree Polynomial we can artificially minimize both MSE and R² to zero on our training dataset. Despite this, we can see intuitively that model will generalize poorly when new data is seen. Regularization works by adding a Penalty Term to the loss function that will penalize the parameters of the model; in our case for Linear Regression, the beta coefficients.
There are two main types of Regularization when it comes to Linear Regression: Ridge and Lasso. First, lets start off with Ridge Regression, commonly called L2 Regularization as its penalty term squares the beta coefficients to obtain the magnitude. The idea behind Ridge Regression is to penalize large beta coefficients. The Loss Function that Ridge Regression tries to minimize is the following:

As we can see from above, the Loss function is exactly the same as before, except now with the addition of the penalty term in red. The parameter lambda scales the penalty. For example, if lambda=0, then the function is the same as before in Least Squares; however, as lambda grows larger the model will lead to underfitting as it will penalize the size of the beta coefficients to zero. Lets look at a simple example below:

As we can see from above, when using a 20th Degree Polynomial model to approximate the points and lambda=0, we have no penalization and exhibit extreme overfitting in the blue line. As we start to increase lambda to 0.5, denoted by the orange line, we start to truly model the underlying distribution; however, we can see that when lambda=100 in the purple line, our model starts to become a straight line, leading to underfitting as the penalty term forces the coefficients to zero. Choosing the lambda value in practice is either performed on a Validation Set or through Cross-Validation. In this way, we retrain our model on our training dataset with different lambda values and which ever one performs best on our Validation set is chosen to be final for the Testing dataset.
Mathematically speaking, our loss function can be transformed into matrix form:

Where beta can be solved for as previously, by finding the gradient and setting it to zero:

Now that we’ve discussed the mathematical theory behind Ridge Regression, lets apply it to our dataset. In practice, one would want to tune the lambda value on a validation set, not the testing set in order to get a good generalization error; however, I am going to do it on the testing set in order to save some room:

As we can see from above, as we increase our lambda value, our error on the training and testing set increase drastically; in addition, it appears that the min error on the testing set is around lambda=0. If you remember correctly, our dataset was not failing from overfitting but underfitting! Therefore, it makes no sense to use regularization, which is why our testing error is getting worse instead of better! I just wanted to show how one could use Ridge Regression if your model was exhibiting overfitting!
The next regularization method to be covered is Lasso, which is commonly called L1 regularization as its penalty term is built off the absolute value of the beta coefficients:

Notice that the only difference between Ridge and Lasso Regularization is that Ridge squares the beta coefficients while Lasso takes the absolute value. The main difference between the two is that Ridge penalizes the size of the beta coefficients, whereas Lasso will drive some of the beta coefficient values to zero, leading to feature selection.
These types of penalty terms can often be rewritten as a constraint problem:

Because Ridge regression squares the beta coefficients, plotting the constraint would lead to a circle; whereas Lasso would lead to square. If we were to plot two beta coefficients (called w1 and w2 in the graph below) values against each other we might end up with the following:

The red line represents the range of values that the two coefficient values can take on, as the coefficient value for w1 increases, the value of w2 starts to decrease. When we plot our L1 norm constraint: |w1|+|w2|≤ lambda, we can see it denoted by the dotted square. Wherever this square box intersects the red line is the chosen value for the coefficients, which we can see would cause w1 to have a value of zero. On the other hand, when we plot our L2 norm constraint: w1²+w2²≤ lambda, we get a circle, as denoted by the dotted circle. Wherever this circle intersects the red line is the chosen value for the constraints, which we can see are both small nonzero values for w1 and w2.
Unfortunately, finding the analytical solution for beta in Lasso Regularization is difficult using matrix calculus as the gradient of the absolute value operation is undefined, therefore numerical methods like Coordinate Descent are often utilized. Because of the complex nature of these algorithms I will not detail the math. In python, Lasso Regression can be performed as follows:

As we can see from above, as we increase our lambda value, our error on the training and testing set increase drastically, eventually converging around lambda=15. The reason why the error converged is because our lambda value was too large for the model and it drove all the beta coefficients to zero. If you remember correctly, our dataset was not failing from overfitting but underfitting! Therefore, it makes no sense to use regularization, which is why our testing error is getting worse instead of better! I just wanted to showcase how you use Lasso Regression if your model was exhibiting overfitting!
The last Regularization technique I am going to introduce is Elastic Net, which came about to harmonize Ridge and Lasso, as Ridge penalizes large coefficients whereas Lasso drives coefficients to zero. The idea behind Elastic Net is create a penalty that will both create feature selection and minimize the size of the weights. There are many different versions of Elastic Net, here are the two most common:

As one can see, the penalty term is a combination of Ridge and Lasso, each with their own lambda value to control how much each penalty term affect the model.
How to deal with Underfitting – Kernel Regression
We’ve discussed what to do when a model starts to overfit, but what about when a model underfits? In our example dataset thus far, our model has shown various signs of underfitting: non-linear residuals and poor R² value on a relatively simple dataset. The most common way to deal with underfitting is to utilize a Kernel. A Kernel is a density function that satisfy three main properties:

Kernel Regression is often called a non-parametric regression technique __ by the Statistics Community. The premise behind using a Kernel is that if we map our input variables to a higher dimension, then the problem can be either easily classified or predicted. The easiest example to see how this works in practice is through a simple classification problem. Suppose we have two groups that we wish to classify using only a line/hyperplane. We can see down below that no line will be able to classify the two groups:


However, if we were to transfer the data into a higher dimension (as we can see on the right hand side), now there exists a hyperplane capable of classifying the data.
To give an example for regression, suppose we only have one feature variable, X, where the target variable Y is equal to X³. We can see below in the bottom left picture that a linear model will fail to accurately represent this data. However, if instead we project our feature variable, X, to a higher dimension, X³, then we can see that our linear model fits a perfect line.


I hope now I’ve convinced you of the power of projecting our feature variables to higher dimension! However, now the question becomes, how do we do so? First, we need to define a function, commonly denoted as phi, that maps our variables to higher dimensional input space. In Kernel Regression, the way in which this is performed is by Kernel Functions. One of the most popular and basic of Kernels is the Polynomial kernel, which simply raises the feature variables to a power. Lets take for example, the simple case where we have only two variables: x1 and x2; then, we want to map this to a higher dimensional space by simply using a polynomial kernel with the power 2:

As we can see from above, we mapped our original data, x1 and x2, to a higher dimension using the phi function with a polynomial power of 2. The only problem is that now our time complexity is proportional to the power of our polynomial, O(k^p). We can reduce this complexity through the Kernel Trick. First, lets recalculate our loss/error metric using phi(x). Note that Kernel Regression utilizes Ridge Regression as the coefficients tend to be extremely large, which is why this method is commonly called Kernel Ridge Regression:

We can se that the derivation of beta is actually recursive, meaning the optimal beta is a function of itself. However, if we were to plug this beta back into our error metric we get:

As you can see, as we reduce our loss function with the new beta value, we get phi(x_i)*phi(x_j), where phi(x) is a O(k^p) operation, which makes this procedure a very time consuming operation. However, the Trick, is that

To give a concrete, example, lets apply this to our previous kernel function, a polynomial with power of two:

As we can see from above, the Kernel Trick is the fact that the dot product of of two data points converted to a high dimensional mapping is the same as the high dimensional mapping of the dot product between the two points! This saves a lot of time and computational resources! Now, we can trade this back in to our loss function to get:


As we can see from above, the format of the loss function is very similar to Least Squares, except where K=X and alpha=beta. To find the optimal beta, we first find the optimal solution for alpha, then plug that into beta!

The time complexity for standard Least Squares is O(k³), but now our matrix result is an n by n matrix as K is n by n; therefore, the time complexity for Kernel Regression is O(n³), which is extremely computationally heavy when there is a lot of data! A common solution is to simply sample data from the total dataset such that n is small.
Another hyperparameter that needs to be tuned for Kernel regression is choosing which Kernel Function to utilize. The three most common are the following:

As we can see from above, each Kernel Function will have its own set of hyperparameters to tune, adding to the complexity.
Now that we’ve discussed the theoretical background, lets apply Kernel Ridge Regression to our problem! For this example I will only showcase the Polynomial Kernel as it is the most common. Because Kernel Ridge also has a lambda/penalty term, I will show the influence of increasing the penalty term on the testing dataset. Note that in practice one would want to this on a validation set, not testing.

As we can see from the plot above, increasing the penalty term of lambda actually decreases the R² value on both testing and training error; however, in practice this might not be the case, so always test out different regularization values. Immediately however, we can see that using Kernel Regression increased the R² on the testing dataset from 0.76 to 0.83, meaning that our model now explains approximately 83% of the variability of the target variable, little bit better than 76%. Now lets examine the residual plot on the training dataset in comparison with the standard Linear Regression:


As we can see from above, our residual plot for Kernel Ridge (on the right hand side) definitely evens out the variance of the residuals to a constant for Y values between 7 and 9; however, the extremely large residuals for Y values between 9 and 10.5 still exist; indicating that our model is underfitting some of those points.
One of the downfalls of Kernel Regression is that interpretability of the model is lost, as now the beta coefficients are not for the feature variables but the data observations, as the prediction of new data is given by:

As we can see, for a new prediction we form a new Kernel Matrix, K, from the dot products between the new data and the data the beta was trained upon, multiplied the alpha vector holding the coefficients. Due to this high dimensional mapping, the interpretability on how the model achieved its results from simply the feature variables is lost, making Kernel Regression a Black Box Method.
However, you might be thinking to yourself, if Kernel Regression is a black box method because the projection to a higher dimension is summed up to one value between the data instances, why don’t we manually project our feature space? For example, suppose we have the following feature space with three variables and project it to a second degree polynomial:

Now we’ve projected our initial data dimension to a higher dimension, allowing us to perform ridge regression to obtain the white-box beta coefficients! However, the problem is that we assume in our Least Squares derivation that (X^T*X) is invertible, which assumes X is linearly independent, meaning no column is a combination of another; but we can clearly see that our new projected data is a linear combination of the original data dimension! In this way, the more higher dimensional terms we add the more likely the inverse does not exist. Kernel Regression escapes this problem as it projects the dot product between data instances, where we assume the data instances are sampled independently.
Support Vector Machines
For our last method in this deep dive into Regressional Analysis, we will look at a close counterpart to Kernel Ridge Regression, Support Vector Machines (SVMs). To give the basic intuition behind SVMs, lets switch over to the objective of classification, where we want to find a decision boundary to classify two groups and we have three possible models:

The problem is that all three decision boundaries correctly classify all points, so now the question is which one is better? The ideal model would be the red line as it is not too close class 1 or class 2. SVM’s solve this problem by adding a margin about the decision boundary, commonly called the Support Vectors:

By adding these support vectors, our model has the ability to ‘feel’ out the data to find a decision boundary that can minimize the error within these support vector margins. There are two types of SVM’s, Soft Margin and Hard Margin. Hard Margin makes the model find a decision boundary such that no data instance is inside the support vector margins; whereas Soft Margin allows for instances to be inside the margins. Hard Margin only works on linearly classifiable data and is extremely sensitive to outliers, therefore Soft Margin is the most common type of SVM.
The width of these support vectors, the margin, is commonly denoted as epsilon. The error function for Support Vector Regression is similar to that of least squares, in that it assumes that the target variable is a linear combination of the feature variables:

However, the construction of the Loss/Error function is different than before, as we want to minimize beta to ensure Flatness, meaning we want small beta coefficients so that no feature variable coefficient becomes too large, leading to overfitting. In addition, we also want to minimize the residual error to be less than the margin width, denoted as epsilon:

However the problem is that a model might not exist for the given epsilon that satisfies this condition (Hard Margin), leading to a surrogate function using _slack variables (_called Soft Margin):

Unfortunately, the mathematics used to solve this problem are no longer as easy as finding a derivative and setting it equal to zero, but involves quadratic programming. Because of this complex nature, I am going to skip the math to find the final solution.
Because SVM’s utilize the data matrix X, non linear mapping can be utilized through kernel functions to achieve non linear regressional planes. I am going to skip the math behind this as it gets messy and complicated; however, the idea is the same as mentioned above for Kernel Ridge. The nice thing is that the Kernel Trick still applies here as well, leading to saved time and computation. Now that we’ve talked about the theoretical side of SVM’s, lets apply it to our problem!
As with Kernel Ridge Regression, there are a whole host of possible Kernel Function to use, to which this time I am going to test three: Polynomial, RBF, and Linear. In addition, there are two more important hyperparameters that SVM needs, C and epsilon. Epsilon is the margin width and C is regularization term. In practice, only the regularization term, C, is changed as changing the margin width will drastically lead to poor results. Here we have the three kernels with default parameters at various C values evaluated on the testing set.

As we can see, for this particular dataset, by increasing the C value, almost all three kernels increase the R² value on the testing set. We can visible see that the RBF kernel performs the best so lets examine its results at C=100 a little bit more in depth:

As we can see, our R² on the testing dataset was better than Least Squares, explaining 81% of the variability of the target variable, but not quite as good as Kernel Ridge Regression with a Polynomial Kernel. However, note that this might not always occur in practice. One could examine the residual plot for this model but it would be very similar to the ones before as the R² is so similar.
Unfortunately, as with Kernel Ridge Regression, because SVMs find their coefficients based off kernels instead of the feature variables, the interpretation on how the model achieved its prediction is lost, making SVM a black box method.
Conclusion
If you’ve made it this far, congratulations! I hope you’ve learned a lot about Regression in the realm of Data Science and Machine Learning!
As a quick recap, we introduced our first model, Least Squares, which simply assumed that the target variable was linear combination of the feature variables, to which the goal was to find these coefficients. The problem that arose was that Least Squares is built off a few assumptions, namely that the errors had constant variance and a mean of zero. In practice however, this was often violated as by assessing a residual plot one could observe the non-linearity of the residuals. Assuming the residuals followed a particular trend, such as a binomial or megaphone, Weighted Least Squares could be utilized to create a model to satisfy these assumptions. One of the many pros of Least Squares and its derivates is its open white-box nature, meaning the model prediction can be directly observed by the coefficients for the feature variables.
In the situation where our model had low training error but yet high test error, we needed to include regularization to prevent overfitting. We discussed three of the most common types of regularization: Ridge, Lasso, and Elastic Net. Ridge regularization shrinks the values of the coefficients while Lasso drives some coefficients to zero, and Elastic Net seeks to harmonize the two.
On the other end of the spectrum, where instead of overfitting, our model underfitted with both high training and testing errors. In order to fix this problem, we projected our feature space to a higher dimension using kernel functions in hopes that a prediction plane would be able to fit the data. This was performed through two methods – Kernel Ridge Regression and Support Vector Machines. The difference between the two is the formulation of the error/loss function, where SVM’s include a margin of error to minimize as well. However, the problem with these higher dimensional mapping models is that the interpretation of how the model achieved its prediction in terms of the feature variables is lost, making them both black-box methods.
In practice, there is no best model to utilize. If you are wanting to present your model to businessmen with no background in Machine Learning then using LR or WLR to explain the importance of different features would be beneficial as they both are white-box methods, then reporting the scores of the black-box methods as they tend to perform better.
All in all, I hope you have learned a lot about the topics discussed above, both theoretically and for application!