Photo by Bekir Dönmez on Unsplash

When Outliers are Significant: Weighted Linear Regression

Methods for weighted regression that incorporate significant outliers

Daniel Kulik
Towards Data Science
8 min readOct 21, 2022

--

Outliers are often mischievous. They have the potential to disrupt an otherwise simple process of regression by introducing themselves as data that is as equally important as the rest, often skewing the fitted model. A straightforward approach is to use outlier detection methods to remove them from the dataset prior to fitting a model. But this has its caveats. Sometimes outliers may be significant and are essential to building a model that fits the data such that it represents all the observations/measurements. This however leads straight back to the issue of outliers skewing the fitted model. So, what can be done to solve this? You guessed it, weighted regression!

Weighted regression is defined as “a generalization of linear regression where the covariance matrix of errors is incorporated in the model”. In simple terms, this means that not all data points are equal in the eyes of the data scientist, and this inequality should also reflect in the fitted model as well. The imbalance of equality during fitting a dataset to a regression model can be alleviated with a few techniques. These techniques range from: inserting a new binary column that flags outliers, to giving each data point its own weight of importance relative to the rest of the dataset.

The art of weighting data can often be an ambiguous one. What data is more important than another? Why is it? And what weight of importance should it receive? All good questions a data scientist should ask when attempting to apply this method. A simple approach would be to use an outlier robust regression model, such as the Huber Regressor, to do the trick. However, many more advanced methods exist for weighting data, with some using prior knowledge of the data itself, and others by applying more sophisticated statistical techniques. This article will focus on weighting the data prior to regression by employing both outlier detection and thresholding methods.

To get started let's load and prepare the data we will use to fit our regression model. The dataset tips is an example dataset available in the Seaborn library from their online repository. This dataset is part of a collection of case studies for business statistics [1]. It consists of 244 tips made by a single waiter at a restaurant over a period of a few months. The dataset has 6 exploratory variables (X) and the “tips” being the response variable (y). Besides the variable “total bill” the rest of the exploratory variables are categorical. In order to use these categorical variables for regression, the dataset requires some preparation. Dummy variables (representing the presence or absence of a qualitative attribute) for the categorical variables must be created. At the same time, let's take a look at how the dataset is distributed.

Scatter plot of Tips dataset. Image by author and reproduced from Seaborn examples.

We now have 8 exploratory variables with which we can use for regression. And from what we can see in the distribution plot, is that by using the variable “total bill” alone, there is a visible linear relationship between it and “tip”. However, something interesting happens at around the 25 “total bill” mark, the data no longer follows the same linear relationship as before. One might say there are perhaps two or even three possible linear models that are necessary to model this dataset. However, this article’s focus is on a singular weighted regression fit and not piecewise modelling. From what we can observe, it seems like Saturday and Sunday dinners seem to deviate the most in terms of “total bill” to “tip” linearity. And since we want to fit a single linear model to the data, these data points may need to be weighted differently to the rest in order to not skew the predicted results. Before this, lets create a baseline regression model with which we can compare our differently weighted regression models to.

Without weighting the exploratory variables, we get a linear model with a R-squared score of 0.4699 and a mean squared error of 1.011. As mentioned above during the quick data analysis, a single linear model is too simplistic for this dataset. But we shall continue along this path and see what can be done by adding weights to the regression model.

First, let's talk about what types of weighting can be used in regression modelling.

  • Continuous weightings: Each variable has a unique weight associated with it that follows some probability distribution function (i.e. a Gaussian distribution).
  • Discrete weightings: Specific variables or range of variables have discrete weights assigned to them based on certain conditions (i.e. inliers/outliers)
  • Piecewise continuous weightings: A combination of both continuous and discrete weightings

Before further explaining and demonstrating each weighting method’s application in regression. Perhaps it would be better to first visualize these weightings. For this, example functions will be used where the x-axis follows normalized (0–1) outlier decision scores. These scores are a likelihood estimation that a value is an outlier relative to the rest of the dataset. Therefore, the higher the decision score the more likely it is an outlier. However, I will not go into further depth on likelihood estimation and its mathematical applications in weighted regression but would rather refer you to go read Weighted Linear Regression if you wish to know more on this. On our y-axis we will have our weightings. So, let’s visualize this!

Top: Three types of continuous weighting functions. Bottom: A discrete and piecewise continuous weighting functions. Image by author reproduced from Examples weights.

While that was a great visualization, what does it all mean, how did you generate these weighting functions, and how can we apply this to regression? From what we can see above, we have quite a few options of weightings to choose from. So, let’s start with the continuous weighting type. We will use a Gaussian function as weights for the weighted regression on the tips dataset. To do so we will first need to find the outlier decision scores for the dataset. This will be done by using the kernel density estimation (KDE) outlier detection method available from PyOD.

By using a continuous weighting function, we get a linear model with a R-squared score of 0.4582 and a mean squared error of 1.033. You may notice that this performance is worse than the linear model without weighting! So why is a regression model with a worse R-squared score a better choice to use? Did we just waste our time? And what have we achieved by using weights in the regression model? The simple answer is, this is to be expected…

To explain this more broadly, while it is true that the overall model performance decreased, the purpose of weighting the data was to assign more importance to data that was more likely to occur/ be measured. Thereby, allowing for outliers that are still significant within the data to contribute to the model, but only have a minor importance to the overall model itself. And therefore, a lower model performance is not an indication of poor fitting. But rather, it does indicate that perhaps the way that we measure our model performance should now be shifted.

Well, that is all nice and dandy, but how can our weighted model performance now be accurately evaluated? To do this, this actually brings us to the second type of weighting. Discrete weighting. Let me explain. Since the weights are discrete, and for our example case binary. This means that a clear distinction is made between inliers and outliers. With this distinction, a new dataset arises from the original with which to better evaluate our model performance metrics against. Inliers.

To do this we will evaluate the outlier decision scores using PyThresh (A library for thresholding outlier scores) and a project that I am openly involved in. So, let’s apply discrete weighting to a weighted regression model.

By using a discrete weighting function, we get a linear model with a R-squared score of 0.4606 and a mean squared error of 1.028. The ratio of inliers was 77.5% and outliers was 22.5%. If we now evaluate the weighted models against the baseline model with regards to only inliers this is what we get:

  • Baseline: R-squared = 0.3814 & Mean squared error = 0.5910
  • Continuous weighting: R-squared = 0.3925 & Mean squared error = 0.580
  • Discrete weighting: R-squared = 0.3966 & Mean squared error = 0.5763

From this we can see that in terms of fitting the entire dataset the unweighted model still performs the best. However, with regards to the inliers it performed the worst. This means, that while the unweighted model did perform better overall, it is now possibly biased towards outliers, reducing its predictive accuracy towards inliers.

This finding, however, must be taken with a grain of salt. As we are data scientist after all and being skeptical of our own bias at model interpretability is also important. From what is apparent from the performance metrics, is that weighting the dataset may remove outlier bias. However, to reiterate, a single linear model, and perhaps even the outlier detection method, are not well suited for this dataset. So, what has been achieved from this exercise? Well, we have seen that significant outliers can be included within a regression model without making them as equally as important as inliers. And that even though they contribute to the final model fit, being significant and all, their importance with respect to predictions has been adjusted accordingly.

Applying the correct weightings for your data during regression becomes just as important as the dataset itself. It is difficult to say with absolute confidence that a perfect set of weights for your data exists, but with the examples above, hopefully it will make this task a little easier when trying to include outliers. One should be mindful to consider the implications of allowing outliers to remain during fitting and always choose the answer that represents our data best. Weighted regression is one of many ways to achieve this, and its uses are a valuable asset.

I will not include an example of piecewise continuous weighting, as with the continuous and discrete weighting examples above, this should hopefully be easy to implement using the two combined.

In closing, I hope that this will aid your data science skills and become a powerful tool to help when handling significant outliers. All the best with your endeavors in the world of data!

[1] Bryant, P. G. and Smith, M (1995) Practical Data Analysis: Case Studies in Business Statistics. Homewood, IL: Richard D. Irwin Publishing

--

--

Machine learning developer working on green projects and a MSc student in Astrophysics