Solving regression problems by combining statistical learning with machine learning

Using Airbnb Seattle Price Prediction as an example

Published in

Towards Data Science

15 min readAug 21, 2019

Statistical learning and machine learning are usually deemed as two separate camps in the world of data science. People in the first camp often pivot more towards understanding the statistical significance while the latter cares more about the model prediction performance. But in reality, these two different approaches are complementary when going hand in hand to solve a data science problem.

Most of the data science projects fall into two main categories — a regression problem(when the target variable is continuous/numerical) or classification problem(when the target variable(s) are discrete/categorical). This post will focus solely on how to approach a regression problem by combining both statistics and machine learning step-by-step.

Linear regression is usually the very first model we learned in any data science courses. It’s simple, straightforward and best of all it can produce interpretable coefficients. Later on, we may have learned many other more sophisticated and higher performing tree-based algorithms like Random Forest, and Gradient Boosting Trees, etc that can solve both regression and classification problems but with more flexibility and tolerance in the underlying assumption of the features.

The tree family algorithms are so great that they become our go-to models, but they somehow make us lazier since they need very little effort in feature preprocessing. However, working with basic algorithms like linear regression renders great opportunities to learn how to select and massage the data, which is a skill I believe to be more important than running the models itself.

Goal

In this post, I will guide you through a regression project using the latest version of Airbnb Seattle dataset from Kaggle. It is intended for readers with basic knowledge of the regression algorithms. The purpose is more to illustrate the entire workflow when you are given a dataset without any explicit goal or direction — you are expected to define your own question of interest and figure out a roadmap to solve it.

The main questions of interest for this project are:

Predict the Airbnb listing price with the available information of the individual listings in Seattle.
Explore the important factors that affect the prices of Airbnb listings in Seattle.

Like any typical data science projects, my approach to the problem I just proposed will follow this flowchart below. It is an iterating process through all the steps rather than a one-way tunnel.

Datasets Overview

First, let’s take a look at the data. There are 3 datasets available on Kaggle:

Listing.csv: contains a comprehensive list of attributes about the place, the host and the average review score

Calendar.csv: includes listing id and the price and availability for each day

Review.csv: includes listing id, reviewer id, and detailed comments

For this project, since my goal is to predict the listing prices, I will be only using the listing dataset. The review dataset is great for sentiment analysis to predict the listing ratings, but I will leave that part for my future post.

There are 93 columns in total in the listing data. It’s hard to examine all the columns at once with a data frame head call, so I will split them by data types and check them in chunks.

Couple things caught my eye by glancing through the columns. First of all, there are a few features that are supposed to be numerical are currently in the “object” data type, such as ‘weekly_price’, ‘monthly_price’, ‘security_deposit’,
‘cleaning_fee’, ‘extra_people’, etc. Here is what they look like right now. I will convert them to numerical in the data cleaning step later.

current string format of some numerical features

I also realized two feature groups that might contain very similar information- one is about the listing neighborhood, the other is about review ratings. Let’s examine them one by one.

unique values of some categorical features

It seems that “neighbourhood_cleansed” is almost identical to “neighborhood” except that the latter contains NaN while the former doesn’t. I will discard the “neighborhood” column in favor of “neighbourhood_cleansed” since the latter is a cleaner version and doing so won’t cause any information loss.

NaN count of the neighborhood feature group

For the review feature group, “review_scores_rating” seems a weighted sum of the other 6 individual rating categories. I think there might be strong collinearity in this feature group that I can just use the overall rating to represent the rest in the group. I’d like to test out this theory later.

Exploratory Analysis

1.1 Do super hosts always have higher ratings than regular Airbnb hosts? Do super hosts also charge differently than the regular hosts?

The first hypothesis I wanted to check is the difference in review rating and price between regular and super hosts. The violin charts below show the review score and listing price distributions between super hosts and regular hosts only and all hosts combined.

The review score rating shows an evident distinction between the super hosts and regular hosts — the majority of super host ratings clutter around 90–100, compared to only 61% ratings above 90 for the regular hosts, due to the long tail in the lower ratings. The price distribution, on the other hand, seems to be the same for both types of hosts except a few rare extremely high price outliers from the regular hosts.

1.2 What are the average listing prices of different property types by neighborhood?

The original property types in the dataset are fairly scattered. I cleaned it up by grouping them into 4 main categories. From the bar chart below we get can get an idea about how average listing prices vary across different neighborhoods and property types.

It seems that prices of apartment listings have much smaller variations in different neighborhoods compared to the other three property groups. Among all the neighborhoods, Downtown and Magnolia are in general the most expensive in most popular property types — apartment, house, townhouse, and condom.

This chart also confirms that neighborhood and property type definitely has an impact on the listing prices.

1.3 Is there multicollinearity among the review feature group?

I divided it by 10 to bring the “review_scores_rating” to the same 1–10 scale as the rest of the features in this group. Here is how their distributions overlap with each other. They all look heavily right-skewed.

To further test my collinearity theory, I think the easiest way is to run a simple linear regression to see if “review_scores_rating” can be represented by a linear equation of rest 6 review features.

And clearly, the answer is yes. So I can be confident that I can use the overall rating to represent the others without losing much information and reduce multicollinearity in the linear model.

Simple Data Cleaning & Feature Selection

Since my goal is to predict the listing price and understand the differences between regular hosts and super hosts, I want to only include attributes that are related to the listings and the hosts. Below are the features I decided to keep.

Numerical features

As we have found in the exploratory analysis, some numeric columns are currently all in string format. Those columns below should be transformed into float type by stripping the “%” and “$” signs from the values.

After a simple cleaning step, now they look good as numerical.

Now the numerical features have been taken care of, let’s take a look at correlation heatmap.

Clearly, there are two main correlated groups from the heatmap. One group is about the listing facilities like the number of bedrooms, bathrooms, the number of people it can accommodate, etc. I want to keep all the features in this group because they all tell the fundamental information about the listing itself. The other correlated group is about review scores as we have already discussed before. I decided to keep only one overall review score “review_scores_rating” and drop the rest of the review columns.

Categorical features

Before I go ahead dummy code all the categorical variables and triple the number of features, I want to see how effective they are to explain the variability of the listing price to decide if I want to use all of them for the model. This step is optional but I found it especially helpful when the dataset contains many categorical features with dozens of levels.

You may have read about the three feature selection techniques from other articles on Medium: filtering method using statistical tests, wrapper method and embedded method using machine learning algorithms like LASSO and Random Forest.

Since I want to test the categorical features against a continuous target variable, the easiest statistical filtering method is the ANOVA test. For those who are confused about which statistical test can be used for what type of data as I used to be, this is a very useful reference.

The code below loops through each categorical feature in the data and calculates the F statistics and P-value from the two-way ANOVA test against the target variable “price”.

ANOVA test results for categorical features

The definition of F statistics (AVONA) = between-group variances of the population mean/ within-group variances of the sample mean.

If the P-value is high than 0.05 (rule of thumb of the significance level) and F statistics is close to 1, that means the within-group(within each categorical level) variance of the price is the same as between-group(between different categorical levels) variance, so we can say the categorical feature is independent of the continuous variable (in this case, price) and it does not have any predicting power because there is no observed variance in price between categorical levels.

Conversely, the higher the F test statistics, the more the differences are of average prices between different levels of the categorical variable, and the more variance of price the categorical feature can explain, the more power it has in predicting the price.

Apparently, from the ANOVA test result above, we can first eliminate the bottom two features because their P-values are too high to be considered statistically significant. Also given their small F statistics, we can conclude there is ver small variability of price between the two levels of host_is_superhost and instant_bookable.

Also, there are two features related to neighbourbood. From the exploratory analysis above, it is obvious that neighborhoods affect the listing price. But since both of them basically contain very similar information except that neighborhood_cleansed is in a more granular level, I want to keep only one of them. The F statistics gives a pretty clear idea about which one to choose. I will keep neighbourhood_group_cleansed because it has a higher F score.

After finishing up the feature selection, let’s take a look at the missing values in the dataset.

square_feet has over 97% NaN values, so I will drop it. For the rest of the numerical features, I will impute the NaN values in the machine learning pipeline later to avoid data leakage in the test set. The final step is converting the categorical features into dummy variables. Now the dataset is ready, let’s go straight to the modeling.

Machine Learning

Model 1: Linear regression on original price

First I want to start with the simplest model — linear regression to see without any treatment of the feature, what the baseline would look like.

I used the handy regressor package to get the model coefficients P values.

Without any feature standardization, the adjusted R-squared for the training set is about 0.62. But if we take a look at the p values, almost all the p values are too large to be statistically significant. The only few features that have p<0.05 are “accommodate”, “bedroom”, “bathroom” and “reviews_per_month”. That means only these couple variables contribute to the model performance even if most features have large coefficients.

Thoughts about Improvement

Right now there are 51 features excluding the target variable. Since I kind of hand-picked the features in the very beginning, it’s only based on my intuition what I think is relevant to the listing price, but I cannot be sure if that’s true. To make sure the linear model doesn’t get “distracted” by some irrelevant features which usually causes overfitting, I will apply regularizations to the linear regression model that penalize large coefficients in return for a simpler model to reduce the variance in the absence of sufficient training data.

L1 LASSO regularization has both the advantage of preventing overfitting and feature selection due to its nature of the penalty term that tends to shrink the coefficient weights of less irrelevant features to zero in the course of minimizing the cost function SSE. Also given we have a small dataset (only 3818 listings in the whole dataset), I also want to leverage cross-validation during the training to make the model more robust in generalization performance.

Last but not least, the dataset contains numerical features that vary in a wide range of scale. So I need to do two preprocessing steps in between the train test split and running the model. First, fill in the missing values in the numerical variables with the median. And second, standardize the values. With sklearn pipeline, these sequential steps can be made into one.

Model 2: LASSO regression on original price

Note that this time I only standardized numerical features and left dummy variables intact. The reason I treated them differently is that I care not only about the prediction accuracy but I want to later interpret the coefficients of dummy variables in an intuitive way. If your purpose is only to predict and you don’t care about the interpretability of the coefficients, you may not need to treat them separately.

M1 Model performance measured by R-squared:

R-squared is to measure the goodness of fit. It is the proportion of the variance for a dependent variable that’s explained by independent variables in a model.

It is not much different from the last linear regression. But I’m more curious about the coefficients.

Evidently, LASSO has effectively pushed some feature coefficients to zero like the ones in the middle.

M1 Coefficient Interpretation

Since all the numerical features in the training data have been standardized, the 1 unit change in any of the numerical features is no longer in their original scale but by the standard deviation. We did not standardize the categorical variables, so their unit of change is unchanged.

So we can say if the number of people the listing accommodates increases by 1 standard deviation(~ 2 people), the price of the listing will increase by $23 on average, holding everything else constant. Or if the review scores rating increases by 1 std(6.6 in the scale of 100), the price will go up by $27.5 on average. For categorical variables, the average price of a listing in the downtown area is $33 more expensive than that of a similar listing elsewhere.

Model 3: LASSO regression on log-transformed price

Now let’s take a look at the distribution of the target variable price. It’s clearly in a long tail shape. A natural log transformation can bring it to normal distribution. This can be valuable both for making patterns in the data more interpretable and for helping to meet the assumptions of inferential statistics.

left: original price, right: log-transformed price

Run the same pipeline again but this time on np.log(y_train).

M3 Model Performance:

The test set performance improved by 12% so log-transformation of the target variable did help the prediction.

M3 Coefficient Interpretation

We fitted the model in this form: Log(Y) ~ scaled(numerical variables) + dummy variables. The general interpretation for the log-transformed variable is like this: 1 unit increase in X will lead to (exp(coefficient)-1)*100%.

Since the numerical data is standardized, the 1 unit still means 1 standard deviation. Using the “accommodates” as an example, 1 std increase of the number of bathrooms(about 0.6 bathrooms) will raise the listing price by 24% ((exp(0.22)-1)*100%) on average.

The bottom feature on the chart the Lake City area is most negatively correlated to the price. If a listing is in the Lake City area, the price will be 36% ((exp(-0.45)-1)*100%) lower than non-Lake City area on average, holding everything else constant.

For more details in coefficient interpretation in the context of standardized independent or dependent variable, I recommend reading these two articles article1, article2. Both provided excellent explanations in great detail.

Model 4: Random Forest Regressor on log-transformed price

Since the linear model did not produce very satisfying R-squared results, I want to switch to the tree-based models because maybe there are no linear relations between the predictors and the price.

With the tree-based models, I don’t need to standardized the features because the decision trees decide on the best split based only on “information gain” or “decrease of impurity(entropy)”, scaling the features would not make any difference to tree models splits. I will fit Random Forest Regressor still on the log-transformed price to compare the performance with the previous model with GridSearchCV to find its optimal parameters.

M4 Model Performance

The test set R-squared score has improved by 3% from 69% with LASSO on the same log-transformed price.

The residual vs. fitted value plot below shows how far away the predicted prices are from the actual prices. Unfortunately, the residuals do not look randomly scattered around the blue line. There are some outliers with very high error from the actual values.

M4 Random Forest Feature Importance

The Random Forest model finds a different set of most important features, particularly the room type, bedrooms, and bathrooms are the three dominating features. Let’s take a look at how the room types affect the price.

There is a clear pricing tier among the 3 room types of listing- the average price follows this order: shared room< private room <entire home/apt. There are outliers of extremely highly-priced listings that belong to the entire home category. No wonder room type = entire home/apt is identified as the top feature by random forest.

Final Thoughts

After several attempts trying different models, the R-squared score in the test set is still not above 0.8. I think it can be caused by a couple of reasons, the biggest one being the data sample is too small(only about 3000). It is hard to draw a generalized pattern with this small sample size and that also leads to overfitting in the training set for random forest.

One way to fix this is to get more Airbnb listing data for other cities to increase the data samples. I can also try running gradient boosting tree model for a better prediction accuracy when there is sufficient data collected.

Statistical learning and machine learning are two indispensable parts to address regression problems. While machine learning provides us with more sophisticated models for predictions, statistical tests can be useful in feature selection, multicollinearity detection and to tell the statistical significance of regression coefficients. As data scientists, we are expected to know both sides really well.

Thanks for reading! As the learning never stops, I will keep updating this post with more results as I progress through the iterations. Stay tuned.