The world’s leading publication for data science, AI, and ML professionals.

How to use PCA, TSNE, XGBoost, and finally Bayesian Optimization to predict the price of houses!

Would you like to have a look at a Machine Learning pipeline from start to finish? To figure out what each of the algorithms in the title…

How to visualize and predict the prices of houses using PCA/TSNE and XGBoost and Bayesian Optimization

Image by author
Image by author

Would you like to have a look at a Machine Learning pipeline from start to finish? To figure out what each of the algorithms in the title is all about? Then keep on reading! The purpose of this project was to explore how and why to use each of these tools, and I applied them to a classic Kaggle dataset with information about houses. Read on if you’d like to know what each of these incredible algorithms can do, and learn how to apply them yourself by having a look at this accompanying Jupyter Notebook!

The Dataset and the Problem

I worked with a Kaggle competition dataset with "79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa". In turn, the task was to predict the sale price of houses based on these 79 explanatory variables. Thus, we have a regression problem on our hands.

Data Cleaning

My approach to data cleaning was pretty straight forward. First, I noted that of the 79 explanatory variables, 6 of them had a lot of data missing. In other words, the six columns representing these were quite empty (more than 90% empty). Thinking that filling these in with the mean or the mode, or trying to infer the real values from the data, might add bias, I decided to drop these columns instead. Then, seeing that very few of the other columns had missing values, I again decided in favor of dropping rather than inferring: I dropped all rows with at least one missing value. That decreased our number of rows by around 60 values, down from around 1400 (I thought this was a worth it tradeoff).

Evidently, one concern is whether the missing data is correlated to some important predictive factors. For example, what if the person responsible for taking down the data missed the second page of the form throughout a day while looking into a specific neighborhood? This would mean that we are dropping a lot of data from a specific neighborhood, and that data could have been important for prediction.

Finally, I used dummy variables to encode categorical variables, thereby expanding our around 70 variables into around 200 columns, many of which are now composed of binary indicators.

Principal Component Analysis (PCA)

What is It, and When Do We Use It?

We use PCA when we want to reduce the number of variables (i.e. the number of dimensions) that encode our data. In this case, we had 79 explanatory variables, and we can reduce that down to a smaller number like 10. How does PCA accomplish this? In short, it finds the vectors that account for most of the variance of our data. The algorithm creates one vector (or principal component) at a time, and each will be optimized for the maximum variance. This means finding vectors that will maximize the sum of the squared distances of the projections of the data onto that vector. That doesn’t sound too clear for you? Well I don’t blame you this stuff is complicated, check out this excellent video by StatQuest on youtube for an intuitive, step-by-step explanation.

PCA Decomposition

79 explanatory variables, or 200 after encoding dummies, aren’t necessarily too many for most popular regression algorithms. That said, I wanted to figure out if (a) we could speed the training of a regression algorithm by decomposing the original data with PCA and (b) if the principal components themselves could tell us something about the variance, and what the most important variables seem to be. Therefore, I used sklearn’s decomposition library to decompose our data into 10 principal components.

Image by author. Figure 1: print output corresponding to the 'scree plot' of the PCA analysis: it shows how much of the variance is explained by each of the first ten principal components. Evidently, the first component is by far the most important and explains most of the variance.
Image by author. Figure 1: print output corresponding to the ‘scree plot’ of the PCA analysis: it shows how much of the variance is explained by each of the first ten principal components. Evidently, the first component is by far the most important and explains most of the variance.
Image by author. Figure 2: the same information as above, but represented in a scree plot. This shows the relative importance of each principle component: most are relatively insignificant after the first!
Image by author. Figure 2: the same information as above, but represented in a scree plot. This shows the relative importance of each principle component: most are relatively insignificant after the first!

This relative importance of principal components is important because it allows us to investigate not only how important the first few principal components are but also what the makeup of each component is, and determine which of the original variables have the highest contribution to that eigenvector. Each of the subcomponents of the first PCA, for example, corresponds to one of the 200 original variables, and specifically to the weight and direction that variable has to the overall component. Thus, taking the first principal component first, I looked into each subcomponent and ranked them for their magnitude (from high to low):

Image by author. Figure 3: the value of the subcomponent, and the name of the original variable. This shows how important each variable was for the first principal component. We can see that Lot Area is by far the most important one, with around 99 times more important than the above-ground living area (second place).
Image by author. Figure 3: the value of the subcomponent, and the name of the original variable. This shows how important each variable was for the first principal component. We can see that Lot Area is by far the most important one, with around 99 times more important than the above-ground living area (second place).

Lot area seems to be, by a long margin, the most important descriptive factor in the first principal component. This makes a lot of sense intuitively: consider that almost every other factor of a house is impacted by how big the lot is. Larger lots have pools, larger buildings, more rooms, larger garages, etc. I then compared this to the second principal component, which by itself explains much less of the variance at 0.4% versus the 98% of the first principal component:

Image by author. Figure 4: This is the value of the subcomponents for the second principal component. It shows that the most important variables in this principal component are the above-ground living area, the total basement surface area, and the first floor surface area.
Image by author. Figure 4: This is the value of the subcomponents for the second principal component. It shows that the most important variables in this principal component are the above-ground living area, the total basement surface area, and the first floor surface area.

Now that the first principal component already accounts for the lot’s size and its correlates, this second principal component accounts for variation due to things like the GrLivArea (the above-ground living area). Nonetheless, size still seems like a critical factor in explaining this data!

Categorization and How Well do the First Two Principle Components Explain Housing Prices?

A second investigation I undertook with PCA was as follows: if I categorize the sale prices of houses into 5 categories, ranging from Tier 1 (<100,000 USD) to Tier 5 (≤ 500,000 USD), and plot each datum according to their first two principal components, how well does PCA separate the different pricing categories? I decided to categorize prices to make it easier to read the graph.

Knowing this can help us determine how useful PCA is in this problem, describing how the data is spread into different levels of price. Note that originally we had around 200 variables, and only with some dimensionality reduction technique like PCA can we actually visualize this data in two dimensions.

Image by author. Figure 5: plotting each house as per its two first principal components. The color of each dot corresponds to its price tier. We can see how we seem to have a cluster at PCA Component 1 = 0 whose colors roughly follow the tiers as we increase the value of PCA Component 2: they go from cheapest (blue, tier 1), to a few expensive dots (purple, tier 5) as we move up.
Image by author. Figure 5: plotting each house as per its two first principal components. The color of each dot corresponds to its price tier. We can see how we seem to have a cluster at PCA Component 1 = 0 whose colors roughly follow the tiers as we increase the value of PCA Component 2: they go from cheapest (blue, tier 1), to a few expensive dots (purple, tier 5) as we move up.

This graph shows us how PCA helps us identify a datum based on its most descriptive factors. The graph’s main cluster is, on the one hand, too dense to allow for easy (or linear) separation between categories. Still, we can see how blue dots are mostly at the lower part, and we progress to red, yellow, green, and a few purples as expected. Interestingly, while the first principal component is largely responsible for explaining the spread (we can see that by how wide the range of values across PCA 1 is), it is PCA 2 that seems to be more in line with a gradation of colors/sale prices. Going back to our analysis of the subcomponents of principal component 1 and principal component 2, we saw how principal component 1 was (a) responsible for most of the explained variance and (b) composed strongly of the lot area. It seems then, from Figure 4, that the lot area explains a lot of the variance, but not so much the price change. We might interpret that while lot area changes a lot of how houses are, it doesn’t predict how much a house costs. This can make a lot of sense: if we don’t consider factors like location, and composition of the building, which are more relevant to the price, we lose important context. The second principal component, which focuses more on the building area, does a better job separating price categories.

Decomposing With TSNE

While PCA is great for linearly separable data, TSNE promises to be a better decomposition technique that maps each point onto only two or three values in terms of separating non-linearly separable data. TSNE is a complex algorithm that outputs two values based on a pairwise similarity between inputs and pairwise similarity of low-dimensional points in the embedding. It matches two distributions of these two similarities to be as close as possible (Derksen, 2016). In other words, it compares each pair of data and uses a normal kernel to determine how to weight their similarity based on the distance between the two points across their different variables. The algorithm then tries to cluster similar points as much as possible by pushing similar points together and different points further away. That explanation might not sound obvious either, so if you’d like to know the inner workings of the algorithm, check out this youtube video.

Ultimately, with PCA we get two values, one for each distribution, that identify each datum. Given that our PCA analysis shows data that do not look clearly linearly separable, I went ahead and ran TSNE as well.

I then created a very similar plot to figure 5, above, in order to visualize whether we can separate the data clearly:

Image by author. Figure 6: plotting the houses based on the two TSNE components.
Image by author. Figure 6: plotting the houses based on the two TSNE components.

As we can see, we can identify individual points more clearly than before, and we can even determine some clusters. However, there still seems to be a lot of overlap! If we were to use something like multiple linear regression, we might expect bad results: the clusters do not look linearly separable. Transforming this data by a kernel might help, as we would do if we wanted to classify these points using a support vector machine, for instance. Still, there is likely too much overlap for kernelization to be effective. Looking at this graph, it seems that regression will be better if we use algorithms that are capable of dealing with nonlinear separation, and I decided to use an algorithm that fits that description well and is known for its high performance.

XGBoost – Extreme Gradient Boosted Trees for Regression

How they Work

Decision trees are quite intuitive: they work by creating splits that lead you to specific predictions based on different factors. The trees in XGBoost take into consideration the past prediction value for a certain datum and create a new tree that splits the existing data as best as possible to maximize the ‘gain’ in prediction (if tree-splits are created that don’t lead to too much gain, based on a hyper-parameter set threshold, the tree is then pruned to prevent overfitting). Gradient Boosted trees create several models, taking past trees and factoring in their predictions to create a new tree, with the purpose of minimizing their prediction error. Once trained, the algorithm will add all built trees’ predictions to make its final regression. Sounds complicated? Check out this youtube video.

Strengths of XGBoost over Classifiers/Regressors

  • The interpretability of results is good, we can visualize the final trees to determine what variables split the tree and how the thresholds worked.
  • The interpretability of results is good, we can visualize the final trees to determine what variables split the tree and how the thresholds worked.
  • Speed and accuracy – XG boost is quick and accurate compared to older algorithms! We can see that from the graph below, comparing both prediction power and time:
Image by author. Figure 7: from (Morde, 2019), is a comparison of the performance of four common classification algorithms.
Image by author. Figure 7: from (Morde, 2019), is a comparison of the performance of four common classification algorithms.

Building my XGBoost Model

To build my XGBoost model, I first converted my data to a special Python object called a D-Matrix, a data format optimized for how XGBoost works. I also have to define some starting values for the hyperparameters of the model, and my approach was to choose relatively random variables within the recommendations in the documentation.

With no optimizations on the hyperparameters, I obtained an RMSE value of 33,897 USD. Given that the mean of Sale Prices is of around 189,932 USD, with a standard deviation of 83,000 USD, this RMSE value is neither too large nor as small as we’d like it to be. To improve it, I did some very simple hyperparameter optimization.

For each of the hyperparameters ("colsample_bytree," "learning_rate," "max_depth," "alpha," "n_boost_round"), that define different aspects of how XGboost will run like the depth of the tree, the number of nodes created, and how many trees are created, I tried several values while keeping the others constant, together with K-fold cross-validation. I plotted the RMSE for each new value of the hyperparameter, one hyperparameter at a time. Evidently, this process is very slow, as it requires us to systematically perform cross-validation at each candidate value. Also, it is not a great process in terms of identifying interactions between hyperparameters. Below is the code I used for cross-validation as well as for testing different options for each hyperparameter:

The following graph was outputted:

Image by author. Figure 8: how Learning Rate changes the RMSE of the cross-validation.
Image by author. Figure 8: how Learning Rate changes the RMSE of the cross-validation.

As we can see, once we get past around 0.07 for the learning rate, we don’t get improvement for the RMSE. In fact, we see a very slight worsening of RMSE towards 0.6 Learning Rate. Based on this graph, I decided to use 0.08 as my estimate for that hyperparameter.

Just like the graph above, I tested each hyperparameter in turn, and came up with a new set of hyperparameters:

With this new model, my RMSE was of 32,389 USD, a very modest improvement of around 4%.

Bayesian Optimization of Hyperparameters

The process above seems a bit too unrefined though: wouldn’t it be better to have an automated ‘guess and check’ algorithm, that can learn from past guesses and make better and better new ones in terms of reducing our RMSE? Well, that is exactly what Bayesian Optimization can do. In practice, we can see Bayesian Optimization as a bit of a black box: we input ranges for each of the hyperparameters (like "colsample_bytree", "learning_rate", etc), an objective function (in this case, a function that returns the RMSE for cross-validation based on a set of tentative hyperparameters), a variable to be optimized (RMSE), and it will output to us the best solutions it has found. In that sense, other algorithms like gradient descent and genetic algorithms can perform practically the same tasks. Why Bayesian optimization, then?

Bayesian optimization is a fascinating algorithm because it proposes new tentative values based on the probability of finding something better. The algorithm has two components, therefore: the first is a model that is trying to find the probability of a particular score, based on hyperparameters. The second is the objective function, which is then used to verify and update this model. The Bayesian optimization library I used uses Gaussian Processes to model the RMSE score based on the hyperparameters: for each guess, the Gaussian Process gains more information about how well its predictions did, and updates to accommodate new data points. The model can then suggest a new, potentially high-reward guess that could be a new minimum.

Image by Will Koehrsen on Towards Data Science. Figure 9: how the Gaussian Process model used by Bayesian Optimization might look like. I annotated a figure produced by (Kohersen, 2018).
Image by Will Koehrsen on Towards Data Science. Figure 9: how the Gaussian Process model used by Bayesian Optimization might look like. I annotated a figure produced by (Kohersen, 2018).

Figure 9 above shows how the Gaussian Process for Bayesian Optimization works: Gaussian Processes create a posterior over functions, giving us therefore a range of probable solutions for predicting the variable f(x). Each time we get a new data-point, our Gaussian Process can update itself to reduce the uncertainty (the shaded grey areas) about the value of the variable at that point. Have a look at the evaluations (black dots): the uncertainty, i.e. shaded grey areas, shrinks when we get close to them. Our smart Bayesian algorithm thrives on uncertainty, however: it targets areas of our function that are quite high in uncertainty and potentially high in rewards as well (a metaphor for life)? The dashed green circumference, for example, represents a prime target for our Bayesian Optimization algorithm to suggest a new tentative set of hyperparameters, since there is a lot of uncertainty about the value of our variable (f(x), in our case the RMSE) at that point.

Applying this algorithm in Python is quite straightforward because we already have a library for it:

Note here that the function ‘fcv’ is our objective function, which takes in a set of tentative hyperparameters, runs k-fold cross validation, and returns the RMSE value for that set of hyperparameters. The dictionary dict_cv holds the ranges I want to test for each hyperparameter, and the last two lines call the Bayesian Optimization function (pretty simple, isn’t it?) The algorithm will then run for n_iter number of iterations, finding better and better results (potentially) each time. Finally, I used the ‘best set’ of hyperparameters to train and test my XGBoost model: interestingly I got an RMSE of 31,357.06 USD, slightly better than the model I used above! It might be that we are getting to a point of diminishing returns with our hyperparameter optimization: maybe around 30k USD is as good as our predictions will get with this dataset!

How does the Model Make Predictions?

One further advantage of the XGBoost algorithm and python package is that we can visualize quite clearly how the decision making happens when we are given new input. The graph below is the graph of my trained model, showing exactly how the decision tree works:

Image by author. Figure 9: one of the 70 decision trees built by XGBoost. At every node, we can see what the decision criteria are, and we can see the predicted price at the leaves. Also note that this is only one tree, whereas XGboost takes the output of several trees into consideration and adds them for regression through an ensemble.
Image by author. Figure 9: one of the 70 decision trees built by XGBoost. At every node, we can see what the decision criteria are, and we can see the predicted price at the leaves. Also note that this is only one tree, whereas XGboost takes the output of several trees into consideration and adds them for regression through an ensemble.

How Important Was Each Factor in Predicting Price?

Another great tool that XGBoost gives is is the ability to see which were the most important features in the regression process. Bellow, we have a visualization with the most important features and their "F Score", which is a measure of importance.

Image by author. Figure 10: This graph simply compares the relative importance of 28 of the most important features for the prediction of the sale price.
Image by author. Figure 10: This graph simply compares the relative importance of 28 of the most important features for the prediction of the sale price.

Interestingly, we can say that the importance pointed out by XGBoost is in line with what I expected from the PCA decomposition! Whereas Lot Area was important in PCA’s first component, I noted that the sale price seems to be better understood by the second component by looking at the graph. Note that the above-ground living area was the most important subcomponent of the second PCA component. Thus, it makes sense to see that it is, in fact, the most important factor in modulating sale price according to XGBoost!

Conclusion: How to Price a House in Ames, Iowa

Qualitatively, we can say that if you’re going to give a house a price, you should be looking first at the above-ground built area, then at the overall quality of the house and at the lot area, and finally and the different surface areas (seems like pretty intuitive, obvious advice).

Quantitatively, we now have a model capable of predicting house prices with an expected error of around 30,000 dollars. This means that if we find a house for sale, run our XGBoost predictor, and find that XGBoost predicts it to be worth considerably more than 30,000 USD above its price, we should probably buy it and re-sell!

References

Derksen, L. (2019, April 29). Visualising high-dimensional datasets using PCA and t-SNE in Python. Retrieved October 18, 2020, from https://towardsdatascience.com/visualising-high-dimensional-datasets-using-pca-and-t-sne-in-python-8ef87e7915b

Koehrsen, W. (2018, July 02). A Conceptual Explanation of Bayesian Hyperparameter Optimization for Machine Learning. Retrieved December 15, 2020, from https://towardsdatascience.com/a-conceptual-explanation-of-bayesian-model-based-hyperparameter-optimization-for-machine-learning-b8172278050f

Morde, V. (2019, April 08). XGBoost Algorithm: Long May She Reign! Retrieved October 18, 2020, from https://towardsdatascience.com/https-medium-com-vishalmorde-xgboost-algorithm-long-she-may-rein-edd9f99be63d


Written By

Topics:

Related Articles