My First Kaggle Competition

Using Random Forests to predict Housing Prices

Published in

Towards Data Science

9 min readOct 5, 2018

Introduction

I recently stumbled upon this article by Rachel Thomas, depicting the various advantages of blogging and, lo and behold, here I am with my first article.

In this article, I will share my experience of participating in my first ever kaggle competition. I completed fast.ai’s Machine Learning for coders MOOC, and I hoped to apply the knowledge gained from this course in this kaggle competition.

Overview

We will be working on the Housing Price Prediction competition.
The description says :

This is a perfect competition for data science students who have completed an online course in machine learning and are looking to expand their skill set before trying a featured competition.

The problem is simple, we have to predict the prices(Dependent variable) of residential homes in Ames, Iowa. We are provided with 79 explanatory features(Independent variables) describing (almost) every aspect of the house.

We will be using Random Forest for this problem, which is an ensemble of Decision Trees. Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction.

Looking at the Data

for full image visit https://i.imgur.com/BEtQD00.png

Most of these columns and their fields don’t make much sense to us, but kaggle has provided us with some brief description in the data section of the competition. You can further look for data_description.txt resource to get the detailed description of every column and its data fields.

Evaluation Metric

The evaluation metric is the RMSLE(Root Mean Square Logarithmic Error), so it makes sense to take the log of the dependent variable(that is SalePrice) in this case.

df_raw.SalePrice = np.log(df_raw.SalePrice)

Data Interpretation

You might have noticed that the first six rows of the “Alley” column are all NaN values(which depicts missing data for the respective row). So let’s find out the percentage of missing values in the whole dataset for all the features.

We have 5 features with Nan% greater than 50. It is highly probable that these features won’t provide us with any useful insight on the dataset so we can consider removing them right now, but we are going to look at feature importance soon enough anyway, so let’s keep them for now and let the random forest make this decision for us.

Data Cleaning

In order to feed this data-set to the random forest, we will have to convert all the non-numerical data into numerical data. All the non-numerical data columns represent categorical data which is of two types:

Ordinal: Where the order of the categories matter.
Nominal: Where the order doesn’t matter.

There are two ways to convert these into numerical:

Label encoding:- Replace the categories with integer labels, ranging from 0 to (number of categories -1). It makes the categorical data ordinal in nature.
One-Hot encoding:- Splitting one column with categorical data to multiple columns(equal to the number of categories). The numbers are replaced by 1s and 0s, depending on which column has what value.

For more information on Label vs One-hot encoding, you can visit this article.

Most of the columns are ordinal anyway, so we will go with Label encoding for this data-set (I tried One-Hot encoding as well later on, and it increased my RMSLE by 0.01). The “train_cats” and “proc_df” are taken from the fastai repo on Github. You can view the source code and documentation if you want to.

We also replaced the missing values for numerical data columns with the median of the respective column, and added a {name}_na column
which specifies if the data was missing.

Looking at the data, I found out that there is no feature that provides us with the age of the house, so I added that in along with another feature that gives total living area.

Splitting into Training and Validation set

One last thing before building our random forest is to divide our data-set into two parts: the training set and the validation set. Possibly the most important idea in machine learning is having separate training & validation data sets. As motivation, suppose you don’t divide up your data, but instead use all of it. And suppose you end up having lots of parameters:

The error for the pictured data points is lowest for the model on the far right (the blue curve passes through the red points almost perfectly), yet it’s not the best choice. Why is that? If you were to gather some new data points, they most likely would be closer to the curve in the middle graph.

splitting our raw data set to a training and a validation set

This illustrates how using all our data can lead to overfitting. A validation set helps diagnose this problem.The score of your model on the validation set is going to represent how well your model will do in the real world, on the data that it has never seen before. So, validation set is just a subset of your dataset which tells you how generalizable your model is.

Training the Model

We are finally ready to build the random forest. We will be using Parfit for optimizing our hyper-parameters.

The best hyper-parameters are:
min_samples_leaf : 1,

max_features : 0.4,

Using the optimized parameters for the random forest, we got an RMSLE of 0.1258 on the validation set.

Visualizing Random Forests

Let’s look at one tree(Estimator) from our bag of trees in the Random forest.

This tree has been visualized up to depth = 2. Every node of the tree splits our data into two halves, such that the weighted average of the MSE is the best possible. The MSE is decreasing as we go further down because the tree is basically finding the best possible split point for every node.
The tree first divides the data on the basis of TotalLivingSf and then its children are divided on the basis of AgeSold and TotalLivingSf respectively and so on.
We have 300 estimators in our forest, and we take the average of all these estimators for making a prediction.

Feature Selection

Next, we are going to dive into feature importance to remove redundant features, and to find out which features are responsible for giving the most insight into our data.

The function ‘rf_feat_importance’ is going to return a pandas dataframe with column names and their respective feature importance. We used the ‘plot_fi’ function to plot a horizontal bar graph depicting the feature importances of all the columns.

def rf_feat_importance(m, df):
    return pd.DataFrame({'cols': df.columns,'imp' :
    m.feature_importances_}).sort_values('imp',ascending = False)fi = rf_feat_importance(m,df)def plot_fi(fi):
      return fi.plot('cols', 'imp', 
      'barh' figsize(20,12), legend=False)plot_fi(fi[:50])

The graph shows that the importance is decreasing in an exponential fashion, so there is no need to consider features that provide almost no insight into our data-set. We will remove all the features with imp < 0.005 and then re-train our model on the new data and check the RMSLE.

tokeep = fi[fi.imp >= 0.005].cols
dfToKeep = df[tokeep]X_train,X_val,y_train,y_val = train_test_split(dfToKeep,y,
                              test_size = 0.25,random_state = 11)
m.fit(X_train,y_train)

The RMSLE after this was 0.12506 which is a slight improvement over the previous 0.1258. The number of features has reduced from 79 to 25. Generally speaking, removing redundant columns should not make it worse. If it makes the RMSLE worse, they were not redundant after all.

The removal of redundant columns might make our model a little bit better. If you think about how these trees are built, when the algorithm is deciding what to split on, it will have fewer things to worry about trying.

There is a good probability of creating a better tree with less data, but it is not going to change the outcome by much. This will let us focus on the features that matter the most.

Correlation-based Feature Selection

Now we are going to find the correlation between the features using Spearman’s rank correlation coefficient. If two features are providing us with the same insight, it is wise to get rid of one of them, because it is basically redundant. This technique is helpful in finding a link between multiple features.

We will plot something called dendrogram, which is a kind of hierarchical clustering.

from scipy.cluster import hierarchy as hc
corr = np.round(scipy.stats.spearmanr(dfToKeep).correlation, 4)
corr_condensed = hc.distance.squareform(1-corr)
z = hc.linkage(corr_condensed, method='average')
fig = plt.figure(figsize=(16,15))
dendrogram = hc.dendrogram(z, labels=dfToKeep.columns, orientation='left', leaf_font_size=16)
plt.show()

dendrogram showing the correlation between all the features.

The sooner two features collide into each other, the more correlated they are. Based on this, we can see five pairs of features that are highly correlated with each other.

These are:
1.) GrLivArea and TotalLivingSF.
2.) GarageArea and GarageCars.
3.) 1stFlrSF and TotalBsmtSF.
4.) GarageYrBlt and YearBuilt.
5.) FireplaceQu and Fireplaces.

Both of the features in each pair are providing us with similar insight, so it is wise to remove one feature from every pair. But which one of the two is to be removed? Let’s find out.

Now in order to find out which one of the two features(in each pair) is to be removed, first of all, I calculated a baseline score (including all the 25 features) and then started removing these ten features one by one. After removing each feature we will again calculate the score(after retraining the model) and compare it with the baseline to see how the removal of that particular feature affects our score.

def get_oob(df,y):
    m = RandomForestRegressor(bootstrap=True,
           criterion='mse',max_depth=None,
           max_features=0.4,max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, 
           n_estimators=300,n_jobs=-1,
           oob_score=False,random_state=11,
           verbose=0,warm_start=False)X_train,X_val,y_train,y_val = train_test_split(df,y,
                     test_size =  0.25,random_state = 11)
    m.fit(X_train, y_train)
    return [rmse(m.predict(X_val),y_val)]get_oob(dfToKeep,y)

We get our baseline score to be 0.12506.

Now we will start removing those 10 features and recalculate the score.
Each output of the following code tells us the RMSLE after retraining the model on the new data. The first token tells us the feature(or column) that was removed and the second token tells us the RMSLE after removing that particular feature from the data.

for c in ['GrLivArea','TotalLivingSF','1stFlrSF','TotalBsmtSF','GarageYrBlt','YearBuilt','GarageArea','GarageCars','Fireplaces','FireplaceQu']:
    print(c, get_oob(dfToKeep.drop(c, axis=1), y))GrLivArea [0.12407075391075345]
TotalLivingSF [0.12487110485019964]
1stFlrSF [0.12658563342527962]
TotalBsmtSF [0.12518884830074287]
GarageYrBlt [0.12475983278616651]
YearBuilt [0.12672934344370876]
GarageArea [0.12412925519107317]
GarageCars [0.12538764179293327]
Fireplaces [0.1258181905119676]
FireplaceQu [0.12630040930065195]

We can see that getting rid of GrLivArea, TotalBsmtSF, GarageYrBlt, and GarageArea reduces our error. So, we can get rid of all these redundant features.

to_drop = ['GrLivArea','GarageYrBlt','TotalBsmtSF','GarageArea']
get_oob(dfToKeep.drop(to_drop,axis=1),y)[0.12341214604541835]

Our RMSLE reduced from 0.1258 to 0.12341 after getting rid of all the redundant features.

dfToKeep.drop(to_drop,inplace=True,axis=1)
len(dfToKeep.columns)21

We are now left with only 21 features starting from 79. This is our final model.
We should now merge our training and validation set and retrain our model on this merged data-set.

dfTrainAndVal = df[columns]
print(len(dfTrainAndVal.columns))
m.fit(dfTrainAndVal,y)

Submitting Predictions

Now we will generate our predictions for the test set but before that we need to do similar transformations on the test set.

df_test = pd.read_csv('data/test.csv')
transform(df_test) # add "ageSold" and "TotalLivingSF" to the set.
train_cats(df_test) 
df_test,_,_ = proc_df(df_test,na_dict = nas)
Id = df_test.Id
df_test = df_test[columns]
ans = np.stack((Id,np.exp(m.predict(df_test))),axis= 1)

This model gave us a score of 0.12480 on the Kaggle test set, which corresponds to a rank of 1338 out of 4052 on the LeaderBoard, which puts us in the top 34 %.

Ending Remarks

I am currently learning about the second type of decision tree ensembles, namely boosting. I will follow up this article with an implementation revolving around Gradient Boosting Regressor instead of a Random Forest and see how that affects our feature importances and the final score.

Any advice and suggestions will be greatly appreciated.

I’d like to thank Jeremy Howard and Rachel Thomas; for making these extraordinary MOOCs. I would recommend for everyone to check out fast.ai, they also have courses for Deep learning and computational linear algebra along with machine learning.