Home Value Prediction

Predicting real estate value using machine learning algorithms

Nate Jermain
Towards Data Science

--

How do companies like Zillow offer price estimates for homes that are not for sale? They collect data on the characteristics of each property and use machine learning algorithms to make predictions. In this article, I’ll demonstrate a similar analysis using a data set included in Kaggle’s “House Prices” competition.

Exploratory Data Analysis

First, lets take a look at the response variable “Sale Price”. It’s positively skewed; most houses sold for between $100,000 and $250,000, but some sold for substantially more.

Figure 1: Observed sale price

The data set contains 80 features that describe characteristics of the property, including the number of bathrooms, basement square footage, year built, garage square footage, etc. The heat map (Figure 2) shows the correlation among each feature and the response variable “SalePrice”. This gives us information about the feature importance in predicting the Sale Price and indicates where there may be multicolinearity. The overall quality of the home “OverallQual” is highly correlated with Sale Price, not surprisingly. In contrast, the year the home was sold “YrSold” has little correlation with the Sale Price.

Figure 2: Heat map showing the correlation among features and sale price

Data Cleaning

Dealing with NAs

There are lots of NAs in this data set; some features are almost all NAs, while there are many that have just a few.

We can remove features that offer little information such as Utilities.

df.Utilities.describe()

All but one property is assigned the “Allpub” category for Utilities, so we can just remove that feature. Due to the lack of variation, the feature has little correlation with our response Sale Price (Figure 2), so we’re not that worried about losing it.

df=df.drop([‘Utilities’], axis=1)

Few NAs are random, in that the lack of information usually has something to do with the the record itself, and not simply because of a collection error. For example, NA recorded for GarageType probably means there isn’t a garage on the property. In this data set there are both categorical and continuous features pertaining to garages. We can fill them in accordingly with 0 and “None” for properties that have NAs for those features, indicating a lack of garage space.

# Garage categorical features to none
for i in (‘GarageType’, ‘GarageFinish’, ‘GarageQual’, ‘GarageCond’):
df[i] = df[i].fillna(‘None’)
# Garage continuous features to 0
for i in (‘GarageYrBlt’, ‘GarageArea’, ‘GarageCars’):
df[i] = df[i].fillna(0)

NAs for other features don’t have a clear explanation associated with the lack of information. In this case, we can observe the frequency of occurrence for each record, and choose the most probable value. Lets look at the frequency distribution for the feature “MSZoning” describing the zoning classification.

Figure 3: Frequency of zoning classification

The classification for residential low density (RL) is by far the most common. A pragmatic approach to addressing the four NAs in this feature will be to simply replace NAs with “RL”.

df.MSZoning=df[‘MSZoning’].fillna(df[‘MSZoning’].mode()[0])

Data Transformation

To maximize the performance of our model, we want to normalize our features and response variable. As we saw in Figure 1, our response variable is positively skewed. By applying a log transformation, Sale Price now resembles a normal distribution (Figure 4).

resp=np.log1p(resp) # transform by log(1+x)
Figure 4: Log transformed response variable Sale Price

We’ll have to check all the continuous features for skew as well.

# identify numerical features
num_feats=df.dtypes[df.dtypes!=’object’].index
# quantify skew
skew_feats=df[num_feats].skew().sort_values(ascending=False)
skewness=pd.DataFrame({‘Skew’:skew_feats})
skewness=skewness[abs(skewness)>0.75].dropna()
skewed_features=skewness.index

Skewness is going to vary a lot between all these features we want to transform. A box cox transformation provides a flexible way of transforming features that may each require an alternate approach. The function boxcox will estimate the optimal lambda value (a parameter in the transformation) and return the transformed feature.

# add one to all skewed features, so we can log transform if needed
df[skewed_features]+=1
# conduct boxcox transformation
from scipy.stats import boxcox
# apply to each of the skewed features
for i in skewed_features:
df[i],lmbda=boxcox(df[i], lmbda=None)

One-Hot Encoding

Finally, we’ll need to one-hot encode (or dummy code) our categorical variables so they can be interpreted by the model.

df=pd.get_dummies(df)

Modeling

We’re going to fit two widely applied machine learning models to the training data and evaluate their relative performance using cross-validation.

Random Forest Regressor

To insure our random forest regressor model has attributes that maximize its predictive capabilities, we’re going to optimize the hyperparameter values. We want to estimate the optimal values for:

n_estimators: number of trees in the forest

max_features: maximum number of features to consider at each split

max_depth: maximum number of splits in any tree

min_samples_split: minimum number of samples required to split a node

min_samples_leaf: minimum number of samples required at each leaf node

bootstrap: whether the data set is bootstrapped or whether the whole data set is used for each tree

n_estimators=[int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
max_features = [‘auto’, ‘sqrt’, ‘log2’]
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]
grid_param = {‘n_estimators’: n_estimators,
‘max_features’: max_features,
‘max_depth’: max_depth,
‘min_samples_split’: min_samples_split,
‘min_samples_leaf’: min_samples_leaf,
‘bootstrap’: bootstrap}

If we used GridSearchCV from sci-kit learn to identify the optimal hyperparameters, we would be evaluating 6,480 candidate models and 32,400 fits with cross-validation of five folds. That would be very computationally expensive, so instead we’ll use RandomizedSearchCV that evaluates a specified number of candidate models (n_iter) with randomly selected hyperparameters from our defined parameter space. We’re going to do k-fold cross-validation using five folds.

from sklearn.ensemble import RandomForestRegressor# the model prior to hyperparameter optimization
RFR=RandomForestRegressor(random_state=1)
from sklearn.model_selection import RandomizedSearchCV
RFR_random = RandomizedSearchCV(estimator = RFR, param_distributions = grid_param, n_iter = 500, cv = 5, verbose=2, random_state=42, n_jobs = -1)
RFR_random.fit(train, resp)
print(RFR_random.best_params_)

Now we have a model with attributes best suited for our data.

Best_RFR = RandomForestRegressor(n_estimators=1000, min_samples_split=2, min_samples_leaf=1,max_features=’sqrt’, max_depth=30, bootstrap=False)

We want a precise measurement of how the home prices predicted by the model differed from the actual prices of the homes sold. We’ll calculate the root mean squared error (RMSE)for the model through k-fold cross-validation. Given five folds, we’ll use the mean RMSE value of each of the five sets of model fits.

from sklearn.model_selection import KFold, cross_val_score
n_folds=5
def rmse_cv(model):
kf = KFold(n_folds,shuffle=True,random_state=42)
.get_n_splits(train)
rmse= np.sqrt(-cross_val_score(model, train, resp, scoring=”neg_mean_squared_error”, cv = kf))
return(rmse.mean())

rmse_cv(Best_RFR)

The random forest model does fairly well, with a mean RMSE of .149.

Lets try another model to see if we can obtain better predictions.

Gradient Boosting Regressor

We’ll conduct the same evaluation using RandomizedSearchCV to identify the optimal hyperparameters. The gradient boosting regressor we’ll use from “xgboost” has the following hyperparameters we’ll want to optimize:

n_estimators: number of trees

subsample: percentage of samples per tree

max_depth: maximum number of levels in each tree

min_child_weight: minimum sum of weights of all observations required in a child

colsample_bytree: percentage of features used per tree

learning_rate: learning rate or step size shrinkage

gamma: minimum reduction of the cost function required to make a split

n_estimators=[int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
subsample = [.6,.7,.8,.9,1]
max_depth = [int(x) for x in np.linspace(10, 50, num = 10)]
min_child_weight = [1,3,5,7]
colsample_bytree=[.6,.7,.8,.9,1]
learning_rate=[.01,.015,.025,.05,.1]
gamma = [.05,.08,.1,.3,.5,.7,.9,1]
rand_param = {‘n_estimators’: n_estimators,
‘subsample’: subsample,
‘max_depth’: max_depth,
‘colsample_bytree’: colsample_bytree,
‘min_child_weight’: min_child_weight,
‘learning_rate’: learning_rate,
‘gamma’: gamma}

Using the same approach employed for the random forest model, we’ll run the randomized hyperparameter search using k-fold cross-validation.

Boost_random = RandomizedSearchCV(estimator = Boost, param_distributions = rand_param, n_iter = 500,
cv = 5, verbose=2, random_state=42, n_jobs = -1)
Boost_random.fit(train, resp)

We can now calculate the RMSE for the tuned model and compare xgboost’s performance to the random forest model.

Best_Boost = XGBRegressor(subsample=.7, n_estimators=1600, min_child_weight=3, max_depth=41,learning_rate=.025, gamma=.05, colsample_bytree=.6)# evaluate rmse
rmse_cv(Best_Boost)

Our gradient boosting regression model exhibited superior performance to the random forest model with a RMSE value of 0.131.

Making Final Predictions

I took a pragmatic approach to modeling in this analysis; there are additional modeling techniques that can marginally increase the prediction accuracy such as stacking or applying a suite of alternate models (e.g. Lasso, ElasticNet, KernalRidge).

We’ll just apply the best model from this analysis (gradient boosting regression) to the test set and evaluate its performance.

# fit to the training data
Best_Boost.fit(train,resp)
# transform predictions using exponential function
ypred=np.expm1(Best_Boost.predict(test))
# make a data frame to hold predictions, and submit to Kaggle
sub=pd.DataFrame()
sub['Id']=test['Id']
sub['SalePrice']=ypred
sub.to_csv('KaggleSub.csv', index=False)

The Gradient Boosting Regression model performed with a RMSE value of 0.1308 on the test set, not bad!

Conclusion

We can make reasonable predictions about the price a house will sell for based on characteristics of the property. Key steps include assigning appropriate values for NAs, normalizing variables, optimizing hyperparameters for candidate models, and choosing the best model.

I appreciate any feedback and constructive criticism. The code associated with this analysis can be found on github.com/njermain

--

--