Linear Regression in Python; Predict The Bay Area’s Home Prices

Published in

Towards Data Science

8 min readOct 25, 2017

Motivation

In order to predict The Bay area’s home prices, I chose the housing price dataset that was sourced from Bay Area Home Sales Database and Zillow. This dataset was based on the homes sold between January 2013 and December 2015. It has many characteristics of learning, and the dataset can be downloaded from here.

Data Preprocessing

import pandas as pd
sf = pd.read_csv('final_data.csv')
sf.head()

There are several features that we do not need, such as “info”, “z_address”, “zipcode”(We have “neighborhood” as a location variable), “zipid” and “zestimate”(This is the price estimated by Zillow, we don’t want our model to be affected by this), so, we will drop them.

sf.drop(sf.columns[[0, 2, 3, 15, 17, 18]], axis=1, inplace=True)sf.info()

The data type of “zindexvalue” should be numeric, so let’s change that.

sf['zindexvalue'] = sf['zindexvalue'].str.replace(',', '')
sf['zindexvalue'] = sf['zindexvalue'].convert_objects(convert_numeric=True)sf.lastsolddate.min(), sf.lastsolddate.max()

(‘01/02/2013’, ‘12/31/2015’)

The house sold period in the dateset was between January 2013 and December 2015.

I now use the describe() method to show the summary statistics of the numeric variables.

sf.describe()

The count, mean, min and max rows are self-explanatory. The std shows the standard deviation, and the 25%, 50% and 75% rows show the corresponding percentiles.

To get a feel for the type of data we are dealing with, we plot a histogram for each numeric variable.

%matplotlib inline
import matplotlib.pyplot as plt
sf.hist(bins=50, figsize=(20,15))
plt.savefig("attribute_histogram_plots")
plt.show()

Figure 1. *A histogram for each numerical variable*

Some of the histograms are a little bit right skewed, but this is not abnormal.

Let’s create a scatter plot with latitude and longitude to visualize the data:

sf.plot(kind="scatter", x="longitude", y="latitude", alpha=0.2)
plt.savefig('map1.png')

Now let’s color code from the most expensive to the least expensive areas:

sf.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4, figsize=(10,7),
    c="lastsoldprice", cmap=plt.get_cmap("jet"), colorbar=True,
    sharex=False)

This image tells us that the most expensive houses sold were in the north area.

The variable we are going to predict is the “last sold price”. So let’s look at how much each independent variable correlates with this dependent variable.

corr_matrix = sf.corr()
corr_matrix["lastsoldprice"].sort_values(ascending=False)

The last sold price tends to increase when the finished sqft and the number of bathrooms go up. You can see a small negative correlation between the year built and the last sold price. And finally, coefficients close to zero indicate that there is no linear correlation.

We are now going to visualize the correlation between variables by using Pandas’ scatter_matrix function. We will just focus on a few promising variables, that seem the most correlated with the last sold price.

from pandas.tools.plotting import scatter_matrixattributes = ["lastsoldprice", "finishedsqft", "bathrooms", "zindexvalue"]
scatter_matrix(sf[attributes], figsize=(12, 8))
plt.savefig('matrix.png')

The most promising variable for predicting the last sold price is the finished sqft, so let’s zoom in on their correlation scatter plot.

sf.plot(kind="scatter", x="finishedsqft", y="lastsoldprice", alpha=0.5)
plt.savefig('scatter.png')

Figure 5. Finished sqft vs. Last Sold Price

The correlation is indeed very strong; you can clearly see the upward trend and that the points are not too dispersed.

Because each house has different square footage and each neighborhood has different home prices, what we really need is the price per sqft. So, we add a new variable “price_per_sqft”. We then check to see how much this new independent variable correlates with the last sold price.

sf['price_per_sqft'] = sf['lastsoldprice']/sf['finishedsqft']
corr_matrix = sf.corr()
corr_matrix["lastsoldprice"].sort_values(ascending=False)

Unfortunately, the new price_per_sqft variable shows only a very small positive correlation with the last sold price. But we still need this variable for grouping neighborhoods.

There are 71 neighborhoods in the data, and we are going to group them.

len(sf['neighborhood'].value_counts())

The following steps cluster the neighborhood into three groups: 1. low price; 2. high price low frequency; 3. high price high frequency.

freq = sf.groupby('neighborhood').count()['address']
mean = sf.groupby('neighborhood').mean()['price_per_sqft']
cluster = pd.concat([freq, mean], axis=1)
cluster['neighborhood'] = cluster.indexcluster.columns = ['freq', 'price_per_sqft','neighborhood']cluster.describe()

These are the low price neighborhoods:

cluster1 = cluster[cluster.price_per_sqft < 756]
cluster1.index

Index([‘Bayview’, ‘Central Richmond’, ‘Central Sunset’, ‘Crocker Amazon’,
‘Daly City’, ‘Diamond Heights’, ‘Excelsior’, ‘Forest Hill’,
‘Forest Hill Extension’, ‘Golden Gate Heights’, ‘Ingleside’,
‘Ingleside Heights’, ‘Ingleside Terrace’, ‘Inner Parkside’,
‘Inner Richmond’, ‘Inner Sunset’, ‘Lakeshore’, ‘Little Hollywood’,
‘Merced Heights’, ‘Mission Terrace’, ‘Mount Davidson Manor’,
‘Oceanview’, ‘Outer Mission’, ‘Outer Parkside’, ‘Outer Richmond’,
‘Outer Sunset’, ‘Parkside’, ‘Portola’, ‘Silver Terrace’, ‘Sunnyside’,
‘Visitacion Valley’, ‘West Portal’, ‘Western Addition’,
‘Westwood Highlands’, ‘Westwood Park’],
dtype=’object’, name=’neighborhood’

These are the high price and low frequency neighborhoods:

cluster_temp = cluster[cluster.price_per_sqft >= 756]
cluster2 = cluster_temp[cluster_temp.freq <123]
cluster2.index

Index([‘Buena Vista Park’, ‘Central Waterfront — Dogpatch’, ‘Corona Heights’, ‘Haight-Ashbury’, ‘Lakeside’, ‘Lone Mountain’, ‘Midtown Terrace’,
‘North Beach’, ‘North Waterfront’, ‘Parnassus — Ashbury’, ‘Presidio Heights’, ‘Sea Cliff’, ‘St. Francis Wood’, ‘Telegraph Hill’, ‘Twin Peaks’], dtype=’object’, name=’neighborhood’)

These are the high price and high frequency neighborhoods:

cluster3 = cluster_temp[cluster_temp.freq >=123]
cluster3.index

Index([‘Bernal Heights’, ‘Cow Hollow’, ‘Downtown’, ‘Eureka Valley — Dolores Heights — Castro’, ‘Glen Park’, ‘Hayes Valley’, ‘Lake’, ‘Lower Pacific Heights’, ‘Marina’, ‘Miraloma Park’, ‘Mission’, ‘Nob Hill’, ‘Noe Valley’, ‘North Panhandle’, ‘Pacific Heights’, ‘Potrero Hill’, ‘Russian Hill’, ‘South Beach’, ‘South of Market’, ‘Van Ness — Civic Center’, ‘Yerba Buena’],
dtype=’object’, name=’neighborhood’)

We add a group column based on the clusters:

def get_group(x):
    if x in cluster1.index:
        return 'low_price'
    elif x in cluster2.index:
        return 'high_price_low_freq'
    else:
        return 'high_price_high_freq'
sf['group'] = sf.neighborhood.apply(get_group)

After performing the above pre-processing, we do not need the following columns anymore: “address, lastsolddate, latitude, longitude, neighborhood, price_per_sqft”, so, we drop them from our analysis.

sf.drop(sf.columns[[0, 4, 6, 7, 8, 13]], axis=1, inplace=True)
sf = sf[['bathrooms', 'bedrooms', 'finishedsqft', 'totalrooms', 'usecode', 'yearbuilt','zindexvalue', 'group', 'lastsoldprice']]
sf.head()

Our data looks perfect!

But before we build the model, we need to create dummy variables for these two categorical variables: “usecode” and “group”.

X = sf[['bathrooms', 'bedrooms', 'finishedsqft', 'totalrooms', 'usecode', 'yearbuilt', 
         'zindexvalue', 'group']]
Y = sf['lastsoldprice']

n = pd.get_dummies(sf.group)
X = pd.concat([X, n], axis=1)m = pd.get_dummies(sf.usecode)
X = pd.concat([X, m], axis=1)drops = ['group', 'usecode']
X.drop(drops, inplace=True, axis=1)X.head()

This is what our data looks like after creating dummy variables:

Train and Build a Linear Regression Model

from sklearn.cross_validation import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=0)from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Done! We now have a working Linear Regression model.

Calculate R squared:

y_pred = regressor.predict(X_test)
print('Linear Regression R squared": %.4f' % regressor.score(X_test, y_test))

Linear Regression R squared: 0.5619

So, in our model, 56.19% of the variability in Y can be explained using X. This is not that exciting.

Calculate root-mean-square error (RMSE)

import numpy as np
from sklearn.metrics import mean_squared_error
lin_mse = mean_squared_error(y_pred, y_test)
lin_rmse = np.sqrt(lin_mse)
print('Linear Regression RMSE: %.4f' % lin_rmse)

Linear Regression RMSE: 616071.5748

Our model was able to predict the value of every house in the test set within $616071 of the real price.

Calculate mean absolute error (MAE):

from sklearn.metrics import mean_absolute_errorlin_mae = mean_absolute_error(y_pred, y_test)
print('Linear Regression MAE: %.4f' % lin_mae)

Linear Regression MAE: 363742.1631

Random Forest

Let’s try a more complex model to see whether results can be improved — the RandomForestRegressor:

from sklearn.ensemble import RandomForestRegressorforest_reg = RandomForestRegressor(random_state=42)
forest_reg.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion=’mse’, max_depth=None, max_features=’auto’, max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=10, n_jobs=1, oob_score=False, random_state=42,
verbose=0, warm_start=False)

print('Random Forest R squared": %.4f' % forest_reg.score(X_test, y_test))

Random Forest R squared”: 0.6491

y_pred = forest_reg.predict(X_test)
forest_mse = mean_squared_error(y_pred, y_test)
forest_rmse = np.sqrt(forest_mse)
print('Random Forest RMSE: %.4f' % forest_rmse)

Random Forest RMSE: 551406.0926

Much better! Let’s try one more.

Gradient boosting

from sklearn import ensemble
from sklearn.ensemble import GradientBoostingRegressor
model = ensemble.GradientBoostingRegressor()
model.fit(X_train, y_train)

GradientBoostingRegressor(alpha=0.9, criterion=’friedman_mse’, init=None, learning_rate=0.1, loss=’ls’, max_depth=3, max_features=None, max_leaf_nodes=None, min_impurity_split=1e-07,
min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, presort=’auto’, random_state=None, subsample=1.0, verbose=0,warm_start=False)

print('Gradient Boosting R squared": %.4f' % model.score(X_test, y_test))

Gradient Boosting R squared”: 0.6616

y_pred = model.predict(X_test)
model_mse = mean_squared_error(y_pred, y_test)
model_rmse = np.sqrt(model_mse)
print('Gradient Boosting RMSE: %.4f' % model_rmse)

Gradient Boosting RMSE: 541503.7962

These are the best results we have so far, so, I would consider this is our final model.

Feature Importance

We have used 19 features (variables) in our model. Let’s find out which features are important and vice versa.

feature_labels = np.array(['bathrooms', 'bedrooms', 'finishedsqft', 'totalrooms', 'yearbuilt', 'zindexvalue', 
                           'high_price_high_freq', 'high_price_low_freq', 'low_price', 'Apartment', 'Condominium', 'Cooperative', 
                          'Duplex', 'Miscellaneous', 'Mobile', 'MultiFamily2To4', 'MultiFamily5Plus', 'SingleFamily', 
                           'Townhouse'])
importance = model.feature_importances_
feature_indexes_by_importance = importance.argsort()
for index in feature_indexes_by_importance:
    print('{}-{:.2f}%'.format(feature_labels[index], (importance[index] *100.0)))

The most important features are finished sqft, zindex value, number of bathrooms, total rooms, year built and so on. And the least important feature is Apartment, which means that regardless of whether this unit is an apartment or not, does not matter to the sold price. Overall, most of these 19 features are used.

Your Turn!

Hopefully, this post gives you a good idea of what a machine learning regression project looks like. As you can see, much of the work is in the data wrangling and the preparation steps, and these procedures consume most of the time spent on machine learning.

Now it’s time to get out there and start exploring and cleaning your data. Try two or three algorithms, and let me know how it goes.

Source code that created this post can be found here. I would be pleased to receive feedback or questions on any of the above.