Data cleaning and feature engineering in Python

Building better machine learning models for predicting San Francisco housing prices

Published in

Towards Data Science

11 min readApr 9, 2019

Housing price data provides a great introduction to machine learning. Anybody who has bought a house or even rented an apartment can easily understand the features: more space, and more rooms, generally lead to a higher price.

So it ought to be easy to develop a model — but sometimes it isn’t, not because machine learning is hard but because data is messy. Also, prices for the exact same house in different neighborhoods of the same city, even only a mile away, may have significantly different prices. The best way to deal with this is to engineer the data so that the model can better handle this situation.

Since finding data can be the hardest problem in machine learning, we will use a great sample set from another data science project on Github which is a set of housing prices in San Francisco, mostly over the last few years, scraped from San Francisco Chronicle home sale listings. This data set can be found here: https://github.com/RuiChang123/Regression_for_house_price_estimation/blob/master/final_data.csv

First, we’ll load the data from a local copy of the file.

import pandas as pd
housing = pd.read_csv("final_data.csv")

Now, let’s take a look at a few graphs of this data set, graphing the total number of rooms by the last sold price.

import matplotlib.pyplot as plt
x = housing['totalrooms']
y = housing['lastsoldprice']
plt.scatter(x,y)
plt.show()

That single point in the far lower-right corner is an outlier. Its value is so extreme that it skews the entire graph, so much that we cannot even see any variation on the main set of data. This will distort any attempt to train a machine learning algorithm on this data set. We need to look more closely at this data point and consider what to do with it. If we sort the data by the total number of rooms, one of our axes above, it should stick out.

housing['totalrooms'].sort_values()

Here are the results:

7524        1.0
11223       1.0
3579        1.0
2132        1.0
5453        1.0
2827        1.0
            ...2765       23.0
8288       24.0
9201       24.0
6860       24.0
4802       26.0
8087       26.0
11083      27.0
2601       28.0
2750       28.0
10727      28.0
11175      33.0
8300       94.0
8967     1264.0
Name: totalrooms, Length: 11330, dtype: float64

Indeed, that data point does stick out. It is the very last value in the list, which is a house that has 1,264 rooms! That is very suspicious, especially since the plot shows it having a pretty low price. At the very least it is wildly inconsistent with the rest of the data. The same may be the case with the previous value showing 94 rooms. We can take a closer look at these two houses with the following commands, pulling them up by their numeric identifier.

First let’s look at the house which supposedly has 1,264 rooms:

df = pd.DataFrame(housing)
df.iloc[[8967]]

This query shows something even more suspicious, which is that the “finishedsqft” field is also 1264.0. In other words, this is clearly just an error, probably on data entry — when the original data set was created, somebody accidentally used the same value for both finishedsqft and totalrooms.

Now, let’s take a look at the value just preceding it, with 94 rooms:

df.iloc[[8300]]

This home which supposedly has 94.0 rooms has only two bedrooms and two bathrooms! Again, this is an error. It is not clear how this crept in, but we can be pretty certain that if we go to this house, it does not have 94 rooms, but only two bedrooms and two bathrooms. We will need to eliminate these two data points but first let’s take another look at a graph of finishedsqft:

x = housing['finishedsqft']
y = housing['lastsoldprice']
plt.scatter(x,y)
plt.show()

There is another outlier in the lower-right. Let’s take a closer look at this data:

housing['finishedsqft'].sort_values()

Here are the results

1618         1.0
3405         1.0
10652        1.0
954          1.0
11136        1.0
5103         1.0
916          1.0
10967        1.0
7383         1.0
1465         1.0
8134       243.0
7300       244.0
...
9650      9699.0
8087     10000.0
2750     10236.0
4997     27275.0
Name: finishedsqft, Length: 11330, dtype: float64

First, unexpectedly, there are ten houses listed at 1.0 square feet. This is clearly wrong. Note that these were impossible to see in the graph, we had to look at the actual values. Additionally, the above results show the largest house at 27,275.0 square feet. It turns out, this is a house with only 2.0 bedrooms and 2.0 bathrooms, even though it is listed at 27,275 square feet, so this is almost certainly a mistake, or at least an extreme outlier. Let’s eliminate all of these outliers and take another look at the graph.

housing = housing.drop([1618])     
housing = housing.drop([3405])     
housing = housing.drop([10652])     
housing = housing.drop([954])     
housing = housing.drop([11136])     
housing = housing.drop([5103])     
housing = housing.drop([916])     
housing = housing.drop([10967])     
housing = housing.drop([7383])     
housing = housing.drop([1465])     
housing = housing.drop([8967])     
housing = housing.drop([8300])     
housing = housing.drop([4997])       
x = housing['finishedsqft']
y = housing['lastsoldprice']
plt.scatter(x,y)
plt.show()

This is looking much better. There still may be some outliers in here, and we could investigate them more closely if we really wanted to, but there are no single data points in this view that are distorting the graph, and probably none (that we can see) that would distort a machine learning model.

Now that we have cleaned the data, we need to do some feature engineering. This involves transforming the values in the data set into numeric values that machine learning algorithms can use.

Take the “lastsolddate” value, for example. In the current data set, this is a string in the form of “mm/dd/yyyy.” We need to change this into a numeric value, which we can do with the following Pandas command:

housing['lastsolddateint'] = pd.to_datetime(housing['lastsolddate'], format='%m/%d/%Y').astype('int')
housing['lastsolddateint'] = housing['lastsolddateint']/1000000000
housing = housing[housing['lastsolddateint'].notnull()]

Now let’s create a checkpoint for our data so that we can refer back to it later.

clean_data = housing.copy()

Additionally, there are a number of fields that we cannot or should not use, so we will eliminate them.

I prefer to create functions to do this sort of work that we might do again and again, as we will see below, in order to simplify the code as we try out different hypotheses.

We remove the columns in remove_list for a number of reasons. Some of them are text values we just cannot do much with (info, address, z_address, zipcode, zpid). The latitude and longitude fields might be useful in some form but for this example it may just complicate things — no reason not to experiment with it in the future though. The zestimate and zindexvalue fields were actually produced by other data science techniques (probably from Zillow), so using them would be cheating! Finally, we will drop usecode (e.g. house, condo, mobile home) which could be quite useful but we will not use it for this example.

def drop_geog(data, keep = []):     
    remove_list = ['info','address','z_address','longitude','latitude','neighborhood','lastsolddate','zipcode','zpid','usecode', 'zestimate','zindexvalue']
    for k in keep:
        remove_list.remove(k)
    data = data.drop(remove_list, axis=1)
    data = data.drop(data.columns[data.columns.str.contains('unnamed',case = False)],axis = 1)
     return data  housing = drop_geog(housing)

Now that we have cleaned up the data, let’s take a look at how a few algorithms manage using it. We will use scikit-learn.

First, we need to split the data into testing and training sets, again using a function that we can reuse later. This assures that when we test the data, we are actually testing the model on data it has never seen before.

from sklearn.model_selection import train_test_split

def split_data(data):
    y = data['lastsoldprice']
    X = data.drop('lastsoldprice', axis=1)
    # Return (X_train, X_test, y_train, y_test)
    return train_test_split(X, y, test_size=0.2, random_state=30)housing_split = split_data(housing)

Let’s try Linear Regression first.

import sys
from math import sqrt
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import GridSearchCV
import numpy as np

from sklearn.linear_model import LinearRegression

def train_eval(algorithm, grid_params, X_train, X_test, y_train, y_test):
    regression_model = GridSearchCV(algorithm, grid_params, cv=5, n_jobs=-1, verbose=1)
    regression_model.fit(X_train, y_train)
    y_pred = regression_model.predict(X_test)
    print("R2: \t", r2_score(y_test, y_pred))
    print("RMSE: \t", sqrt(mean_squared_error(y_test, y_pred)))
    print("MAE: \t", mean_absolute_error(y_test, y_pred))
    return regression_model
    
train_eval(LinearRegression(), {}, *housing_split)

This train_eval function can be used for any arbitrary scikit-learn algorithm, for both training and evaluation. This is one of the great benefits of scikit-learn. The first line of the function incorporates a set hyperparameters that we want to evaluate against. In this case, we pass in {} so we can just use the default hyperparameters on the model. The second and third lines of this function do the actual work, fitting the model and then running a prediction on it. The print statements then show some stats that we can evaluate. Let’s see how we faired.

R2: 0.5366066917131977
RMSE: 750678.476479495
MAE: 433245.6519384096

The first score, R², also known as the Coefficient of Determination, is a general evaluation of the model showing the percentage of variation in the prediction that can be explained by the features. In general, a higher R² value is better than a lower one. The other two stats are root mean squared error and mean absolute error. These two can only be evaluated in relation to other evaluations of the same statistic on other models. Having said that, an R² of .53, and the other stats in the many hundreds of thousands (for houses probably costing one or two million) is not great. We can do better.

Let’s see how a few other algorithms perform. First, K-Nearest Neighbors (KNN).

from sklearn.neighbors import KNeighborsRegressor
knn_params = {'n_neighbors' : [1, 5, 10, 20, 30, 50, 75, 100, 200, 500]}
model = train_eval(KNeighborsRegressor(), knn_params, *housing_split)

If Linear Regression is mediocre, KNN is terrible!

R2: 0.15060023694456648
RMSE: 1016330.95341843
MAE: 540260.1489399293

Next we will try Decision Tree.

from sklearn.tree import DecisionTreeRegressor

tree_params = {}
train_eval(DecisionTreeRegressor(), tree_params, *housing_split)

This is even worse!

R2: .09635601667334437
RMSE: 1048281.1237086286
MAE: 479376.222614841

Finally, let’s look at Random Forrest.

from sklearn import ensemble
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression

forest_params = {'n_estimators': [1000], 'max_depth': [None], 'min_samples_split': [2]}
forest = train_eval(RandomForestRegressor(), forest_params, *housing_split)

This one is a bit better, but we can still do better.

R2: 0.6071295620858653
RMSE: 691200.04921061
MAE: 367126.8614028794

How do we improve on these results? One option is to try other algorithms, and there are many, and some will do better. But we can actually fine tune our results by getting our hands dirty in the data with feature engineering.

Let’s reconsider some of the features that we have in our data. Neighborhood is an interesting field. The values are things like “Portrero Hill” and “South Beach.” These cannot be simply ordered (from most expensive to least expensive neighborhood), or at least, doing so would not necessarily produce better results. But we all know that the same house in two different neighborhoods will have two different prices. So we want this data. How do we use it?

Python’s Pandas library gives us a simple tool for creating a “one-hot encoding” of these values. This takes the single column of “neighborhood” and creates a new column for each value in the original neighborhood column. For each of these new rows (with new column header names like “Portrero Hill” and “South Beach”), if a row of data has that value for the neighborhood in the original column, it is set to 1, otherwise it is set to 0. The machine learning algorithms can now build a weight associated with that neighborhood, which is either applied if the data point is in that neighborhood (if the value for that column is 1) or not (if it is 0).

First, we need to retrieve our check-pointed data, this time keeping the “neighborhood” field.

housing_cleaned = drop_geog(clean_data.copy(), ['neighborhood'])

Now we can create a one-hot encoding for the “neighborhood” field.

one_hot = pd.get_dummies(housing_cleaned['neighborhood'])
housing_cleaned = housing_cleaned.drop('neighborhood',axis = 1)

We will hold onto the “one_hot” value and add it later. But first, we have to do two more things. We need to split the data into a training set and a test set.

(X_train, X_test, y_train, y_test) = split_data(housing_cleaned)

For our final step, we need to scale and center the data.

from sklearn.preprocessing import StandardScalerscaler = StandardScaler()
scaler.fit(X_train)
X_train[X_train.columns] = scaler.transform(X_train[X_train.columns])
X_train = X_train.join(one_hot)
X_test[X_test.columns] = scaler.transform(X_test[X_test.columns])
X_test = X_test.join(one_hot)

housing_split_cleaned = (X_train, X_test, y_train, y_test)

Let’s unpack this step a bit.

First, we apply StandardScaler(). This function scales and centers the data by subtracting the mean of the column and dividing the standard deviation of the column, for all data points in each column. This standardizes all of the data, giving each column a normal distribution. It also scales the data, because some fields will vary from 0 to 10,000, such as “finishedsqft,” while others will vary only from 0 to 30, such as number of rooms. Scaling will put them all on the same scale, so that one feature does not arbitrarily play a bigger role than others just because it has a higher maximum value. For some machine learning algorithms, as we will see below, this is critical to getting even a half decent result.

Second, it is important to note that we have to “fit” the scaler on the training features, X_train. That is, we take the mean and standard deviation of the training data, fit the scaler object with these values, then transform the training data AND the test data using that fitted scaler. We do not want to fit the scaler on the test data, as that would then leak information from the test data set into the trained algorithm. We could end up with results that appear better than they are (because the algorithm already is trained on test data) or appear worse (because the test data is scaled on their own data set, and not on the test data set).

Now, let’s rebuild our models with the newly engineered features.

model = train_eval(LinearRegression(), {}, *housing_split_cleaned)

Now, under Linear Regression, the simplest algorithm we have, the results are already better than anything we saw previously.

R2: 0.6328566983301503
RMSE: 668185.25771193
MAE: 371451.9425795053

Next is KNN.

model = train_eval(KNeighborsRegressor(), knn_params, *housing_split_cleaned)

This is an a huge improvement.

R2: 0.6938710004544473
RMSE: 610142.5615480896
MAE: 303699.6739399293

Decision Tree:

model = train_eval(DecisionTreeRegressor(), tree_params,*housing_split_cleaned)

Still pretty bad, but better than before.

R2: 0.39542277744197274
RMSE: 857442.439825675
MAE: 383743.4403710247

Finally, Random Forrest.

model = train_eval(RandomForestRegressor(), forest_params, *housing_split_cleaned)

Again, a decent improvement.

R2: 0.677028227379022
RMSE: 626702.4153226872
MAE: 294772.5044353021

There is certainly far more that can be done with this data, from additional feature engineering to trying additional algorithms. But the lesson, from this short tutorial, is that seeking more data or pouring over the literature for better algorithms may not always be the right next step. It may be better to get the absolute most you can out of a simpler algorithm first, not only for comparison but because data cleaning may pay dividends down the road.

Finally, in spite of its simplicity, K-Nearest Neighbors can be quite affective, so long as we treat it with the proper care.

Data cleaning and feature engineering in Python

Building better machine learning models for predicting San Francisco housing prices

Written by Scott Johnson