Simple House Price Predictor using ML through TensorFlow in Python

Published in

Towards Data Science

9 min readDec 16, 2018

The profession of reality is moving into the 21st century, and as you can imagine home listings are flooding the internet. If you have ever looked at buying a home, renting an apartment, or just wanted to see what the most expensive home in town is (we have all been there), then chances are you have been to Zillow, Realtor.com, Readfin, or Homesnap. If you go to Zillow and search for homes near you, you will see a listing like this:

It’s an aesthetic listing for sure, and like all classified ads, it has an asking price; in this case, $379,900. But if you scroll further down you will see a tab titled “Home Value,” and expanding the window will give you a “Zestimate.”

A Zestimate is what Zillow has predicted the value of the house to be; it is their best guess. Zillow gives this definition, “A Zestimate home valuation is Zillow’s estimated market value. It is not an appraisal. Use it as a starting point to determine a home’s value.”

But how does Zillow guess the price so accurately? The difference between the asking price and the Zestimate is only $404. But surely this isn’t done manually, right? Zillow has over 110 million homes in its database (not all currently on the market) and it is simply not feasible to perform these estimations by hand. You might then assume that it is some form of algorithm, and you would be right. But even a traditional algorithm would likely underperform and be unfeasibly complicated given the complexity of home valuation. Home values depend on location, number of bathrooms, square footage, the number of floors, garages, pools, neighboring values, etc. I think you get the point. This is where the topic of this article comes into play, machine learning!

In this article I am going to walk you through building a simple house price prediction tool using a neural network in python. Get a coffee, open up a fresh Google Colab notebook, and lets get going!

Step 1: Selecting the Model

Before we start telling the computer what to do, we need to decide what kind of model we are going to use. We need to first ask ourselves what the goal is. In this situation we have some inputs about the house (location, number of bathrooms, condition, etc) and we need to produce an output: price. This is a numeric output, which means we can express it on a continuous scale (more on that later). Given these parameters we can choose to utilize a neural network to perform regression. Tensorflow, a Google machine learning framework, is a great base on top of which to build such a model.

If you are not familiar with how neural networks work, which would be helpful in understanding what is going on here, simply Google it or find a YouTube video, there a great resources for this.

Step 2: Gathering the Data

The dataset we are going to use comes from Kaggle.com. If you are unfamiliar with Kaggle, this is a great opportunity to venture on over there and check it out! In essence, it is a website that houses datasets for data science applications as well as hosts competitions (often with sizable cash prizes). In fact, Zillow hosted a competition to help improve Zestimate. I am not kidding when I tell you that the cash prize was $1,200,000. Someday…

Anyway, go to Kaggle.com and create an account, it only takes a second and it is free. Find “Edit Profile” and navigate down to “Create New API Token.” This will create a file called “kaggle.json” which you will download. This is a file that contains your username and API key, so do not give it to anyone and do not edit the contents (it you do loose it or alter it on accident you can expire the old one and get a new one).

Navigate over to the “Competitions” tab of Kaggle and search for “House Prices: Advanced Regression Techniques.” You will see that there is a “Data” tab, which is where we will be pulling the data from.

Side Note: You will need to accept the terms and conditions associated with the competition to download the data, but this doesn’t necessitate any involvement in the competition, not to worry.

Now we have everything we need to get this bread… I mean data.

Step 3: Building the model

The environment in which we are going to build this model is Google Colab, a free online python notebook environment. Type colab.research.google.com into your browser and start a new Python 3 notebook.

For those who have not used a notebook before, you will compile each of the cells by hitting run before you move the next one.

Let’s begin by installing and importing out dependencies:

import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler

Tensorflow is the machine learning framework we will use, pandas will serve as out dataframe, numpy will assist in data manipulation, matplotlib is our data visualization tool, and sklearn is will give us a means to scale our data.

This next command will install the Kaggle API which we will use in conjunction with our kaggle.json file to import the data directly to the environment.

!pip install kaggle

The next line uses a built-in Colab file tool which allow us to upload “kaggle.json” to the notebook. Simply execute the following command and use the button that appears to upload the file.

from google.colab import files
files.upload()

The Kaggle API needs that file to be in a specific location for the authentication process. Just trust me on this one. Execute this command to create the directory and place the file.

!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/!chmod 600 ~/.kaggle/kaggle.json

Now we are ready to import our data using the API! Go to the data tab on the Kaggle competition page and press the following button, which will copy the specific command you need right to your clipboard:

You can, of course, just copy the code below but this is how you would get the data from different datasets/competitions. Important Note: you need to place the “!” in front of the command in Colab, you would not need this if you were running locally.

!kaggle competitions download -c house-prices-advanced-regression-techniques

If you ran the line and received “Error 403: Forbidden” then you likely did not accept the terms and conditions of the Kaggle competition.

Execute the following commands to see the names of the files in your current directory (what we just downloaded):

!ls

Now for a larger bit of code. What this next cell is doing is reading the .csv files and building a dataframe to house them. We can use this dataframe to interact with our data. This code takes the dataframe, removes the ‘Id’ column which we don’t need here and then separated the data into two separate dataframes: one for categorical values, and one for continuous values.

#Build the dataframe for train data
train=pd.read_csv('train.csv',encoding='utf-8')
train.drop(['Id'], axis=1)
train_numerical = train.select_dtypes(exclude=['object'])
train_numerical.fillna(0,inplace = True)
train_categoric = train.select_dtypes(include=['object'])
train_categoric.fillna('NONE',inplace = True)
train = train_numerical.merge(train_categoric, left_index = True, right_index = True)

So why did we split the data into numerical and categorical columns? This is because they are two separate data types. One consists of numeric data that exists on a continuous spectrum and the other contains strings that are associated with a category. We need to treat them differently here.

Side Note: If you ever want to see the contents of a dataframe to help with visualization, just call the head() function as seen below.

train.head()

Run these next commands to isolate outlier variables. Sklearn will help us remove the outliers in the data. This will make the learning process easier for those data points that more accurately represent non-outlier cases.

from sklearn.ensemble import IsolationForestclf = IsolationForest(max_samples = 100, random_state = 42)
clf.fit(train_numerical)
y_noano = clf.predict(train_numerical)
y_noano = pd.DataFrame(y_noano, columns = ['Top'])
y_noano[y_noano['Top'] == 1].index.valuestrain_numerical = train_numerical.iloc[y_noano[y_noano['Top'] == 1].index.values]
train_numerical.reset_index(drop = True, inplace = True)train_categoric = train_categoric.iloc[y_noano[y_noano['Top'] == 1].index.values]
train_categoric.reset_index(drop = True, inplace = True)train = train.iloc[y_noano[y_noano['Top'] == 1].index.values]
train.reset_index(drop = True, inplace = True)

This next bit of code takes the dataframe, converts it into a matrix and applies what is known as a MinMaxScaler. This process scales the values down to a specified range to make training easier. For example, a list of numbers from 100 to 1000 could be converted to a range of 0 and 1 with 0 being 100 and 1 being 1000.

col_train_num = list(train_numerical.columns)
col_train_num_bis = list(train_numerical.columns)col_train_cat = list(train_categoric.columns)col_train_num_bis.remove('SalePrice')mat_train = np.matrix(train_numerical)
mat_new = np.matrix(train_numerical.drop('SalePrice',axis = 1))
mat_y = np.array(train.SalePrice)prepro_y = MinMaxScaler()
prepro_y.fit(mat_y.reshape(1314,1))prepro = MinMaxScaler()
prepro.fit(mat_train)prepro_test = MinMaxScaler()
prepro_test.fit(mat_new)train_num_scale = pd.DataFrame(prepro.transform(mat_train),columns = col_train_num)
train[col_train_num] = pd.DataFrame(prepro.transform(mat_train),columns = col_train_num)

The following code will hash the categorical features into to numerical inputs that our model can understand. Hashing is a topic for another post, give it a google search if you are curious.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
COLUMNS = col_train_num
FEATURES = col_train_num_bis
LABEL = "SalePrice"FEATURES_CAT = col_train_catengineered_features = []for continuous_feature in FEATURES:
    engineered_features.append(
        tf.contrib.layers.real_valued_column(continuous_feature))for categorical_feature in FEATURES_CAT:
    sparse_column = tf.contrib.layers.sparse_column_with_hash_bucket(
        categorical_feature, hash_bucket_size=1000)engineered_features.append(tf.contrib.layers.embedding_column(sparse_id_column=sparse_column, dimension=16,combiner="sum"))

Now we will isolate the input and output variables and then split them into a test and training set. A rule of thumb to use when creating the test train is 80% test and 20% train, which I have done below (test_size=0.2). The result here are input and output sets for both the test and train

# Build the training set and the prediction set
training_set = train[FEATURES + FEATURES_CAT]
prediction_set = train.SalePrice# Split the train and prediction sets into test train sets
x_train, x_test, y_train, y_test = train_test_split(training_set[FEATURES + FEATURES_CAT] ,
                                                    prediction_set, test_size=0.2, random_state=42)
y_train = pd.DataFrame(y_train, columns = [LABEL])
training_set = pd.DataFrame(x_train, columns = FEATURES + FEATURES_CAT).merge(y_train, left_index = True, right_index = True)y_test = pd.DataFrame(y_test, columns = [LABEL])
testing_set = pd.DataFrame(x_test, columns = FEATURES + FEATURES_CAT).merge(y_test, left_index = True, right_index = True)

Now we can combine the continuous and categorical features back together and then construct the model framework by calling the DNNRegressor functions and passing in the features, hidden layers, and desired activation function. Here were are using three layers, each with a decreasing number of nodes. The activation function is “relu” but try using “leaky relu” or “tanh” to see if you get better results!

training_set[FEATURES_CAT] = training_set[FEATURES_CAT].applymap(str)
testing_set[FEATURES_CAT] = testing_set[FEATURES_CAT].applymap(str)def input_fn_new(data_set, training = True):
    continuous_cols = {k: tf.constant(data_set[k].values) for k in FEATURES}
    
    categorical_cols = {k: tf.SparseTensor(
        indices=[[i, 0] for i in range(data_set[k].size)], values = data_set[k].values, dense_shape = [data_set[k].size, 1]) for k in FEATURES_CAT}# Combines the dictionaries of the categorical and continuous features
    feature_cols = dict(list(continuous_cols.items()) + list(categorical_cols.items()))
    
    if training == True:
        # Converts the label column into a constant Tensor.
        label = tf.constant(data_set[LABEL].values)# Outputs the feature columns and labels
        return feature_cols, label
    
    return feature_cols# Builds the Model Framework
regressor = tf.contrib.learn.DNNRegressor(feature_columns = engineered_features, 
                                          activation_fn = tf.nn.relu, hidden_units=[250, 100, 50])categorical_cols = {k: tf.SparseTensor(indices=[[i, 0] for i in range(training_set[k].size)], values = training_set[k].values, dense_shape = [training_set[k].size, 1]) for k in FEATURES_CAT}

Executing the following function will begin the training progress! It will take a fe minutes, so get a stretch in!

Step 5: Training the Model

regressor.fit(input_fn = lambda: input_fn_new(training_set) , steps=10000)

Let’s visualize the results! This block of code will import our data visualization tool, calculate the predicted values, grab the actual values, and then plot them against each other.

Step 6: Evaluating the Model and Visualizing the Results

import matplotlib.pyplot as plt
import matplotlibev = regressor.evaluate(input_fn=lambda: input_fn_new(testing_set, training = True), steps=1)
loss_score = ev["loss"]
print("Final Loss on the testing set: {0:f}".format(loss_score))import matplotlib.pyplot as plt
import matplotlib
import itertoolsev = regressor.evaluate(input_fn=lambda: input_fn_new(testing_set, training = True), steps=1)
loss_score = ev["loss"]
print("Final Loss on the testing set: {0:f}".format(loss_score))
reality = pd.DataFrame(prepro.inverse_transform(testing_set.select_dtypes(exclude=['object'])), columns = [COLUMNS]).SalePricey = regressor.predict(input_fn=lambda: input_fn_new(testing_set))
predictions = list(itertools.islice(y, testing_set.shape[0]))
predictions = pd.DataFrame(prepro_y.inverse_transform(np.array(predictions).reshape(263,1)))matplotlib.rc('xtick', labelsize=30) 
matplotlib.rc('ytick', labelsize=30)fig, ax = plt.subplots(figsize=(15, 12))
plt.style.use('ggplot')
plt.plot(predictions.values, reality.values, 'ro')
plt.xlabel('Predictions', fontsize = 30)
plt.ylabel('Reality', fontsize = 30)
plt.title('Predictions x Reality on dataset Test', fontsize = 30)
ax.plot([reality.min(), reality.max()], [reality.min(), reality.max()], 'k--', lw=4)
plt.show()

Not bad! To get better results, try changing the activation function, the number of layers, or the size of the layers. Perhaps using another model entirely. This in not a huge dataset, so we are limited by the amount of information we have, but these techniques and principals can be transferred onto larger dataset or more complex problems.

Feel free to contact me regarding and questions, comments, concerns, or suggestions.

I would also like to give a shoutout to Julien Heiduk who’s model this is a reduction of. Go check out his Kaggle here: https://www.kaggle.com/zoupet