The world’s leading publication for data science, AI, and ML professionals.

Predict House Prices with Machine Learning using Python

Regression model trained on 1,883 properties

Image source: Greg Carter Barrister & Solicitor (Fair Use)
Image source: Greg Carter Barrister & Solicitor (Fair Use)

Property valuation is an imprecise science. Individual appraisers and valuers bring their own experience, metrics and skills to a job. Consistency is difficult, with UK and Australian-based studies suggesting valuations between two professionals can differ by up to 40%. Crikey!

Perhaps a well-trained machine could perform this task in place of a human, with greater consistency and accuracy.

Let’s prototype this idea and train some ML models using data about a house’s features, costs and neighbourhood profile to predict its value. Our target variable – property price – is numerical, hence the ML task is regression. (For a categorical target, the task becomes classification.)

Watch the YouTube version here.

We’ll use a dataset from elitedatascience.com that simulates a portfolio of 1,883 properties belonging to a real-estate investment trust (REIT). There are 26 columns. Here’s a small snippet:

Snapshot of the original dataset.
Snapshot of the original dataset.

The steps are:

  1. EDA & data-processing: explore, visualise and clean the data.
  2. Feature engineering: leverage domain expertise and create new features.
  3. Model training: we’ll train and tune some tried-and-true classification algorithms, such as ridge and lasso regression.
  4. Performance evaluation: we’ll look at common regression task metrics like the R²-score and mean squared average.
  5. Deployment: batch-run or get some data engineers / ML engineers to build an automated pipeline?

New to AI or ML? Check out my explainer articles [here](https://medium.com/swlh/differential-equations-versus-machine-learning-78c3c0615055) and here.


1. Data exploration and processing

Exploratory data analysis (EDA) helps us understand the data and provides ideas and insights for data cleaning and feature engineering. Data cleaning prepares the data for our algorithms while feature engineering is the magic sauce that will really help our algorithms draw out the underlying patterns from the dataset. Remember:

Better data always beats fancier algorithms!

We start by loading some standard Data Science Python packages into JupyterLab.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso, Ridge, ElasticNet
from sklearn.ensemble import RandomForestRegressor,
                             GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score,
                            mean_absolute_error 
import pickle

Import the dataset:

df = pd.read_csv('real_estate_data.csv')

Here’s a snapshot of our dataframe again. The shape is (1,883, 26).

Snapshot of the original dataset. Click to expand.
Snapshot of the original dataset. Click to expand.

The target variable is tx_price, which represents the price at which the property was last sold.

There are 25 columns/features.

  • Property characteristics: tx_year – year when it was last sold, property_type (single-family vs. apartment/condo/townhouse).
  • Property costs: monthly property_tax and insurance.
  • Property features: beds, baths, sqrt (floor area), lot_size (incl. outside area), year_built & basement (yes/no).
  • Neighbourhood lifestyle: number of restaurants, groceries, nightlife, cafes, shopping, arts_entertainment, beauty spas & active_life (gym, yoga studios etc.) within 1 mile radius.
  • Neighbourhood demographic: median_age, married (%), college_grad (%).
  • Neighbourhood schools: num_schools (within district) & median_schools (median score out of 10 of public schools in district).

1.1 Numerical features

A list of our numerical features can be obtained with the code:

df.dtypes[df.dtypes!='object']

We’ll skip that here. Instead, let’s head straight for some histograms for all our numerical features:

df.hist(figsize=(20,20), xrot=-45)

This looks mostly fine. I’ve left a few comments in red.

Distribution of target variable (tx_price):

We have a right-tailed distribution, which can also be seen by looking at a violin plot.

sns.violinplot(data=df, x='tx_price')
df.tx_price.median()
Output: 392000

The median US house price in 2020 Oct is $325,000, so the houses amassed in this REIT portfolio appears to be a bit more uptown on average.

Missing values:

df.select_dtypes(exclude=['object']).isnull().sum()
Output:
tx_price              0
beds                  0
baths                 0
sqft                  0
year_built            0
lot_size              0
basement              223
restaurants           0
groceries             0
nightlife             0
cafes                 0
shopping              0
arts_entertainment    0
beauty_spas           0
active_life           0
median_age            0
married               0
college_grad          0
property_tax          0
insurance             0
median_school         0
num_schools           0
tx_year               0

A closer examination of the basement feature suggests that rows with basement=0 were encoded with NaN. Hence this is actually an incorrect labelling problem. We’ll need to convert NaN’s to 0 for the algorithms.

df.basement.fillna(0, inplace=True)

_Note: If these NaN’s were genuine missing values, we should create an indicator variable basement_missing (with value 1 when basement=NaN) before converting the NaNs in basement to 0._

Outliers: The histogram for lot_size suggests we have some pretty large mansion estates!

df.lot_size.sort_values(ascending=False)
Output:
102     1220551
1111     436471
1876     436035
1832     436035
1115     435600
Name: lot_size, dtype: int64

Actually, there’s just one property with a lot size way bigger than the others. Let’s regard it as an outlier and filter it out for our modelling.

df = df[df.lot_size < 500000]

Here’s a correlation heatmap for our numerical features.

# mask out upper triangle
mask = np.zeros_like(df.corr(), dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# heatmap
sb.heatmap(df.corr()*100, 
           cmap='RdBu_r', 
           annot = True, 
           mask = mask)

The triangle block of red suggests that the neighbourhood lifestyle features tend to correlate pretty well with each other. For instance, active_life, beauty_spas, cafe, nightlife and restaurants are all highly correlated. This multicollinearity might affect model performance, as regression features should be independent.


1.2 Categorical features

Our categorical variables can be listed with the code:

df.dtypes[df.dtypes=='object']

These are: property_type, exterior_walls and roof.

The classes for each of these categorical features can be listed with:

df.property_type.value_counts()
Output:
Single-Family                    1080
Apartment / Condo / Townhouse     802
Name: property_type, dtype: int64
df.exterior_walls.value_counts()
Output:
Brick                  686
Siding (Alum/Vinyl)    503
Metal                  120
Combination            107
Wood                    72
Wood Siding             49
Brick veneer            48
Stucco                  26
Other                   10
Concrete                 8
Concrete Block           7
Block                    7
Asbestos shingle         6
Rock, Stone              5
Masonry                  3
Wood Shingle             2
Name: exterior_walls, dtype: int64
df.roof.value_counts()
Output:
Composition Shingle      1179
Asphalt                   132
Shake Shingle              55
Other                      49
Wood Shake/ Shingles       30
Gravel/Rock                30
Roll Composition           12
Asbestos                    9
Slate                       9
Composition                 5
asphalt                     5
Metal                       4
composition                 4
shake-shingle               3
Built-up                    2
asphalt,shake-shingle       1
Name: roof, dtype: int64

From this, we can clean up the classes a bit. We’ll

  • merge together sparse classes (those with too few observations)
  • merge classes with similar meanings (e.g. subsume Concrete and Block into the more general Concrete block class.
  • fix up labelling errors (e.g. concrete should be Concrete).
df.exterior_walls.replace(['Wood Siding', 'Wood Shingle', 'Wood'],
                           'Wood', inplace=True)
df.exterior_walls.replace('Rock, Stone', 'Masonry', inplace=True)
df.exterior_walls.replace(['Concrete','Block'], 'Concrete Block',
                            inplace=True)
df.exterior_walls.replace(['Concrete Block', 'Stucco', 'Concrete',
                           'Block', 'Masonry', 'Other', 
                           'Asbestos shingle', 'Rock, Stone'],
                           'Other', inplace=True)
df.roof.replace(['Composition', 'Wood Shake/ Shingles', 
                 'Composition Shingle'], 'Composition Shingle',
                  inplace=True)
df.roof.replace(['Other', 'Gravel/Rock', 'Roll Composition',
                 'Slate', 'Built-up', 'Asbestos', 'Metal'], 'Other',
                  inplace=True)
df.roof.replace(['asphalt','asphalt,shake-shingle',
                 'shake-shingle'], 'Shake Shingle', inplace=True)
df.roof.replace('composition', 'Composition',inplace=True)

Missing values:

df.select_dtypes(include=['object']).isnull().sum()
Output:
property_type       0
exterior_walls    223
roof              353
dtype: int64

We’ll want to tell our algorithms that these values are Missing. This is more instructive than simply removing the rows.

for feat in df.select_dtypes(include=['object']):
    df[feat] = df[feat].fillna("Missing")

Let’s plot some bar graphs of our three categorical features.

for feat in df.dtypes[df.dtypes=='object'].index:
   sb.countplot(y=feat, data=df)

Finally, categorical features must be one-hot encoded for our algorithms. We’ll do that now.

df = pd.get_dummies(df, columns = ['exterior_walls', 
                                   'roof', 
                                   'property_type'])

1.3 Segmentations

Segmentations combine both numerical and categorical features.

Let’s segment all our categorical variables (property_type, exterior_walls and roof) by our target tx_price. This will provide a detailed look at what might drive property values.

for feat in df.dtypes[df.dtypes=='object'].index:
   sb.boxplot(data=df, x = 'tx_price', y = '{}'.format(feat))

Let’s also look at how property and neighbourhood features differ between single-family homes and apartment/condo/townhouse by using a groupby.

df.groupby('property_type').agg(['mean','median'])

In summary, single-family houses are more expensive and larger, while apartment/condo/townhouse dwellings are closer to entertainment/shops and attract younger residents. This isn’t too surprising. Family homes tend to be in suburbia, while apartments and higher density living tend to be closer to downtown.


2. Feature engineering

A lot of feature engineering rests on domain expertise. If you have a subject matter expert (SME) on real estate to provide guidance, you’ll have a better chance of engineering some awesome feature that will really make your modelling shine.

Here, we’ll engineer four new features:

  • two_and_two: Properties with two bedrooms and two bathrooms. Why? Through domain expertise, you might know that these properties are popular with investors. This is an indicator variable.
  • during_recession: The US property market went into recession between 2010 and 2013. Thus being in this time period might influence pricing to a substantial degree.
  • property_age: This one is kind of common sensical. Newer properties should attract a higher price, right? (Answer: yes but not always.)
  • school_score: We define this interaction feature to be the product of num_school and median_school, which gives a single score that represents the quality of schooling in the area.

The following code creates these new features.

df['two_and_two'] = ((df.beds == 2) &amp; (df.baths == 2)).astype(int)
df['during_recession'] = ((df.tx_year >= 2010) &amp; 
                          (df.tx_year <= 2013))
                          .astype(int)
df['property_age'] = df.tx_year - df.year_built
df.drop(['tx_year', 'year_built'], axis=1, inplace=True)
df['school_score'] = df.num_schools * df.median_school

The new property_age feature arguably supercedes the original tx_year and year_built, thus we’ll remove them.

Also, out of interest, we can see that roughly 9% of properties have two beds / two baths, and 26% were sold during the housing recession of 2010–13:

df.two_and_two.mean()
df.during_recession.mean()
Outputs:
0.09458023379383634
0.2635494155154091

Analytical base table: The dataset after applying all of these data cleaning steps and feature engineering is our analytical base table. This is the data on which we train our models.

Our ABT has 1,863 properties and 40 columns. Recall our original dataset had just 26 columns! Generally, ABTs have more columns than the original dataset because of:

  • one-hot encoding, where a new column is created for every class in every categorical feature; and
  • feature engineering.

On a related note, a problem in ML is the curse of dimensionality, where your ABT has too many columns/features. This is a major problem in deep learning, where data processing can result in ABTs with thousands of features or more. Principal component analysis (PCA) is a dimensionality-reduction technique where high dimensional correlated data is transformed to a lower dimensional set of uncorrelated components, referred to as principal components (PC). The good news is that the lower dimensional PCs capture most of the information in the high dimensional dataset.

I plan to write an article on unsupervised learning techniques, including PCA. Keep your eyes peeled!


Enjoying this story? Get an email when I post similar articles.

3. Modelling

We’re going to train four tried-and-true regression models:

  • regularised linear regression (Ridge, Lasso & Elastic Net)
  • random forests
  • gradient-boosted trees

First, let’s split our analytical base table.

y = df.status
X = df.drop('tx_price', axis=1)

We’ll then split into training and test sets.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)

We’ll set up a pipeline object to train. This will streamline our model training process.

pipelines = {
    'lasso' : make_pipeline(StandardScaler(),
              Lasso(random_state=123)),
    'ridge' : make_pipeline(StandardScaler(),
              Ridge(random_state=123)),
    'enet' :  make_pipeline(StandardScaler(),
              ElasticNet(random_state=123)),
    'rf' :    make_pipeline(
              RandomForestRegressor(random_state=123)),
    'gb' :    make_pipeline(
              GradientBoostingRegressor(random_state=123))
}

We also want to tune the hyperparameters for each algorithm.

For all three regularised regressions, we’ll tune alpha (L1 & L2 penalty strengths), along with the l1_ratio for Elastic Net (i.e. weighting between L1 & L2 penalties).

lasso_hyperparameters = {
    'lasso__alpha' : [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10]}
ridge_hyperparameters = {
    'ridge__alpha' : [0.001, 0.005, 0.01, 0.1, 0.5, 1, 5, 10]}
enet_hyperparameters = { 
    'elasticnet__alpha': [0.001, 0.005, 0.01, 0.05, 0.1, 1, 5, 10], 
    'elasticnet__l1_ratio' : [0.1, 0.3, 0.5, 0.7, 0.9]}

For our random forest, we’ll tune the number of estimators (n_estimators) and the max number of features to consider during a split (max_features), and the ** min number of samples to be a leaf (min_samples_lea**f).

rf_hyperparameters = {
     'randomforestregressor__n_estimators' : [100, 200],
     'randomforestregressor__max_features' : ['auto', 'sqrt', 0.33],
     'randomforestregressor__min_samples_leaf' : [1, 3, 5, 10]}

For our gradient-boosted tree, we’ll tune the number of estimators (n_estimators), learning rate, and the maximum depth of each tree (max_depth).

gb_hyperparameters = {
      'gradientboostingregressor__n_estimators' : [100, 200],
      'gradientboostingregressor__learning_rate' : [0.05, 0.1, 0.2],
      'gradientboostingregressor__max_depth' : [1, 3, 5]}

Finally, we’ll fit and tune our models. Using GridSearchCV we can train all of these models with cross-validation on all of our declared hyperparameters with just a few lines of code!

fitted_models = {}
for name, pipeline in pipelines.items():
    model = GridSearchCV(pipeline, 
                         hyperparameters[name], 
                         cv=10, 
                         n_jobs=-1)
    model.fit(X_train, y_train)
    fitted_models[name] = model

4. Evaluation

I’ve written a dedicated article on popular Machine Learning metrics, including the ones used below.

4.1 Performance scores

We’ll start by printing the cross-validation scores. This is the average performance across the 10 hold-out folds and is a way to get a reliable estimate of the model performance using only your training data.

for name, model in fitted_models.items():
    print( name, model.best_score_ )
Output:
lasso 0.3085486180300333
ridge 0.3165464682513239
enet 0.34280536738492506
rf 0.4944720180590308
gb 0.48797200970900745

Moving onto the test data, we’ll output the R2-score and mean absolute error (MAE).

The R²-score represents the proportion of total variance explained by the model and ranges from 0 to 100. If the R²-score = 100, then the dependent variable (tx_price) correlates perfectly with the features.

The MAE is the average error between the predictions and actuals.

for name, model in fitted_models.items():
   pred = model.predict(X_test)
   print(name)
   print(' - - - - ')
   print('R²:', r2_score(y_test, pred))
   print('MAE:', mean_absolute_error(y_test, pred))
   print()
Output:
lasso
--------
R^2: 0.4088031693011063
MAE: 85041.97658598644

ridge
--------
R^2: 0.4092637562314514
MAE: 84982.89969349177

enet
--------
R^2: 0.40522476546064634
MAE: 86297.65161608408

rf
--------
R^2: 0.5685576834419455
MAE: 68825.53227240045

gb
--------
R^2: 0.5410951822821564
MAE: 70601.60664940192

The winning algorithm is the random forest, with the highest R² score of 0.57 and the lowest MAE. We can actually do better than this.

Remember earlier we removed the tx_year and year_built features after engineering property_age? It turns out that was the wrong choice. Including them would have given the model a big performance bump to R² = 0.81. Moreover, leaving out a few of the highly correlated neighbourhood profile features (i.e. active_life, beauty_spas, cafe, nightlife and restaurants) would have increased performance even more. This highlights the importance of both feature engineering and feature selection.

FYI, here are the hyperparameters of the winning random forest, tuned using GridSearchCV.

RandomForestRegressor(bootstrap=True, 
                      criterion='mse',
                      max_depth=None,
                      max_features='auto',
                      max_leaf_nodes=None,
                      min_impurity_decrease=0.0,
                      min_impurity_split=None,
                      min_samples_leaf=10, 
                      min_samples_split=2,
                      min_weight_fraction_leaf=0.0,
                      n_estimators=200, 
                      n_jobs=None,
                      oob_score=False, 
                      random_state=123,
                      verbose=0, 
                      warm_start=False))],

4.2 Feature importances

Consider the following code.

coef = winning_model.feature_importances_
ind = np.argsort(-coef)
for i in range(X_train.shape[1]):
    print("%d. %s (%f)" % (i + 1, X.columns[ind[i]], coef[ind[i]]))
x = range(X_train.shape[1])
y = coef[ind][:X_train.shape[1]]
plt.title("Feature importances")
ax = plt.subplot()
plt.barh(x, y, color='red')
ax.set_yticks(x)
ax.set_yticklabels(X.columns[ind])
plt.gca().invert_yaxis(

This will print a list of features ranked by importance and a corresponding bar plot.

1. insurance (0.580027)
2. property_tax (0.148774)
3. sqft (0.033958)
4. property_age (0.031218)
5. during_recession (0.027909)
6. college_grad (0.022310)
7. lot_size (0.020546)
8. median_age (0.016930)
9. married (0.015506)
10. beauty_spas (0.013840)
11. active_life (0.011257)
12. shopping (0.010523)
13. school_score (0.010032)
14. restaurants (0.009975)
15. median_school (0.007809)
16. baths (0.007009)
17. cafes (0.005914)
18. groceries (0.005578)
19. nightlife (0.004049)
20. arts_entertainment (0.003944)
21. beds (0.003364)
22. exterior_walls_Siding (Alum/Vinyl) (0.001808)
23. exterior_walls_Brick (0.001348)
24. roof_Composition Shingle (0.001239)
25. roof_Missing (0.000778)
26. roof_Shake Shingle (0.000750)
27. exterior_walls_Missing (0.000632)
28. num_schools (0.000616)
29. exterior_walls_Metal (0.000578)
30. basement (0.000348)

The top two predictors by far are

  • cost of monthly homeowner’s insurance and
  • monthly property tax.

This is not entirely surprising, as the insurance premium is often calculated by the insurer based on the cost of replacing the building. This requires – surprise surprise – a good valuation of the worth of the building. Similarly, the amount of property tax is usually tied to the property’s worth.

The next two strongest predictors are the size of the property (sqft) and how old it is (property_age). Larger and newer properties tend to fetch more in the market, hence these results conform to expectations too.


5. Communication & Deployment

Image by ThisisEngineering RAEng (Fair Use)
Image by ThisisEngineering RAEng (Fair Use)

An executable version of this model (.pkl) can be saved from the Jupyter notebook:

with open('final_model.pkl', 'wb') as f:
    pickle.dump(fitted_models['rf'].best_estimator_, f)

The REIT could pre-process new housing data before feeding it into the trained model. This is called a batch run.

Once the modelling is done, you need to sell the impact of your data science work.

This means communicating your findings and explaining the value of your models and insights.

Here, data storytelling is crucial to career success.

With business sign-off, the REIT might want to deploy the model into an production environment by engaging with data engineers and machine learning engineers.

These data specialists would build an automated pipeline around our model, ensuring that fresh property data can be pre-processed with our cleaning and feature engineering logic, with predictions pushed downstream to decision-makers on an automated and regular basis.

The model would now be in a production environment – taken care of by a 24/7 operations team – that can serve and scale your model to thousands of consumers or more.

Read my deep dive into the realities of working with data at large organisations.


Final comments

We started with a business problem: a company n the business of buying, holding and selling a large portfolio of investment properties for investors (aka a REIT) wanting to bring consistency and increased performance to their property valuations.

We trained a winning random forest model on a load of historical data comprising over 1,800 past property transactions.

The REIT can run new data on our trained .pkl file on a manual basis, or an automated pipeline could be built by their engineering department.

Our model was a regression model, where the target variable is numerical.

The other side of the coin for supervised learning are classification models, whose target variable is categorical. Over here, I trained a binary classification model that predicts whether an employee is at risk of leaving a company.

You’ll then sell the value of your work to stakeholders, and then perhaps work with the wider organisation to deploy it into a production environment.

Find me on Twitter & YouTube [[here](https://youtube.com/@col_shoots)](https://youtube.com/@col_invests), here & here.

My Popular AI, ML & Data Science articles

  • AI & Machine Learning: A Fast-Paced Introduction – here
  • Machine Learning versus Mechanistic Modelling – here
  • Data Science: New Age Skills for the Modern Data Scientist – here
  • Generative AI: How Big Companies are Scrambling for Adoption – here
  • ChatGPT & GPT-4: How OpenAI Won the NLU War – here
  • GenAI Art: DALL-E, Midjourney & Stable Diffusion Explained – here
  • Beyond ChatGPT: Search for a Truly Intelligence Machine – here
  • Modern Enterprise Data Strategy Explained – here
  • From Data Warehouses & Data Lakes to Data Mesh – here
  • From Data Lakes to Data Mesh: A Guide to Latest Architecture – here
  • Azure Synapse Analytics in Action: 7 Use Cases Explained – here
  • Cloud Computing 101: Harness Cloud for Your Business – here
  • Data Warehouses & Data Modelling – a Quick Crash Course – here
  • Data Products: Building a Strong Foundation for Analytics – here
  • Data Democratisation: 5 ‘Data For All’ Strategies – here
  • Data Governance: 5 Common Pain Points for Analysts – here
  • Power of Data Storytelling – Sell Stories, Not Data – here
  • Intro to Data Analysis: The Google Method – here
  • Power BI – From Data Modelling to Stunning Reports – here
  • Regression: Predict House Prices using Pythonhere
  • Classification: Predict Employee Churn using Python – here
  • Python Jupyter Notebooks versus Dataiku DSS – here
  • Popular Machine Learning Performance Metrics Explained – here
  • Building GenAI on AWS – My First Experience – here
  • Math Modelling & Machine Learning for COVID-19 – here
  • Future of Work: Is Your Career Safe in Age of AI – here

Related Articles