How to go from bias to buyer

Published in

Towards Data Science

12 min readJan 25, 2019

There are many articles on estimating house prices, and many articles about webscraping. While both plays a part in this article, I want to focus also on typical biases that are just so tempting to fall for even -or especially- when using data for decision support.

I recently left the beautiful city of Amsterdam and a beloved, bright little city apartment in a great neighborhood just to live a parasital life in the freezing cold basement of my in-laws in Oslo. Not exactly qualifying to be called a great success story.
But since I knew this was coming, I started web-scraping the local real estate market already months in advance to cut my time in a parasite-existence as short as possible, find my own home again and do so with (the illusion of) superior information.

I used webscraping in the past to collect Google news headlines, which was fairly easy because the structure of the website I was accessing was so simple. This time, the structure was a bit more complicated so let me explain the website first:
The website finn.no is the Norwegian go-to site for pretty much everything. It is a second-hand marketplace just as much as a real estate- or job market. In the real estate section, the website presents about 50 thumbnails of apartments and you can click through several dozen pages.

Typical structure of the real estate website

The thumbnails contain little more information than the size and price of a housing unit. Only once you follow the link via the thumbnail to a new site you get all relevant information, so I had to set up the web scraper in two rounds.

The first part would collect the “href” or link from every thumbnail on every page and write them into a simple list. During the second part a loop iterates through the collected links, accessing the specific web addresses one by one and collecting all the details from each website where the more interesting information is stored.

Detailed information after accessing an individual ad

So much for the high level structure. I used the BeautifulSoup package for the webscraping and collected the attributes with many try- except: Exception pass statements. These work pretty much like it sounds: they try to process a certain command, but if it does not work, the code moves on. This is perfect when working with content which varies slightly such as webpages which feature a very similar structure but not exactly all the same elements. No construction year specified? No problem, the code moves on without errors.

I collected the detailed information in a nested dictionary. On the highest level, the href works as the key and the value is the collection of attributes. This collection consists again of multiple keys (attribute names) and values (attribute values).

On the top in the back is the high level key (the web-address) and the value. In the front is this value expanded, showing the lower level key and value pairs.

Dictionaries are flexible in format and convert very quickly into a neat data frame using the high level key as an index and the lower level keys as columns. They are also handy when collecting non-numeric attributes and converting them into dummy variables. Such attributes can be easily collected as a key-value pair like {balcony:1}. Later on when converting the dictionary into a dataframe, the key turns into a separate column and the data points which don’t contain the balcony key-value pair get a Noneentry. With a one liner and Python’s df.fillna(0)command we can swap the None entries with a 0 and instantly have a binary dummy variable column indicating the presence or absence of a balcony.

I ran the webscraper over all apartments in Oslo once, and then added a filter on “new today” and set-up a task scheduler to run it daily, extending the dataset day by day with around 50–100 new data points. Oslo is not a huge city and winter is a slow period for the real estate market, but after a few months the dataset was about 5500 observations. Enough to get some insight and better information on the market.

Once you found an interesting ad on Finn, you can see average m² prices of housing units sold in the same area over time and check how your chosen unit compares to this average.

Finn shows average prices per m² over time in an area. The green dot displays the housing unit you inspect.

It is tempting to to take this as an indicator if a house is relatively cheap or expensive and I have had realtors tell me that a certain flat has a m² price “which is lower than the average in that neighborhood”. But if you see that the average price is 75 000 NOK per m² and the unit in question is sold with
65 000 euros per m², it does not necessarily mean it is a “steal”. Usually the price per m² is higher for smaller apartments and lower for larger ones. So it could be that the average displayed for an area contains many small apartments with a very high price per m². If the house you are looking at with a “below average” m² price is significantly larger in size than the average, it could still be more expensive per m² than comparably large apartments in the same area. Let’s see if m² prices are truly decreasing with size in my dataset:

The trend seems especially pronounced for small apartments up to about 45 m² and it seems almost impossible to even find a 25 m² studio for less than 100.000 NOK per m² (at the time of writing almost 10.000 Euros). But small apartments are also often clustered in the most attractive locations, so naturally m² prices are already higher because of that. Investigating the same picture within the same area decreases the impact of the location differences and shows that the effect persists: smaller apartments feature higher m² prices. Everyone knows that comparing averages is tricky, but it is worth thinking about the variables at play to avoid making crude misjudgments.

I discovered another very handy feature of my database; While exploring the data, I noticed apartments which have been uploaded more than once and with an adjusted price. I jumped to the conclusion that this would tell me the direction of the general market trend. As we can see, almost all of these were adjusted downwards with a big spike around -5%!

Density plot of the price adjustments done on re-uploaded ads.

What does it say about the entire market that apartments are re-uploaded with a lower price? Was I right to infer for a brief moment that prices are going down? No, I was not. Imagine a distribution of apartments ranging from underpriced to overpriced. The underpriced apartments sell quickly and maybe even above the initial asking price. The middle of the spectrum, the units with a fair asking price also get sold without bigger issues and leave the market so that both the underpriced and fair-priced side of the spectrum disappears out of sight. We only observe the units which were priced too high and need to try again with a more competitive price tag. We never see the underpriced ones being uploaded with the higher price that they were sold for. A form of survivorship bias is at play here and not necessarily the collapse of the Norwegian real estate market. I still think it is useful information to have if you are interested in a particular apartment, and you observe that it did not sell for a certain price and came back cheaper. Imagine you would go to a viewing for such an apartment. You already know that the market rejected it for a certain price and you would probably not make a bid in the area of the first, unsuccessful price. It might also give a tiny bit of negotiation power to know that the seller side is potentially growing impatient.

After a few weeks I had found an ad for an apartment I really liked. It seemed relatively cheap, but with only average m² price shown on Finn.no, how could I be certain. I thought a predictive model trained on my data set and predicting the price of the apartment of my interest might help. After cleaning the data, it is always a good idea to explore the data a bit and look for features that can be transformed or dropped for a better fit and performance. Let’s have a first look at the general relationship between the m² and the price of housing in Oslo with a linear line fit laid on top of it.

We observe the expected positive relationship and a common attribute of scatterplots of the housing market: the variance in prices increases with the size of the unit. While the estimator in a regression model with such a spreading shape remains unbiased and predictive models would still be workable, there is usually a simple way to account for the phenomenon (called heteroscedasticity). Instead of a linear relationship between size and price, we can take the logarithm of both variables.

Looking at the logarithmic scatterplot we see that the spread of the scatter decreased.

Next, I had to select meaningful features from the wide range of collected attributes. I have seen people use scatterplots for feature selection. They explore alleged causality between the explanatory and the target variable by eye-balling the plot. Sometimes the conclusion is that the feature has no impact on the target variable because the graph shows a horizontal line fit. But a scatterplot is simply a two-dimensional representation of a multi-dimensional relationship and as such, it doesn’t tell the whole story. Imagine this example: Lets say we have houses of different size and in different locations. The smallest one is directly in the popular city center. The other ones are located with increasing distance to the city center and are also increasingly getting larger with distance. The negative effect of the increasing distance to the center cancels out the price increase that the additional floor space would bring perfectly. If you plot a scatterplot of these observations with the price on the y- and the size on the x-axis, you could observe a horizontal line and wrongly conclude that the size has absolutely no impact on the price. We preach that correlation does not necessarily imply causation, but no correlation does also not imply that there can not be underlying causal effects. A plot shows only two dimensions in isolation but gives no information on how some features interact with each other.

To investigate such interaction, a correlation matrix is often used. In my case, it is however important to keep in mind, how the data came to existence. Not every observation had a value for every feature on the Finn website. The realtors simply listed whatever they thought is worth listing. Seeing that many variables appear frequently with the exact same wording, there might be a pre-selection of attributes they can choose from, but many “exotic” features appear only once for a specific ad, suggesting they were entered manually. Some seemingly reliable variables therefore disqualify as not reliable and show also wrong degree of correlation to other features. Some ads f.e. stated that the house or apartment is “central”. And a major driver for prices is of course the area it is situated in or as every realtor ever preaches: “location, location, location!”

But not every housing unit that is located central also received this attribute in the data set. Only the ones for which the realtor wanted to emphasize this quality in particular.
Better than relying on such a weak feature is to create zip code dummy variables from the address of the housing unit. We cannot list each individual factor that makes a certain location attractive and we can certainly not collect it from Finn, but we somehow know or accept that some post codes are just generally good locations. Zip code dummies capture the entire unexplained attractiveness of a location which includes being central, having great schools or a low crime rate. Since post codes are usually geographically clustered (0251 is next to 0252 and so on), we use a higher level, say 3-digits, to group postcodes and make sure we have enough observations per post code dummy. In order to get post code dummies, you can use the very handy “get_dummies” function:

df = pd.concat([df, pd.get_dummies(df.post_code3)], axis=1)

It is also important to keep in mind that ads are ads (duh) and as such only list positive attributes and sometimes conveniently leave out the ones that might be perceived negatively (apartments on the ground floor often “forgot” to state the floor they were on). There is very little I can do about this. But even without this complication, two apartments of equal attributes in the data do not have to be alike. No attribute in my data captures the feature if the view out of the windows of a flat is blocked by the outside wall of the neighboring building, or the bathroom looks like a half-rotten fungus swinger party from the 70s. The data captures many things, but aesthetics are not among them. So the prediction will also not take the condition of an apartment into account.

The continuous variables on the other hand are quite reliable as they consist of the most basic information.

From left to right: floorspace, price (in NOK), running costs, floor, energylabel (converted from letters), construction year.

Of the categorical variables I found the most useful to be: dummies for the type of housing and ownership, zip codes, the presence of a balcony, garden, fireplace (very common and popular in Norway → 25% of all observations!), elevator, garage or parking spot. I also kept some more “exotic” dummies if they were stated frequently enough (> than 10% of observations) but not obsolete like “basic access to the sewage”. These variables include f.e. “janitor-service” .

I ran two different models. A linear regression in log specification, and a gradient boosting model. I separated the one observation I wanted to predict before training the model, so that the observation is truly “new”. For the linear model it is not necessary to do a train-test-split, as overfitting is not a concern if you restrict functional form of your estimator.

from sklearn.metrics import mean_squared_errorfrom sklearn.linear_model import LinearRegression
Y = df.price_log
X = df.drop(columns = 'price_log')
regressor = LinearRegression()
regressor.fit(X, Y)
print('Liner Regression R squared: %.4f' % regressor.score(X, Y))
mse_linear = mean_squared_error(Y, regressor.predict(X))
print("MSE: %.4f" % mse_linear)
regressor.predict(ad_for_pred)

For the gradient boosting model I tried a few parameter settings but did not spend too much time optimizing them as the prediction did not change significantly with each change. The R² was a bit higher for the test set than for the entire set in the linear model (0.89 vs 0.85) and the mean squared error decreased (0.012 vs 0.014). The model is easy to set up:

from sklearn.model_selection import train_test_split
from sklearn import ensemble
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=0)
params = {'n_estimators': 500, 'max_depth': 4, 'min_samples_split': 30, 'min_samples_leaf':10,'learning_rate': 0.1, 'loss': 'ls'}
regressor_gr_boost = ensemble.GradientBoostingRegressor(**params)regressor_gr_boost.fit(X_train, y_train)
mse_gb = mean_squared_error(y_test, regressor_gr_boost.predict(X_test))
print("MSE: %.4f" % mse_gb)
print('Gradient Boosting training set R squared: %.4f' %regressor_gr_boost.score(X_train,  y_train)
print('Gradient Boosting test set R squared: %.4f' %regressor_gr_boost.score(X_test,  y_test)
regressor_gr_boost.predict(ad_for_pred)

The price for the object of my interest was about 5% higher than the asking price listed on Finn with the linear prediction. The gradient boosting showed even an 8% difference. The exact prediction is in this case of minor importance and certainly not correct to the euro given the data quality flaws discussed before. But we get a clear indication that the apartment is on the low end of the price range given its features.

I decided to go to the viewing to see if the apartment was indeed great value for money. It was located on a lively street with many shops and cafes, but also some traffic and even a tram line which was probably a reason for the lower price. I still thought it was relatively cheap as my prediction provided me with a sense of the prices I had to expect if I were to look for a comparable apartment in a more quiet side street. In fact, I ended up buying (true story, not just for the narrative of this article). The data analysis did not take the decision what apartment to buy of my shoulders, but it provided decision support. And when I take up a loan of a few hundred thousand euros, I’ll happily take any support I can get.

How to go from bias to buyer

Written by Mathias Schläffer