A Cartographic Exploration of Housing in Amsterdam

Published in

Towards Data Science

8 min readAug 22, 2018

I spent some time last week playing around with house price data I pulled from the web, and got some really good feedback on the first post in this series. One of the most common requests I got was to add some context to the story in these numbers. It’s much easier to understand if it’s put in terms we are all familiar with, so plotting my findings on a map seems like a reasonable next step.

Making Maps

This is what the city of Amsterdam looks like if you were to peel away everything except the boundaries between postcodes:

The city government does a really excellent job providing openly accessible data like this for a variety of use-cases. If you’re curious, I highly recommend taking a peek at what they’ve got to offer. Combining the data you find on their site with some more familiar Google Maps yields a much clearer picture:

Geocoding

There are a handful of free ways to convert an address into a Latitude and Longitude. My preferred method is with R, and the Google Maps API, which they offer for free to hobbyists like myself. You’re limited to encoding 2500 addresses per day before it starts to cost you. Fortunately for me, our data is only 1800 rows — no problem. This is what our houses look like when you place them on a map.

It’s not particularly thrilling to look at, so let’s colorize by Asking Price as an easy first step:

This doesn’t really tell us much, because houses cost lots of money all over the city. We can see there might be a pocket of really expensive homes in the center near Vondelpark, but we don’t really see any major themes emerge beyond that. What happens if we look at everyone’s favorite metric — Price/Square Meters — to see if a clearer picture begins to emerge:

Interestingly, that same pocket of very expensive homes appears again, but this time we see a big swath of the city center get brighter as well. These homes are mostly around the central rings of the old city, and the places nearest to Vondelpark. This makes sense, because the more desirable and historic parts of town would also be more expensive to buy at any size.

But does this difference in price per square meter explain all of the difference between our bright spot and the rest of town? Perhaps another component might be that these homes are just larger in size? We find this out by plotting the number of rooms each property has. Again, we see the same bright spot, just on the south side of Vondelpark:

We can conclude from these three plots that the expensive part of town is also the place where we find the largest homes (by room count) and the highest listed prices per square meter. It also highlights why we chose to use the size and number of rooms to predict price in our models. Fancy.

The Coefficients

We care about the coefficients because they tell us about the relationship between our predictors and the price. We already know that price varies geographically, so let’s plot what that geographic variation looks like in our model. A mixed-effects model gives us a different estimate for each zip code, so we can compare neighboring areas against one another. For every one, we get a mixture of the fixed and random effects which gives us our model coefficients (b1, b2, and Intercept) as follows:

Price = b1*SQM + b2*Rooms + Intercept

First, let’s look at the coefficient for price per square meter (b1):

Consistent with the dot-plots we saw earlier, it looks like the 1075 and 1071 zipcodes are the priciest areas in all of Amsterdam. Houses in this part of town are impressive, and very expensive. Similarly, the areas of the Jordaan and Haarlemmerbuurt are expensive per square meter. This is where the postcard photos are taken, with crooked old houses lining the canal sides. The outskirts of town paint a different picture, with the lowest prices found in the zuidoost and noord areas. These parts of town are newer, less visited by tourists, and quite a bit more affordable as a result.

When looking at the value of an additional room (b2), however, we see an interesting pattern emerge:

Increasing the number of rooms causes a somewhat expected increase in the cost of a home, but only in some places. The same pricey zip codes we saw previously actually have a negative (green) coefficient for the rooms variable. Holding everything else constant, an additional room appears to bring down the value of the house in these areas.

Why might this be? One possible explanation could be that the expensive parts of town are occupied by fancy people, and those fancy people prefer to have larger, fancier rooms in their home. Imagine two identical properties of 200 sqm, one with 7 and one with 8 rooms. The property with 8 must, by definition, have smaller rooms. Bigger, it seems, is better for these areas.

On the other hand, the city center lights up yellow-orange, where adding an additional room to a 100 sqm house could add as much as 40 thousand euros to its listing price. This could be for many reasons, but I think it might be due to tourism. This part of town is where every visitor wants to go, and finding places to put all those people is expensive. Maybe it’s good to be able to fit lots of people in a small space?

After accounting for the number of rooms, and the value of an added square meter of space, the intercept is what we add to the end to get our total listing price. When it is positive, we can assume that any size house will be worth something to a prospective buyer. Conversely, a negative intercept implies that the price/sqm and rooms must offset enough to make up for it. In this case, we see that the fringes of town all have intercepts at or above about 100k euros. This makes up for the fact that the model estimates their value per square meter to be much lower.

On the other hand, the expensive parts of town have negative intercepts. To overcome this, the houses must be larger, and have much higher SQM coefficients. This is consistent with what we see when we plotted the homes earlier.

A note on model performance:

We established a relationship between our different variables in the last post, and have begun to develop a more complete picture of the state of housing in Amsterdam.

We can already illustrate how different neighborhoods can command vastly different prices, and shown that it is a function both of the house’s size, and its geographic location. The mixed-effects models used in this project are useful because they allow flexibility in the way our model approximates reality. By virtue of their design, we can get a tailor-made estimation in each post code, while still building a model that accounts for information we have throughout the city.

To illustrate this, I generated some plots color-coded by the amount of error they generate in their predictions. Recall that the Ordinary Least Squares (OLS) model did not account for any information about a home’s location (post code), and assumed that the importance of size and rooms is uniform throughout the entire city. We know this is a naive assumption, and we can see how poorly the model performs as a result. The Mean Absolute Percentage Error (MAPE) is all over the place, varying from around 10% to more than 130%.

Gray areas indicate zip codes where MAPE exceeds 50%

Clearly the model has learned to base its predictions on the bulk of the homes it sees, which are mostly located towards the city center. This ends up really hurting how reliable its predictions are for houses on the edges of town. Areas not shaded are not represented in our data, so we can’t estimate errors for them.

If, however, we assume that each neighborhood is different, and allow the model to flex a bit, it gets much more confident about its predictions. The resulting improvement in the model is obvious when we plot the errors in the same way as before:

It’s worth mentioning that neither of these models was intended for prediction purposes, nor validated against a holdout set. As such, they are really much better suited for analyzing coefficients, and not making predictions or forecasts. Anyone interested in building such a model should properly cross-validate it against ground-truth. I do, however, think this is an interesting illustration of the flexibility and interpretability of such an approach.

Going Forward:

I’m not quite sure where to go from here. I will certainly continue to search for a home in Amsterdam. I love this city, and the people who live here. The challenge continues to be finding actual sale prices for property here. Until that happens, I might continue by adding some other publicly available data to the models in an attempt to improve performance. ¯\_(ツ)_/¯

For now, I’ve put the bulk of my scraper code in a github repository for anyone curious to try it themselves. It’s ugly, but it works. This post also involved quite a bit of work processing and plotting in R. I will try to get these scripts posted there too, if I can find time to make them more readable.