SE Apartment Project

A Data Science approach to Stockholms Apartment Prices

Turning data into insights and predictors with Machine Learning

Gustaf Halvardsson

Follow

Published in

Towards Data Science

7 min readDec 22, 2019

--

The goal of this story and a recap of the last story.

In the previous story, we explored how we could acquire and clean data. While the cleaning and filtering took a lot longer than expected, we ended up with a sparkly clean dataset. To jock your memory, we now have this as our dataset:

Sample of 5 out of 14 171 apartment listings with 9 different features.

Splitting the data for validation.

We have roughly 14 000 entries in our dataset which will serve great to train our ML-model. To train our model, we need a lot of data, so-called training data. But to check the performance of our model we will also require data, so-called, validation data. Here the balance is important since if we have too few lines of training data, our model will not be able to train on enough data and be accurate. On the flip side, if we do not have enough lines of validation data we can’t be as certain of our model's accuracy when we later test it. Note: We could also use Cross-Validation but because we have a good amount of data, the gain would be minimal.

For these reasons I chose to do a 3/4 split of the data, meaning 75% of lines go towards training and 25% towards validation.

Building the Machine Learning Model.

The first step to training my Machine Learning (ML) model is considering the features I should include which is also important. From intuition and previous knowledge of apartments I decided to choose: ‘Date‘, Size’, ‘NearbyPOIs’, ‘Rooms’, ‘Latitude’, and ‘Longitude’. Choosing the correct features to include and exclude will be very important for our model's performance so we will return to tune this later. But let's start with this and see where we end up.

How to model it

This problem will be a regression-task and is probably very prone to overfitting. Therefore I ultimately decided to use a Random Forest Tree Regressor. Because it’s an ensemble method (a method that uses the average of different evaluations) it works very well to counter overfitting and usually produces a very good result with little tuning of its hyperparameters.

Because it’s a decision-tree model, we also do not need to normalize the data to avoid that certain features take more precedence than others.

We now train the model with our training-data and let's test it against the validation data.

I wrote a function that calculates the accuracy in Mean Absolute Error or Mean Absolute Error Percentage. Follow the links if you want to learn more but the short story for both measurements is that the closer to 0, the better.

Hmm not exactly a great predictor for apartment pricing. What this means is that our model's predictions will be, on average, 24% or over 1 000 000 million SEK off. We can definitely improve this. So how do we optimize this?

Feature Engineering and Optimizations

I plotted the correlations of the features to discover which features are best to include using our trained model and its features.

feat_importances = pd.Series(trainedModel.feature_importances_, index=trainFeatures.columns)
feat_importances.nlargest(numOfFeatures).plot(kind='barh')

Which feature that has the biggest impact on the price. The higher the value, the bigger the impact. This means Size affects the pricing the most.

I ultimately decided to choose: ‘Size’, ‘NearbyPOIs’, ‘Latitude’, and ‘Longitude’ because they had the biggest impact on the Price which you can see in the graph here.

I left out Rent since where a situation where we know the Rent but not the Price, or vice-versa, did not seem necessary to include.

There are a ton of optimizations to be made that I used in this instance, but that is out-of-scope for this article. However, after these optimizations, we ended up with the following performance:

Shows how close the predictions were to actual pricing (in SEK). X-Axis: What the prediction model predicted the price would be. Y-Axis: Actual Price for the apartment. Each dot represents one apartment listing and the red line is the linear fit of all predictions.

Much better, where our model can predict the sale price, on average 390 000 SEK or 8.4 % within the actual price.

Something interest to note here is that for this accuracy, we only use 4 features: ‘Size’, ‘NearbyPOIs’, ‘Latitude’ and ‘Longitude’ to get this good of an accuracy. Since NearbyPOIs, Latitude, and Longitude all come from the Adress, we only in reality need 2 features. Why is this impressive? It means our model only needs apartments Size and Adress to have an accuracy of 91.6%.

Why don't we just give the model all of the features? The more the better, right?
No, doing this usually causes overfitting — where the model learns the training data too well, meaning noise and outliers is also included in its predictions, rather than learning a general formula, which is what we are looking for.

Conclusion: Turning Data into Insights.

At this point, we have clean and organized data and an ML-model that predicts the sale price. To conclude this study, I want to summarize what useful insights that gave us.

Let's start with the actual data. This is a summarization of all rows of data we have. It shows amongst other things the mean, standard deviation, max- and min value in our data for all 9 features.

With the assumption that the data available from Hemnet is representative of the apartment market in Stockholm, we can conclude that for apartments in Stockholm:

The median price is 3.9 million SEK.
The median rent is 2 800 SEK.
The median number of rooms is 2.

Furthermore, we can compare the relationship between all the features (their correlations) to learn even more how each feature affects each other, not just the pricing.

Shows how much each feature is dependent on each other. For example, the Size is entirely dependent on itself, so its value is 1 (green). The lower (red, -1) or higher (green, 1) the value, the more dependency.

From this graph, we can see things that might not be surprising, that Rooms is very dependent on Size. But we can also conclude some things we may not have known before about apartments in Stockholm:

The longitude is significantly dependent on NearbyPOIs. Meaning: the number of interesting places nearby(restaurants, bars, schools, etc. that are interesting according to Yelp) depends on whether you live in West or East Stockholm, where east Stockholm has more NearbyPOIs.
PricePerKvm is twice as dependent on Longitude as Latitude. Meaning if you live in East Stockholm, your apartment is likely to be more expensive.

There are tons, of course, more connections to be made but these were my personal favorites.

Lastly, I would like to return to this correlation graph again, which showed how much of an impact each feature has on Price.

An insight that I found the most surprising is just how small of an impact Number of Rooms has on the pricing. Meaning:

The pricing of an apartment is not significantly different whether it has an extra room or two. It is mostly the Size in Square Meters that matters in this context.

We can also conclude from this graph that Size, NearbyPOIs and Latitude, and Longitude are the features that affect the Price the most. Three of these are all entirely dependent on the location. Meaning:

Location is everything and will dramatically affect the pricing.

To show just how the pricing varies in Stockholm I decided to show this visually.

Shows the geographical positions of the most expensive apartments. X-Axis: Latitude, Y-Axis: Longitude, Z-Axis: Sales Price / Square Meters (SalePricePerKvm)

As you can see, a pattern does exist. The highest point (yellow) as you can see there are coordinates for in the middle of Östermalm.

Key Takeaways from this case study

For apartments in Stockholm:

The median price is 3.9 million SEK.
The median rent is 2 800 SEK.
The median number of rooms is 2.
The number of interesting places nearby (restaurants, bars, schools, etc. that are interesting according to Yelp) depends on whether you live in west or east Stockholm, where east Stockholm has more interesting places.
If you live in east Stockholm, your apartment is likely to be more expensive.
The pricing of an apartment is not significantly different whether it has an extra room or two if it does not add more square meters.

Read the previous story here:
Acquiring and cleaning data with Web Scraping.

All source code is available here:
https://github.com/gustafvh/Apartment-ML-Predictor-Stockholm_-with-WebScraper-and-Data-Insights