SE Apartment Project

A Data Science approach to Stockholms Apartment Prices

Acquiring and cleaning data with Web Scraping in Python

Gustaf Halvardsson

Published in

Towards Data Science

7 min readDec 19, 2019

Project Background

Being considered one of the most overpriced housing markets in the world, I figured the apartment market in Stockholm would serve as an interesting case study with a lot of interesting data available. To transform a large portion of data into useful insights that are easy to understand and make a model that predicts the sale price of future apartment listings.

I also decided to make sure to approach this problem in the way a Data Scientist would approach it with some of the head-aches and challenges a real-situation would present. This means I want to challenge myself in transforming the data that is available, into clean data that can actually be used.

Note: Usually this role would include being presented a specific goal that would benefit the business. But in this case, I have chosen to take a very general approach where I don’t know yet what patterns I might be looking for.

“A data scientist is someone who can obtain, scrub, explore, model, and interpret data, blending hacking, statistics, and machine learning. Data scientists not only are adept at working with data, but appreciate data itself as a first-class product.” — Hillary Mason, founder, Fast Forward Labs.

Read the next part of this project here:
Turning data into insights and predictors with Machine Learning.

All source code is available here:
https://github.com/gustafvh/Apartment-ML-Predictor-Stockholm_-with-WebScraper-and-Data-Insights

Project Goals

1. To gather real-life data that is available online and be able to collect, clean and filter it into a dataset of both good quality and quantity. No shortcuts with squeaky-clean Kaggle-datasets this time.

2. Based on data gathered, make a Machine Learning model which — based on relevant features such as size, number of rooms, and location — makes an accurate guess on its sale-price with a feasible accuracy such that it would actually be considered useful.

3. Turn all data I’ve collected into insights that hopefully teach me something surprising or useful about the Stockholm housing market.

Special tools used

Python: Entire project is written in Python because of its large support for libraries in this area
Selenium (Python): Popular end-to-end testing software for executing the Web Scraping program in Chrome.
Pandas & Numpy: Python libraries for easy-to-use data structures and data analysis tools.
Yelp API: API to receive nearby Points-of-interests as well as coordinates based on an address.
Seaborn & Matplotlib: Data visualization libraries for Python meant to show our results visually.

Acquiring Good and Plentiful Data.

To even begin this project, I need good data. Good data in this sense means it fulfills mainly two requirements, quality and quantity. For the quality, it is important that our data has different attributes (features) that are relevant to what I want to know and predict. If I want to predict the sale price of an apartment, good features would, for example, be the size of the apartment, its location, number of rooms, when it was sold, etc.

I can find all of this data available at Hemnet, a Swedish listing site for apartments so that's where I decided to use the data from.

Very important about web scraping: In my case, I will in no way use this as a commercial product or for any monetary gain but always respect other people's or companies' data and make sure you always double-check what applies when it comes to web-scraping to not infringe on copyright.

Demo of how my Selenium Web Scraper works by automatically controlling an instance of Chrome.

In short, the Web Scraper works by opening a Chrome browser window, navigate to Hemnets Stockholm listings, get all apartment-listings on that page and navigate to the next page to repeat the process.

Some problems I encountered and luckily solved:

Hemnet restricts listings to 50 pages per search query. I used this as an opportunity to get an even spread of data amongst the size of the apartments. I split my scraping-jobs into querying 20 segments, each corresponding to a specific size of apartments that I then tell the Scraper to get for me. For ex. one segment is all apartments with a size between 20 and 25 square meters. This resulted in me getting:
50 listings per page * 50 pages * 20 segments = 50 000 listings
Now I also have a better data-spread than before so that turned out great!

segments = [20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 250]for i in range(0, (len(segments)-1)):
driver = initCrawler(segments[i], segments[i+1])apDataNew =           getRowsFromHnet(driver, 50)
apData = apData.append(apDataNew, ignore_index=True)

Google Ad banners. These are tricky since their classes are dynamic and can cause the Scraper to misclick when I try to click the ‘Next Page’ button. This, unfortunately, caused the Scraper to get the same page several times in my routine since it didn’t click the next page button properly sometimes. Because of this, I ended up with 15 000 listings instead of 50 000 after removing duplicates. However, I did not end up digging deeper into this since luckily 15 000 is still more than enough data to move forward with.
Privacy Policy Pop-up in an iframe. Pop-ups every time since we are in incognito mode and needed to be clicked. Turns out it was implemented in an iframe which complicated things but was solved with Seleniums window-management tools.

Adding features to our data with the help of Yelp.
With the help of Yelps API, we can get a lot of details back based on a certain location. By sending an address we can, for example, get the coordinates, (latitude, and longitude) of that location. This is very useful since we can now turn the string-based address, like “Hamngatan 1” into something more quantifiable like two numbers. This will make the location into a usable feature and later creates the possibility to geometrically compare apartments locations.

Additionally, we also receive the Points-of-Interests feature to add to our apartment dataset. This is defined as the number of places that Yelps deem worthy to list, which is within 2 km of that apartment. You may ask, why is that relevant? It is relevant because it can be an indicator of how central your apartment is located and that may reflect in the sale price. Adding this feature may, therefore, help with our accuracy in our model later on.

response = yelp_api.search_query(location=adress, radius=2000, limit=1)return (latitude, longitude, pointsOfInterestsNearby)

Cleaning, Scrubbing and Filtering the data

This is the part where I think I learned the most about applying a problem to the real world since data is never a nicely formatted and clean as you would like, and this time was no different. First, I started by cleaning the data to reformat it:

Change date order and make numerical. I used a python dictionary structure to convert the month into its numerical value.

# 6 december 2019 --> 20191206
df.Date = [apartment.split(' ')[2] + (apartment.split(' ')[1]) + (apartment.split(' ')[0]).replace('december', '12')

Make all features numerical (floats) except the Address and Broker.
Trim sale price. Convert from “Slutpris 6 250 000 kr” to “6250000”.
Remove wrongly acquired coordinates. The Yelp API did in a few instances think that an address was in a completely different part of Sweden and returned those coordinates. These could skew the data so if an apartment had coordinates that were significantly outside the range of Stockholm, the entire row was dropped.

df = df.loc[17.6 < df['Longitude'] < 18.3]
df = df.loc[59.2 < df['Latitude'] < 59.45]

Add a new combined feature PricePerKvm. Since the prize is strongly connected to the size of the apartment, comparing the price directly between different sized apartments would not show how fairly priced it was. To directly compare the prizes I created a new feature called (PricePerKvm) which is just the price divided by its square meters.

Drumroll, please: Our Final dataset

After a lot of cleaning, scrubbing, and filtering we ended up with a dataset containing 14 171 apartment listings with 9 different features.

In this part, I learned that a large majority of time in a Data Science project will include gathering and cleaning the data since timewise, this part took almost 80% of the time for the total project. Real-life data is rarely clean, filtered or structured in the way you would like. Its also not always uniform so for the next project I would like to take on the challenge of combining different types of datasets into one.

When it comes to this dataset, it should be more than enough data for our next part — building a predictive Machine Learning model and also turning the data into insights!

Read the next story in this project here:
Turning data into insights and predictors with Machine Learning.