Web Scraping Apartment Listings in Stockholm

Using Python for scraping and MS SQL server for storing

Published in

Towards Data Science

8 min readDec 25, 2018

Me and by partner have sold our apartment and are in search of a new one. Since the majority of people, searching for a new apartment, manually go through https://www.hemnet.se/ which to me seems tedious and exhausting. So I thought — why not use my Python knowledge to programatically find the most suitable apartments? So that is what I did, I built a very simple web scraper in order to find objects that are up for sale.

1. The web scraping part

Down below follows a pair of functions that handles HTML-objects via the library BeautifulSoup. Please keep in mind that these functions can be written in a more professional and optimal way but the scope of this project was indeed to get a hold of the data fast, in order to close an “optimal” apartment deal.

1.1 Scraping functions

First we need to handle the BeautifulSoup object which we do in a standard Python function that I have chose to name “Booli_ScrapeObjects”. The function take two input arguments (page and object_info) where page is the URL and object_info is a Pandas Data Frame that is suppose to store all the interesting information for each object. The function works on the premise that the HTML-object of the website stays static. The function follows:

def Booli_ScrapeObjects(page, object_info):
 request = requests.get(page)
 soup = BeautifulSoup(request.text,'lxml')
 links = soup.select("a[href*=/annons/]")
 
 for j, row in enumerate(links):
  info = row.contents[5].text.split("\n")
  while '' in info:
   info.remove('')
  info[0] = info[0].strip(" kr")
  info[1] = info[1].strip(" kr/m²")
  info[2] = info[2].strip(" kr/mån")
  object_info.append(info)
  try:
   info.insert(0,"https://www.booli.se" + links[j]["href"])
  #FETCHING ADDRESS, # ROOMS AND M2
  request_apartment = requests.get(info[0])
  soup_apartment = BeautifulSoup(request_apartment.text, 'lxml')
  address = soup_apartment.findAll('span',
   class_ = 'property__header__street-address')
  address = address[0].contents[0].strip("\n\t\t")
  info.append(address)
  size = soup_apartment.findAll('span',
   class_ = 'property__base-info__title__size')
  size = size[0].contents[0].strip("\n\t").split(",")
  rooms = size[0].strip(" rum")
  m2 = size[1].strip(" m²")
  info.append(rooms)
  info.append(m2)
  except:
   info.insert(0, "Unknown")   #Link
   info.append("Unknown")      #Address
   info.append("Unknown")      #Rooms
   info.append("Unknown")      #m2
   info.append("Unknown")      #Estimate
   continue
return object_info

Print Screen from Boolis webpage - showing number of objects per page.

The above function scrapes objects that can be found on the given page. If the number of objects found are > 38 the objects are split into multiple pages (see picture to the left). This problem is solved in the function stated below, which takes the same URL argument in order to calculate how many pages we have, based on the premise that each page contains at most 38 objects.

def Booli_findNumberOfPagesData(url):
 request = requests.get(url)
 soup = BeautifulSoup(request.text,'lxml')
 data = soup.findAll('div',
		class_ = 'search-list__pagination-summary')
 numberOfObjectsPerPage = 38
 try:
  numberOfObjects = int(data[0].text[
		-(len(data[0].text)-3 - data[0].text.rfind("av")):])except:
  numberOfObjects = 0
  numberOfPages = int(
		np.ceil(numberOfObjects/numberOfObjectsPerPage))return numberOfPages, numberOfObjects

Having the two above functions we can write a new function that loops through each URL and stores the information in a Pandas Data Frame. The function follows:

def loopThroughRegions(data_url, 
			m2_max, 
			m2_min, 
			maxListPrice, 
			minListPrice):
 object_info = []
 region = []
 length = [0]
 for index, row in data_url.iterrows():
  #Base URL
  url = "https://www.booli.se/{}/{}/?
			maxListPrice={}&
			maxLivingArea={}&
			minListPrice={}&
			minLivingArea={}&
			objectType=L%C3%A4genhet&
			page=1&
			upcomingSale=
				".format(row["Region"],
					row["RegionID"],
					maxListPrice, 
					m2_max, 
					minListPrice, 
					m2_min)
  object_info = Booli_ScrapeObjects(url, object_info)
  numberOfPages, numberOfObjects = Booli_findNumberOfPagesData(url)
  for page in range(2, numberOfPages):
   url = "https://www.booli.se/{}/{}/?
		maxListPrice={}&
		maxLivingArea={}&
		minListPrice={}&
		minLivingArea={}&
		objectType=L%C3%A4genhet&
		page={}&
		upcomingSale=
			".format(row["Region"],
				row["RegionID"],
				maxListPrice,
				m2_max,
				minListPrice,
				m2_min,
				page)
  object_info = Booli_ScrapeObjects(url, object_info)
 length.append(len(object_info))#Creating a simple vector containing duplicates of regions up to number of object stored for each region
 for i in range(0, length[len(length)-1] - length[len(length) - 2]):
  region.append(row["Region"])return object_info, region

In order to store the Pandas Data Frame in a regular Microsoft SQL Server Database we need to clean the data. For example we would like two rewrite objects that have 1½ rooms to 1.5 and so on. A simple function that performs this simple cleaning process can be found below:

def cleaningData(object_info):
 for index, row in object_info.iterrows():if row["m2"].find("+") != -1:
   m2s = row["m2"].split("+")
   newM2 = int(m2s[0]) + int(m2s[1])
   object_info.set_value(index, "m2", newM2)if row["Number of rooms"].find("½") != -1:
    rooms = row["Number of rooms"].split("½")if rooms[0] == "":
     newRooms = 0.5
    else:
     newRooms = float(0.5) + float(rooms[0])object_info.set_value(index, "Number of rooms", newRooms)if row["Rent"].find("—") != -1:
     newRent = 0
     object_info.set_value(index, "Rent", newRent)
    else:
     newRent = "".join(row["Rent"].split())
     object_info.set_value(index, "Rent", newRent)return object_info

1.2 The database storing part

Finally we have all the data in a structured format - so now it’s time to store it in a convenient way, namely, in a database. We have all the data stored in a Pandas Data Frame that looks like this:

Pandas Data Frame containing apartment data.

So we need to construct a database table that can store each column. This can easily be done via the following SQL-code (given that you use MS SQL server - though the syntax is similair for other databases):

Query for creating the table “UpcomingSales”.

Now lets write a simple function that can connect to my local “on-prem” database using the pyodbc-package. The function follows:

def mssql_connect(server, database, driver):
    cnxn = pyodbc.connect('DRIVER='+driver+ \
                          ';SERVER='+server+ \
                          ';DATABASE='+database + \
                          ';Trusted_Connection=yes')
    cursor = cnxn.cursor()
    return cnxn, cursor

Using this function combined with a query that inserts data by row we finally reach our goal (the query could preferably be written using a simple merge statement):

#SQL INPUT PARAMETERS
pyodbc.pooling = False
server = 'server'
database = 'database'
username = 'username'
password = 'password'
driver= '{ODBC Driver 13 for SQL Server}'cnxn, cursor = mssql_connect(	server, 
				database, 
				username, 
				password, 
				driver)
data = result.values.tolist()
for i, item in enumerate(data):
 insert_query = "IF NOT EXISTS ( \
			SELECT \
				* \
			FROM \
				[Booli].[UpcomingSales] \
			WHERE \
				[Link] = '" + str(item[0]) + "' AND
				[DateInserted] = '" + 
				str(date.today()) +"') \
			BEGIN \
			INSERT INTO [Booli].[UpcomingSales] \
			VALUES ('" + str(item[0]) + \
				"'," + str(item[1]) + \
				"," + str(item[2]) + \
				"," + str(item[3]) + \
				",'" + str(item[4]) + \
				"'," + str(item[5]) + \
				"," + str(item[6]) + \
				",'" + str(item[7]) + \
				"','" + str(date.today()) +"') \
			END"cursor.execute(insert_query)#Cleanup
cnxn.commit()
cursor.close()
cnxn.close()

Now we can obtain the result by simply talking to the database via a simple select-query:

Select-query to the table that holds the object data.

2. Visualizing data from sold objects

It is easy to adjust the code in order to scrape sold objects instead of objects up for sale, I wont go over how to do it but instead I will show some simple visualization of scraped data from sold objects. First of we start by calling the database with a SQL-query in order to obtain the data. The resulting data will be stored in a Pandas Data Frame by the name “data”. The Python code for fetching the data looks like this:

query = "SELECT 
             * \
         FROM \ 
             [Booli].[SoldObjects]"
data = pd.read_sql(query, cnxn)

The output looks like this:

Further we split the rent cost into different price sections by using the inbuilt pandas function “cut”:

bins = [0, 500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 10000]
labels = ['-500', '500-1000', '1000-1500', '1500-2000', '2000-2500', 	'2500-3000', '3000-3500', '3500-4000', '4000-10000'])data["Rent categories"] = pd.cut(
				data['Rent'], 
				bins = bins, 	
				include_lowest = True, 
				labels = labels
				)

Now we can plot the data in order to get a understanding of how the price varies between rent categories over time

As the intuition tells you — the lower the rent the higher the price. Now lets take a closer look on how the prices differ between regions in Stockholm

For those of you that live in Stockholm the result, probably, wont come as a surprise that Östermalm has the highest mean sell price and is far above the overall mean.

Now lets have a look at the distribution for each regions [price/m²] with their respective mean

and further the distribution of [price/m²], for the whole sample:

As we can see the mean [price/m²] is 85 000 [SEK] for the whole sample.

Lastly we take a look at the mean [price/m²] for each region and number of rooms for the objects.

as mentioned before, the sample lacks data for apartments in the upper “# rooms”-range and therefor the mean is not representative for actual market price. Just to make my point clear lets count objects per region and “#rooms”

as the plot shows — in the best of worlds maby we can say that “#rooms” = 1,2,3 and 4 have a representative mean where as the mean for “#rooms” > 4 have a highly distorted mean due to few data points.

3. Remarks

First of the Python-code can, most surely, be written more efficiently but as pointed out the purpose of this project was not to write production ready code for scraping aparments, this was just a one-off project for a specific purpose. With that said one clear optimization for the scraping part would be to search for multiple areas instead of one.

4. Further work

One could implement a simple regression model with response variable, Y, of “price per squared meter” to find out the contribution of each independent variable X. The model could also be used as a price estimator given a set of values for the independent variables.

Thank you for reading. If you have any questions please contact me via cfuru@kth.se and please feel free to add me at linkedin.