Web Scraping

Is the Real Estate Market in Italy Disrupted due to the COVID-19 Outbreak?

Milan case study — Web Scraping kijiji.it owned by eBay

Yasser Elsedawy
Towards Data Science
9 min readApr 17, 2020

--

How COVID-19 will impact the real estate market in Italy is hard to predict. The real estate industry is being hit by the coronavirus, and it’s going to get worse before it gets better. The effects on real estate will vary by sector and market, and the extent of the effects will depend upon the duration of the economic shutdown.

The sectors of real estate that have been hit hardest so far are hotels, restaurants, bars and other entertainment destinations followed closely by retail and housing.

In this project we will be investigating the trend of residential real estate sector in Milan, Italy, before and after the coronavirus outbreak, in particular before and after Feb 21st. The residential sector focuses on the buying, selling and rental of properties used as homes or for non-professional purposes. The residential real estate sector is comprised of single-family homes, apartments, condominiums, planned unit developments, and more.

We will be scrapping the data from kijiji.it for all the available renting apartments offerings or postings on the platform from January 1st until April 15th. kijiji is an online advertising service fully owned subsidiary of eBay launched in February 2005. Kijiji websites are available for more than 100 cities in Canada and Italy.

Let’s get started.

Note: If you are interested in the code, please visit my profile on github.

We managed to collect 461 properties for rental purposes. First thing we had to do is to clean the data, we can remove any values without prices and any values before the month of January. We will also remove the euro sign and anything that prevent the columns from being considered as a number and not a text.

Now it’s time to work on the outliers, using the Interquartile Range Rule, we were able to remove 25 values in total.

Data Exploring & Visualization

Now before splitting the data set to before and after coronavirus outbreak, let’s first do some data exploring on the whole dataset so we can be familiar with what we have first.

  • Checking the price mean and the correlation between price and area
Price mean = 955.4214876033058
price area
price 1.000000 0.261033
area 0.261033 1.000000

The average price for all the apartments are 955 while 26% correlation between the price and area is actually not that good, and it’s worth investigating.

  • Now, let’s explore the change in mean price and area for each location in Milan.

We can notice that the most expensive areas are (Bonola / Molino Dorino / Lampugnano) (San Siro / Fiera) while the cheapest ones are (Città Studi / Lambrate) and very strangely (Centro), that will be under investigation in the next step when we split the data.

#Note: the milano area is based on user submission of not specifying the zone which automatically assigned to the municipality name.

  • Now let’s split the dataset.

We have 85 online rent postings before the outbreak while strangely we have more than triple the number after for the same time period. Which can indicate a huge drop in sales.

  • Checking average prices before the after the outbreak with the correlation.
Price mean before = 1141.8235294117646
Price mean after = 898.4280575539568

Average prices before the outbreak are higher, as expected.

Let’s check and print the correlation and explain it.

price      area
price 1.000000 0.555945
area 0.555945 1.000000
price area
price 1.000000 0.176422
area 0.176422 1.000000

A positive relationship exists with 55% correlation between prices and areas before the outbreak while on the other hand it’s almost zero after the outbreak, which suggests that people’s pricing behaviors have changed. It’s also worth to mention that 55% correlation between price and area is not that strong and it shows how in a lot of tenants can be charged for a lot more than what they get, it’s also worth mentioning that house prices don’t only depend on those features.

  • Change in prices by location before and after the outbreak.

Here we can clearly see the change in trends, before the outbreak prices in the center was the highest along with San Siro area (~1200), however after the outbreak the prices dropped to almost half in the center and Loreto area to reach ~600 with small to medium drops in other areas, however we can notice an increase in the prices in areas like Bonola / Molino Dorino / Lampugnano.

  • Change in prices based on the mean area for each location before the outbreak.

It should be normal to have the same price for the same features in the same location like what we see in (Baggio / Forze Armate / Quinto Romano) or in the center, however there are a lot of variations in the prices in areas like (Porta Genova / Navigli / Corso Italia) or (San Siro / Fiera), that’s definitely explains the low correlation number from before, and we should expect higher variations in the after data.

  • Change in prices based on the mean area for each location after the outbreak.

As expected, prices disruption, can be due to fear from the country lockdown as workers stay home, and due to business shutdowns, quarantines and curfews. Huge numbers of layoffs will lead to further contraction in consumer spending, which will force landlords to decrease their prices, some has already started to adapt to the new situation and some are still offering the same normal prices, having in mind also that the market for short-term renting is slowing down massively.

Clustering

In this part we will try to divide the market into distinct subsets of apartments based on their area and price.

We will use a method called, MinMaxScaler(feature_range = (0, 1)) will transform each value in the column proportionally within the range [0,1]. We use this as the first scaler choice to transform a feature, as it will preserve the shape of the dataset (no distortion).

  • Let’s, visualize the clusters before the outbreak

Cluster 0: prices(1500–2250)€, area(60–105)mq, this cluster has small apartments with high prices, those usually are concentrated in expensive neighborhood or in the center.

Cluster 1: prices(625–1500)€, area(35–86)mq, this is the medium cluster, you can have small apartments with higher prices than cluster 3 for the sake of a better neighborhood.

Cluster 2: prices(1000–2200)€, area(100–130)mq, this cluster kinda have a correlated relationship, big apartments with high prices.

Cluster 3: prices(375–1100)€, area(20–70)mq, this cluster has also a positive relationship, small apartments with low price range, that is usually in areas far from the center.

  • Let’s join the location to the cluster data and print prices clustered based on the location, also for the data before the outbreak.

Clusters are as before only here we can see each cluster contains postings in which areas in the city, cluster 0 as predicted has apartments in the center, we can also assume from this results that the Milano segment is associated with the center and expensive neighborhoods.

  • Doing the same procedures for the data after the outbreak

Cluster 0: prices(100–1000)€, area(20–70)mq, this cluster has small apartments with very low price range, that is usually in areas far from the center.

Cluster 1: prices(100–1000)€, area(80–125)mq, this cluster has bigger apartments with the same low price range from before, which indicates that maybe some apartments had to lower their range of prices after the outbreak.

Cluster 2: prices(750–1500)€, area(40–86)mq, this is the medium cluster, you can have small apartments with higher prices than cluster 3 for the sake of a better neighborhoods.

Cluster 3: prices(1200–2250)€, area(55–150)mq, this cluster has big range from small apartments to bigger ones, however with high prices, those usually are concentrated in expensive neighborhoods or in the center.

In this last representation, cluster 3 as shown before has the highest prices, however, the prices are not the center anymore as before the outbreak, they are in areas like Bicocca and Citta Studi, concentrated near universities, while the center prices has shown a huge drop.

Exploring with ArcGis

for reference: https://github.com/Esri/arcgis-python-api

We will be taking the original dataframe after fixing the date column and we will manually insert the coordinates from google maps, then we will delete unidentified places ‘~Altre zone’.

  • We will be splitting the data again and then print our first map. This map is showing the properties before (cyan color) and after (red color) the outbreak
  • Let’s visualizes the spatial density of the properties (houses) using a heat-map

First, for the data from before

Second, for the data after

  • Now, let’s visualize spatial distribution by price.

First for the data from before the outbreak.

Second for the data from after the outbreak.

  • You got the idea, now let’s do the same and visualize spatial distribution by area

Conclusion

We can notice the price range went down in comparison from the dataset before the outbreak, 500€ drop shows the fall in prices, and maybe its not as big as the drop in sales.

Our data and our analysis showed that the market behaviour has definitely changed, some landlords has already started to lower their renting price for their apartments, the data also showed that the number of postings after the outbreak is triple the number before which indicated a huge drop in sales.

Property renting and buying platform Idealista.com reported in early April that the effects of the coronavirus crisis have yet to be noticed on the market, although they warned the next quarter will probably shed more light on the situation.

According to their data from the first quarter of 2020, house prices across Italy dropped by 0.4 percent, with the average price for second-hand properties now standing at €1,699 per square meter (annual decrease of 2 percent).

Future work

If you like the idea of investigating the market trends, having more features of each apartment is definitely recommended, playing with dummy variables to turn the location to a measurable variable can be an interesting idea, collecting more data after couple of month to have a deeper look is also recommended.

If you want to view the full code and to know how to scrape the data yourself or to download the dataset in this project, check out my profile on github.

Thank you:)

Note from the editors: Towards Data Science is a Medium publication primarily based on the study of data science and machine learning. We are not health professionals or epidemiologists, and the opinions of this article should not be interpreted as professional advice. To learn more about the coronavirus pandemic, you can click here.

--

--