Hyderabad Housing Prices.

Saiteja Mahadev
Towards Data Science
7 min readNov 10, 2019

--

Photo by Shiv Prasad on Unsplash

Every day I hear a lot of stories from my friends and colleagues at office about their hard time finding a Flat / Apartment for rent in Hyderabad, they mostly being either the flats are not open for the bachelors or the rents being too high in a given area or the size of the flat being too small for the given price. I wanted to get the real picture of the above problems, so I have decided to use my python skills to get and analyze the data about the Houses available for rent from one of the leading housing rentals website.

I have divided the current task into 3 parts.

  • Data Collection using Web-Scraping.
  • Data Preprocessing.
  • Data Analysis.

Data Collection.

Photo by Luca Bravo on Unsplash

Web-Scraping is often used as a tool to collect the data when no real-time data available for the analysis. First of all, we are going to collect all the search results from one of the leading real estate websites in Hyderabad. The website that we going to collect data from is magicbrics.com.The website is great and all the data is neatly arranged, however, you are free to choose a website of your choice and you will be able to adapt the code very easily.

Before we actually begin scraping the data, we need to decide on what features need to be scraped from the website. In the current example, we will be collecting,

  • No Bedrooms.
  • No Bathrooms.
  • Type of Furnishing.
  • The Tennants Preferred.
  • The Area of the House in sqft.
  • The locality of the House.
  • Price or Rent.

First of all, let’s import all the dependencies,

import re                       # Regular Expressions Library
import pandas as pd # Data Preprocessing Library
from bs4 import BeautifulSoup # Used to get Data from HTML files.
from urllib.request import urlopen,Request #Used to get a given url

Some websites automatically block any kind of scraping, and that’s why we’ll define a header to pass along the get command, which will basically make our queries to the website look like they are coming from an actual browser.

headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})

Firstly, we need to query the website and get the data in the form of HTML code (of the page).

request = Request(url, headers=headers)
response = urlopen(request)
html = response.read()

The output of our Html would look something like below which is a raw Html code and this needs to be passed to the Beautiful Soup method to get the structured Html code.

Html Output(html)  :b'\r\n\r\n\r\n\r\n\r\n\r\n\r\n<!DOCTYPE html>\r\n<html>\r\n\t<head>\r\n\t\t<meta charset="UTF-8">\r\n\t\t<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"> \r\n\t\t<meta name="viewport" content="width=device-width, initial-scale=1">\r\n\t\t<meta name="google-signin-client_id" content="614944930403-u1s9tmqebd273mutmdhmsngnhvivgd6a.apps.googleusercontent.com">\r\n\t\t\r\n\t\t<link rel="dns-prefetch preconnect" href="//cdn.staticmb.com" />\r\n\t\t<link rel="dns-prefetch preconnect" href="//img.staticmb.com" />\r\n\t\t<link rel="dns-prefet'html_soup = BeautifulSoup(html , 'html.parser')Beautiful Soup Output (html_soup) :<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<meta content="614944930403-u1s9tmqebd273mutmdhmsngnhvivgd6a.apps.googleusercontent.com" name="google-signin-client_id"/>
<link href="//cdn.staticmb.com" rel="dns-prefetch preconnect"/>
<link href="//img.staticmb.com" rel="dns-prefetch preconnect"/>
<link href="//b.scorecardresearch.com" rel="dns-prefetch preconnect"/>
<link href="//www.googleadservices.com" rel="dns-prefetch preconnect"/>
<link href="//stats.g.doubleclick.net" rel="dns-prefetch preconnect"/>
<link href="//ase.clmbtech.com" rel="dns-prefetch preconnect"/>
<link href="//ingestor.magicbricks.com" rel="dns-prefetch preconnect"/>
<link href="//maps.googleapis.com" rel="dns-prefetch preconnect"/>
<link href="//www.googletagmanager.com" rel="dns-prefetch preconnect"/>
<title>Properties in Hyderabad</title>

After we get the code we need to navigate through the Html code to find out where exactly we can find our required features in the whole source code. We can check this out in our browser and We can also find out the position in the Html document, of a particular object like the price of the properties. Right-click it and select inspect.

The entire webpage is divided into chunks and each chunk will be placed div container with a class name to it (current example it is flex relative clearfix m-srp-card__container), each such chunk will have all the required features about a single house. We can scan the entire page and collect all such chunks into a list and iterate over them to find the feature set for all the houses.

page_container = html_soup.find_all('div', class_ = 'flex relative clearfix m-srp-card__container' )

After collecting the webpage data as a list of containers, we can see that data pertaining to each field is stored in <span> tags in each of the page_containers. We can write a script to collect all the span tags of each page_container, extract the data and loop it over the entire page_containers list as below.

span_containers = page_container[3].find_all('span')
span_containers[1].text
Output : ' \n \t \t\t\t \t \t\t\t16,000 \t \t\t'span_containers[1].text.replace('\n','')Output :' \t \t\t\t \t \t\t\t16,000 \t \t\t'

The logic remains the same for all the fields except Locality where the locality needs to be extracted from a string of characters and little tricky to get. The data would look like this. I want string between in and 2000 sqft

'    3 BHK Apartment  for rent in SMR Vinay Harmony County, Appa junction 2000 sqft  '

So, I have used regular expressions to find the pattern which starts with ‘in’ and ends with a ‘number (\d)’ but this pattern is not uniform across the data, Some samples are also ending with spaces or with nothing at the end. So, I have used 3 patterns to get the locality data.

       Locality_1 = Locality_0.text.replace('\n',' ')
Locality_2 = re.search(r'in(.+?)\d', Locality_1)
if Locality_2 is None :
Locality_2 = re.search(r'in(.+?),',Locality_1)
if Locality_2 is None :
Locality_2 = re.search(r'in(.+?) ',Locality_1)
Output:YMCA Circle, Narayanguda
Amarnath Residency
Nanakram Guda
My Home Avatar, Narsingi, Outer Ring Road
Nizampet
Kachiguda, NH
Miyapur, NH
Dukes Galaxy, Banjara Hills, NH
Machabollaram, Medchal Road

In a similar way, each of the span_containers will have different features and can be extracted as shown above and can be saved in a .csv file for further processing. The final output .csv file would look like below.

House Prices Data.

The complete working code for the above along with data.csv file can be found in my Github repository saitejamahadev/hyderabad_housing_data.

Data Preprocessing.

Photo by Mika Baumeister on Unsplash

Data preprocessing is an important step in any data science project. Initially, when we were extracting the data, we made sure that data is clean and free from any random values but if we closely analyze the features of the data set, we can find that data has some garbage or missing values extracted during scraping.

unique Bedroom values
Unique Bathroom values
unique Furnishing type and Tennants preferred

we can see the incorrect or garbage values embedded into the feature set by calling pandas unique() method as shown above. I have only listed it for a few features, but almost all the features of the data set have one or another type of irregularity that needs to be addressed.

Now, some of the features where garbage is attached to the required feature, we can extract the correct sample feature value but for others, where the data is incorrectly collected i.e Furnishing-type, it can have values “Furnished” or “Semi-Furnished” or “Unfurnished” but not others as shown above.

This issue with data can be considered as the problem of missing values and there are many classic ways to handle it in data preprocessing depending on the business case but in current use case I want to go with the substitution of Mode/Mean of the features with the assumption that in a given area people are unlikely to rent a house for bachelors where there is preference for families and vice-versa.

If we observe the samples, we can see that ‘\t’ is commonly embedded and removing it and replacing the resulting one or more spaces with a single space will clean most of our samples and this is common to all the samples. we can use below lines of code to achieve that result.

data.Bathrooms[price] = re.sub(r'\t','', data.Bathrooms[price])
data.Bathrooms[price] = re.sub(r'\s+','',data.Bathrooms[price])
sample code how different features are handled in preprocessing.

Finally, the cleaned data can be saved in a new file and it would look like shown below.

House prices data after preprocessing.

Note: The data preprocessing techniques we have used in the House_data_preprocessing method are specific to the type of data collected and irregularities present in the data. The code needs slight modifications when working with the different web-page.

Data Analysis.

After the data has been cleaned we use the python data visualizations tools such as Matplotlib to try and visualize the data and answer questions such as

  • What is the common Fursnhing style of the flat available for rent?
  • What type of Tennants are generally preferred by Landlords?
  • How many numbers of Bedrooms are generally available for rent?
  • What is the distribution of Furnishing type vs Tennants preferred?

as follows. The below code snippet will answer the above questions.

Data Visualization.
  • What is the common Fursnhing style of the flat available for rent?
Furnishing Type
  • What type of Tennants are generally preferred by Landlords?
Tennants Preferred
  • How many numbers of Bedrooms are generally available for rent?
Bedrooms Distribution
  • What is the distribution of Furnishing type vs Tennants preferred?
Furnishing Type vs Tenants Preferred
FurnishingType vs Tennants Preferred

Note: Please visit saitejamahadev/Hyderabad_House_Prices_Analysis for the complete working code of the above data preprocessing and analysis steps.

--

--

Machine Learning Practitioner || Looking for Full time opportunities in the field of ML.