Predicting Airbnb prices with machine learning and location data

A case study using data from the City of Edinburgh, Scotland

Published in

Towards Data Science

15 min readNov 6, 2019

Keywords: Airbnb, Edinburgh, city, data science, pandas, geopandas, geospatial, foursquare, maps, matplotlib, modeling, neighbourhood, networks, numpy, foursquare API, planning, python, urban planning, data visualization

As part of the IBM Data Science Professional Certificate, we get to have a go at our very own Data Science Capstone, where we get a taste of what is like to solve problems and answer questions like a data scientist. For my assignment, I decided to do yet another project that looks into the relationship between Airbnb prices and its determinants. Yes, there are several very cool ones like Laura Lewis’s here. I would not have been able to do mine without reading and understanding hers (and her code), so kudos! However, being that I’m all about transportation research, I added a little touch of geospatial analysis by looking into locational features as possible predictors. This post explains a bit of the project background, data collection, cleaning and pre-processing, modeling, and a quick wrap up.

Project background and aim
The data
Data cleaning and exploration
The location feature
Building the models
Improving the models
Conclusions and recommendations

For the complete notebook with all the code, you can check out the repo on my Github.

Project background and aim

If for some reason you don’t already know, Airbnb is a internet marketplace for short-term home and apartment rentals. It allows you to, for example, rent (list) out your home for a week while you’re away, or rent out your empty bedroom. One challenge that Airbnb hosts face is determining the optimal nightly rent price. In many areas, renters (hosts) are presented with a good selection of listings and can filter by criteria like price, number of bedrooms, room type, and more. Since Airbnb is a market, the amount a host can charge is ultimately tied to market prices. The search experience on Airbnb looks like this:

Airbnb Edinburgh home page at https://www.airbnb.co.uk/s/Edinburgh--United-Kingdom/all?_set_bev_on_new_domain=1571135536_%2B1HjOBOK%2FivIgihM

Although Airbnb provides hosts with general guidance, there are no easy to access methods to determine the best price to rent out a space. There is third-party software available, but for a hefty price (for an example on available software, click here).

One method could be to find a few listings that are similar to the place that will be up for rent, average the listed prices and set our price to this calculated average price. But the market is dynamic, so we would want to update the price frequently and this method can become tedious.

Another issue? We’re probably going to miss the competitive advantages that our listing has over the surrounding listings, like how near it is to a grocery store, a pub and other extra services, or maybe even how awesome my photos are compared.

So, this project uses several listing features to try and predict price, with the added bonus of a predictor based on space: the property’s proximity to certain venues. This allows the model to put an implicit price on things such as living close to a bar, pub or a supermarket.

The data

Airbnb doesn’t release any data to the public but a separate group named Inside Airbnb scrapes and compiles publicly available information about many cities listings from the Airbnb website. For this project, I used their data set scraped on July 21, 2019, on the city of Edinburgh, Scotland. It contains information on all Edinburgh Airbnb listings that were live on the site on that date (over 14,000).

The data has certain limitations. The most noticeable one is that it scrapes the advertised price rather than the actual price paid by previous customers. More accurate data is available for a fee in sites like AirDNA.

Each row in the data set is a listing available for rental in Airbnb’s site for the specific city (observations). The columns describe different characteristics of each listing (features).

Some of the more important features this project will look into are the following:

accommodates: the number of guests the rental can accommodate
bedrooms: number of bedrooms included in the rental
bathrooms: number of bathrooms included in the rental
beds: number of beds included in the rental
price: nightly price for the rental
minimum_nights: minimum number of nights a guest can stay for the rental
maximum_nights: maximum number of nights a guest can stay for the rental
number_of_reviews: number of reviews that previous guests have left

To model the spatial relationship between Airbnb rental prices and property proximity to certain venues, I used the Foursquare API to access the city’s venues and the street network is available through OpenStreepMap (OSM).

Data cleaning and exploration

In short, the original dataset contained 14,014 Airbnb listings and 106 features but I dropped a bunch. For example, some of those are free text variables, like the host description of the property and all the written reviews. To include that data, I would’ve had to do some Natural Language processing, but unfortunately, I couldn’t. So those features were dropped. Also, several columns only contained one category, so they were dropped too. Most listings don’t offer the experiences feature, so that was also gonner.

A complete step by step of this section, how and why I cleaned the data the way I did is on the notebook on my github or on this report without all the code.

An interesting exploratory data analysis is there too. One of the ‘no sh*t Sherlock’ findings is that there’s clear seasonality when it comes to listings entering the market.

Clear seasonality in the middle of the year

Every year, there is a peak towards hosts joining around the middle of the year (summer), and the lowest points are the beginning and the end of each year. In August, during the Edinburgh Fringe Festival, room rental price rises considerably. It is an extremely popular event and much of the rented property available will have been taken up due to the number of people who attend each year. Edinburgh has lots of other events throughout the year, mainly focused between April and December. There are word-class artists performing throughout these months.

For the rest of the year, Edinburgh is uniquely placed as the cultural capital of, not just Scotland, but the UK. There’s stable demand for accommodation and as well as tourism, because Edinburgh serves as a business capital after London.

In terms of changes in prices over time, the average price per night for Airbnb listings in Edinburgh has increased slightly over the last 10 years. But the top end of property prices has increased a lot, resulting in a larger increase in the mean price compared to the median. The mean price in 2010 was £107.33 and the median £115.0, whereas the mean price in 2018 (the last complete year of data) was £104.55 and the median £79.0. Basically, expensive accommodation owners rapidly discovered the benefits of using Airbnb (mainly less tax and less responsibilities, but that’s a story for another day).

The location feature

Now for the cool stuff (for me). When looking for accommodation, being near to certain touristic areas is important. But also, knowing that you will have a grocery store or a supermarket at a walkable distance can be a plus. A lot of renters like the fact that you can cook your own meals (and save some ££) when you get an Airbnb vs a hotel room.

So, proximity to certain venues, such as restaurants, cafes and shops could help us predict price.

Much of Airbnb listings are centred around Edinburgh’s Old Town area, which is consistent with the huge draw for tourists, especially during the annual Fringe festival.

Most Airbnb listings in Edinburgh are near the Old Town area

The distribution of prices also reflects that the most expensive listings are on the Old Town area.

To get information on specific venues, I used latitude and longitude of Edinburgh’s neighbourhoods to download a list of venues per neighbourhood via Foursquare’s API.

The latitude and longitude came from a publicly available list of neighbourhoods in the form of a geojson file. I could have retrieved the latitude and longitude from the original Airbnb listings dataset, but I was already working with this file, so I decided to just use that. Afterwards, I merged them by neighbourhood.

First, I had to load the geodata:

map_df is a GeoDataFrame. To retrieve the venues per neighbourhood, I extracted the latitude and longitude or x and y from the point object in the GeoSeries. The following code returns x and y as separate GeoDataFrame columns within the original map_df.

Then we create the API request url, make the GET request and extract the information for each venue that is nearby our neighbourhoods. This very useful function below was provided to us during the Data Science course. It loops through all neighbourhoods to retrieve the data we want:

Now we run the above function on each neighborhood and create a new dataframe called edinburgh_venues.

Once we have this, we can look at the different venues around each neighbourhood. To begin with, there were 182 unique venue categories. There were pretty messy though. So I cleaned the retrieved data, grouped the rows by neighbourhood and took the mean and the frequency of occurrence of each category.

This helped me find the 5 most common venues per neighbourhood:

And save it to a pandas dataframe to have a look and decide which venues are of interest:

Five most common venues per neighbourhood (first five rows)

Looking closely to the data, it was clear that the most common venues were Hotel, Pub, Grocery Store/Supermarket, and Cafe/Coffee Shop, with Bar, Bus Stop and Indian Restaurant coming behind. Restaurants were divided into subcategories. So I aggregated them into one big Restaurant category. I thought it unlikely that having a hotel nearby affects price, because Airbnb listings are supposed to be on a different category of short-term rental. They offer quite different benefits compared to hotels. So I didn’t consider that category of venues.

Armed with this data, I began the process of measuring that proximity to venues thing we have been talking about. We do this by analysing accessibility.

Accessibility is the ease of reaching destinations. Is a very important concept in transportation with a wide body of literature available. Accessibility shapes people’s ability to access opportunities and reach their full potential as human beings. That is why accessibility measures are used a lot as proxies for spatial and transport inequality.

To keep it simple, for this project, I measured accessibility by measuring access to venues or Points of Interest (POIs) downloaded from Foursquare. Pubs, Restaurants, Cafes and Supermarkets/Grocery Stores are our POIs (dataset of 441 POIs).

To help with the accessibility analysis, Pandana is the Python library of choice. Pandana is a handy graph library that allows for Pandas dataframes to be passed through into a network graph that maps graph-level analyses to underlying C operations. All of this is to say, it’s much faster than traditional Python-based graphs.

To start the analysis, we have to get the street network data from the OSM API using the location data of Edinburgh (the bounding box).

A Network with 51,553 nodes is then downloaded from OSM after running the code. When we save it to HDF5, you’ll note that certain nodes are excluded. This can be useful when refining a network so that it includes only validated nodes. I used the low_connectivity_nodes() method to identify those nodes that may not be connected to the larger network and exclude them.

Finally, we calculate accessibility to the POIs:

The resulting data frame looks like this:

*Network distance from the node to each of the 10 POIs*

And what a better way to understand what all these numbers are about than with some nice visualisations (what is geospatial analysis without them really?). First, let’s look at the distance to the nearest POI of any type, since I’m not interested in accessibility differences between venues:

We can see that there are some zones where people have to walk more than 500 meters to reach the nearest amenity, whereas Edinburgh’s Old Town has walking distances of less than 100 meters on average.

The map shows the walking distance in meters from each network node to the nearest restaurant, bar, cafe, pub and Grocery shop. But, a better indicator of accessibility might be having access to a large number of amenities. So instead of the nearest, I plotted accessibility to the fifth-nearest amenity:

This time is even more noticeable that Old Town and the city center area of Edinburgh have better accessibility.

For this project, I’m hypothesizing that access to restaurants, shops, cafes, bars and pubs are important for Airbnb users. So I’m weighting all POIs equally and using distance to the fifth nearest venues as a compound measure of accessibility. This gives a clearer picture of which neighbourhoods are most walkable, compared with plotting just the distance to the single nearest venue.

Here’s the code to set the distance to the fifth nearest amenity as a compound measure of accessibility:

So there it is. A location feature has been created and then added to the dataset for modelling.

Building the models

A lot of data preparation happened here, which of course is on the repo. I won’t go into any details but in summary, I cleaned the data, checked for multicollinearity and dealt with problematic features, log-transformed the data which helped make price more normally distributed, hot-encoded categorical features and standardised it with StandardScaler(). Lastly, I split the data into train-test.

Two models were used:

A Spatial Hedonic Price Model (OLS Regression), with the LinearRegression from Scikit-Learn library
The Gradient Boosting method, with the XGBRegressor from the XGBoost library

The evaluation metrics used were mean squared error (for loss) and r-squared (for accuracy).

I chose to run a Hedonic Model Regression because they are commonly used in real estate appraisal and real estate economics. Ideally, Lagrange multiplier tests should be conducted to verify if there is spatial lag in the dependent variable and therefore a spatial lag model is preferred for estimating a Spatial HPM (see this post for spatial regression using Pysal). However, for the purposes of this project, I’m only using a conventional OLS model for hedonic price estimation that includes spatial and locational features, but not a spatial lag that accounts for spatial dependence.

So, the first explanatory variables are the listings characteristics (accommodates, bathrooms, etc) and the second group of explanatory variables based on spatial and locational features are Score, which is the network distance to 5th nearest venue we computed with Pandana; and Neighbourhood belonging, 1 if the listing belongs to the specified neighbourhood, 0 otherwise.

Here’s my code for the Spatial Hedonic Price Model:

And results:

So the features explain approximately 51% of the variance in the target variable. Interpreting the mean_squared_errorvalue is somewhat more intuitive that the r-squared value. As the square root of a variance, RMSE can be interpreted as the standard deviation of the unexplained variance, and has the useful property of being in the same units as the response variable. In this case though, the response variable is log-transformed, so the interpretation is not in the original unit, but more like a percentage deviation. Lower values of RMSE indicate better fit. RMSE is a good measure of how accurately the model predicts the response. We can see this relationship graphically with a scatter plot:

Actual vs Predicted Price for Spatial Hedonic Regression Model

I also tried using Ridge Regularization to decrease the influence of less important features. Ridge Regularization is a process which shrinks the regression coefficients of less important features. It takes a parameter alpha, which controls the strength of the regularization.

I experimented by looping through a few different values of alpha, and see how this changes results.

Here’s what I got:

Yeah… not much happened. The RMSE and r² values are almost identical as the alpha value increases, which means the model prediction doesn’t improve substantially with the ridge regression model.

Moving on to the Gradient boosted decision trees model.

XGBoost (eXtreme Gradient Boosting) is an implementation of gradient boosted decision trees designed for speed and performance. Is a very popular algorithm that has recently been dominating applied machine learning for structured data.

It was expected that this model was going to provide the best achievable accuracy and a measure of feature importance compared to the Hedonic regression. And it did.

Here’s the code for the XGBoost model:

And the results:

With this model, the features explain approximately 65% of the variance in the target variable and has a smaller RMSE than the SHP regression model.

Apart from its superior performance, a benefit of using ensembles of decision tree methods like gradient boosting is that they can automatically provide estimates of feature importance from a trained predictive model. For more detailed information on how feature importance is calculated in boosted decision trees, have a look at this answer in StackOverflow.

As you can see, a good number of features have a feature importance of 0 in this XGBoost regression model.

The top 10 most important features are:

If the rental is the entire flat or not room_type_Entire home/apt
How many people the property accommodates (accommodates)
The type of property (property_type_Other)
The number of bathrooms (bathrooms)
How many days are available to book out of the next 90 (availability_90)
The number of reviews (number_of_reviews)
The cancellation policy being moderate (cancellation_policy_moderate)
How many other listings the host has (host_listings_count)
The minimum night stays (minimum_nights)
The maximum nights stay (maximum_nights)

The most important feature to predict price is if the rental is the entire flat. Like by a lot. It makes sense. Asking price is higher if the offer is for the entire flat/house. The second is how many people the property accommodates, which also makes sense. It’s the first thing you would look for when you search for a place.

The funny thing though is that location features are not in the top ten. Belonging to a certain neighbourhood and Score (accessibility measure) are of relatively low importance compared to the top 3 features. However, Review_Scores_Location_9/10 almost made it to the top ten. Pretty interesting.

Seems price is affected by previous renters positive rating of location instead of the actual accessibility of the property with respect to main points of interest.

This could also be because Edinburgh is a small and walkable city with good transportation services. So maybe location is not a major problem in reaching main tourist attractions and amenities.

Improving the models

The feature importances graph produced by the XGBoost model showed that a lot of the review columns are of relatively low importance. So, I dropped them. Then I ran the models again, just without those columns.

Here are the results:

And like with the Ridge Regression attempt, not much happened here either. Both Spatial Hedonic Regression and XGBoost perform almost exactly the same without the additional review columns.

Conclusions and Recommendations

If we are to choose a model to predict price, it would be the XGBoost without the review columns. It performs better than both Spatial Hedonic Regression Models and just as good as the first XGBoost model but is less computationally expensive. Still, it only predicts about 66% of the variation in price. Which means we still have a remaining 34.44% unexplained.

A bunch of different variables that are not included could explain the rest of the variance.

For example, given the importance of customer reviews of the listing in determining price, a better understanding of the reviews could very likely improve the prediction. One could use Sentiment Analysis. A score between -1 (very negative sentiment) and 1 (very positive sentiment) can be assigned to each review per listing property. The scores are then averaged across all the reviews associated with that listing and the final scores can be included as a new feature in the model (see here for an example).

We could also include image quality as a feature. Using Difference-in-Difference with deep learning and supervised learning analyses on an Airbnb panel dataset, researchers found that units with verified photos (taken by Airbnb’s photographers) generate additional revenue per year on average (see here).

I still think accessibility plays an important role, even though it wasn’t so clear this time. I used the OSM pedestrian network for the accessibility analysis, but using a multi-modal graph network would have been the ideal thing to do. This way I would’ve measured accessibility in terms of the whole transport network and not only in terms of walkability. Maybe the feature importance would have been higher on the list.

And that’s that. This was a very fun project to work with.

If you somehow managed to go through the whole thing and are reading this, thank you for your time! Any feedback is much appreciated.