Airbnb Pricing Recommender

Using machine learning to identify optimal listing prices

Published in

Towards Data Science

6 min readFeb 18, 2020

Photo Credits to Aleksandar Pasaric (Pexels)

As part of my data science program, Metis, I sought to design a better pricing system for Airbnb hosts to optimize their listing price. Through this article, I hope to explain the reasons that motivated me to design this model and the approach I took to train an optimal regression model.

Background

If you are a host, you’ve probably already heard of AirBnB’s smart pricing, a pricing tool to help hosts control prices automatically in response to demand and various other factors. But while smart pricing is able to take into account factors such as booking history, it is intended to maximize occupancy rate and can suggest lower prices than hosts would like.

Therefore, I wanted to design a tool trained on active listings to suggest optimal prices while also allowing hosts to compare the recommended price against prices of similar listings.

Pricing itself is very personal as some hosts seek to offer an affordable and budget-friendly experience while others may be more extravagant and seek to provide a super luxury experience. Instead of just suggesting a price to a host, I wanted hosts to be able to compare the recommended price against their direct competition in order to better help them make a decision on their listing price.

Objectives

Identify actionable features that hosts can use to improve marketability
Explore geographic locations of active listings
Create an interpretable regression model to allow hosts to understand the factors behind their suggested price

My Approach

Gather data from InsideAirBnB
Pre-process data and identify optimal listings
Feature Engineering / Exploratory data analysis
Regression modeling and evaluation
Identify similar listings using NearestNeighbors

The Data

Unfortunately, AirBnB doesn’t have any open datasets but InsideAirBnB is an independent entity that web scrapes publicly available information of AirBnB listings in major cities around the world. For the scope of this project, I chose to focus on listings in Tokyo, Japan.

There are some limitations to the data in that I only used listings scraped in September 2019. Furthermore, while I wanted to model seasonality with time series, InsideAirBnB did not have a year’s worth of data for Tokyo.

Data Processing

Something to keep in mind is that modeling prices on all the scraped listings doesn’t make much sense. Many of these listings may be priced poorly or inactive and training a model on them wouldn’t necessarily suggest optimal prices. Therefore, defining what a “good” listing is and filtering out sub-optimal listings was the first step in the data cleaning process. One key feature to keep track of was occupancy, which wasn’t a feature included in the dataset.

The San Francisco Model

The San Francisco model is InsideAirBnB’s occupancy model designed to estimate how often an AirBnB listing is rented out. In short, it does the following:

Estimate # of bookings by assuming a 50% review rate
Define an average length of stay per city (3 days in most cities)
Multiply estimated bookings and average length of stay to determine occupancy rate

You can read more about this occupancy model here.

So, how can we identify active, marketable listings?

Listing has been reviewed in the last 6 months
Listing has more than 5 reviews in the last 12 months
Estimated number of booked days per month is more than 7 days

Price Distribution

Taking a look at the distribution of our target variable, price, reveals that 76.69% of listings are priced under $150. This does play a role later on as I found that using a threshold of $150 and creating two models, one for listings above and one for those below this threshold, improved model performance.

Handling Amenities

The number of amenities that a listing offers definitely plays a role in price. InsideAirBnB’s dataset has an amenities feature that lists all the amenities that a listing offers but having amenities in this format wasn’t useful for regression modeling.

Thus, I parsed the string of amenities for each listing and created a new feature column for each possible amenity, with a boolean value to denote whether a listing offers that amenity or not. One of the problems with this is that it greatly increases the dimensionality of the dataset but we can get rid of less informative amenities later by looking at p-values and using lasso regularization.

Popular Tourist Attractions

Another factor that comes into play is proximity to tourist attractions. While Tokyo has an excellent public transportation system, I wanted to explore whether distance to certain tourist attractions plays any factor into how listings are priced. The 5 tourist attractions I chose are:

Tokyo Imperial Palace
Ginza Shopping District
Sensoji Temple
Ueno Park
Tokyo Skytree

In fact, plotting the locations of the most ideal listings remaining in the dataset after the pre-processing revealed that most of these listings are relatively close to these tourist locations.

Modeling and Evaluation

Linear Regression

During the modeling phase, many irrelevant features were dropped due to high p-values in OLS or multicollinearity. In order to allow hosts to easily understand the factors driving a price prediction, linear regression was the model I chose for easy interpretability. The mean absolute error (MAE) rounds out to about $18. I also employed ridge and lasso regression to prevent overfitting but neither of these methods had a noticeable affect on the final MAE score.

Finally, let’s look at actionable features that had an impact on the model prediction. As you can see below, becoming a superhost and providing dishes/silverware for customers goes a long way into improving the marketability of a listing as well as driving up the price.

XGBoost

In addition to linear regression, I used XGBoost since it is able to find a more non-linear fit for the data. As expected, XGBoost outperformed linear regression with a mean absolute error of about $14.23. But while XGBoost outperforms linear regression, it’s less interpretable. It is possible to use algorithms such as SHAP to add interpretability to the XGBoost model but from a host’s perspective, it would still be easier to understand linear regression weights than to understand a SHAP plot.

Finding Similar Listings

One of my original objectives in this project was to give users some similar listings to compare things such as amenities, prices, etc. In this part of the project, I used the NearestNeighbors algorithm to simply find the closest listings in the feature space. Since the original dataset has a listing url feature, indexing the dataset for the closest listings also provides the listing url for hosts to go to the AirBnB page of those listings.

Conclusion

Overall, this project was a fun exercise in exploring characteristics of popular AirBnB listings in Tokyo. Something to note though is that there are definitely other factors such as seasonality, demand, and day of the week to take into account when considering price. While the dataset I used didn’t have a great way to incorporate these factors, next steps on improving this project would be to use a time series model to address seasonality and figuring out a way to model demand.

Thank you for reading!

Project repository can be found here.