Photo Credit: Pixabay

Building a Content Based Recommender System for Hotels in Seattle

How to use description of a hotel to recommend similar hotels.

--

The cold start problem is a well known and well researched problem for recommender systems, where system is not able to recommend items to users. due to three different situation i.e. for new users, for new products and for new websites.

Content-based filtering is the method that solve this problem. Our system first uses the metadata of new products when creating recommendations, while visitor action is secondary for a certain period of time. And our systems recommend a product to a user based upon the category and description of the product.

Content-based recommendation systems may be used in a variety of domains ranging from recommending web pages, news articles, restaurants, television programs, and hotels. The advantage of content-based filtering is that it doesn’t have a cold-start problem. If you just start out a new website, or any new products can be recommended right away.

Let’s assume we are starting a new online travel agency (OTA), and we have signed up thousands of hotels that are willing to sell on our platform, and we start seeing traffic coming from our website users, but we don’t have any users history, therefore, we are going to build a content-based recommendation systems to analyze hotel descriptions to identify hotels that are of particular interest to the user.

We would like to recommend hotels based on the hotels that a user has already booked or viewed using the cosine similarity. We would recommend hotels with the largest similarity to the ones previously booked or viewed or showed interest by the user. Our recommender system is highly dependent on defining an appropriate similarity measure. Eventually, we select a subset of hotels to display to the user or to determine an order in which to display the hotels.

The Data

It’s very hard to find public available hotel description data, therefore, I collected them by myself from each hotel’s homepage for over 150 hotels in Seattle area, that includes downtown business hotels, boutique hotels and bed and breakfast, airport business hotels, inns near the universities, motels in the middle of nowhere, and so on. The data can be found here.

import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import re
import random
import plotly.graph_objs as go
import plotly.plotly as py
import cufflinks
pd.options.display.max_columns = 30
from IPython.core.interactiveshell import InteractiveShell
import plotly.figure_factory as ff
InteractiveShell.ast_node_interactivity = 'all'
from plotly.offline import iplot
cufflinks.go_offline()
cufflinks.set_config_file(world_readable=True, theme='solar')
df = pd.read_csv('Seattle_Hotels.csv', encoding="latin-1")
df.head()
print('We have ', len(df), 'hotels in the data')
Table 1

Have a look few hotel name and description pairs.

print_description.py
print_description(10)
Figure 1
print_description(100)
Figure 2

EDA

Token (vocabulary) Frequency Distribution Before Removing Stop Words

unigram_distribution.py
Figure 3

Token (vocabulary) Frequency Distribution After Removing Stop Words

unigram_distribution_stopwords_removed.py
Figure 4

Bigrams Frequency Distribution Before Removing Stop Words

bigrams_distribution.py
Figure 5

Bigrams Frequency Distribution After Removing Stop Words

bigrams_distribution_stopwords_removed.py
Figure 6

Trigrams Frequency Distribution Before Removing Stop Words

trigrams_distribution.py
Figure 7

Trigrams Frequency Distribution After Removing Stop Words

trigrams_distribution_stopwords_removed.py
Figure 8

Everyone knows Seattle’s Pike Place Market, it is way more than a public farmers market. It is a historical vibrant tourism attraction comprised of hundreds of farmers, craftspeople, small businesses. The hotel industry thrives on location, tourists look for a hotel that is possibly nearest to downtown and / or must-visit attractions of the city. Therefore, every hotel would brag about it if it is not too far from the hotel.

Hotel Description Word Count Distribution

df['word_count'] = df['desc'].apply(lambda x: len(str(x).split()))
desc_lengths = list(df['word_count'])
print("Number of descriptions:",len(desc_lengths),
"\nAverage word count", np.average(desc_lengths),
"\nMinimum word count", min(desc_lengths),
"\nMaximum word count", max(desc_lengths))
word_count_distribution.py
Figure 9

Many hotels use description to their full potential, know how to utilize captivating descriptions to appeal to travelers’ emotions to drive direct bookings. Their descriptions may be longer than others.

Text Preprocessing

The test is pretty clean, we don’t have a lot to do, but just in case.

description_preprocessing.py

Modeling

  • Create a TF-IDF matrix of unigrams, bigrams, and trigrams for each hotel.
  • Compute similarity between all hotels using sklearn’s linear_kernel (equivalent to cosine similarity in our case).
  • Define a function that takes in hotel name as input and returns the top 10 recommended hotels.
hotel_rec_model.py

Recommendations

Let’s make some recommendations!

recommendations('Hilton Seattle Airport & Conference Center')

A good test on whether our similarity works is that the content based recommender returns all airport hotels when an airport hotel is a seed.

We can also ask Google. The following are recommended by Google for “Hilton Seattle Airport & Conference Center”:

Figure 10

Three out of four recommended by Google were also recommended by us.

The following are recommended by tripadvisor for “Hilton Seattle Airport & Conference Center”:

Figure 11

Not bad either.

Try a bed & breakfast.

recommendations("The Bacon Mansion Bed and Breakfast")

The following are recommended by Google for “The Bacon Mansion Bed and Breakfast”:

Figure 12

Cool!

The following are recommended by tripadvisor for “The Bacon Mansion Bed and Breakfast”, which I was not impressed.

Figure 13

Jupyter notebook can be found on Github, if you prefer, this is a nbviewer version.

Have a productive week!

--

--