Predict Where a New User Will Book Their First Travel Experience

User Intention Prediction, Learning to Rank

Susan Li

Published in

Towards Data Science

10 min readNov 22, 2018

At the heart of the Airbnb site is its search. And one of the most central features of its search is search ranking.

The search results Airbnb displays are personalized to the listings and experiences they predict will be best for that user.

This article details the exploration of two data sets released by Airbnb on the the purpose of predicting the first country to which an Airbnb user books a trip. More specific, to predict top 5 travel destination countries in decreasing order of relevance.

We want to achieve a level of personalization by inferring guest preferences based on their demographic information and session activities, as guests plan their trips by engaging with listings and making inquiries.

The Data

The date sets can be downloaded from Kaggle, and we will use three data sets:

train_users.csv
test_users.csv
sessions.csv

There are 16 features to describe each user, they are:

id: user id
date_account_created: the date of account creation
timestamp_first_active: timestamp of the first activity, note that it can be earlier than date_account_created or date_first_booking because a user can search before signing up
date_first_booking: date of first booking
gender
age
signup_method
signup_flow: the page a user came to signup up from
language: international language preference
affiliate_channel: what kind of paid marketing
affiliate_provider: where the marketing is e.g. google, craigslist, other
first_affiliate_tracked: whats the first marketing the user interacted with before the signing up
signup_app
first_device_type
first_browser
country_destination: this is the target variable we are to predict

There are 6 features to describe each web session, they are:

user_id: to be joined with the column ‘id’ in users table
action
action_type
action_detail
device_type
secs_elapsed

Users Exploration

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plttrain_users = pd.read_csv('train_users_2.csv')
test_users = pd.read_csv('test_users.csv')
print("There were", train_users.shape[0], "users in the training set and", test_users.shape[0], "in the test set.")
print("In total there were", train_users.shape[0] + test_users.shape[0], "users.")

Figure 1

We will explore all the users in the training and test data sets.

df = pd.concat((train_users, test_users), axis=0, ignore_index=True)

Missing Data

There were “unknown” in gender column and first_browser column, we will fill “unknown” with “NaN”. Also, there was no information in date_first_booking column in test set, so, we will drop this feature.

After that, we can see that there were lots of missing data.

df.gender.replace('-unknown-', np.nan, inplace=True)
df.first_browser.replace('-unknown-', np.nan, inplace=True)
df.drop('date_first_booking', axis=1, inplace=True)
df.isnull().sum()

Users’ Age

df.age.describe()

The maximum age is 2014, this is not possible. It seems that some users have filled a year instead of their age. Also the minimum age at 1 sounds ridiculous. According to term of service, the official minimum age is 18, but in practice this isn’t enforced. If you’re under 18, have a credit or debit card, and are reasonably respectful and mature, you won’t have any problems using Airbnb.

df.loc[df['age'] > 1000]['age'].describe()

df.loc[df['age'] < 18]['age'].describe()

So, we will first correct every mistakenly filled age, then set the limits for the age, and set “NaN” for outliers.

df_with_year = df['age'] > 1000
df.loc[df_with_year, 'age'] = 2015 - df.loc[df_with_year, 'age']df.loc[df.age > 95, 'age'] = np.nan
df.loc[df.age < 16, 'age'] = np.nandf['age'].describe()

Looks more reasonable. We can now visualize users’ age.

plt.figure(figsize=(12,6))
sns.distplot(df.age.dropna(), rug=True)
sns.despine()

As expected, the most common age of Airbnb users is between 25 and 40 years old.

Users Gender

plt.figure(figsize=(12,6))
df["gender"] = df['gender'].fillna('M')
sns.countplot(data=df, x='gender')
plt.xticks(np.arange(4), ("NaN", "Male", "Female", "Other"))
plt.ylabel('Number of users')
plt.title('Users gender distribution')
sns.despine()

Approx. 45% of users’ gender were not presented. And there is no significant difference between female and male users on Airbnb’s platform.

Travel Destination Country

This is the what we will predict in the test data.

plt.figure(figsize=(12,6))
sns.countplot(x='country_destination', data=df)
plt.xlabel('Destination Country')
plt.ylabel('Number of users')
sns.despine()

Nearly 60% of users did not book any destination (NDF). The most booked country is the US (nearly 30% of all users booked US), given all users in the data set are from the US. We can say that US travelers in the data set are more likely to travel within the US.

From now on, we will only study users who made at least one reservation.

plt.figure(figsize=(12,6))
df_without_NDF = df[df['country_destination']!='NDF']
sns.boxplot(y='age' , x='country_destination',data=df_without_NDF)
plt.xlabel('Destination Country box plot')
plt.ylabel('Age of Users')
plt.title('Country destination vs. age')
sns.despine()

There was no significant age difference among users who book different destinations. However, users who booked Great Britain tend to be a little older than users who booked Spain and Portugal.

Users’ Signup

plt.figure(figsize=(12,6))
df_without_NDF = df[df['country_destination']!='NDF']
sns.countplot(x='signup_method', data = df_without_NDF)
plt.xlabel('Destination Country')
plt.ylabel('Number of users')
plt.title('Users sign up method distribution')
sns.despine()

Over 70% of all bookers in the data used basic email method to sign up with Airbnb, and less than 30% of bookers used their facebook account to sign up, approx. only 0.26% of the bookers in the data used their Google account to sign up.

plt.figure(figsize=(12,6))
sns.countplot(x='country_destination', data = df_without_NDF, hue = 'signup_method')
plt.xlabel('Destination Country')
plt.ylabel('Number of users')
plt.title('Users sign up method vs. destinations')
sns.despine()

For users who made booking at least once, most of them signed up with Airbnb through basic email method, no matter which country they were travelling to.

plt.figure(figsize=(12,6))
sns.countplot(x='signup_app', data=df_without_NDF)
plt.xlabel('Signup app')
plt.ylabel('Number of users')
plt.title('Signup app distribution')
sns.despine()

Over 85% of all bookers in the data set signed up on Airbnb’s website, over 10% of all bookers signed up with iOs.

plt.figure(figsize=(12,6))
sns.countplot(x='country_destination', data=df_without_NDF, hue='signup_app')
plt.xlabel('Destination Country')
plt.ylabel('Number of users')
plt.title('Destinatiuon country based on signup app')
sns.despine()

US travelers have more variety of sign up apps than travelers to the other countries. To see more clear, we remove the US.

df_without_NDF_US = df_without_NDF[df_without_NDF['country_destination']!='US']
plt.figure(figsize=(12,6))
sns.countplot(x='country_destination', data=df_without_NDF_US, hue='signup_app')
plt.xlabel('Destination Country')
plt.ylabel('Number of users')
plt.title('Destinatiuon country based on signup app without the US')
sns.despine()

Signing up through Airbnb website is the most common signing up on booking every destination country in the data.

Affiliate

plt.figure(figsize=(12,6))
sns.countplot(x='affiliate_channel', data=df_without_NDF)
plt.xlabel('Affiliate channel')
plt.ylabel('Number of users')
plt.title('Affiliate channel distribution')
sns.despine()

plt.figure(figsize=(20,6))
sns.countplot(x='affiliate_provider', data=df_without_NDF)
plt.xlabel('Affiliate provider')
plt.ylabel('Number of users')
plt.title('Affiliate provider distribution')
sns.despine()

Approx. 65% of bookers signed up directly, without any affiliate program, and over 23% of bookers signed up through Google affiliate program, however, if you remember, there were only 0.26% of bookers signed up with their Google accounts.

First

plt.figure(figsize=(12,6))
sns.countplot(x='first_affiliate_tracked', data=df_without_NDF)
plt.ylabel('Number of users')
plt.title('First affiliate tracked distribution')
sns.despine()

plt.figure(figsize=(18,6))
sns.countplot(x='first_device_type', data=df_without_NDF)
plt.xlabel('First device type')
plt.ylabel('Number of users')
plt.title('First device type distribution')
sns.despine()

plt.figure(figsize=(18,6))
sns.countplot(x='country_destination', data=df_without_NDF, hue='first_device_type')
plt.ylabel('Number of users')
plt.title('First device type vs. country destination')
sns.despine()

Around 60% of bookers use Apple devices. Particularly in the US.

plt.figure(figsize=(18,6))
sns.countplot(x='country_destination', data=df_without_NDF_US, hue='first_device_type')
plt.ylabel('Number of users')
plt.title('First device type vs. country destination without the US')
sns.despine()

However, outside of the US, Windows desktop is far more common, in particular, there was little usage difference between Mac desktop and Windows desktop in Canada and Australia.

plt.figure(figsize=(20,6))
sns.countplot(x='first_browser', data=df_without_NDF)
plt.xlabel('First browser')
plt.ylabel('Number of users')
plt.title('First browser distribution')
plt.xticks(rotation=90)
sns.despine()

Almost 30% of all the bookers in the data used Chrome browser.

Users’ Preferred Language

plt.figure(figsize=(12,6))
sns.countplot(x='language', data=df_without_NDF)
plt.xlabel('language')
plt.ylabel('Number of users')
plt.title('Users language distribution')
sns.despine()

Vast majority of the bookers’ language preference is English, there is no surprise given most of the users in the data set are from the US.

plt.figure(figsize=(12,6))
sns.countplot(x='language', data=df_without_NDF_US)
plt.xlabel('language')
plt.ylabel('Number of users')
plt.title('Users language distribution without the US')
sns.despine()

Without the US, English is still the most preferred language, interestingly, Chinese is the 2nd most preferred languages for bookers in the data.

Dates

To visualize time series, we need first to convert data type to date time.

df_without_NDF['date_account_created'] = pd.to_datetime(df_without_NDF['date_account_created'])
df_without_NDF['timestamp_first_active'] = pd.to_datetime((df_without_NDF.timestamp_first_active // 1000000), format='%Y%m%d')plt.figure(figsize=(12,6))
df_without_NDF.date_account_created.value_counts().plot(kind='line', linewidth=1.2)
plt.xlabel('Date')
plt.title('New account created over time')
sns.despine()

plt.figure(figsize=(12,6))
df_without_NDF.timestamp_first_active.value_counts().plot(kind='line', linewidth=1.2)
plt.xlabel('Date')
plt.title('First active date over time')
sns.despine()

The pattern looks similar between date account created and date first active, as it should be. From these two plots, we can see how fast Airbnb has grown between 2014 and 2015.

df_2013 = df_without_NDF[df_without_NDF['timestamp_first_active'] > pd.to_datetime(20130101, format='%Y%m%d')]
df_2013 = df_2013[df_2013['timestamp_first_active'] < pd.to_datetime(20140101, format='%Y%m%d')]
plt.figure(figsize=(12,6))
df_2013.timestamp_first_active.value_counts().plot(kind='line', linewidth=2)
plt.xlabel('Date')
plt.title('First active date 2013')
sns.despine()

When diving into 2013, we see that there were several peak months for Airbnb bookers such as July, August and October, and December is the least active month for Airbnb bookers. In addition, it follows a similar pattern, such as peaks and off-peaks at the similar distance.

User Session Exploration

sessions = pd.read_csv('sessions.csv')print("There were", len(sessions.user_id.unique()), " unique user IDs in the session data.")

Figure 28

Action Type

There were “NaN” and “unknown” in the action type. So we will change “unknown” to “NaN”.

sessions.action_type.replace('-unknown-', np.nan, inplace = True)
sessions.action_type.value_counts()

Action

sessions.action.value_counts().head(10)

Action Detail

sessions.action_detail.value_counts().head(10)

Device type

plt.figure(figsize=(18,6))
sns.countplot(x='device_type', data=sessions)
plt.xlabel('device type')
plt.ylabel('Number of sessions')
plt.title('device type distribution')
plt.xticks(rotation=90)
sns.despine()

This affirms the previous discovery about users. The most common device type among Airbnb users are Apple products.

Sessions of Users who had made bookings

From the previous analysis, we know which users had made bookings through Airbnb platform, so we want to explore these bookers session data. Are they different with non-bookers?

Bookers Action Types

booker_session = pd.merge(df_without_NDF, sessions, how = 'left', left_on = 'id', right_on = 'user_id')
booker_session.action_type.value_counts()

Bookers Top Actions

booker_session.action.value_counts().head(10)

Unfortunately, there was no significant difference in actions between bookers and all users.

Data Preprocessing & Feature Engineering

train_users = pd.read_csv('train_users_2.csv')
test_users = pd.read_csv('test_users.csv')
df = pd.concat((train_users, test_users), axis=0, ignore_index=True)
df.drop('date_first_booking', axis=1, inplace=True)

Date time features

Cast date time column to property date time format.
Split dates into day, week, month, year.
Get the difference(time lag) between the date in which the account was created and when it was first active.
Lastly, we drop columns we do not need anymore.

datetime_features

Age features

Convert year to age, set limits to age, and fill NaNs with -1.

age_features

User session action features

There were 10,567,737 recorded sessions to which there were 135,483 unique users in the data.

We will group by user_id, count the number of times an action, action type and action detail is made by each user. Doing groupby.agg(len)is roughly twice as fast as groupby.size(). Therefore, I am using groupby.agg(len). For device type, we group by user_id, sum up total secs_elapses for each user. Finally, we add a new column named “most_used_device” for each user most used device.

sessions_features

User sessions secs_elapsed features

We will extract information from secs_elapsed feature for each user, such as sum, mean, min, max, median, variance, if the sum of secs_elapsed is greater than 86,400 second, we consider day_pause, if the sum of secs_elapsed is greater than 300,000 seconds, we consider it is a long pause, and if the sum of secs_elapsed is shorter than 3,600 seconds, we consider it is a short pause. After that, we merge sessions data with user data.

The following scripts were borrowed from Kaggle.

secs_elapsed

Encoding categorical features

categorical_features = ['gender', 'signup_method', 'signup_flow', 'language','affiliate_channel', 'affiliate_provider', 'first_affiliate_tracked', 'signup_app', 'first_device_type', 'first_browser', 'most_used_device', 'weekday_account_created', 'weekday_first_active']
df = pd.get_dummies(df, columns=categorical_features)

Normalized Discounted Cumulative Gain (NDCG)

NDCG is a normalization of the Discounted Cumulative Gain (DCG) measure. NDCG is a family of ranking measures widely used practice. in particular, NDCG is very popular in evaluating Web search.

There are several excellent papers on NDCG can be found here and here.

Our evaluation metric is NDCG @k where k=5. So we select top 5, then get the average.

ndcg_score

Cross Validation with Xgboost

We will use training data for cross validation
we will fill NaN with -1.
We set the general parameters, tree booster parameters and learning task parameters as follows, and our evaluation metrics is mlogloss (Multiclass logloss). The detailed guide on how to set Xgboost parameters can be found on its official website.

xgb_cv

From the above scripts, the highest average test NDCG score we have achieved is 0.827582.

Jupyter notebook can be found on Github. Enjoy the rest of the week!