Predicting destination countries for new users of Airbnb

Tanmayee W
Towards Data Science
11 min readDec 2, 2018

--

I found this interesting challenge posted by Airbnb on Kaggle 3 years ago. But it is never too late to get you hands dirty with a stimulating data challenge! :)

This Kaggle challenge presents a problem to predict which country will be a new user’s booking destination. For the purpose of this challenge, we will make use of three datasets provided by Airbnb.

  1. Training set
  2. Testing set
  3. Sessions data

Let us understand how user profile looks like in training and testing datasets.

There are 16 features to describe each user which are as follows

  1. id: user id
  2. date_account_created: the date of account creation
  3. timestamp_first_active: timestamp of the first activity, note that it can be earlier than date_account_created or date_first_booking because a user can search before signing up
  4. date_first_booking: date of first booking
  5. gender
  6. age
  7. signup_method
  8. signup_flow: the page a user came to signup up from
  9. language: international language preference
  10. affiliate_channel: what kind of paid marketing
  11. affiliate_provider: where the marketing is e.g. google, craigslist, other
  12. first_affiliate_tracked: whats the first marketing the user interacted with before the signing up
  13. signup_app
  14. first_device_type
  15. first_browser
  16. country_destination: this is the target variable we are to predict

Similarly, there are 6 features to describe each web session of a user which are as follows:

  1. user_id: to be joined with the column ‘id’ in users table
  2. action
  3. action_type
  4. action_detail
  5. device_type
  6. secs_elapsed

I have divided the analysis for this challenge into two parts —

I. Exploratory analysis

II. Predictive Modelling

Let us first get started with exploring the datasets.

I. Exploratory analysis

# Importing the necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

Let us load the data now.

train_users = pd.read_csv("train_users_2.csv")
test_users = pd.read_csv("test_users.csv")
print("There were", train_users.shape[0], "users in the training set and", test_users.shape[0], "users in the test set.") print("In total there were",train_users.shape[0] + test_users.shape[0], "users in total." )

Now we will explore all the users in the train and test sets.

df = pd.concat((train_users, test_users), axis = 0, ignore_index = True, sort = True)

Checking the null values.

display(df.isnull().sum())

We see that there are null values in the columns ‘age’, ‘country_destination’, ‘date_first_booking’, ‘first_affiliate_tracked’.

We try to check the unique values in each column to identify if there is any missing data. We find that there unknowns in the columns ‘gender’ and ‘first_browser’.

df.gender.unique()
df.first_browser.unique()

We replace the unknowns with NaNs and drop the column ‘date_first_booking’ as it does not feature any values in our test set.

df.gender.replace("-unknown-", np.nan, inplace=True)
df.first_browser.replace("-unknown-", np.nan, inplace=True)
df.drop("date_first_booking", axis = 1, inplace=True)

Users’ age

Let us now check summary statistics of age to find if there are any apparent anomalies.

df.age.describe()

Maximum age is 2014 which is not possible. It looks like the users have inadvertently filled in the year instead of their age. Also, the minimum age of 1 looks absurd.

df.loc[df['age']>1000]['age'].describe()
df.loc[df['age']<18]['age'].describe()

We correct the mistakenly filled ages and then set age limits (18 — lower bound and 95 — upper bound).

df_with_year = df['age'] > 1000
df.loc[df_with_year, 'age'] = 2015 - df.loc[df_with_year, 'age']
df.loc[df_with_year, 'age'].describe()
df.loc[df.age > 95, 'age'] = np.nan
df.loc[df.age < 18, 'age'] = np.nan

Visualizing the users’ ages.

plt.figure(figsize=(12,6))
sns.distplot(df.age.dropna(), rug=True)
plt.show()
Fig.1: Plot visualizing the age bracket of the users

As we see in Fig.1, most of the user base falls in the age bracket of 20 to 40 as expected.

Users’ gender

plt.figure(figsize=(12,6))
sns.countplot(x='gender', data=df)
plt.ylabel('Number of users')
plt.title('Users gender distribution')
plt.show()
Fig.2: Plot visualizing gender distribution of the users

As we see in Fig.2, there is no significant difference between the male and female users.

Travel destination

This our target variable for the prediction problem.

plt.figure(figsize=(12,6))
sns.countplot(x='country_destination', data=df)
plt.xlabel('Destination Country')
plt.ylabel('Number of users')
plt.show()
Fig.3: Plot showing popular destinations among users

We see in Fig.3, nearly 60% of the users did not end up booking any trip represented by NDF. A majority of users booked a destination in US (about 30%) considering that user population in this problem is from US. Thus, my inference is that US travelers tend to travel within US itself.

We will now only analyze those users who made atleast one booking.

plt.figure(figsize=(12,6))
df_without_NDF = df[df['country_destination']!='NDF']
sns.boxplot(y='age' , x='country_destination',data=df_without_NDF)
plt.xlabel('Destination Country')
plt.ylabel('Age of users')
plt.title('Country destination vs. age')
plt.show()
Fig.4: Plot showing choice of destination countries varies across ages

Fig. 4 shows that there is no significant age difference among the users booking trips to the destinations displayed in the graph. However, the users booking trips to Great Britain seem to be relatively older than those booking trips to Spain and Netherlands.

plt.figure(figsize=(12,6))
sns.countplot(x='signup_method', data = df_without_NDF)
plt.xlabel('Signup Method')
plt.ylabel('Number of users')
plt.title('Users sign up method distribution')
plt.show()
Fig.5: Plot showing sign up method distribution

Fig.5 shows out of all the users who made atleast one booking, nearly 70% used the basic method (email) to sign up with Airbnb.

plt.figure(figsize=(12,6))
sns.countplot(x='country_destination', data = df_without_NDF, hue = 'signup_method')
plt.xlabel('Destination Country')
plt.ylabel('Number of users')
plt.title('Users sign up method vs. destinations')
plt.legend(loc='upper right')
plt.show()
Fig.6: Plot showing distribution of sign-up methods vs. destinations

Fig. 6 tells us among the users who made atleast one booking, most of them used email method to signup with Airbnb irrespective of the destination country booked.

plt.figure(figsize=(12,6))
sns.countplot(x='signup_app', data=df_without_NDF)
plt.xlabel('Signup app')
plt.ylabel('Number of users')
plt.title('Signup app distribution')
plt.show()
Fig.7: Plt showing signup app distribution

Among all the bookers, most of them signed up using Airbnb’s website. Next majority of users signed up using iOS (Fig.7).

plt.figure(figsize=(12,6))
sns.countplot(x='country_destination', data=df_without_NDF, hue='signup_app')
plt.xlabel('Destination Country')
plt.ylabel('Number of users')
plt.title('Destination country based on signup app')
plt.legend(loc = 'upper right')
plt.show()
Fig.8: Plot showing distribution of destination countries based on signup app

We see that the users in US display a variety in usage of apps to sign up on Airbnb. For rest other countries, it looks like users prefer to sign up using the Airbnb’s website only (Fig.8).

Affiliate

plt.figure(figsize=(12,6))
sns.countplot(x='affiliate_channel', data=df_without_NDF)
plt.xlabel('Affiliate channel')
plt.ylabel('Number of users')
plt.title('Affiliate channel distribution')
plt.show()
Fig.9: Plot showing distribtution of affiliate channels

We see that nearly 70% of the users came to the Airbnb’s website directly without any affiliate involvement (Fig.9).

plt.figure(figsize=(18,6))
sns.countplot(x='first_device_type', data=df_without_NDF)
plt.xlabel('First device type')
plt.ylabel('Number of users')
plt.title('First device type distribution')
plt.show()
Fig.10: Plot showing distribution of first device type used by users

Form Fig.10, it seems that the most popular device that users use to first access Airbnb’s website is Mac desktop (40%) followed by Windows desktop (30%).

plt.figure(figsize=(18,6))
sns.countplot(x='country_destination', data=df_without_NDF, hue='first_device_type')
plt.ylabel('Number of users')
plt.title('First device type vs. country destination')
plt.legend(loc = 'upper right')
plt.show()
Fig.11: Plot showing distribution of first device type vs. country destination

From Fig.11, irrespective of the destination country booked, Mac Desktop emerges as the clear favourite device for the users to access Airbnb’s website. This seems to be the highest in the US. Closely following it on its heels is Windows Desktop.

df_without_NDF_US = df_without_NDF[df_without_NDF['country_destination']!='US']
plt.figure(figsize=(18,6))
sns.countplot(x='country_destination', data=df_without_NDF_US, hue='first_device_type')
plt.ylabel('Number of users')
plt.title('First device type vs. country destination without the US')
plt.legend(loc = 'upper right')
plt.show()
Fig.12: Plot showing distribution of first device type vs. destinations (excluding US)

Outside of the US too, Apple devices seem more popular than the Windows devices (Fig.12).

plt.figure(figsize=(20,6))
sns.countplot(x='first_browser', data=df_without_NDF)
plt.xlabel('First browser')
plt.ylabel('Number of users')
plt.title('First browser distribution')
plt.xticks(rotation=90)
plt.show()
Fig.13: Plot showing distribution of first browsers used by users

As expected, 30% of the bookers used Chrome browser to access Airbnb website. Next favourite seems to be Safari browser (Fig.13).

Users’ Preferred Language

plt.figure(figsize=(12,6))
sns.countplot(x='language', data=df_without_NDF)
plt.xlabel('language')
plt.ylabel('Number of users')
plt.title('Users language distribution')
plt.show()
Fig.14: Plot showing user language distribution

Almost all the users language preference is English. This is reasonable as our population for the problem comes from US.

Dates

To visualize how bookings varying across months and years, let us first convert the date columns to datetime type.

df_without_NDF['date_account_created'] = pd.to_datetime(df_without_NDF['date_account_created'])
df_without_NDF['timestamp_first_active'] = pd.to_datetime((df_without_NDF.timestamp_first_active)//1000000, format='%Y%m%d')
plt.figure(figsize=(12,6))
df_without_NDF.date_account_created.value_counts().plot(kind='line', linewidth=1.2)
plt.xlabel('Date')
plt.title('New account created over time')
plt.show()
Fig.15: Plot showing trend of user account creation

There is a huge jump in account creation after 2014. Airbnb has grown leaps and bounds after 2014 (Fig.15).

#Creating a separate dataframe for the year 2013 to analyse it further.
df_2013 = df_without_NDF[df_without_NDF['timestamp_first_active'] > pd.to_datetime(20130101, format='%Y%m%d')]
df_2013 = df_2013[df_2013['timestamp_first_active'] < pd.to_datetime(20140101, format='%Y%m%d')]
plt.figure(figsize=(12,6))
df_2013.timestamp_first_active.value_counts().plot(kind='line', linewidth=2)
plt.xlabel('Date')
plt.title('First active date 2013')
plt.show()
Fig.16: Plot showing trend of first activity of users in 2013

If we see month wise activty of the users then the peak months were July, August and October. On the other hand, least active month was December (Fig.16).

Users’ session data exploratory analysis

Loading the sessions data.

sessions = pd.read_csv("sessions.csv")print("There were", len(sessions.user_id.unique()),"unique user ids in the sessions data.")

Checking null values.

display(sessions.isnull().sum())

Checking unknowns.

sessions.action_type.unique()

We see that there are NaNs and unknown in column ‘action_type’ so we convert all unknowns to NaNs.

sessions.action_type.replace('-unknown-', np.nan, inplace=True)sessions.action.value_counts().head(10)
sessions.action_type.value_counts().head(10)
sessions.action_detail.value_counts().head(10)
plt.figure(figsize=(18,6))
sns.countplot(x='device_type', data=sessions)
plt.xlabel('Device type')
plt.ylabel('Number of sessions')
plt.title('Device type distribution')
plt.xticks(rotation=90)
plt.show()
Fig.17: Plot showing device type distribution

As we discovered even earlier when we explored the users data that the most popular device to access Airbnb seems to be Mac Desktop. Let us now look at how sessions behavior look like for users who made atleast one booking (Fig. 17).

session_booked = pd.merge(df_without_NDF, sessions, how = 'left', left_on = 'id', right_on = 'user_id')#Let us see what all columns we have for session_booked
session_booked.columns

Let us look at the top 5 actions that bookers usually do.

session_booked.action.value_counts().head(5)

II. Predictive Modelling

#Importing necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

Loading the data sets and doing some data prepocessing along the way.

train_users = pd.read_csv('train_users_2.csv')
test_users = pd.read_csv('test_users.csv')
df = pd.concat((train_users, test_users), axis=0, ignore_index=True)
df.drop('date_first_booking', axis=1, inplace=True)

Feature engineering

df['date_account_created'] = pd.to_datetime(df['date_account_created'])df['timestamp_first_active'] = pd.to_datetime((df.timestamp_first_active // 1000000), format='%Y%m%d')df['weekday_account_created'] = df.date_account_created.dt.weekday_name
df['day_account_created'] = df.date_account_created.dt.day
df['month_account_created'] = df.date_account_created.dt.month
df['year_account_created'] = df.date_account_created.dt.year
df['weekday_first_active'] = df.timestamp_first_active.dt.weekday_name
df['day_first_active'] = df.timestamp_first_active.dt.day
df['month_first_active'] = df.timestamp_first_active.dt.month
df['year_first_active'] = df.timestamp_first_active.dt.year

Calculating the time lag variables.

df['time_lag'] = (df['date_account_created'] - df['timestamp_first_active'])
df['time_lag'] = df['time_lag'].astype(pd.Timedelta).apply(lambda l: l.days)
df.drop( ['date_account_created', 'timestamp_first_active'], axis=1, inplace=True)

Let us fill -1 in place of NaNs in the age column.

df['age'].fillna(-1, inplace=True)

Let us group by user_id and count the number of actions, action_types, and action details for each user.

#First we rename the column user_id as just id to match the train and test columnssessions.rename(columns = {'user_id': 'id'}, inplace=True)action_count = sessions.groupby(['id', 'action'])['secs_elapsed'].agg(len).unstack()
action_type_count = sessions.groupby(['id', 'action_type'])['secs_elapsed'].agg(len).unstack()
action_detail_count = sessions.groupby(['id', 'action_detail'])['secs_elapsed'].agg(len).unstack()
device_type_sum = sessions.groupby(['id', 'device_type'])['secs_elapsed'].agg(sum).unstack()
sessions_data = pd.concat([action_count, action_type_count, action_detail_count, device_type_sum],axis=1)
sessions_data.columns = sessions_data.columns.map(lambda x: str(x) + '_count')
# Most used device
sessions_data['most_used_device'] = sessions.groupby('id')['device_type'].max()
print('There were', sessions.shape[0], 'recorded sessions in which there were', sessions.id.nunique(), 'unique users.')
secs_elapsed = sessions.groupby('id')['secs_elapsed']secs_elapsed = secs_elapsed.agg(
{
'secs_elapsed_sum': np.sum,
'secs_elapsed_mean': np.mean,
'secs_elapsed_min': np.min,
'secs_elapsed_max': np.max,
'secs_elapsed_median': np.median,
'secs_elapsed_std': np.std,
'secs_elapsed_var': np.var,
'day_pauses': lambda x: (x > 86400).sum(),
'long_pauses': lambda x: (x > 300000).sum(),
'short_pauses': lambda x: (x < 3600).sum(),
'session_length' : np.count_nonzero
}
)
secs_elapsed.reset_index(inplace=True)
sessions_secs_elapsed = pd.merge(sessions_data, secs_elapsed, on='id', how='left')
df = pd.merge(df, sessions_secs_elapsed, on='id', how = 'left')
print('There are', df.id.nunique(), 'users from the entire user data set that have session information.')
#Encoding the categorical features
categorical_features = ['gender', 'signup_method', 'signup_flow', 'language','affiliate_channel', 'affiliate_provider', 'first_affiliate_tracked', 'signup_app', 'first_device_type', 'first_browser', 'most_used_device', 'weekday_account_created', 'weekday_first_active']
df = pd.get_dummies(df, columns=categorical_features)df.set_index('id', inplace=True)

Splitting train and test

#Creating train dataset
train_df = df.loc[train_users['id']]
train_df.reset_index(inplace=True)
train_df.fillna(-1, inplace=True)
#Creating target variable for the train dataset
y_train = train_df['country_destination']
train_df.drop(['country_destination', 'id'], axis=1, inplace=True)from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
encoded_y_train = label_encoder.fit_transform(y_train) #Transforming the target variable using labels
#We see that the destination countries have been successfully encoded now
encoded_y_train
#Creating test set
test_df = df.loc[test_users['id']].drop('country_destination', axis=1)
test_df.reset_index(inplace=True)
id_test = test_df['id']
test_df.drop('id', axis=1, inplace=True)

Removing duplicates from train and test if any.

duplicate_columns = train_df.columns[train_df.columns.duplicated()]duplicate_columns

We find that there are duplicates, we thus remove them.

#Removing the duplicates 
train_df = train_df.loc[:,~train_df.columns.duplicated()]

Similarly, treating the duplicates in the test set.

test_df.columns[test_df.columns.duplicated()]
test_df = test_df.loc[:,~test_df.columns.duplicated()]

Training the model and making predictions

We use XGBoost model for the given prediction problem.

import xgboost as xgb
xg_train = xgb.DMatrix(train_df, label=encoded_y_train)
#Specifying the hyperparameters
params = {'max_depth': 10,
'learning_rate': 1,
'n_estimators': 5,
'objective': 'multi:softprob',
'num_class': 12,
'gamma': 0,
'min_child_weight': 1,
'max_delta_step': 0,
'subsample': 1,
'colsample_bytree': 1,
'colsample_bylevel': 1,
'reg_alpha': 0,
'reg_lambda': 1,
'scale_pos_weight': 1,
'base_score': 0.5,
'missing': None,
'nthread': 4,
'seed': 42
}
num_boost_round = 5print("Train a XGBoost model")
gbm = xgb.train(params, xg_train, num_boost_round)

Making predictions on the test set

y_pred = gbm.predict(xgb.DMatrix(test_df))

Selecting top 5 destinations for each userid.

ids = []  #list of ids
cts = [] #list of countries
for i in range(len(id_test)):
idx = id_test[i]
ids += [idx] * 5
cts += label_encoder.inverse_transform(np.argsort(y_pred[i])[::-1])[:5].tolist()

Creating a dataframe with users and their top five destination countries.

predict = pd.DataFrame(np.column_stack((ids, cts)), columns=['id', 'country'])

Creating the final prediction csv.

predict.to_csv('prediction.csv',index=False)

We have successfully predicted destination countries for the Airbnb’s new users!

Next Steps:

  1. Since, we know which destination countries are more popular with the users, Airbnb can implement targeted marketing. This means, focusing marketing strategies for these specific countries to the users identified in the above exercise.
  2. Airbnb can plan ahead which countries they should scout more to get accomodation-providers onboard as they can clearly see the users inclination to visit those countries.
  3. Depending on the choice of the destination country of a particular user, Airbnb can possibly think of similar destination countries (in terms of climate, topography, choices of recreation etc.) to offer as other viable travel options to that user.
  4. This analysis offers extensive idea about how user profile looks like. Airbnb was leverage it to its advantage to experiment new marketing strategies or brainstorm about what changes in demand could follow in the coming years.

If you enjoyed reading my analysis, do send me claps! :)

Source code is hosted on GitHub.

--

--