First place prize of the last year, but believe me, it looks the same each year :) (photo source: DSG FB group)

Why You Should Not Code 30 Hours in a Row

Denis Vorotyntsev
9 min readOct 16, 2018

--

On the last week of September I participated in the final of Data Science Game 2018 (DSG), the most famous international competition for talented students devoted to Machine Learning and Data Science. This year, 26 teams from all over the world came to Paris to challenge themselves. The task for the event is always a Kaggle-alike competition but limited in time: competitors have only 30 hours to achieve the best scores.

My team and I represented Finland at this event. For the first time in the history of DSG, Finland has reached the final of DSG and has got into top 10 final leaderboard. This post will be about the task of the competition, challenges that we faced during the event, a description of our solution, and what we have learned during our 30-hour coding sprint.

Task and Data

The task was given by Cdiscount, a French e-commerce website with a broad offer: a wide range of products including, among others, cultural goods, high-tech, IT, household appliances, personal appliances, and food. Cdiscount provided competitors with a data set, consisting of navigation tracking elements, and a data set, describing the hierarchy of product categories.

In respect of user’s privacy, Cdiscount voluntarily placed the challenge at a session level without any kind of user identification. Therefore, all types of personal data had been removed from the data set. In addition, all product-related information had been encrypted to ensure equity between all participants (French-speaking and others).

Description of provided data

We were asked to predict the probability that a purchase action would occur between the last observation available and the end of the session (for a given user, a session is defined as a succession of events, it ends when the time between two successive events exceeds 30 minutes). The task was a binary classification, metrics of the competition — logloss.

The illustration shows how the target was built

Our Solution

At the beginning of the competition, organizers provided three data sets: train, test, and hierarchy of products. The last one was a little buggy, and organizers were trying to fix it during the event; so, we decided not to use it at all. The train data set consisted of 1.3M rows, which represented about 130k sessions. In the test, we had information about 90k sessions. Therefore, classes were imbalanced: only 9.4% of all sessions presented in the train dataset led to purchase.

The task was quite difficult for a number of reasons: the proposed data set was pretty raw (we could not run simple fit-predict without some pre-processing), each session contained a sequence component (the nature of data was the sequence of actions from the user), the vast majority of features was anonymized. However, we have faced similar cases before; so, we knew what to do. The data set consisted of several types of features, that’s why we were processing them differently. Let’s discuss them in detail.

Time Features

Duration was the only time feature in the data set. It showed the time between the start of session till the opening of the current page. The absolute value was not meaningful for models; consequently, we decided to calculate how much time a user spent on the current page and on the previous one. Then, we worked with those two features as with numerical ones.

String Features

There was a number of string features in this contest. They represented different things: a type of page (and a simplified type of page), ids of various goods on the page or in the user’s basket, filters that user utilized, carousel ids, a search line, and a site id.

The most common way to deal with those features is to combine them all together and work with them as with text features (to be precise, work with them with tools that are commonly used in NLP tasks). We performed TF-IDF or Count Vectorizer on joined strings for each session; ‘or’ in this case means that we picked a single strategy (not both at the same time) for each string column. When we worked with multiple models, we tested various combinations of pre-processing. Also, we used different sets of hyperparameters of TF-IDF and Count Vectorizer for each model. We did it mostly because we wanted our model to give us diverse predictions (i.e. models that ‘make errors’ in different places) so that we could stack or blend it later.

Besides, we made a couple of numerical features from those string ones: the number of values for each column of each user, the number of unique values for each string column of each user, etc.

Categorical Features

We decided to get the most frequent value in each string column and work with it as with a categorical feature. What’s more, we added first, first two, last, and last two actions in the form of a category.

We used several strategies to deal with categorical features: label encoding, one hot encoding, and mean encoding. Many teams used mean Bayesian encoding (see this Kaggle kernel to get a better understanding of this idea), which tries to regularize probabilities for rare categories. Based on my experience, it is not the best way to perform mean encoding. The right way is the following: we divided train data sets into k folds, k-th fold will be used for validation, while k-1 folds will be used for calculating a mean target for all levels of this category and transferred to the test data set and a validation part. Then, k-1 folds will be divided into t folds, and the same operation will be repeated (k and t — some numbers, we used k=10, t=5). Using this strategy, we do not need any type of regularization; we often achieve better scores. However, this approach requires more computational time.

The concept of mean encoding: classes will become more separable with it

Numerical Features

The best way to deal with numerical features in this competition was to make their aggregation by sid and some categorical columns and then calculate some statistics:

import pandas as pd 
import numpy as np
from scipy import stats
cat_columns_list = """ all cat columns """
num_columns_list = """ all numberical columns """
# nearly all sids contained at least one NaN in each numerical feature
functions_dict = {
'max': np.nanmax,
'min': np.nanmin,
'ptp': lambda x: np.nanmax(x) - np.nanmin(x),
'mean': np.nanmean,
'std': np.nanstd,
'mean_div_std': lambda x: np.nanmean(x) / (np.nanstd(x) + 1) # std==0 sometimes
'median': np.nanmedian,
'skew': lambda x: stats.skew(x, nan_policy='omit')
}
num_features_dfs = []
for cc in cat_columns_list:
for fc in num_columns_list:
for fnc_d in functions_dict:
fnc_f = fnc_d.keys()
fnc_name = d.values()
grb_df = df[['sid'] + cc + fc].groupby(['sid'] + cc).agg(fnc_f).reset_index()
new_col_name = fc + 'l_grouped_by_' + cc + 'l_agg_' + fnc_name
grb_df = grb_df.rename({fc: new_col_name}, axis=1)
piv_df = grb_df.pivot(index = 'sid', columns = cc, values = new_col_name)
num_features_dfs.append(piv_df)

Sequence Prediction Feature

As it was mentioned above, the data for each session contained the sequence of pages and some actions from the user (make a search, add filters or put in the basket, and so on). The most common way to deal with those sequences is to use RNN or LSTM models to predict the target variable, but we did not do that. We thought that because of various lengths of each session (both in terms of the number of pages and the time spent on each page) those models would give us a poor score (disclaimer: we were right; competitors, who tried those models, reported about a low score of this approach).

To catch a sequence nature of the data, we did a completely different thing instead. For each session, we knew the sequence of a type of pages that he had visited. We made a model that predicted the next page, giving the information of previous n pages (we tried different n, n=3 was the best one according to our validation score). We combined the train and test dataset together and made OOF (Grouped KFold strategy) predictions for the session; then, we calculated the accuracy of our predictions for each session. This feature was used for predicting the target variable — whether a user will purchase something or not. In the end, this feature was among the most important ones.

The illustration of the new feature distribution

Models

We tried to use feedforward neural networks in this competition, but they gave us a worse score than tree-based models. We employed a net with l layer and n neurons in each layer (we tried various values for l and n, the best result was achieved with the following parameters: l=6 and n=256, but still it was too bad to work with). For categorical features, we tried OHE and Embedding layers. After each layer, we used a dropout layer. After the event, other participants told us that NN did not help much in this task.

In our final solution, we trained several tree-based models on various subsets of features (different preprocessing strategies were used). We used CatBoost and LightGBM with hyperparameters that we tuned ourselves. XGBoost was slow to train, and we had not too much time in the end; so, we did not use it. Our final submission was a blend of top 5 of our best model based on the weighted score on public leaderboard and validation:

w_score = (log_loss_public * n_samples_public + log_loss_cv * n_samples_cv) / (n_samples_public + n_samples_cv)

Learned Lessons and Results

We were in top 1-top 5 of the public leaderboard most of the competition time, but something went wrong. We were working for 30 hours in a row with only 1.5 hours of sleep, which were a clear mistake from our part. We felt extremely tired closer to the deadline and could not do our best: our ideas became shallow, and we made a couple of bugs in a code. In the end, we were outperformed by teams who decided to spend time in bed, but not in front of the monitor, or those who worked with the 2-by-2 scheme: two were sleeping while they remain two were working. I think that was our main fault mostly because of which we end up on 8th position. We lost all our tactical advantage because of a strategic mistake.

Next time, we will pay more attention to time management.

Final leaderboard

Nevertheless, I’m happy with the results. I have participated in Data Science Game for the first time, but there was a couple of teams, who had participated twice. The challenge that we faced was tough, but we did not give up. At the end, it made us stronger: personally, I have found a number of areas where I could improve my pipelines of machine learning competitions (robust feature engineering, multiple information outputs during pre-processing) and a lot of new ideas, which I could try on the new data in upcoming challenges (mostly related to feature selection).

In the end, I would like to say thanks to all organizers of this great event; thanks to Microsoft for free credits on Azure and Cdiscount for an interesting task. Special thanks to all who participated in DSG 2018, without such strong competitors it would be uninteresting to participate.

It was a good battle to be a part of. I will do my best to participate in Data Science Game 2019.

Our team: Dmitrii Mikheev, Denis Vorotyntsev, Margarita Pirumyan, Nikita Ashikhmin

More info on DSG: Data Science Game website

P.S.: I could publish neither the data nor the code according to the NDA policy :(

--

--