Ad Demand Forecast with Catboost & LightGBM

Predict demand for an online classified ad, Feature engineering

Published in

Towards Data Science

6 min readDec 31, 2018

Avito.ru is a Russian classified advertisements website with sections devoted to general good for sale, jobs, real estate, personals, cars for sale, and services.

It is the most popular classifieds site in Russia and is the third biggest classifieds site in the world after Craigslist and the Chinese website 58.com. In December 2016, it had more than 35 million unique monthly visitors. On average, Avito.ru’s users post more than 500,000 new ads daily and the overall ads are about 30 million active listings.

We would like to help Avito.ru to predict demand for an online advertisement based on its features such as category, title, image, its context (geographically where it was posted) and historical demand for similar ads in similar contexts. With this information, Avito can inform it’s sellers on how to best optimize their listing and provide some indication of how much interest they should realistically expect to receive.

The Data

The training data contains the following features:

item_id - Ad id.
user_id - User id.
region - Ad region.
city - Ad city.
parent_category_name - Top level ad category as classified by Avito's ad model.
category_name - Fine grain ad category as classified by Avito's ad model.
param_1 - Optional parameter from Avito's ad model.
param_2 - Optional parameter from Avito's ad model.
param_3 - Optional parameter from Avito's ad model.
title - Ad title.
description - Ad description.
price - Ad price.
item_seq_number - Ad sequential number for user.
activation_date- Date ad was placed.
user_type - User type.
image - Id code of image. Ties to a jpg file in train_jpg. Not every ad has an image.
image_top_1 - Avito's classification code for the image.
deal_probability - The target variable. This is the likelihood that an ad actually sold something. This feature’s value can be any float from zero to one.

df = pd.read_csv('train_avito.csv', parse_dates = ['activation_date'])
df.head()

Since deal_probability is our target variable, we would like to look into it in more details.

import seaborn as sns
import matplotlib.pyplot as plt
pd.set_option('display.float_format', '{:.2f}'.format)
plt.figure(figsize = (10, 4))
n, bins, patches = plt.hist(df['deal_probability'], 100, facecolor='blue', alpha=0.75)
plt.xlabel('Ad price')
plt.xlim(0, 1)
plt.title('Histogram of deal probability')
plt.show();

plt.figure(figsize = (10, 4))
plt.scatter(range(df.shape[0]), np.sort(df['deal_probability'].values))
plt.xlabel('index')
plt.ylabel('deal probability')
plt.title("Deal Probability Distribution")
plt.show();

Almost one million Ads has 0 probability, which means it did not sell anything, and few ads have a probability of 1 which means it did sell something. The other ads have a probability between 0 and 1.

CatBoost

Develped by Yandex researchers and engineers, CatBoost is a machine learning algorithm that uses gradient boosting on decision trees. It is available as an open source library. After hearing many good things about CatBoost, we should give it a try.

Feature engineering

Missing values

The following are the number of missing values in the features.

null_value_stats = df.isnull().sum()
null_value_stats[null_value_stats != 0]

We decided to fill these missing values with -999, by filling missing values out of their distributions, the model would be able to easily distinguish between them and take it into account.

df.fillna(-999, inplace=True)

Date time features

We create several new date time features by using the original activation_date column before drop this column.

df['year'] = df['activation_date'].dt.year
df['day_of_year'] = df['activation_date'].dt.dayofyear
df['weekday'] = f['activation_date'].dt.weekday
df['week_of_year'] = df['activation_date'].dt.week
df['day_of_month'] = df['activation_date'].dt.day
df['quarter'] = df['activation_date'].dt.quarterdf.drop('activation_date', axis=1, inplace=True)

Our features are of different types — some of them are numeric, some are categorical, and some are text such as titleand description, and we could treat these text features just as categorical features.

categorical = ['item_id', 'user_id', 'region', 'city', 'parent_category_name', 'category_name', 'param_1', 'param_2', 'param_3', 'title', 'description', 'item_seq_number', 'user_type', 'image', 'image_top_1']

We will not need to encode categorical features. CatBoost supports both numerical and categorical features. However, we do need to identify categorical features indices.

X = df.drop('deal_probability', axis=1)
y = df.deal_probabilitydef column_index(df, query_cols):
    cols = df.columns.values
    sidx = np.argsort(cols)
    return sidx[np.searchsorted(cols, query_cols, sorter=sidx)]categorical_features_indices = column_index(X, categorical)

CatBoost Model Training

X_train, X_valid, y_train, y_valid = train_test_split(
    X, y, test_size=0.25, random_state=42)model=CatBoostRegressor(iterations=50, depth=3, learning_rate=0.1, loss_function='RMSE')
model.fit(X_train, y_train,cat_features=categorical_features_indices,eval_set=(X_valid, y_valid),plot=True);

A basic model gives a fair solution and training & testing error are pretty much in sync. We will try tuning model parameters, features to improve the results.

CatBoost Model Tuning

iterations is maximum number of trees that can be built when solving machine learning problems.
learning_rate is used for reducing the gradient step.
depth is the depth of the tree. Any integer up to 16 when using CPU.
We calculate RMSE as metric.
bagging_temperature defines the settings of the Bayesian bootstrap, the higher the value the more aggressive the bagging is. We do not want it high.
We will use the overfitting detector, so, if overfitting occurs, CatBoost can stop the training earlier than the training parameters dictate. And the type of the overfitting detector is “Iter”.
metric_period is the frequency of iterations to calculate the values of objectives and metrics.
od_wait, consider the model overfitted and stop training after the specified number of iterations (100) since the iteration with the optimal metric value.
eval_set is the validation dataset for overfitting detector, best iteration selection and monitoring metrics’ changes.
use_best_model=True if a validation set is input (the eval_setparameter is defined) and at least one of the label values of objects in this set differs from the others.

CatBoost.py

CatBoost Feature Importance

fea_imp = pd.DataFrame({'imp': model.feature_importances_, 'col': X.columns})
fea_imp = fea_imp.sort_values(['imp', 'col'], ascending=[True, False]).iloc[-30:]
fea_imp.plot(kind='barh', x='col', y='imp', figsize=(10, 7), legend=None)
plt.title('CatBoost - Feature Importance')
plt.ylabel('Features')
plt.xlabel('Importance');

Results from the CatBoost feature importance ranking shows that attribute “price” has the most significant impact on deal probability. On the other hand, date time features have minimal impacts on deal probability.

LightGBM

LightGBM is a fast, distributed, high performance gradient boosting framework based on decision tree algorithms. It is under the umbrella of the DMTK project of Microsoft.

We will train a LightGBM model to predict deal probabilities. We will go through the similar feature engineering process as we did when we trained CatBoost model, in addition, we will also encode categorical features.

Feature Engineering

lightGBM_feature_engineering.py

Convert data into LightGBM dataset format. This is mandatory for LightGBM training.

X_train, X_valid, y_train, y_valid = train_test_split(
    X, y, test_size=0.25, random_state=42)
    
# LightGBM dataset formatting 
lgtrain = lgb.Dataset(X_train, y_train,
                feature_name=feature_names,
                categorical_feature = categorical)
lgvalid = lgb.Dataset(X_valid, y_valid,
                feature_name=feature_names,
                categorical_feature = categorical)

LightGBM Model Training

num_leaves is the main parameter to control the complexity of the tree model. when trying to tune the num_leaves, we should let it be smaller than 2^(max_depth) (225).
We use max_depth to limit growing deep tree.
for better accuracy, we us small learning_rate with large num_iterations.
To speed up training and deal with overfitting, we set feature_fraction=0.6, that is, selecting 60% features before training each tree.
Set verbosity = -1, eval metric on the eval set is printed at every verbose boosting stage.
early_stopping_rounds = 500, the model will train until the validation score stops improving. Validation score needs to improve at least every 500 round(s) to continue training.
verbose_eval = 500, an evaluation metric is printed every 500 boosting stages.

lightGBM_model_training.py

LightGBM Feature Importance

fig, ax = plt.subplots(figsize=(10, 7))
lgb.plot_importance(lgb_clf, max_num_features=30, ax=ax)
plt.title("LightGBM - Feature Importance");

It’s not surprising to see price is still at the very top. It is interesting to see that item_seq_number has the most significant impact on deal probability in lightGBM model, however, in the CatBoost model, it is only the 12th feature.

That’s it for today, Jupyter notebook can be found on Github. Happy New Year!

Reference: Kaggle