Ad Demand Forecast with Catboost & LightGBM
Predict demand for an online classified ad, Feature engineering
We would like to help Avito.ru to predict demand for an online advertisement based on its features such as category, title, image, its context (geographically where it was posted) and historical demand for similar ads in similar contexts. With this information, Avito can inform it’s sellers on how to best optimize their listing and provide some indication of how much interest they should realistically expect to receive.
The Data
The training data contains the following features:
item_id
- Ad id.user_id
- User id.region
- Ad region.city
- Ad city.parent_category_name
- Top level ad category as classified by Avito's ad model.category_name
- Fine grain ad category as classified by Avito's ad model.param_1
- Optional parameter from Avito's ad model.param_2
- Optional parameter from Avito's ad model.param_3
- Optional parameter from Avito's ad model.title
- Ad title.description
- Ad description.price
- Ad price.item_seq_number
- Ad sequential number for user.activation_date
- Date ad was placed.user_type
- User type.image
- Id code of image. Ties to a jpg file in train_jpg. Not every ad has an image.image_top_1
- Avito's classification code for the image.deal_probability
- The target variable. This is the likelihood that an ad actually sold something. This feature’s value can be any float from zero to one.
df = pd.read_csv('train_avito.csv', parse_dates = ['activation_date'])
df.head()
Since deal_probability
is our target variable, we would like to look into it in more details.
import seaborn as sns
import matplotlib.pyplot as plt
pd.set_option('display.float_format', '{:.2f}'.format)
plt.figure(figsize = (10, 4))
n, bins, patches = plt.hist(df['deal_probability'], 100, facecolor='blue', alpha=0.75)
plt.xlabel('Ad price')
plt.xlim(0, 1)
plt.title('Histogram of deal probability')
plt.show();
plt.figure(figsize = (10, 4))
plt.scatter(range(df.shape[0]), np.sort(df['deal_probability'].values))
plt.xlabel('index')
plt.ylabel('deal probability')
plt.title("Deal Probability Distribution")
plt.show();
Almost one million Ads has 0 probability, which means it did not sell anything, and few ads have a probability of 1 which means it did sell something. The other ads have a probability between 0 and 1.
CatBoost
Develped by Yandex researchers and engineers, CatBoost is a machine learning algorithm that uses gradient boosting on decision trees. It is available as an open source library. After hearing many good things about CatBoost, we should give it a try.
Feature engineering
Missing values
The following are the number of missing values in the features.
null_value_stats = df.isnull().sum()
null_value_stats[null_value_stats != 0]
We decided to fill these missing values with -999, by filling missing values out of their distributions, the model would be able to easily distinguish between them and take it into account.
df.fillna(-999, inplace=True)
Date time features
We create several new date time features by using the original activation_date
column before drop this column.
df['year'] = df['activation_date'].dt.year
df['day_of_year'] = df['activation_date'].dt.dayofyear
df['weekday'] = f['activation_date'].dt.weekday
df['week_of_year'] = df['activation_date'].dt.week
df['day_of_month'] = df['activation_date'].dt.day
df['quarter'] = df['activation_date'].dt.quarterdf.drop('activation_date', axis=1, inplace=True)
Our features are of different types — some of them are numeric, some are categorical, and some are text such as title
and description
, and we could treat these text features just as categorical features.
categorical = ['item_id', 'user_id', 'region', 'city', 'parent_category_name', 'category_name', 'param_1', 'param_2', 'param_3', 'title', 'description', 'item_seq_number', 'user_type', 'image', 'image_top_1']
We will not need to encode categorical features. CatBoost supports both numerical and categorical features. However, we do need to identify categorical features indices.
X = df.drop('deal_probability', axis=1)
y = df.deal_probabilitydef column_index(df, query_cols):
cols = df.columns.values
sidx = np.argsort(cols)
return sidx[np.searchsorted(cols, query_cols, sorter=sidx)]categorical_features_indices = column_index(X, categorical)
CatBoost Model Training
X_train, X_valid, y_train, y_valid = train_test_split(
X, y, test_size=0.25, random_state=42)model=CatBoostRegressor(iterations=50, depth=3, learning_rate=0.1, loss_function='RMSE')
model.fit(X_train, y_train,cat_features=categorical_features_indices,eval_set=(X_valid, y_valid),plot=True);
A basic model gives a fair solution and training & testing error are pretty much in sync. We will try tuning model parameters, features to improve the results.
CatBoost Model Tuning
iterations
is maximum number of trees that can be built when solving machine learning problems.learning_rate
is used for reducing the gradient step.depth
is the depth of the tree. Any integer up to 16 when using CPU.- We calculate RMSE as metric.
bagging_temperature
defines the settings of the Bayesian bootstrap, the higher the value the more aggressive the bagging is. We do not want it high.- We will use the overfitting detector, so, if overfitting occurs, CatBoost can stop the training earlier than the training parameters dictate. And the type of the overfitting detector is “Iter”.
metric_period
is the frequency of iterations to calculate the values of objectives and metrics.od_wait
, consider the model overfitted and stop training after the specified number of iterations (100) since the iteration with the optimal metric value.eval_set
is the validation dataset for overfitting detector, best iteration selection and monitoring metrics’ changes.use_best_model=True
if a validation set is input (theeval_set
parameter is defined) and at least one of the label values of objects in this set differs from the others.
CatBoost Feature Importance
fea_imp = pd.DataFrame({'imp': model.feature_importances_, 'col': X.columns})
fea_imp = fea_imp.sort_values(['imp', 'col'], ascending=[True, False]).iloc[-30:]
fea_imp.plot(kind='barh', x='col', y='imp', figsize=(10, 7), legend=None)
plt.title('CatBoost - Feature Importance')
plt.ylabel('Features')
plt.xlabel('Importance');
Results from the CatBoost feature importance ranking shows that attribute “price” has the most significant impact on deal probability. On the other hand, date time features have minimal impacts on deal probability.
LightGBM
LightGBM is a fast, distributed, high performance gradient boosting framework based on decision tree algorithms. It is under the umbrella of the DMTK project of Microsoft.
We will train a LightGBM model to predict deal probabilities. We will go through the similar feature engineering process as we did when we trained CatBoost model, in addition, we will also encode categorical features.
Feature Engineering
Convert data into LightGBM dataset format. This is mandatory for LightGBM training.
X_train, X_valid, y_train, y_valid = train_test_split(
X, y, test_size=0.25, random_state=42)
# LightGBM dataset formatting
lgtrain = lgb.Dataset(X_train, y_train,
feature_name=feature_names,
categorical_feature = categorical)
lgvalid = lgb.Dataset(X_valid, y_valid,
feature_name=feature_names,
categorical_feature = categorical)
LightGBM Model Training
num_leaves
is the main parameter to control the complexity of the tree model. when trying to tune thenum_leaves
, we should let it be smaller than2^(max_depth)
(225).- We use
max_depth
to limit growing deep tree. - for better accuracy, we us small
learning_rate
with largenum_iterations
. - To speed up training and deal with overfitting, we set
feature_fraction=0.6
, that is, selecting 60% features before training each tree. - Set
verbosity = -1
, eval metric on the eval set is printed at every verbose boosting stage. early_stopping_rounds = 500
, the model will train until the validation score stops improving. Validation score needs to improve at least every 500 round(s) to continue training.verbose_eval = 500
, an evaluation metric is printed every 500 boosting stages.
LightGBM Feature Importance
fig, ax = plt.subplots(figsize=(10, 7))
lgb.plot_importance(lgb_clf, max_num_features=30, ax=ax)
plt.title("LightGBM - Feature Importance");
It’s not surprising to see price is still at the very top. It is interesting to see that item_seq_number
has the most significant impact on deal probability in lightGBM model, however, in the CatBoost model, it is only the 12th feature.
That’s it for today, Jupyter notebook can be found on Github. Happy New Year!
Reference: Kaggle