In Anomaly Detection one of the most tedious problems is to deal with imbalance. Our role as Data Scientist is, at the first stage, to detect patterns responsible for abnormal behaviors. Secondly, develop ad-hoc ML models which override class imbalance and try to return the best results.
What we do in this post is to take advantage of imbalance. Through the building of a clever pipeline and some tricks, we’ll be able to boost the performance of our ML model.
In detail, we analyze a problem of conversions forecasting but our approach is not only specialized in this task. A similar problem, of imbalance classification, can be fraud detection, churn prediction, remaining life estimation, toxicity detection and so on.
THE DATASET
I found a good benchmark for our analysis on Kaggle, directly from a past competition (Homesite Quote Conversion), in order to make all much truly and interesting.
This dataset represents the activity of a large number of customers who are interested in buying policies. Each entry corresponds to a potential customer and the QuoteConversion_Flag indicates whether the customer purchased a policy.
The provided features (296 in total) are anonymized and provide a rich representation of the prospective customer and policy. They include specific coverage information, sales information, personal information, property information, and geographic information. Our task is to predict QuoteConversion_Flag.
We have to manage 48,894 successes out of 211,859 not conversions (23.07% of the whole amount). As you can see it’s a great imbalanced problem!

THE VALIDATION STRATEGY
In my opinion, when we faced problems of classification like this one, the first thing we have to do is to reason about the presence of temporal dependencies. Here, the presence of a temporal pattern is given by the ‘Original_Quote_Date’ column. Keep in mind of time features is important in order to develop a robust validation scheme, which reflects in the best way the reality; i.e. we want to develop a model which is able to forecast future possible conversions. So, the best way to develop a strong and stable solution is to split our data in train/test according to the ‘quote date’. Given that, I’ve sorted the original data by time and I use the first 70% (from 2013–01–01 to 2014–08–25) as train and the latest 10% (from 2015–03–09 to 2015–05–18) as test. I don’t use deliberately 20% of data (from 2014–08–25 to 2015–03–09) because I want to create a leak and I’m curious to see how good is our model for future forecasts.
With a strong validation pipeline, we are already at a good point. What we want to do now is to compare two different architectures. A classical one, composed by a simple cross-validated LogisticRegression and a more sophisticated one, made by multiple cross-validated LogisticRegressions, where we operate negative downsampling. We simple random sample a portion of entries from the majority class in order to make it equal to the minority one (conversion).
THE MODEL
Now we have introduced all we need, let’s make it works. I don’t focus so much on feature engineering and model selection; I prefer to illustrate a general approach which is customizable in every aspect.
Firstly I implement the structure of our baseline model. It is composed by a cross-validation schema where, in every fold, I reduce the dimension of original features with PCA to the first 50 components and fit a LogisticRegression. I try the StratifiedCrossValidation criterion for its ability to take care of imbalance and the possibility to generalize well on our data. Training with 5 folds I achieve an average AUC of 0.837 in out of fold splits and a final AUC of 0.773 on test, resulting in clear overfitting on training data!

In the same way, the recall value for the minority class is extremely low and unsatisfactory.
Our second architecture is based also on a cross-validation schema, wherein I operate as before dimensionality reduction and fit LogisticRegressions. It differs, from the previous case, for these aspects:
- I choose a simple Kfold cross-validation procedure without shuffle;
- Before the cross-validation loop, I reduce the initial train dataset undersampling the minority class (the choice of the selected entries is made random);
- When the entire cross-validation procedure ends, I store the results and repeat all… I’m not mad, I repeat all (for a given number of times) changing the entries which form the sample of the class subject to undersampling, i.e. I change the seed of sampling;
- In the end, I average all the results, obtained permuting seeds, and calculate performances (also stacking suits well).
for s, seed in enumerate(seeds):
train_pos = X_train.loc[y_train == 1, features].copy()
train_pos['QuoteConversion_Flag'] = [1]*neg_istances
train_pos['Quote_Date'] = Time_train[y_train == 1]
train_neg = X_train.loc[y_train == 0, features].copy()
train_neg['QuoteConversion_Flag'] = [0]*pos_istances
train_neg['Quote_Date'] = Time_train[y_train == 0]
train_neg = train_neg.sample(neg_istances, random_state=seed)
train = pd.concat([train_pos,
train_neg]).sort_values('Quote_Date')
y = train.QuoteConversion_Flag.values
train = train.drop(['Quote_Date',
'QuoteConversion_Flag'],
axis=1).reset_index(drop=True)
for fold,(in_index,oof_index) in enumerate(skf.split(train, y)):
print(fold+1, 'FOLD --- SEED', seed)
scaler = StandardScaler()
pca = PCA(n_components=50, random_state=seed)
y_in, y_oof = y[in_index], y[oof_index]
X_in = train.iloc[in_index, :]
X_in = scaler.fit_transform(X_in)
X_in = pca.fit_transform(X_in)
X_oof = train.iloc[oof_index, :]
X_oof = scaler.transform(X_oof)
X_oof = pca.transform(X_oof)
model = LogisticRegression(C=0.1, solver="lbfgs",
max_iter=1000)
model.fit(X_in, y_in)
yoof[oof_index,s] = model.predict_proba(X_oof)[:,1]
pred[:,s] += model.predict_proba(
pca.transform(
scaler.transform(X_test[features])
))[:,1]
print('AUC', roc_auc_score(y_oof, yoof[oof_index,s]))
AUC[s] += roc_auc_score(y_oof, yoof[oof_index,s])
del model; del pca; del scaler
pred = pred/n_splits
AUC = AUC/n_splits
The AUC is 0.852 in out of fold splits (higher than before) but with a relative AUC on test of 0.818. Also, the confusion matrix improves.

This means that we were able to improve performances and reduce overfitting at the same time. These results are possible for two main reasons:
- We took advantage of permuting downsampling because we’ve introduced a little bit of diversity in our data every time I’ve changed seed;
- We chose a more realistic cross-validation schema because our data follows a temporal pattern, making the present (train) different from the future (test). K-fold without shuffle is able to reproduce this dependency in training, lowering overfitting.
SUMMARY
In this post, I produce a valid pipeline to deal situation of imbalance in data and diversity between train and test. As you can see, this is a common scenario in the real-life problem but also it’s often proposed in a lot of Kaggle competitions. The first thing we can do is to define a good validation schema and try to extract value from sampling techniques. This will permit us to achieve better performances and reduce computation time.
Keep in touch: Linkedin