Turn and Face the Strange

How to leverage anomaly detection methods to improve your supervised learning

Evie Fowler
Towards Data Science

--

Photo by Stefan Fluck on Unsplash

Traditional predictive analytics offers two paradigms through which to view most problems: point estimation and classification. Modern data science is largely concerned with the latter, framing many questions in terms of categorization (think of how an insurer might seek to identify which clients will generate high costs, rather than predict costs for each client; or how a marketer might be more interested in which ads will return positive ROI rather than predicting the specific ROI of each ad). Given that, data scientists have developed and are familiar with a number of classification methods, from logistic regression to tree and forest based methods to neural nets. There’s a hitch, though — many of these methods work best on data with roughly balanced outcome classes, which real world applications rarely deliver. In this piece, I’ll show how to use anomaly detection methods to mitigate the problems created by imbalanced outcome classes in supervised learning.

Let’s imagine that I’m planning a trip out of my home city of Pittsburgh, Pennsylvania. I’m not picky about where I go, but I’d really like to avoid travel hiccups like a canceled, diverted, or even severely delayed flight. A classification model could help me identify which flights are likely to experience problems, and Kaggle has some data that could help me build one.

I begin by reading in my data and developing my own definition of a bad flight — anything canceled, diverted, or with an arrival delay longer than 30 minutes.

import pandas as pd
import numpy as np
from sklearn.compose import make_column_transformer
from sklearn.ensemble import GradientBoostingClassifier, IsolationForest
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

# read in data
airlines2022 = pd.read_csv('myPath/Combined_Flights_2022.csv')
print(airlines2022.shape)
# (4078318, 61)

# subset by my target departure city
airlines2022PIT = airlines2022[airlines2022.Origin == 'PIT']
print(airlines2022PIT.shape)
# (24078, 61)

# combine cancellations, diversions, and 30+ minute delays into one Bad Flight outcome
airlines2022PIT = airlines2022PIT.assign(arrDel30 = airlines2022PIT['ArrDelayMinutes'] >= 30)
airlines2022PIT = (airlines2022PIT
.assign(badFlight = 1 * (airlines2022PIT.Cancelled
+ airlines2022PIT.Diverted
+ airlines2022PIT.arrDel30))
)
print(airlines2022PIT.badFlight.mean())
# 0.15873411412908048

About 15% of flights fall into my “bad flights” category. That’s not low enough to traditionally consider this an anomaly detection problem, but it is low enough that supervised methods might not perform as well as I’d hope. Nevertheless, I’ll get started by building a simple gradient boosted tree model to predict whether a flight will experience the type of problem I’d like to avoid.

To begin I need to identify which features I’d like to use in my model. For the sake of this example I’ll select just a few promising-looking features to model with; in reality, feature selection is a very important part of any data science project. Most of the features available here are categorical and need to be encoded as part of this data prep stage; the distance between cities needs to be scaled.

# categorize columns by feature type
toFactor = ['Airline', 'Dest', 'Month', 'DayOfWeek'
, 'Marketing_Airline_Network', 'Operating_Airline']
toScale = ['Distance']

# drop fields that don't look helpful for prediction
airlines2022PIT = airlines2022PIT[toFactor + toScale + ['badFlight']]
print(airlines2022PIT.shape)
# (24078, 8)

# split original training data into training and validation sets
train, test = train_test_split(airlines2022PIT
, test_size = 0.2
, random_state = 412)
print(train.shape)
# (19262, 8)
print(test.shape)
# (4816, 8)

# manually scale distance feature
mn = train.Distance.min()
rng = train.Distance.max() - train.Distance.min()
train = train.assign(Distance_sc = (train.Distance - mn) / rng)
test = test.assign(Distance_sc = (test.Distance - mn) / rng)
train.drop('Distance', axis = 1, inplace = True)
test.drop('Distance', axis = 1, inplace = True)

# make an encoder
enc = make_column_transformer(
(OneHotEncoder(min_frequency = 0.025, handle_unknown = 'ignore'), toFactor)
, remainder = 'passthrough'
, sparse_threshold = 0)

# apply it to the training dataset
train_enc = enc.fit_transform(train)

# convert it back to a Pandas dataframe for ease of use
train_enc_pd = pd.DataFrame(train_enc, columns = enc.get_feature_names_out())

# encode the test set in the same way
test_enc = enc.transform(test)
test_enc_pd = pd.DataFrame(test_enc, columns = enc.get_feature_names_out())

The development and tuning of a tree based model could easily be its own post, so I won’t get into it here. I’ve used the feature importance ratings of an initial model to do some reverse feature selection, and tuned the model from there. The resulting model performs decently at identifying late, canceled, or diverted flights.

# feature selection - drop low importance terms|
lowimp = ['onehotencoder__Airline_Delta Air Lines Inc.'
, 'onehotencoder__Dest_IAD'
, 'onehotencoder__Operating_Airline_AA'
, 'onehotencoder__Airline_American Airlines Inc.'
, 'onehotencoder__Airline_Comair Inc.'
, 'onehotencoder__Airline_Southwest Airlines Co.'
, 'onehotencoder__Airline_Spirit Air Lines'
, 'onehotencoder__Airline_United Air Lines Inc.'
, 'onehotencoder__Airline_infrequent_sklearn'
, 'onehotencoder__Dest_ATL'
, 'onehotencoder__Dest_BOS'
, 'onehotencoder__Dest_BWI'
, 'onehotencoder__Dest_CLT'
, 'onehotencoder__Dest_DCA'
, 'onehotencoder__Dest_DEN'
, 'onehotencoder__Dest_DFW'
, 'onehotencoder__Dest_DTW'
, 'onehotencoder__Dest_JFK'
, 'onehotencoder__Dest_MDW'
, 'onehotencoder__Dest_MSP'
, 'onehotencoder__Dest_ORD'
, 'onehotencoder__Dest_PHL'
, 'onehotencoder__Dest_infrequent_sklearn'
, 'onehotencoder__Marketing_Airline_Network_AA'
, 'onehotencoder__Marketing_Airline_Network_DL'
, 'onehotencoder__Marketing_Airline_Network_G4'
, 'onehotencoder__Marketing_Airline_Network_NK'
, 'onehotencoder__Marketing_Airline_Network_WN'
, 'onehotencoder__Marketing_Airline_Network_infrequent_sklearn'
, 'onehotencoder__Operating_Airline_9E'
, 'onehotencoder__Operating_Airline_DL'
, 'onehotencoder__Operating_Airline_NK'
, 'onehotencoder__Operating_Airline_OH'
, 'onehotencoder__Operating_Airline_OO'
, 'onehotencoder__Operating_Airline_UA'
, 'onehotencoder__Operating_Airline_WN'
, 'onehotencoder__Operating_Airline_infrequent_sklearn']
lowimp = [x for x in lowimp if x in train_enc_pd.columns]
train_enc_pd = train_enc_pd.drop(lowimp, axis = 1)
test_enc_pd = test_enc_pd.drop(lowimp, axis = 1)

# separate potential predictors from outcome
train_x = train_enc_pd.drop('remainder__badFlight', axis = 1); train_y = train_enc_pd['remainder__badFlight']
test_x = test_enc_pd.drop('remainder__badFlight', axis = 1); test_y = test_enc_pd['remainder__badFlight']
print(train_x.shape)
print(test_x.shape)

# (19262, 25)
# (4816, 25)

# build model
gbt = GradientBoostingClassifier(learning_rate = 0.1
, n_estimators = 100
, subsample = 0.7
, max_depth = 5
, random_state = 412)

# fit it to the training data
gbt.fit(train_x, train_y)

# calculate the probability scores for each test observation
gbtPreds1Test = gbt.predict_proba(test_x)[:,1]

# use a custom threshold to convert these to binary scores
gbtThresh = np.percentile(gbtPreds1Test, 100 * (1 - obsRate))
gbtPredsCTest = 1 * (gbtPreds1Test > gbtThresh)

# check accuracy of model
acc = accuracy_score(gbtPredsCTest, test_y)
print(acc)
# 0.7742940199335548

# check lift
topDecile = test_y[gbtPreds1Test > np.percentile(gbtPreds1Test, 90)]
lift = sum(topDecile) / len(topDecile) / test_y.mean()
print(lift)
# 1.8591454794381614

# view confusion matrix
cm = (confusion_matrix(gbtPredsCTest, test_y) / len(test_y)).round(2)
print(cm)
# [[0.73 0.11]
# [0.12 0.04]]

But could it be better? Perhaps there is more to be learned about flight patterns using other methods. An isolation forest is a tree-based anomaly detection method. It works by iteratively selecting a random feature from the input dataset, and a random split point along the range of that feature. It continues building a tree this way until each observation in the input dataset has been split into its own leaf. The idea is that anomalies, or data outliers, are different from other observations, and so it is easier to isolate them with this pick and split process. Thus, observations that are isolated with just a few rounds of pick and split are considered anomalous, and those that can’t be separated from their neighbors quickly are not.

The isolation forest is an unsupervised method, so it can’t be used to identify a particular kind of anomaly of the data scientist’s own choosing (e.g. canceled, diverted, or very late flights). Still, it can be useful for identifying observations that are different from others in some unspecified way (e.g. flights on which something is different).

# build an isolation forest
isf = IsolationForest(n_estimators = 800
, max_samples = 0.15
, max_features = 0.1
, random_state = 412)

# fit it to the same training data
isf.fit(train_x)

# calculate the anomaly score of each test observation (lower values are more anomalous)
isfPreds1Test = isf.score_samples(test_x)

# use a custom threshold to convert these to binary scores
isfThresh = np.percentile(isfPreds1Test, 100 * (obsRate / 2))
isfPredsCTest = 1 * (isfPreds1Test < isfThresh)

Combining the anomaly scores with the supervised model scores provides additional insight.

# combine predictions, anomaly scores, and survival data
comb = pd.concat([pd.Series(gbtPredsCTest), pd.Series(isfPredsCTest), pd.Series(test_y)]
, keys = ['Prediction', 'Outlier', 'badFlight']
, axis = 1)
comb = comb.assign(Correct = 1 * (comb.badFlight == comb.Prediction))

print(comb.mean())
#Prediction 0.159676
#Outlier 0.079942
#badFlight 0.153239
#Correct 0.774294
#dtype: float64

# better accuracy in majority class
print(comb.groupby('badFlight').agg(accuracy = ('Correct', 'mean')))
# accuracy
#badFlight
#0.0 0.862923
#1.0 0.284553

# more bad flights among outliers
print(comb.groupby('Outlier').agg(badFlightRate = ('badFlight', 'mean')))

# badFlightRate
#Outlier
#0 0.148951
#1 0.202597

There are a few things to note here. One is that the supervised model is better at predicting “good” flights than “bad” flights — this is a common dynamic in rare event prediction, and why it’s important to look at metrics like precision and recall on top of simple accuracy. More interesting is the fact that the “bad flight” rate is nearly 1.5 times higher among flights the isolation forest has classified as anomalous. That’s in spite of the fact that the isolation forest is an unsupervised method and is identifying atypical flights in general, rather than flights that are atypical in the particular way I’d like to avoid. This seems like it must be valuable information for the supervised model. The binary outlier flag is already in a good format to use as a predictor in my supervised model, so I will feed it in and see if it improves model performance.

# build a second model with outlier labels as input features
isfPreds1Train = isf.score_samples(train_x)
isfPredsCTrain = 1 * (isfPreds1Train < isfThresh)

mn = isfPreds1Train.min(); rng = isfPreds1Train.max() - isfPreds1Train.min()
isfPreds1SCTrain = (isfPreds1Train - mn) / rng
isfPreds1SCTest = (isfPreds1Test - mn) / rng

train_2_x = (pd.concat([train_x, pd.Series(isfPredsCTrain)]
, axis = 1)
.rename(columns = {0:'isfPreds1'}))
test_2_x = (pd.concat([test_x, pd.Series(isfPredsCTest)]
, axis = 1)
.rename(columns = {0:'isfPreds1'}))

# build model
gbt2 = GradientBoostingClassifier(learning_rate = 0.1
, n_estimators = 100
, subsample = 0.7
, max_depth = 5
, random_state = 412)

# fit it to the training data
gbt2.fit(train_2_x, train_y)

# calculate the probability scores for each test observation
gbt2Preds1Test = gbt2.predict_proba(test_2_x)[:,1]

# use a custom threshold to convert these to binary scores
gbtThresh = np.percentile(gbt2Preds1Test, 100 * (1 - obsRate))
gbt2PredsCTest = 1 * (gbt2Preds1Test > gbtThresh)

# check accuracy of model
acc = accuracy_score(gbt2PredsCTest, test_y)
print(acc)
#0.7796926910299004

# check lift
topDecile = test_y[gbt2Preds1Test > np.percentile(gbt2Preds1Test, 90)]
lift = sum(topDecile) / len(topDecile) / test_y.mean()
print(lift)
#1.9138477764819217

# view confusion matrix
cm = (confusion_matrix(gbt2PredsCTest, test_y) / len(test_y)).round(2)
print(cm)
#[[0.73 0.11]
# [0.11 0.05]]

Including outlier status as a predictor in the supervised model does in fact boost its top decile lift by several points. It seems that being “strange” in an undefined way is sufficiently correlated with my desired outcome as to provide predictive power.

There are limits to how useful this quirk is, of course. It’s certainly not true of all imbalanced classification problems, and isn’t particularly helpful if explainability is very important to the final product. Still, this alternative framing can provide helpful insights on a variety of classification problems, and is worth a try.

--

--