Automatic Machine Learning in Fraud Detection Using H2O AutoML

Machine Learning Automation in Finance

Yuefeng Zhang, PhD
Towards Data Science

--

Machine learning has many applications in finance such as security, process automation, loan/insurance underwriting, credit scoring, trading, etc. [1][2]. Financial fraud is one of the major concerns in financial security [1][2]. To fight the increasing risk of financial fraud, machine learning has been actively applied to fraud detection [3][4].

There are many technical challenges in applying machine learning to fraud detection. One of the major difficulties is that the dataset tends to be highly skewed and imbalanced in terms of positive and negative classes [3][4]. As shown in [3], in order to get descent fraud detection/prediction results in this case, typically both domain expertise and a large amount of manual work are required for data exploration, data preprocessing, feature engineering, model selection, training, evaluation, etc.

To address these challenges, H2O [6] provides a user-friendly automatic machine learning module, called AutoML [7], that can be used by non-experts.

In this article, similarly to [3], I use the same highly skewed and imbalanced synthetic financial dataset in Kaggle [5] to demonstrate how to use AutoML [7] to simplify machine learning for fraud prediction compared with the machine learning method in [3].

1. Machine Learning without Automation

This section summarizes the key points of the machine learning method without automation in [3] to establish a baseline for comparison with the H2O automatic machine learning method in Section 2.

1.1 Data Exploration

In [3], extensive data exploration and analysis work is performed to understand which data records and features are required, and which of those can be dropped without significant impact to machine learning modeling and fraud prediction. This type of work tends to need domain expertise as shown in [3]. The major results in [3] can be summarized as follows:

  • feature type

The response class isFraud (0-No, 1-yes) is set only when the feature type of value is CASH_OUT or TRANSFER. Thus the only relevant data records are those records that have a type of value CASH_OUT or TRANSFER [3].

  • feature isFlaggedFraud

The feature isFlaggedFraud is set only in 16 data records/samples [3] over millions of data records in total, thus we can drop this feature without significant impact to the results of modeling and fraud prediction.

  • features nameOrig and nameDest

As pointed out in [3], the features nameOrig and nameDest are meaningless since they don’t encode merchant accounts in the expected way, and thus can be dropped.

1.2 Data Preprocessing

The dataset is preprocessed as follows in [3] based on the results of data exploration.

  • Extracting data records of type TRANSFER or CASH_OUT

According to the results of data exploration, fraud only occurs when the data type is either TRANSFER or CASH_OUT. Thus only those data records are extracted from the raw dataset as below for model training and fraud prediction in [3]. In addition, the less useful features nameOrig, nameDest, and isFlaggedFraud are dropped:

X = df.loc[(df.type == ‘TRANSFER’) | (df.type == ‘CASH_OUT’)]
X = X.drop([‘nameOrig’, ‘nameDest’, ‘isFlaggedFraud’], axis = 1)
  • Imputing missing values

As described in [3], a destination account balance of zero is a strong indicator of fraud. Thus the account balance should not be imputed with a statistic value or a value from a distribution with a subsequent adjustment for the amount transacted. This is because doing so would make fraudulent transactions to appear genuine. To avoid this issue, the destination account balance of value 0 is replaced with -1 in [3] to make it more amenable to machine-learning algorithm for fraud detection:

X.loc[(X.oldBalanceDest == 0) & (X.newBalanceDest == 0) & (X.amount != 0), [‘oldBalanceDest’, ‘newBalanceDest’]] = -1

In addition, as pointed in [3], the data also has several transactions with zero balances of originating account both before and after a non-zero amount transaction. In this case, the fraction of such transactions is much smaller in the case of fraudulent (0.3%) compared with the case of genuine transactions (47%) [3]. Similarly to the treatment of destination account balance of zero, the originating account balance of 0 is replaced with null value instead of imputing a numeric value to separate the fraudulent transactions from the genuine ones in [3].

X.loc[(X.oldBalanceOrig == 0) & (X.newBalanceOrig == 0) & (X.amount != 0), ['oldBalanceOrig', 'newBalanceOrig']] = np.nan

1.3 Feature Engineering

As described in [3], a zero-balance of destination account or originating account can help to differentiate between fraudulent and genuine transactions. This motivates the author to create the following two new features to record the errors in the originating and destination accounts for each transaction [3]. These new features are important in obtaining the best performance for the machine learning algorithm adopted in [3].

X[‘errorBalanceOrig’] = X.newBalanceOrig + X.amount - X.oldBalanceOrigX[‘errorBalanceDest’] = X.oldBalanceDest + X.amount — X.newBalanceDest

1.4 Model Selection

The first approach considered in [3] for model selection is to balance the imbalanced data by undersampling the majority class before applying a machine learning algorithm. The disadvantage of undersampling is that a model trained in this way may not perform well on real-world unseen imbalanced data since almost all the imbalance information was discarded during model training.

The second approach considered in [3] is to oversample the minority class. The author tried various types of anomaly-detection and supervised learning approaches.

As reported in [3], after many experiments, the author finally concluded that the best result is obtained on the original dataset by using the XGBoost machine learning algorithm.

1.5 Model Training and Evaluation

The dataset is split into two parts as below in [3], 80% for model training and 20% for model testing:

trainX, testX, trainY, testY = train_test_split(X, Y, test_size = 0.2, random_state = randomState)

The hyper-parameters of the selected XGBClassifier model are set to particular values:

weights = (Y == 0).sum() / (1.0 * (Y == 1).sum())clf = XGBClassifier(max_depth = 3, scale_pos_weight = weights, n_jobs = 4)

The model training and testing are performed in [3] as below:

probabilities = clf.fit(trainX, trainY).predict_proba(testX)

The value of the AUPRC (Area Under the Precision-Recall Curve) rather than the normal AUC is used to evaluate the model performance:

print(‘AUPRC = {}’.format(average_precision_score(testY, probabilities[:, 1])))

2. Automatic Machine Learning

As described in the previous section and [3], in order to get descent fraud prediction results from highly skewed and imbalanced data, extensive domain knowledge and crafted manual work are required for data exploration, data preprocessing, feature engineering, model selection, training, evaluation, etc.

This section is to demonstrate how to use H2O AutoML [7] to reduce the amount of manual work by automatic machine learning, including but not limited to the following:

  • automatic data preprocessing (e.g., handling missing data)
  • automatic feature engineering
  • automatic model selection
  • automatic model training
  • automatic model evaluation

H2O [6] is based on a client and server of cluster architecture. The H2O server needs to start before any other activities can begin:

import h2o
from h2o.automl import H2OAutoML
h2o.init()

The following should show up if the H2O server starts successfully on a local machine:

2.1 Data Loading

Once the synthetic financial dataset in Kaggle [5] is downloaded onto a H2O server machine, the dataset can be loaded onto the H2O server as follows:

df = h2o.import_file(‘./data/PS_20174392719_1491204439457_log.csv’)
df.head(10)

A summary description of the dataset can be obtained as below:

df.describe()

The data type of the response class isFraud is set as categorical (i.e., factor) since it is binary (0-No, 1-Yes):

factorslist = [‘isFraud’]
df[factorslist] = df[factorslist].asfactor()

2.2 Data Preprocessing

To be comparable with the method in [3], similarly to [3], only the data records of type TRANSFER or CASH_OUT are extracted from the original dataset for model training and fraud prediction, and the insignificant features nameOrig, nameDest, and isFlaggedFraud are dropped.

df1 = df[df[‘type’] == ‘TRANSFER’ or df[‘type’] == ‘CASH_OUT’]
y = “isFraud”
x = df.columns
x.remove(y)
x.remove(“nameOrig”)
x.remove(“nameDest”)
x.remove(“isFlaggedFraud”)

2.3 Model Selection and Training

To be comparable with the method in [3], the extracted dataset is split into two parts as follows, 80% for model training and 20% for model testing:

train, test = df1.split_frame([0.8])

The H2O AutoML model with hyper-parameter max_models of value 10 is selected:

aml = H2OAutoML(max_models = 10, seed = 1)

The selected model is trained with default settings as follows:

aml.train(x = x, y = y, training_frame = train)

2.4 Model Evaluation

  • Viewing top list of trained models

Once the H2O AutoML model is trained, the corresponding leaderboard method can be used to show the list of trained models in decreasing order of AUC (not AUPRC):

lb = aml.leaderboard
lb.head(rows=lb.nrows)

We can see from the above table that H2O AutoML automatically selected and trained 12 different models, including stacked ensemble models. The leading model is XGBoost_3_AutoML_20191113_110031.

  • Obtaining and evaluating the leading model

The model on the top of the list of trained models can be obtained as follows:

leader_model = aml.leader

Note that this leading model is the best in terms of model training AUC (not AUPRC) scores.

A comprehensive summary of the leading model testing performance can be obtained as follows:

leader_model.model_performance(test)

As shown below, the leading model achieved a testing AUPRC (i.e., pr_auc) score of 0.988.

The following code is to get the feature importance of the leading model:

leader_model.varimp_plot()
  • Obtaining and evaluating the model with the best training AUPRC

However the model testing AUPRC rather than AUC is used in [3] for model performance evaluation. In order to make a fair comparison, we need to get and evaluate the trained model with the best testing AUPRC. To this end, we can obtain and evaluate the model with the best training AUPRC first and then compare its testing AUPRC performance with the leading model to determine which model should be selected.

As described before, the leaderboard method of the H2O AutoML displays the list of trained models in decreasing order of training AUC, not AUPRC. To find the model with the best training AUPRC, the following code can be used to show the trained models in decreasing order of training AUPRC:

import pandas as pdmodel_ids = list(aml.leaderboard['model_id'].as_data_frame().iloc[:,0])model_auprc_map = {'Model Id' : [], 'AUPRC' : []}
for mid in model_ids:
model = h2o.get_model(mid)
model_prauc_map['Model Id'].append(mid)
model_prauc_map['AUPRC'].append(model.pr_auc(train))
model_auprc_df = pd.DataFrame(model_auprc_map)
model_auprc_df.sort_values(['AUPRC'], ascending=0, inplace=True)
model_auprc_df.head(20)

It can be seen that the model id with the best training AUPRC of 0.937 is XGBoost_2_AutoML_20191113_110031.

The model on the top of the list can be obtained as follows:

best_auprc_model_id = model_auprc_df.iloc[0, 0]
best_auprc_model = h2o.get_model(best_auprc_model_id)

Then, a comprehensive summary of model testing performance can be obtained:

best_auprc_model.model_performance(test)

As shown below, the model with the best training AUPRC (i.e., pr_auc) achieved a testing AUPRC score of 0.975.

The following code is to get the feature importance of the model with the best training AUPRC score:

best_prauc_model.varimp_plot()

The testing results show that the leading model outperformed the model with the best training AUPRC score in terms of testing AUPRC scores. Thus the leading model should be selected.

2.5 Comparison with Machine Learning without Automation

The following table compares the major machine learning activities between H2O AutoML [7] and the machine learning method without automation in [3].

As described in Section 1 and [3], domain expertise and a large amount of manual work are required in handling missing data, feature engineering, model selection, training, evaluation, etc. All of those types of work are done automatically in AutoML [7] without human intervention. In addition, the model hyper parameter setting in AutoML [7] is much simpler compared with the machine learning method in [3].

Regarding model testing performance in terms of model testing AUPRC, however, the machine learning method in [3] achieved a higher score of 0.997 compared with the score of 0.988 with AutoML.

The main advantage of AutoML is that it can be used by non-experts to achieve quite descent fraud detection/prediction results from complicated dataset such as the highly skewed and imbalanced dataset used in this article.

Summary

In this article, as in [3], I used the same highly skewed and imbalanced synthetic financial dataset in Kaggle [5] to demonstrate the capability of H2O AutoML [7] in enabling non-experts to apply machine learning to financial fraud detection. This was achieved by automation of data preprocessing, feature engineering, model selection, model training, and model evaluation. A decent model testing AUPRC score of 0.988 was achieved.

As described in Section 2.5, a higher model testing AUPRC score of 0.997 was achieved in [3] by manually craft the methods of data preprocessing, feature engineering, model selection, training, etc. In this case it may be justified to favor the user-defined machine learning method over AutoML, depending on business requirements. I noticed that H2O provides a more powerful end-to-end automatic machine learning toolset, called, H2O Driverless AI [8]. This toolset has the capability of Bring Your Own Recipes to enable users to plug-and-play their own methods for data preprocessing, feature engineering, modeling, etc.

A Jupyter notebook with all of the source code in this article is available in Github [9].

References

[1]. K. Didur, Machine learning in finance: Why, what & how

[2]. D. Faggella, Machine Learning in Finance — Present and Future Applications

[3]. A. Joshua, Predicting Fraud in Financial Payment Services

[4]. R. Pierre, Detecting Financial Fraud Using Machine Learning: Winning the War Against Imbalanced Data

[5]. Synthetic Financial Datasets For Fraud Detection

[6]. H2O.ai

[7]. H2O AutoML

[8] H2O Driverless AI

[9]. Y. Zhang, Jupyter notebook in Github

DISCLOSURE STATEMENT: © 2019 Capital One. Opinions are those of the individual author. Unless noted otherwise in this post, Capital One is not affiliated with, nor endorsed by, any of the companies mentioned. All trademarks and other intellectual property used or displayed are property of their respective owners.

--

--

Senior Data Scientist at Wavicle Data Solutions, He was a Senior Data Scientist at SMS Assist, a Senior Data Engineer at Capital One, and a DMTS at Motorola