The world’s leading publication for data science, AI, and ML professionals.

Machine learning for credit card fraud detection

A Jupyter book for reproducible research

Hands-on Tutorials

ML for Credit Card Fraud detection is one of those fields where most of the published research is unfortunately not reproducible. Real-world transaction data cannot be shared for confidentiality reasons, but we also believe authors do not make enough efforts to provide their code and make their results reproducible.

We just released the five first chapters of a book on the topic, which aims at making a first step towards improving the reproducibility of research in this domain: https://fraud-detection-handbook.github.io/fraud-detection-handbook. The book is the result of ten years of collaboration between the Machine Learning Group, University of Brussels, Belgium, and Worldline. Forthcoming chapters will address more advanced topics, such as class imbalance, feature engineering, deep learning, and interpretability.

This preliminary release is in a Jupyter book format, making all the experiments and results reproducible, under an open-source license. Sections that include code are Jupyter notebooks, which can be executed either locally, or on the cloud using Google Colab or Binder. The source code for the book (which includes the text, pictures, and code) is available on GitHub:

Fraud-Detection-Handbook/fraud-detection-handbook

The intended audience is students or professionals, interested in the specific problem of credit card fraud detection from a practical point of view. More generally, we think the book is also of interest for data practitioners and data scientists dealing with machine learning problems that involve sequential data and/or imbalanced classification problems.

In this blog post, we wish to summarize the currently published content and the added value to the Data Science community. The first five chapters bring the following main contributions:

  • Simulator of synthetic transaction data: The book proposes a simulator for transaction data that allows the creation of synthetic transaction datasets of varying complexity. In particular, the simulator allows to vary the degree of class imbalance (low proportion of fraudulent transactions), contains both numerical and categorical variables (with categorical features that have a very high number of possible values), and features time-dependent fraud scenarios.
  • Reproducibility: The book is a Jupyter Book, which allows to interactively execute or modify the sections of this book that contain code. Together with the synthetic data generator, all the experiments and results presented in this book are reproducible.
  • State-of-the-art review: The book synthesizes the recent surveys on the topic of Machine Learning for credit card fraud detection (ML for CCFD). It highlights the core principles presented in these surveys and summarizes the main challenges of fraud detection systems.
  • Evaluation methodology: A major contribution of the book is a detailed presentation and discussion of the performance metrics and validation methodologies that can be used to assess the efficiency of fraud detection systems.

We highlight in the following how the book allows to easily reproduce experiments, and design new ones.

Jupyter book format

With the reproducibility of experiments as a key driver for this book, the choice of a Jupyter Book format appeared better suited than a traditional printed book format.

First, all the sections of this book that include code are Jupyter notebooks, which can be executed on the cloud using Binder or Google Colab thanks to the shuttle icon in the external links. Notebooks may also be run offline by cloning the book repository.

Second, the open-source nature of the book – fully available on a public GitHub repository – allows readers to open discussions on the book content thanks to GitHub issues, or to propose amendments or improvements with pull requests.

Transaction data simulator

In the past, MLG contributed to this reproducibility gap by publishing the credit card fraud detection dataset on Kaggle, thanks to its collaboration with Worldline. The dataset has been widely used in the research literature, on Towards Data Science, and is one of the most upvoted and downloaded on Kaggle (more than 8 million views and 300K downloads since 2016).

Nevertheless, this dataset has some limitations, in particular in terms of obfuscated features, and limited time span. To address these limitations, the book proposes a simple simulator for transaction data, that generates transaction data in their most simple form. In essence, a payment card transaction consists of any amount paid to a merchant by a customer at a certain time. The six main features that summarise a transaction therefore are:

  1. The transaction ID: A unique identifier for the transaction
  2. The date and time: Date and time at which the transaction occurs
  3. The customer ID: The identifier for the customer. Each customer has a unique identifier
  4. The terminal ID: The identifier for the merchant (or more precisely the terminal). Each terminal has a unique identifier
  5. The transaction amount: The amount of the transaction.
  6. The fraud label: A binary variable, with the value 0 for a legitimate transaction, or the value 1 for a fraudulent transaction.

These features are referred to as TRANSACTION_ID, TX_DATETIME, CUSTOMER_ID, TERMINAL_ID, TX_AMOUNT, and TX_FRAUD. The simulator generates transaction tables, as illustrated below.

The details of the simulator are described in Chapter 3, Section 2.

Baseline simulated dataset

A baseline simulated dataset, generated using the simulator, is used throughout the book to experimentally assess machine learning techniques and methodologies. It highlights most of the issues that practitioners of fraud detection face using real-world data. In particular, it includes class imbalance (less than 1% of fraudulent transactions), a mix of numerical and categorical features (with categorical features involving a very large number of values), non-trivial relationships between features, and time-dependent fraud scenarios. This baseline dataset contains around 1.8 million transactions, involves 5000 customers and 10000 terminals, and spans a period of 183 days (6 months of transactions).

Given that non-numerical and categorical features typically cannot be used as inputs to prediction models, feature engineering is used to transform the baseline features into numerical features, using methods that include binary encoding, aggregation, and frequency encoding. After preprocessing, the dataset contains fifteen input variables. It is worth noting that this preprocessing is qualitatively similar to the preprocessing applied to the Kaggle dataset. A major added value of this simulated dataset is that users may now start from the raw transaction data, which contain the customer and terminal ID and explore the use of other feature engineering methods.

The baseline datasets are made available on GitHub, allowing their reuse by simply cloning the corresponding repositories.

Standard template for experiments

Functions that are recurrently used in the book, such as loading data, fitting and assessing prediction models, or plotting results, have been gathered together in a single notebook available in the reference section (called shared_functions).

The setting up of an environment for running experiments is as simple as:

  1. Downloading and including the shared functions and,
  2. Cloning the repository containing simulated data.

This is achieved with the following lines of code:

Designing a baseline fraud detection system

The design of a baseline fraud detection system typically consists of three main steps:

  1. Defining a training set (historical data) and a test set (new data). The training set is the subset of transactions that are used for training the prediction model. The test set is the subset of transactions that are used to assess the performance of the prediction model.
  2. Training a prediction model: This step consists in using the training set to find a prediction model able to predict whether a transaction is genuine or fraudulent. We will rely for this task on the Python sklearn library, which provides easy-to-use functions to train prediction models.
  3. Assessing the performance of the prediction model: The performance of the prediction model is assessed using the test set (new data).

Data loading and splitting

Let us first start with the loading of data and their splitting into a training and test set. This can be achieved with:

The code above loads three weeks of data. The first week is used as a training set and the last week as a test set. The second week, referred to as the delay period, is set aside. The figure below plots the number of transactions per day, the number of fraudulent transactions per day, and the number of fraudulent cards per day. Note that the number of transactions per day is around 10000 (the data is scaled by a factor of 50 to make the number of frauds more visible). The dataset contains on average 0.7% of fraudulent transactions.

Model training and assessment

The training and assessment of a prediction model can then be achieved by (1) defining the input and output features, (2) training a model using any classifier provided by the Python sklearn library, and (3) assessing the performances.

Using the shared functions, the implementation comes down to a few lines of code, as illustrated in the following gist:

The resulting decision tree is:

The example above is covered in greater detail in this notebook, which also provides implementations of fraud detection systems using logistic regression, random forests, and gradient boosting prediction models. Binder or Google Colab may be used to reproduce the results and explore your own modeling strategies.

Conclusion

We hope this preliminary release will be of interest for practitioners working on the topic of fraud detection, and welcome feedback and suggestions for improvement.

Yann-Aël Le Borgne and Gianluca Bontempi

Acknowledgments

This book is the result of ten years of collaboration between the Machine Learning Group, University of Brussels, Belgium and Worldline. The collaboration was made possible thanks to Innoviris, the Brussels Region Institute for Research and Innovation, through a series of grants which started in 2012 and ended in 2021.

Links


Related Articles