The world’s leading publication for data science, AI, and ML professionals.

Real-time Fraud Detection With Machine Learning

As our life and finance are moving from physical to digital world, real-time fraud detection will take a centre stage.

Inside AI

Photo by Bermix Studio on Unsplash
Photo by Bermix Studio on Unsplash

Unlike our parents and grandparents, we live and breathe in the digital world. Initially, it was discussions on online forums, then chats and emails, and now most of our entire life and financial transactions are executed in digital mode.

As the stakes are getting higher, it is not enough to detect fraud after the event. Imagine someone with a few confidential information about your bank or credit card details, able to execute a fraudulent transaction. Banks and insurance companies need tools and techniques to detect frauds in real-time to take appropriate actions.

We humans lose the sense of interpretation and visualisation as we move beyond three-dimensional space.

Today a financial transaction involves hundreds of parameters like transaction amount, past transaction trends, GPS location of the transaction, transaction time, merchant name etc. We need to consider many parameters to detect an anomaly and fraud in realtime.

Isolation forest algorithm implemented in Scikit-Learn can help to identify the frauds in realtime and avoid financial loss. In this article, I will discuss step by step process of a fraudulent transaction with Machine Learning.


Step 1: We need to import the packages which we are going to use. We will use "make_blobs" to generate our test data and will measure the accuracy of the fit model with accuracy_score.

from sklearn.datasets import make_blobs
from sklearn.metrics import accuracy_score
from sklearn.ensemble import IsolationForest

Step 2: In real life, we base the model based on millions and billions of past transactions and hundreds of parameters. In this article, we will consider a hundred samples and four features to understand the core concept and the process.

X, y = make_blobs(n_samples=[4,96], centers=[[5,3,3,10],[9,3,6,11]], n_features=4, random_state=0, shuffle="True")

The array X holds values of the four parameters for a hundred records and, y stores whether it is a fraud or normal transaction.

Step 3: We will use 300 base estimators (trees) in the ensemble and 10 number of samples from the dataset to train each base estimator.

clf = IsolationForest(n_estimators=300,max_samples=10,
random_state=0,max_features=4,contamination=0.1).fit(X)

Also, we will use all four feature values ("max_feature" parameter) for the model. In projects, with feature engineering, the importance of each parameter is determined and ascertained the list of features on which model is to be based. I will not discuss the details of feature engineering in this article and will discuss it later in a separate article. The IsolationForest model is further fitted with the sample dataset.

We set the value of the parameter "contamination" based on the proportion of the anomaly in historical data and stakes of missing anomaly against false alarms. Let say that proportion of fraud transaction in the historical dataset is 0.05 % and it is a very high stake transaction. In such a scenario, we may like to set the contamination value from 0.25 to 0.35. Setting the contamination value 5 to 7 times the anomaly proportion in historical data records will ensure that none of the rogue transaction is wrongly classified. Indeed setting a high contamination value compare to anomaly proportion also lead to increase few false alarms. In case the stakes are lower, then we may afford to miss to catch a few fraudulent transactions but decrease false alarms with lower contamination value.

Step 4: In the below code, fitted IsolationForest model predicts whether a transaction is a fraud or normal transaction. IsolationForest predicts the anomaly as "-1" and normal transaction as "1". In our sample test dataset, fraud transactions are codified as "0" and normal transactions as "1".

y_pred=clf.predict(X)
y_pred[y_pred == -1] = 0

To compare the model prediction accuracy with actual classification from sample datasets, we will classify the predicted fraud transaction from "-1" to "0".

Step 5: As now the fraud transaction is labelled as "0 in the sample and predicted set, hence we can compare the prediction accuracy of the model directly with accuracy_score function.

fraud_accuracy_prediction= round(accuracy_score(y,y_pred),2)
print("The accuracy to detect fraud is {accuracy}  %" .format (accuracy=fraud_accuracy_prediction*100))

It seems the model identified the fraud transaction with 93% accuracy. The prediction accuracy of the model may not look good enough on first glance, but remember as the stakes are higher, hence we are ok with few false alarms (false positive). These false alarms sacrifice the prediction accuracy, but it is better to be ultra-safe than missing a few frauds transactions.

Step 6: We will use the confusion matrix to look deeper into the ** predictions**.

from sklearn.metrics import confusion_matrix
print(confusion_matrix(y, y_pred))

Out of the total 100 transactions in sample datasets, the model could identify all four true fraud transactions.

Model labelled seven genuine transactions as fraud (false alarm) due to contamination (safety factor) parameter of 0.1 in the model. We have set the contamination value higher than the actual proportion of the fraud transaction in historical data as it is better to be safe than sorry in case stakes are higher.

Step 7: We have written a small function to detect whether the new transaction is a fraud in realtime. It takes the parameter values of the new transaction feeds into the trained model to detect the authenticity of the transaction.

def frauddetection(trans):
    transaction_type=(clf.predict([trans]))
    if  transaction_type[0] < 0:
        print("Suspect fraud")
    else:
        print("Normal transaction")
    return

Step 8: Various transaction parameters are collected at the time of the new transaction.

frauddetection([7,4,3,8])   
frauddetection([10,4,5,11])

The authenticity of the transaction is ascertained by calling the function defined earlier with transaction parameters.

I have simplified a few things like the number of features in the transaction, the number of historical transaction to fit the model, feature engineering etc. to explain the core concept. We have seen the way isolation forest algorithm can help to detect fraudulent transactions in real-time.

If you would like to know the way we can perform feature engineering with exploratory data analysis then read the article on Advanced Visualisation for Exploratory data analysis (EDA) .


Related Articles