The world’s leading publication for data science, AI, and ML professionals.

Categorizing Fraudulent Credit Card Transactions

Experimenting with various classifier models and a deeper dive into XG Boost

Photo by Karolina Grabowska of Pexels.com
Photo by Karolina Grabowska of Pexels.com

Intro:

As a student of the Metis Data Science Bootcamp, I chose to explore fraudulent card transactions for my second individual project. This project had three requirements: query data from a postgreSQL database with SQL, develop a Classification model, and create an interactive visualization. I found the postgreSQL and interactive visualization components fairly straight forward, so the bulk of this post discusses the development of a usable model, however, I included some insights and tips for the other two components. Also, check out the Github Repository for all of the code behind the project.

Background:

After looking for some interesting data to analyze, I came across this competition on Kaggle. An e-commerce payment solutions company, Vesta Corporation, put together a few .csv files containing card transaction information, including the transaction amount, card type, browser of purchase, some other various basic information, and over 300 features that were engineered by the company but were left undefined in the data description. The target was contained in a column named "isFraud" and defined fraudulent transactions with a 1 and valid transactions with a 0. With the .csv files in hand, it was time to throw them into a Postgresql database and turn this information into insights.

Before I dig in, I would like to lay out the project:

  • SQL Skills (postgreSQL, SQLAlchemy, Psycopg2)
  • Classification Modeling (Logistic Regression, XG Boost, Random Forest, and much more)
  • Interactive Visualization (Streamlit)

Step 1: Using postgreSQL 🗄

As I stated earlier, a requirement of this project was the use of SQL to query the data. Having the data in .csv files, I was able to turn them into DataFrames to throw into an SQL database.

The code is below and I’ll discuss what it all means. First, I read the .csv files into DataFrames and then created an engine to connect to the local postgreSQL database that I named ‘project3’. This SQLAlchemy engine is the key to this whole block of code. It allows the jupyter notebook to interact with postgreSQL and create tables in a given database. Also, these .csv files contain a variety of types (objects, ints, floats, etc.), and it’s amazing how SQLAlchemy can interpret the DataFrame and create columns with corresponding datatypes. Also, it makes this process a lot faster than using command line (the bigger .csv file has over 300 columns).

# First, turn those csv files into DataFrames.
train_identity = pd.read_csv('../Project-3/train_identity.csv')
train_transaction = pd.read_csv('../Project-3/train_transaction.csv')
# Create a connection to the project3 postgreSQL database.
engine = create_engine('postgresql://[USERNAME]:[PASSWORD]@localhost:5432/project3')
# Turn the DataFrames into tables in the postgreSQL database
table_name = 'train_ident'
train_identity.to_sql(table_name,
engine,
if_exists='replace',
index=False,
chunksize=500)
table_name = 'train_trans'
train_transaction.to_sql(table_name,
engine,
if_exists='replace',
index=False,
chunksize=500)

That’s the bulk of the SQL component of the project. I won’t touch any more on this, but I’ll just post the more complicated SQL query joining both tables to investigate the data further.

mastercard_full = pd.read_sql(
"""SELECT * FROM train_trans LEFT JOIN train_ident ON train_trans."TransactionID" = train_ident."TransactionID" WHERE train_trans.card4='mastercard'""", con=engine)

Step 2: Classification Modeling 🎯

With the data successfully stored in a postgreSQL database and then queried into a DataFrame, I started with some basic EDA and quickly understood my main challenge: class imbalance.

Image by Author
Image by Author

The issue with class imbalance comes down to how one determines the success of the model. If I only cared about accuracy, I could predict all transactions as valid and have 97% accuracy. To the untrained eye, 97% accuracy looks very desirable, but this means the model predicts 0% of fraudulent transactions. This 0% metric is referred to as recall. Recall answers the question: of all the fraudulent transactions, what percentage were correctly predicted by the model? On the other hand, the recall for valid transactions would be 100%, since all of the valid transactions were correctly predicted as valid. Since it is understood that class imbalance is the issue, and the goal is to predict fraudulent transactions, the metric for success became recall for fraudulent transactions combined with overall accuracy.

So, how did I address the issue of class imbalance? There are a few ways to tackle class imbalance, but I chose to focus on oversampling. The three methods I explored were RandomOverSampler, SMOTE, and ADASYN. Each of these methods take the minority class and over-samples it until it is balanced with the majority class. RandomOverSampler randomly duplicates data points in the minority class. Synthetic Minority Oversampling Technique (SMOTE) creates new points in the minority class, but with linear interpolations and the use of K-nearest neighbors, so it is a more advanced oversampling technique. Lastly, Adaptive Synthetic Sampling (ADASYN) creates new points in the minority class according to their density distributions.

In regards to the code, it is very simple to implement these oversampling techniques as shown below. It is important to note that the training data is oversampled, which is then used to fit a model, resulting in a better model overall.

# RandomOverSampler
ros = RandomOverSampler(random_state=0)
X_tr_sam, y_tr_sam = ros.fit_sample(X_train,y_train)
# SMOTE
X_smoted, y_smoted = SMOTE(random_state=42).fit_sample(X_train,y_train)
# ADASYN
X_adasyn, y_adasyn = ADASYN(random_state=42).fit_sample(X_train,y_train)

Another common metric for evaluating binary classification problems is the ROC curve and the area under that curve (AUC). Below are three ROC curves corresponding to the different methods of overfitting and a few classifier models I wanted to test. It’s fairly obvious that Random Forest and XG Boost are the best models, also, they are ensemble models, which speaks to the superior quality of ensembling models together to create a better model as a whole.

Image by Author
Image by Author

Utilizing these visuals and metrics created a clearer vision for moving forward. I decided to hone in on the XG Boost classifier with random oversampling. Also, it is important to note that XG Boost performed slightly worse than the Random Forest classifier, however, I chose XG Boost because a lot of Kaggler’s hype up XG Boost, so I wanted to dive deeper into this specific type of modeling.

As for features and feature engineering, it’s probably best to view the notebook for further information. In regards to XG Boost, I was very excited to tune hyperparameters to increase the ROC AUC score, and thus, increase the model recall. In order to find the optimal hyperparameters, I looked in the documentation to find the specific parameters to tune. The parameters max_depth, min_child_weight, and gamma relate to model complexity, while colsample_bytree and subsample relate to the model dealing with noise. After using GridSearchCV, I found optimal values for each of these parameters. GridSearchCV is a very intensive process, so I ran my parameters separately due to the strain on my computer, it’s probably better to run all of the parameters at the same time.

# Choose the parameters to tune
cv_params = {'max_depth': [4,5,6], 'min_child_weight': [1,2,3], 'gamma': [0,1,2]}
# Run gridsearch
xgbc1 = GridSearchCV(estimator=xgb, param_grid=cv_params,scoring='roc_auc')
xgbc1.fit(X_tr_sam, y_tr_sam)
# Observe the best parameters
xgbc1.best_params_
{'gamma': 2, 'max_depth': 6, 'min_child_weight': 1}
# Repeat with other parameters
cv_params = {'subsample': [0.5,0.75,1], 'colsample_bytree': [0.5,0.75,1]}
fix_params = {'gamma': 2, 'max_depth': 6, 'min_child_weight': 1}
xgbc = GridSearchCV(estimator=XGBClassifier(**fix_params), param_grid=cv_params,scoring='roc_auc')
xgbc.fit(X_tr_sam, y_tr_sam)
xgbc.best_params_
{'colsample_bytree': 1, 'subsample': 0.75}

With the data oversampled (to fight class imbalance), the features selected and engineered, and the model’s parameters’ tuned, it is time to look at the outcome. Below is a graphic combining the ROC curve with a confusion matrix. A confusion matrix is another way to easily view the outcome of a classification model. First, the AUC score is 0.96, which is better than any of the models shown in the ROC curves above. In regards to the confusion matrix, it is easy to see where the model works well, and where it falls short. As for the goal of this project, having 1,589 out of 1,949 fraudulent transactions correctly predicted seems very good, but what else can be done?

Most, if not all, classification models output the probability of a specific prediction. There is usually a probability threshold parameter that can tell the model where to cut-off a positive prediction. For XG Boost, I could not figure out a way to tune the probability threshold within the model, but using…

xgb.predict_proba(X_test)

… outputs an array of probabilities describing the probability of the specific point in question being classified as fraud. I focused on this prediction probability array for the basis of the Streamlit app.

Part 3: Streamlit App 🖥

Streamlit is such an amazing tool to create apps for those new to coding. Since I only started heavily coding this year with a strong focus on python, jumping into advanced flask, html, d3, or any of the more in-depth developing languages and tools seemed a bit out of the scope for this short classification project. Luckily, Steamlit is very easy to use as a python programmer, and the resulting application looks professional.

The code for the app is in this python file, but I will run through the basic concept of my resulting application. If a bank were to use this model, I wanted to provide an aide to determine the probability threshold for the model. With a very low threshold, all fraudulent transactions would be correctly predicted, but many valid transactions would be falsely predicted as fraud. On the other hand, a higher probability threshold would result in more fraudulent transactions being classified as valid. This app allows the bank to see what would happen at different thresholds and chose the specific threshold to suit its needs.

The first graph shows the relationship between recall and precision versus threshold values with the gold line representing the chosen threshold. Below the first graph is recall and the amount saved by recognizing fraudulent transactions. Even further below those numbers is a confusion matrix, providing an additional visual aide to help determine the optimal probability threshold.

Conclusion: 📗

Class imbalance is a common issue when dealing with classification problems, and oversampling is a great way to combat that issue when working with large (or small) datasets. Also, python makes it easy to compare multiple models with basic code and then tuning hyperparameters with GridSearchCV. These techniques are very important for data scientists to learn early on because model improvements can be done quickly with such basic knowledge. Now I feel the need to go to my previous project and tune hyperparameters and test more models to increase that project’s validity. Regardless, here are some basic take-aways in regards to this project:

  • Learn SQL!
  • Always take time to explore the data
  • Develop an MVP early on and build upon that model
  • Streamlit makes it easy to create apps
  • Give the stakeholders an easily understood solution or interpretation

Check out the Github Repository for more information on this project – I really like making my notebooks easy to follow. Also, reach out if you have any questions or comments.


I’ve really been enjoying my data science journey and want to continue sharing my learning experiences. Feel free to check out the blog on my second project, and come back for more updates as I continue to traverse the world of data science.

Reach out: LinkedIn | Twitter


Related Articles