An overview of precision and recall and an evaluation of the trade-off between the two metrics for building credit card fraud detection models
Real world datasets are rarely balanced with equal representation of each class. Dealing with imbalanced data is a problem that data scientists address very frequently. When evaluating tradeoffs for binary classification one must decide whether the priority is precision or recall. All code can be found in this notebook.

What are precision and recall?
Precision is the rate of true positives divided by the sum of true positives and false positives, the top row in the figure above. Having a high precision means having a high measure of relevant results returned with limited irrelevant results. Recall, on the other hand, is the number of true positives divided by the sum of true positives and false negatives, the left column in the figure above. A high recall indicates the model is able to successfully identify relevant results without mislabeling them as irrelevant. Depending on use case, one must evaluate whether to prioritize precision or recall.
When should you prioritize recall?
For example, when testing patients for COVID-19 it is extremely important to capture as many positive cases as possible to understand the prevalence of the virus within a given area. It is very dangerous to misdiagnose someone as not having the virus when in fact they do because they can spread the disease to others without knowing. In the opposite case, if someone is healthy but diagnosed as having the virus the penalty is they will unnecessarily self-isolate for a few days. False negatives are much more harmful than false positives in this case so we have to prioritize recall.
When should you prioritize precision?
On the other hand, when Netflix is recommending content to its users it doesn’t matter if a series the user might like isn’t displayed on the list of suggestions. What is consequential is having a high rate of suggestions that the user has no interest in. The ratio of true positives to true positives and false negatives, or recall, isn’t very important. The recommender network drives value by consistently delivering suggestions the user will enjoy so precision is the priority in this scenario because every suggestion must be a correct prediction to maintain usability.

Credit card fraud is the unauthorized use of a credit or debit card to make purchases. Credit card companies have an obligation to protect their customers’ finances and they employ fraud detection models to identify unusual financial activity and freeze a user’s credit card if transaction activity is out of the ordinary for a given individual. The penalty for mislabeling a fraud transaction as legitimate is having a user’s money stolen, which the credit card company typically reimburses. On the other hand, the penalty for mislabeling a legitimate transaction as fraud is having the user frozen out of their finances and unable to make payments. There is a very fine tradeoff between these two consequences and we will discuss how to handle this when training a model.
In this example, I will evaluate the precision-recall tradeoff using a credit card fraud detector. In the dataset included, 99.83% of transactions were authorized by the user but 0.17% are fraud. This dataset is very heavily imbalanced so we must evaluate the merits of prioritizing precision and recall when developing a model. This is a binary classification problem with 0 being authorized and 1 being unauthorized. Having a model that simply predicts 0 for every example will return 99.83% accuracy which is typically excellent for a binary classification predictor but not in this case. Although this naïve model maintains a high classification accuracy it does nothing to solve the problem so we will evaluate precision and recall instead. To address this issue we need to protect user’s finances by trying to flag as many fraud transactions as possible while at the same time not mislabeling too many transactions so users can reliably use their credit cards without the inconvenience of having transactions declined. A quick note about the features used in this dataset, they are heavily masked and scaled so as to not reveal any personal or financial data about the users so the features are not very intuitive.
One of the key problems we need to evaluate in building our model is sampling. A random split into training and testing data will yield roughly 0.17% of class 1 in the training set. This introduces the problem of undersampling and underfitting the data. This can be addressed by sampling a much more balanced representation of both classes. The more we sample from class 0, the more likely our model is to predict class 0 for unseen data. I use a ratio of 2 to 1, as opposed to 1 to 1, for authorized to fraud transactions to limit the number of false positives we get. As mentioned earlier, false positives are very harmful because they prevent customers from using their credit cards. This 2 to 1 ratio is a great tradeoff because it gives strong representation to fraudulent transactions while at the same time not giving them too much weight. Most solutions to this problem give a more equitable split in training samples but they understate the problem of false positives. Freezing users out of their finances is extremely harmful and needs to be taken into more consideration when developing algorithms to predict credit card fraud.
For a model, I will use a bagging classifier with AdaBoost as the meta estimator built on a decision tree as the base estimator. A bagging classifier is an ensemble estimator that fits base classifiers on random subsets of the original training data and then "votes" by predicting the most commonly returned result from each of the base estimators. Our meta estimators are AdaBoost classifiers built on decision trees which fit classifiers on the original dataset and then fit additional copies of the classifier that learn from incorrectly classified instances by adjusting weights to guess them correctly. Using boosting and bagging gives enhanced performance over the base estimator class of decision trees. Essentially, we are training multiple self-correcting classifiers on small, random subsets of the data.


Each data subset we use when fitting our bagging classifier will consist of 100 and 50 examples respectively for valid and fraud transactions. We then repeat this process 10 times to train our classifier. To see how I did this, please refer to this notebook. Choosing small, unrepresentative subsamples is effective in this case for training a model but we need to understand how this works in the real world. In this case we need our test data sampling to differ significantly from our training data sampling and be representative of the actual use case. Failing to do so limits our ability to understand how our model works in practice. I will use the same 99.8% and 0.2% representation respectively for valid and fraud transactions. Our confusion matrix yields a precision of 0.83 and recall of 0.24. Only 0.58% of legitimate transactions were flagged as fraud and 83% of actual fraud transactions were flagged as such which is excellent in dealing with the problem at hand. We are able to catch a high percentage of fraudulent transactions while at the same time not harming our users with false positives.
Let’s contrast this method with another, the XGBoost classifier. XGBoost employs the concept of boosting, similar to the previous model, but is optimized for speed and efficiency.


We train this model on the same subsamples as above and we see very different results. Here, we see a precision of 0.87 meaning we catch nearly 87% of fraud transactions, a slight improvement over our previous model. However, this comes at a very large cost. Out of all authorized transactions, 3.3% are incorrectly flagged as fraud. This means that 1 in 30 transactions will be declined, greatly limiting our users’ abilities to go about their day and access their finances. Does the inconvenience for our customers outweigh the gain of catching a few extra fraudulent transactions before they happen? I don’t think so but this question is subject to debate.
Conclusion
The precision-recall tradeoff is a very challenging problem data scientists have to solve when working with imbalanced data and some use cases should prioritize precision while others should prioritize recall, there is no universal right or wrong answer. In our use case of fraudulent credit card transactions we give a little more weight to precision than most models do in this case because I think the consequences of false positives are understated.