The world’s leading publication for data science, AI, and ML professionals.

Turning Lending Club’s Worst Loans into Investment Gold

We Use Machine Learning to Mine Profit From Lending Club's Junkiest Loans

So shiny...
So shiny…

This is a writeup of a machine learning project I completed. In this post I hope to:

  • Describe my algorithm for predicting loan defaults.
  • Use the algorithm to construct a portfolio of clean loans that earns an above average return.
  • Introduce and explain ROC curves, precision, and recall.

_You can find the code I used to run the analysis on my GitHub._


Lending Club, one of the original peer to peer lenders and one time fintech darling (though not anymore), is an interesting business. They make money by connecting people who want to borrow money with those who are willing to lend it. Lending Club adds value to the process by screening out the riskiest borrowers and using their proprietary algorithm to assign a grade (and interest rate) to all the loan applicants that make it past their filters.

We are interested in them today because they offer something that very few other investment assets offer currently – a juicy interest rate. For those of you that follow financial trends, you know that the Federal Reserve (America’s central bank) has pushed yields to and maintained them at historically low levels since the Financial Crisis (2008). Check it out in the chart below:

The One Year Treasury Bill Rate is Pretty Low These Days
The One Year Treasury Bill Rate is Pretty Low These Days

The net result of this low interest rate monetary policy was a decline in yields (yield is another way of saying interest rate) across the risk spectrum. All yields from mortgage rates to the interest rates on high yield debt (loans to companies with high levels of debt relative to their income) compressed to historical lows as investment managers bought anything and everything that could earn them a decent return.

If you are interested in investing in something that pays you a regular interest rate these days here is your menu of options (see chart below). Your bank account earns you a negative return after inflation and U.S. Treasuries barely beat inflation. Going further out the risk curve into various types of corporate debt doesn’t help much either. But what’s that over there?

Inflation Adjusted Yields for Various Investment Assets
Inflation Adjusted Yields for Various Investment Assets

The pink bar really jumps out right? "Lending Club High Yield" is a weighted average of the yields on Lending Club’s D, E, F, and G rated loans (where A is the highest and G is the lowest). These junk loans (finance industry parlance for risky loans) offer a much juicier yield than their higher rated (A, B, and C) counterparts. Average yields for A, B, and C rated loans are around 12% lower than yields for junk loans!


The Problem

So what’s the catch? The catch is that these junk loans have extremely high rates of default.

Approximately 28% of the junk loans I looked at defaulted! (My dataset was every 36 month loan originated by Lending Club in 2015)

The chart below shows how this massive default rate impacts the 15% yield we thought we were going to earn. The defaults dropped us from an inflation adjusted yield of 15% to a mere 2%! The 2% return includes recoveries – money owed that is extracted from the borrower after he has already defaulted.

After Defaults, Lending Club Junk Loans Yield Very Little (All Yields are Inflation Adjusted)
After Defaults, Lending Club Junk Loans Yield Very Little (All Yields are Inflation Adjusted)

But There is Still Hope!

All is not lost. If we can build a classification model that reliably predicts which loans will go bad, then we can focus our investments in the junk loans that our model deems least likely to default. First let’s take a step back and answer the question, "What is a classification model?"

Classification is a popular objective of machine learning algorithms – we want to know what class (a.k.a. group) an observation belongs to. The ability to precisely group observations is really useful for various business applications such as predicting whether a particular user will buy a product or (as we are trying to do here) forecasting whether a given loan will default or not.

If the paragraph above sounds familiar, that’s because I took it almost verbatim from an earlier blog post. In that post, I wrote extensively about the random forest classifier – the algorithm we will now use to classify each loan into either likely to default or NOT likely to default.

Please read that post if you want to go deeper into how random forest works. But here is the TLDR – the random forest classifier is an ensemble of many uncorrelated decision trees. The low correlation between trees creates a diversifying effect allowing the forest’s prediction to be on average better than the prediction of any individual tree and robust to out of sample data.

The random forest algorithm employs the following two tricks to reduce correlation between trees – bagging (bootstrap aggregation) and feature randomness (check out my post on random forest for more).


Feature Selection

The loan data and features that I used to build my model came [from Lending Club’s website](http://Back to ROC Curves). I downloaded the .csv file containing data on all 36 month loans underwritten in 2015. If you play with their data without using my code, make sure to carefully clean it to avoid data leakage. For example, one of the columns represents the collections status of the loan – this is data that definitely would not have been available to us at the time the loan was issued.

As expected with data about loans, most of the features are related to the borrower’s personal and financial characteristics:

  • Home ownership status
  • Marital status
  • Income
  • Debt to income ratio
  • Credit card loans
  • Characteristics of the loan (interest rate and principal amount)

Since I had around 20,000 observations, I used 158 features (including a few custom ones – ping me or check out my code if you would like to know the details) and relied on properly tuning my random forest to protect me from overfitting.


Model Selection Using ROC Curves

Even though I make it seem like random forest and I are destined to be together, I did consider other models too. The ROC curve below shows how these other models stack up against our beloved random forest (as well as guessing randomly, the 45 degree dashed line).

ROC Curves for the Various Classification Models I Tried (Validation Data)
ROC Curves for the Various Classification Models I Tried (Validation Data)

Wait, what is a ROC Curve you say? I’m glad you asked because I wrote an entire blog post on them!

In case you don’t feel like reading that post (so saddening!), here is the slightly shorter version – the ROC Curve tells us how good our model is at trading off between benefit (True Positive Rate) and cost (False Positive Rate). Let’s define what these mean in terms of our current business problem.

The Confusion Matrix
The Confusion Matrix
  • True Positive Rate, also known as recall, is the ratio of True Positives to Actual Defaults. In our confusion matrix (to the left), it is the green box divided by the sum of the green and yellow boxes. It tells us what percentage of Actual Defaults we are correctly classifying with our model.
  • False Positive Rate is the ratio of False Positives to Actual NO Defaults. In our matrix, it is the red box divided by the sum of the red and blue boxes. It tells us the percentage of clean loans that we are incorrectly classifying as defaults.

The key is to recognize that while we want a nice, big number in the green box – increasing True Positives comes at the expense of a bigger number in the red box as well (more False Positives).

Let’s see why this occurs. For each loan, our random forest model spits out a probability of default. But what constitutes a default prediction? A predicted probability of 25%? What about 50%? Or maybe we want to be extra sure so 75%? The answer is it depends.

The probability cutoff that decides whether an observation belongs to the positive class or not is a hyperparameter that we get to choose.

This means that our model’s performance is actually dynamic and varies depending on what probability cutoff we choose. If we pick a really high cutoff probability such as 95%, then our model will classify only a small number of loans as likely to default (the values in the red and green boxes will both be low). But the flip-side is that our model captures only a small percentage of the actual defaults – or in other words, we suffer a low True Positive Rate (value in yellow box much bigger than value in green box).

The reverse situation occurs if we choose a really low cutoff probability such as 5%. In this case, our model would classify many loans to be likely defaults (big values in the red and green boxes). Since we end up predicting that most of the loans will default, we are able to capture the vast majority of the actual defaults (high True Positive Rate). But the consequence is that the value in the red box is also very large so we are saddled with a high False Positive Rate.

How a ROC Curve is Generated
How a ROC Curve is Generated

Back to ROC Curves

Wow, that was a longer than expected digression. We are finally ready to go over how to read the ROC curve.

The chart to the left visualizes how each line on the ROC curve is drawn. For a given model and cutoff probability (say random forest with a cutoff probability of 99%), we plot it on the ROC curve by its True Positive Rate and False Positive Rate. After we do this for all cutoff probabilities, we produce one of the lines on our ROC curve.

But what does that actually mean?

Each step to the right represents a decrease in cutoff probability – with an accompanying increase in false positives. So we want a model that picks up as many true positives as possible for each additional false positive (cost incurred).

That’s why the more the model exhibits a hump shape, the better its performance. And the model with the largest area under the curve is the one with the biggest hump – and therefore the best model.

Whew finally done with the explanation! Going back to the ROC curve above, we find that random forest with an AUC of 0.61 is our best model. A few other interesting things to note:

Average Interest Rate and Investor Return for Lending Club Loans (2007 to Q1/2019)
Average Interest Rate and Investor Return for Lending Club Loans (2007 to Q1/2019)
  • Compounding this is the fact that Lending Club’s riskier loans UNDERperform their safer loans, anathema to adherents of Modern Portfolio Theory (left chart). The implication here is that Lending Club may not be charging riskier borrowers a high enough interest rate. My hunch is that pressure to hit revenue targets is forcing them to charge lower interest rates in order to attract more borrowers.
  • The naive Bayes classifier produces the same results as guessing randomly. At some point I will write about this model, but I was surprised by how badly it did.

Why Random Forest?

Lastly, I wanted to expound a bit more on why I ultimately chose random forest. It’s not enough to just say that its ROC curve scored the highest AUC, a.k.a. Area Under Curve (logistic regression’s AUC was almost as high). As data scientists (even when we are just starting out), we should seek to understand the pros and cons of each model. And how these pros and cons change based on the type of data we are analyzing and what we are trying to achieve.

I chose random forest because all of my features exhibited very low correlations with my target variable. Thus, I felt that my best chance for extracting some signal out of the data was to use an algorithm that could capture more subtle and non-linear relationships between my features and the target. I also worried about over-fitting since I had a lot of features – coming from finance, my worst nightmare has always been turning on a model and seeing it blow up in spectacular fashion the second I expose it to truly out of sample data. Random forests offered the decision tree’s ability to capture non-linear relationships and its own unique robustness to out of sample data.


Feature Importances

The random forest implementation in scikit-learn has a handy "featureimportances" attribute. If we check it out for our model we find that the three most important features are:

  1. Interest rate on the loan (pretty obvious, the higher the rate the higher the monthly payment and the more likely a borrower is to default)
  2. Loan amount (similar to previous)
  3. Debt to income ratio (the more indebted someone is, the more likely that he or she will default)

Precision and Recall

OK, we have our model now so it’s finally time to construct a portfolio. It’s also time to answer the question we posed earlier, "What probability cutoff should we use when deciding whether or not to classify a loan as likely to default?"

Let’s define two key terms:

Precision: When our model classifies a loan as likely to default, what is the probability that it actually defaults?

A model that emphasizes precision has a high probability cutoff and incurs more false negatives.

Recall: Out of all the loans that actually defaulted, how many did our model flag as likely to default?

A model that emphasizes recall has a low probability cutoff and incurs more false positives.

A critical and somewhat overlooked part of classification is deciding whether to prioritize precision or recall. This is more of a business question than a data science one and requires that we have a clear idea of our objective as well as how the costs of false positives compare to those of false negatives.

I would strongly argue for recall. Remember that our objective is to find a clean set of loans that we can invest in with confidence. A false negative (we predict NO default but the loan defaults) costs us real money while a false positive (predict default but the loan does not) is just an opportunity cost – so even if a loan appears just little bit dodgy, we should throw it away. There will always be more loans to choose from so we should be very selective when deciding whether or not to invest our hard earned money.

Let’s use a probability cutoff of 20%.

Finally, the Investment Results

I have a secret to confess – at the start of our analysis I hid away approximately 6,500 loans so that we can now use them to perform a true out of sample test. Let’s see how we do.

Out of the 6,500 loans in our test set, 28% defaulted. But out of the roughly 450 loans that our model classifies as safe (unlikely to default), only 15% of them defaulted. We were able to decrease the frequency of default by almost 50%!

That’s cool and all but how does it look when we translate it into returns? Lending Club recommends investors to buy and hold a portfolio of at least 100 loans (for diversification). So let’s use a Monte Carlo simulation to randomly select 100 loans from the entire test set and 100 loans from our clean set (the ones picked by our model) over and over again (we will run 5,000 simulations) and see how we do.

Return Comparison - Model Picked Loans vs. All Loans
Return Comparison – Model Picked Loans vs. All Loans

We successfully doubled the earned return from 6% for the entire test set to 12% when investing in only the loans chosen by our model! Victory!

Conclusion

It looks like we were able to successfully predict which loans were more likely to default, avoid them, and ultimately earn significantly higher returns.

We also learned more about random forests, how to interpret ROC curves, and how precision and recall considerations impact our model’s probability cutoff.

If you enjoyed this, please check out some of my other posts (links below). Cheers!


Links

More from yours truly on Data Science:

Understanding how random forest works

Understanding how logistic regression works

My data science boot camp experience

More from yours truly on Investing:

Do stocks provide an excess return over cash?

What the next stock market downturn might look like


Related Articles