P2P Lending Platform Data Analysis: Exploratory Data Analysis in R — Part 1

An Exploration over Prosper Loan Data

Lorna Yen
Towards Data Science

--

Peer-to-peer lending platform industry is thriving in recent years. Thousands of investors are making profit through these platforms; thousands of borrowers are getting money more easily. Although these platforms provide credit score and basic information of borrowers to ensure lending tradings are in a safety environment, there are still thousands of people under risk of losing money.

Even Prosper — a leading financial peer-to-peer lending platform company, still suffered from credit risk issues before. But after their reconstruction and new credit system launched, the credit risk has been improved. It leads me want to find out stories behind the scenes. Here I will explore this Prosper data set and try to find out some patterns behind borrowers properties, different Rating type, and how they link to default loan or completed loan.

Besides, I will also share some ideas abut motivations of variable exploration. After all, in the such a huge data set, if we start from knowing some basic domain knowledge, it’s easier to dig out valuable features among this big data.

The Prosper loan Data

This Prosper Dataset was provided by Udacity as a part of Data Analyst Nanodegee(last updated on 03/11/2014) which can be download here. It contains 81 variables and 113,979 observation for each loan list data during 2005 to 2014, which can be roughly classified by four kinds of variable categories:

  • Loan Status : The status of the loan list, such as Cancelled, Charged off, Completed, Current, Defaulted, Final Payment In Progress, Past Due.
  • Borrower Data : Basic properties about borrowers such as income, occupation, employment status, etc.
  • Loan Data : Basic properties about the loan such as length of the loan(term), Borrower APR, etc.
  • Credit Risk Metrics : Metrics measured the risk of loans, such as Credit grade, Prosper Score, bank card utilization, etc.

For variable definition, see the link.

Initial Question of Interest

I have never used the P2P lending platform before, but I’m always curious about why investors always can earn profit from these risky loans?

My first guess for this question is that credit assessment metrics on P2P lending platform have certain reference value, especially for platform with good performance like Prosper or LendingTree. The reason is, on P2P lending platform, we can not get a lot of information about borrowers, the only reference sources are metrics provided by P2P lending platform. And these information provide investors more reliable and opportunities to earn money. According this, my question can be — what’s the properties of rating metrics in good P2P lending platform — like Prosper — and how they are linked to default and completed loans? If I can find features that have some trends with default and completed loans, maybe these features are the key roles in assessing good and bad loans.

Before proceeding to explore loan data to find out answers, Let’s go understanding basic properties about the P2P lending platform.

P2P lending Platform: Prosper Outlook

In general, the properties of P2P lending platform is very different from a traditional lending channels — bank, which always evaluates a loan with borrower’s credit score from an independent credit reporting agency. But we know on P2P lending platforms, loans are always high risk. If they rate these loans through credit reporting agency, these loans would be always rated as high risk. Such the situation would affect judges from investors in P2P lending platform. So there may exist some risk metrics that are different from traditional credit metrics, and they can evaluate P2P loans properly.

Let’s check out Prosper page to see their risk rating metric from an investor’s perspective. If we look at the loan list page , we can see the “Rating” column with each loan list on the left side. Besides, we can see other basic information in each loan like Category or Amount. So if an investor decides whether to invest in a loan or not, the “Rating” metric becomes the major assessment of credit risk in the platform.

So what’s the Rating and how was the Rating performing? I flipped through Prosper’s Form S-1 and Annual Report and Wiki, and made a summary:

Before 2009, the major credit risk metric displayed for investors was Credit Grade, which was based on the borrower’s credit score from an independent credit reporting agency. But the loan performance on Prosper were not very well at that time. After temporarily shut down asked by SEC and restructuring, Prosper launched new credit risk metric displayed since July 2009 — Prosper Rating, which was regarded as stricter credit guidelines for borrowers. The new loan performance shows that Prosper’s loan default rate has been significantly reduced.

Seems like Prosper Rating performed well than old Credit Grade metric which evaluates a loan like banks’ way. It’s leads me want to compare the Loan Status before and after 2009 in the Prosper loan data.

Some tips for knowing the data background knowledge…

Before preceding to the next part, I want to introduce some tips about how to quickly understand the basic domain knowledge for a target data.

As an experienced investment banker, quickly understanding the basic domain knowledge and summarizing it are our major daily job. When we get start to know a brand new industry knowledge, it’s very useful to find the company’s listed document, such as SEC filing. The most well-known listing document is called Annual Report. If the company of interest has not listed on stock market before, it’s also feasible to search the document of leading companies with the same industry which have been listed.

The following are the major sources to quickly get to know a new industry knowledge and history of a company/industry:

  • Listed document: Form S-1, Form 10-K, Annual Report, etc. They provide basic company background and history, industry information and competition, main product and service introduction, financial performance, etc. They can be easily found on the company web page of Investor Relation(IR) if the company has been in Initial Public Offerings of stock market.
  • Industry Report: We can find plentiful industry trends and players on industry report. Well-known sources such as IBISWorld, IDC, MarketResearch.com, etc. Note most of industry report sources require a paid account, but they always provide a report summary which make us can get some basic information.
  • Statistical source: It’s useful to investigate a quantity performance during a period. Most of listed documents provide financial reports. If you want a more integrated statistical source, the most recommended source is Statista.
  • Wikipedia and Google Search: Panacea for nearly everything.

Loan Performance before and after 2009

In the data set, I defined HighRisk loan be loans are PastDue, Chargeoff or Defaulted; Completed loans be loans are in Completed, FinalPaymentInProgress and Cancelled.

The bar chart above shows that the proportion of high risk loans have decreased after 2009 from about 37% to 30%.

Let’s compare the relationship between each level of Credit Grade and Prosper Rating from HR to AA(high to low risk) and Loan Status. I want to check how do Prosper Rating and Credit Grade assess both of the bad and good loans.

The percentage of High Risk loan appears an inverse relationship with both of Prosper Rating and Credit Grade as the risk level decrease. The lower percentage of High Risk loan, the better the Rating is. And we can see that the whole High Risk loans(in green color) actually decrease after Prosper Rating was launched.

I further group the loans with each Credit level from AA to HR(low to high risk)in both HighRisk and Completed loan:

The chart above shows that number of loans rated in good level rating before 2009 have decreased after 2009 in both Completed loan and High Risk loan, implied that Prosper conduct more stricter loan audit after 2009. Further more, High Risk loans totally decreased compared to loans before 2009 as we have shown in previous plot, while High Risk loans rated in D and E still increased after 2009.

It can be inferred that:

  • Prosper conduct stricter loan auditing after 2009.
  • The ability of Prosper Rating performed better on assessing the high risk loans compared to Credit Grade applied before 2009.

Components of Prosper Rating

We have seen that the well performance of Prosper Rating from the Prosper data. So how does the Prosper Rating be measured?

According to this page, Prosper Rating is determined by Estimated Loss Rates, and this Estimated Loss Rates is determined by two scores: 1) a custom Prosper Score and 2) Credit Score from a consumer credit reporting agency (like Experian). So I will investigate more in Prosper Score and Credit Score to see how they make Prosper Rating more accurate than Credit Grade.

1. Prosper Score

According to Prosper website, Prosper Score was built using historical Prosper data to assess the risk of Prosper borrower listings. It ranges from 1 to 11, with 11 being the lowest risk, to 1 being the highest risk.

Graph above shows Prosper Score has a bell-shaped distribution spiking on Score with 4,6,7, and fewer counts with scores in both lowest and highest risk among the Prosper data.

Group Prosper Score with each Loan Status, we can see they are distributed a left-skewed shape in completed loan, which means completed loans primarily locate in good rating. However, Prosper score distributed a bell-shaped in high risk loan. Compared to Prosper Rating with a left-skewed shape shown before, seems like Prosper Score presents a less ability to detect the high risk loans.

The other component of Prosper Rating is credit score from a reporting agency. In this data set, I found the variables related to this kind of scores were CreditScoreRangeLower and CreditScoreRangeUpper. I create a new variable, CreditScoreAverage, by averaging both of the two variables, to as a representative variable for credit score.

2. Credit Score Average

Before 2009, Prosper does not allow individuals with an Credit Score (Experian Scorex PLUS) below 520 to post listings on the Platform. And after 2009, Prosper made the Credit Score have the minimum threshold up to 640, but in some cases they allowed scores minimum value to 600 if borrower previously completed a Prosper loan. So I divided the graph into two period and limit the minimum value of the score on x-axis to 510 and 630 to exclude the outliers of special cases.

Both of the CreditScoreAverage before and after 2009 distributed right-skewed, and with most of counts in 610 to 670 before 2009, and with most of counts in 670 to 710 after 2009. The overall Average Credit Score after 2009 was apparently rated higher compared to loans before 2009. The reason is Prosper truly set the higher threshold on borrower’s credit score after 2009, and it also matches the observation result in previous section of Prosper Rating.

But how do Credit Score Average and Prosper Score make the Prosper Rating more accurate? Does Credit Score Average make difference between Completed and High Risk loan before and after 2009? I grouped the Average Credit Score with loan status in Completed and High Risk before and after 2009:

Compared the distribution of Completed and High Risk loans, the graphs above appear nearly similar distributions with right-skewed shape in both two periods. Seems like CreditScoreAverage does not make difference to detect Completed and High Risk loans before and after 2009, except that the change of threshold.

It turns out: If Prosper only uses Credit Score for auditing, under the condition of more strict assessing after 2009(higher threshold), the credit score of overall borrowers at that time will primarily located in high-risk tiers, even for the loans which have the high probability to complete. However, since Prosper combined Prosper Score as well, it makes Prosper Rating present much better measuring ability and appear a much better discriminating between completed and high risk loans.

Investigation so far…

Let’s make a brief summary. After 2009, Prosper applied the Prosper Score to make Prosper Rating have more discrimination between bad loans and completed loans, under the condition of stricter assessing standard on bureau score threshold after 2009. So we can say Prosper Score played important role in the Prosper Rating metric.

Let’s using data to elaborate the assumption:

Above graphs show that trends between Prosper Rating and Prosper Score appear a slightly positive shape, and the variance of Prosper Score in each Prosper Rating is more concentrated. Compared to Credit Score Average, the variance of Credit Score Average in each Prosper Rating is kind of larger than Prosper Score’s. Seems like Prosper puts more linear weights of Prosper Score than Credit Score Average on their own Prosper Rating model.

Next step: Lifting the Veil of Prosper Score

So what’s the major elements of Prosper Score? I flipped through Prosper annual report from 2010 and 2013, I found some information about Prosper Score:

Prosper Score is built to estimate the likelihood that a loan will go 61+ days past due. Unlike credit score obtained from a credit reporting agency is based on a much broader population, Prosper Score is based on a more precise picture from a smaller lending platform subset.

Interest. I infer that if Prosper just measure the borrower credit by traditional bureau agency, in fact it is just a similar measuring way like a bank or other official lending institution. Prosper Score consider the borrower behaviors that is unique among platform population. Maybe such a custom assessment is more suitable for the lending platform market, because it is specifically measured by the Prosper borrower and applicant population. Because we know the lending platform offers borrower a additional platform when he can’t borrow from a bank which measure credit score in more strict way. The lending platform spreads the risk across many investors, and it make the measuring way to be very different.

Hence, I search what’s the major elements of Prosper Score. I found some different sources that Prosper Score was composed by different set of elements over time, like the website or this one. I am not going explore all the related features in the Prosper data in order to avoid making the report too long. Instead, I will choose some variables I think important which also covered by these variable lists from these sources.

In the next part, I will explore the major features may be related to Prosper Score, which have the probability to make Prosper Rating be more discriminating in evaluating quality of loans.

Note: For more detail exploration result, see my report on Rpubs and codes in GitHub!

--

--