Utilizing your Data Science Project (Part 1)

Putting a Lending Club machine learning model into production

Tim Copeland
Towards Data Science

--

The graphical user interface for my LendingClub dashboard application

One of the strongest trends in the data science industry in the past few years is increased emphasis on deploying machine learning models in a production environment. Employers are expecting more than just feature engineering and modeling. Your ability to perform, at least, some basic software engineering tasks could make or break the job search process.

I’m currently in the midst of said job search process. Especially since I come from a non-tech background, I didn’t want to limit myself to just being able to fit, predict, and score models without being able to step outside that sandbox. In most real-world applications (no pun intended), a model will be part of a larger product. Building a simple program that puts your model into production introduces you to ideas that are unique to a production environment. In my case, I also wanted to prove I could learn a new library or use case that I wasn’t given an explicit curriculum for.

There is another even more obvious reason why you might want to put your model into production: you get to use it, or share it with others! I live in a state where investors are not allowed to use Lending Club, and I wanted to have the option to share my models with friends or family who could utilize it, whether they were Python-trained or not.

For those unfamiliar, LendingClub is a peer-to-peer lending marketplace. Essentially, borrowers apply for loans and are assigned an interest rate by LendingClub. Individual investors are able to choose loans to fund or invest in, raising capital for a loan in a similar way to a crowd-sourcing campaign. As an investor, your returns vary based on the loans you choose (both the interest and default rates). Therefore, if you can better predict which borrowers will pay back their loans, you can expect better investment returns.

For one of my personal projects, I built a model that reliably selected a portfolio of loans that beat the market average. After my model was completely fine-tuned (i.e. the “data science,” part was complete), there were 5 main steps to putting the model into production.

Saving/serializing your model

When you’re presenting a model in a jupyter notebook, it doesn’t really matter how long your model takes to run. However, in a production environment, speed usually matters. Even in cases where we are constantly receiving new data, it is rare that a model will need to be retrained right away. Depending on the use case, the training of the model could take place only once, at a certain time every day, be triggered by an external event, or something entirely different. For our purposes, training our model once, as we’re developing our application, will suffice.

In order to utilize our model in a functional manner, we need to save the results for future use. This is also known as serialization. A simple way to do this in python is with the pickle library

From the library documentation:

The pickle module implements binary protocols for serializing and de-serializing a Python object structure. “Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte stream (from a binary file or bytes-like object) is converted back into an object hierarchy.

All we need to know is that this library gives us an easy way to save a fitted model, so that we can load it when our application starts, instead of having to fit a potentially time-consuming model every time we want to run our program.

I set up two python programs. The first contains all the code to train the model that my app is going to use. The last thing the program does is save this model for future use.

rf.fit(X,y)
import pickle
pickle.dump(rf, open('rf.sav', 'wb'))

Then, in our user-facing program, when we want to reload our model:

rf = pickle.load(open(‘rf.sav’, ‘rb’))

This loads an instance of our object in Python, and allows us to work with it like we would be able to if we had just created it.

Loading live data — using the Lending Club API

This step will vary greatly depending on the specific problem you are dealing with. It might involve scraping a web page, loading data put into a database, or loading customer information for a customer viewing your web page. In this case, we want to utilize our model to let us know which currently fundable loans are likely to give us the highest return on investment. In order to do that, we need a live source of data.

Luckily, Lending Club provides an easy-to-use API that we can communicate with using the requests library. Requests allows you to make HTTP requests extremely easily, allowing you to interact with web pages and APIs. You can find the documentation here.

The first two lines of code specify variables and parameters for our request. Our variable, ‘apikey,’ is defined when a user logs in (and enters their API key), and ‘Content-Type,’ dictates the format of the data the API returns. Lending Club lists new loans at four different times throughout the day. The parameter ‘showAll,’ tells the API whether to return all fundable loans, or just loans from the last listing.

header = {‘Authorization’: apikey, ‘Content-Type’: ‘application/json’}params = {‘showAll’: ‘True’}

The following code makes a request from the Lending Club API:

url = ‘https://api.lendingclub.com/api/investor/v1/loans/listing’resp = requests.get(url, headers=header, params=params)

Finally, we extract loan information from the response and format it as a pandas data frame.

loans = resp.json()[‘loans’]loans = pd.DataFrame.from_dict(loans)

We now have live data from LendingClub in a familiar format.

Processing live data

I won’t go too in-depth here, but it’s worth noting that, if you care about accurate predictions, any pre-processing steps you performed on testing and training data must be performed on live data as well. This means outlier removal, transformations, etc.

This does have implications for some decisions that we make during data cleaning and feature engineering. For example, a log transformation will be dramatically simpler to implement in an identical manner in production than a box-cox transformation will be.

Predict repayment probability using model

Procedurally, this isn’t too different than predicting probabilities for a test data set. However, this will be the first time you unleash a model for live data, for which you don’t know the end result. You’re predicting the future! This is the moment you’ve been working towards through the many months of tutorials or classwork, so embrace it!

Implementing the model with a user interface

This is another section that will vary dramatically depending on the use case. Some models, like a recommendation system, will affect how an application behaves. My case is a little simpler. I’m just displaying predicted probabilities along with live loan information.

My first program was simply a script that printed a data frame with model results to the terminal. This did force me to think about many of the lessons above, and I think it’s a completely fruitful and worthwhile exercise if that’s as far as one takes it. However, I did want to build a graphical user interface as well (what most people think of as a program or app). I felt like it would significantly increase my ability to share my program with others, and I also felt like my inability to create an app was something I should remedy.

I settled on PyQt5 as a library to do my development. The process was fairly extensive, and I wrote about it more in part two of this article. There are several online walk-throughs that, along with documentation, made it possible to pick up. Qt designer, which lets you lay out a window visually, and generates code for you, was also incredibly helpful.

Conclusion

As an applicant, productizing your model shows that you are more versatile than someone who only knows how to evaluate models in a jupyter notebook. You’ve actually thought about the real-world uses of your model, and potentially weighed trade-offs that you wouldn’t have for a more theoretical project. Creating a graphical user interface shows the ability to learn a new library and branch beyond the typical data science toolbox.

Depending on your specific project, there might also be some serious tangible benefit to putting your model into an end product. I’m definitely looking forward to developing my app and model further so I can share it with friends and family. If you are currently seeking a job in data science, or happen to have a fun side project for which utilization would be beneficial, you should consider trying to do the same!

If you’re interested in the basics of how I set up the graphical user interface (GUI) for this program, check out part two here.

--

--

Data Analyst @ Root, Inc | Former Economist and Analytics mentor | Columbus, Ohio