Exploring Machine Learning Solutions for Credit Scoring Farmers in Kenya

Emmanuel Blonkowski
Towards Data Science
14 min readJun 10, 2020

--

Image by SunCulture, used with permission.

Introduction

This article covers a data science project for SunCulture, a solar energy startup based in Nairobi, Kenya, providing solar water pumps, televisions, and lighting on credit to small farmers.

It highlights the different paths that we explored during the project, what worked and what didn’t, and what we’ve learned along the way. Our hope is that the information contained in this article will help other companies working on financing in Africa learn from our experience.

Small scale farmers in Africa have usually been neglected by the traditional banking sector and they have very few financing options. But how to know if they’re going to pay back? Traditional credit scoring uses the past credit history as the main indicator, but for many SunCulture customers, this information is not available.

We proposed to use a combination of Internet-of-Things (IoT) data sources, including soil sensors and pump usage from an irrigation system, data that was already available to the company.

Over a period of 7 months, I was tasked to explore the solution space for credit scoring using sensor data. The idea was to create an algorithm that outputs a probability that a farmer will default based on his or her IoT sensor data as input. In technical terms, this is a ​time series classification​ problem.

Time series data have always been of major interest to financial services, and now with the rise of real-time applications, other areas such as retail and programmatic advertising are turning their attention to time-series data driven applications.

In the last couple of years, several key players in cloud services have released new products for processing time series data. It is therefore of great interest to understand the role and potentials of Machine Learning in this rising field.

We’ll talk about a failed first attempt, which led us to explore the solution space. As the amount of trials grew, we found the need to organize experiments and manage datasets. Finally, at the end of the mission, we realized the importance of dealing with a salient feature of our data: the low number of defaults. I’ll start with introducing the farmers since I was lucky enough to know them and go on field:

Meeting the farmers

I was looking for missions online when I got into contact with Charles Nichols, CTO and Co-Founder of SunCulture.

Since I happened to be based in Kenya at the time, they invited me to visit their headquarters and sent me to the mountain region of Nanyuki to meet some of their clients (farmers).

The goal of this field trip was to understand the issues farmers were facing and see opportunities for technical solutions.

Alex Gitau helped me understand the basics of farming. Five good criteria for a farm success are:

  • Soil fertility, as measured by the nutrients present
  • Water characteristics
  • Water availability in terms of volume
  • The size of the land
  • The market value of the crops grown.

I spoke to Patrick Ngetha, a new farmer, who explained his financial issues to me. There is a lot of variability and uncertainty both on the market prices and the harvest results, caused for example by pests. These revenues come irregularly a few times during the year.

The main cost with his operations were buying the seeds and paying the workers. Farmers need to buy high quality seeds from companies otherwise the vegetables don’t find their market.

The workers need to be paid at the time of the harvest that is before the money from the sales comes in, which further stresses the cash flow.

He had the potential to grow his farm, which currently was about half an acre, but he lacked the funding.

He had experimented with different crops and Alex Gitau, the SunCulture agronomist and my guide, was confident that he had the skills necessary to take his activity to the next level. However he was not able to get funding from the banks.

It was very rewarding going on field and realizing that the project had a big potential for social impact. So I was very motivated to start with our first model.

A Rocky start with our first architecture: TimeNet

TimeNet is a deep learning model introduced by P. Malhotra et al. in the paper: TimeNet: Pre-trained deep recurrent neural network for time series classification. It is an unsupervised method, in layman terms, it converts the raw sensor data into a much smaller summary vector. We then use a classic machine learning method, in our case logistic regression, to do the final classification based on the summary.

Sounds simple? Well, we had a lot of issues.

The main one was to get the model to actually converge. It happened frequently that in the middle of a training that was going quite well, the model suddenly “forgot everything”. This is quite frequent when training recurrent neural networks and is called gradient explosion.

In order to solve this issue we tried scaling up our compute capabilities and started training on multiple GPU instances. After a whole month and about 15 experiments we finally gave up as this setup had not been able to correctly classify a single loan default.

Why didn’t TimeNet work? It was too much of a research model and there was not enough data to learn from.

Exploring the solution space

We then moved on to the models proposed in Time Series Classification from Scratch with Deep Neural Networks: A Strong Baseline from Zhiguang Wang et al. This is a more popular article (as can be seen from the ~500 citations) and it comes with code for the three models it introduces. Also, training their model is very simple, which meant convergence would be easy.

We started training with their ResNet model. Unfortunately, the first run didn’t succeed most probably due to the length of our time series. They were longer than most that were explored in the paper.

Convolutional models, like ResNet, have a ”pattern size” called a receptive field that is the length of the longest pattern they can possibly detect. For example, if the receptive field is 24 hours then the model can only detect daily patterns and will not be able to detect a weekly regularity.

The receptive field of the default ResNet came out to a few dozen hours on our data, which we felt was insufficient. This turned out to be the key observation. We finally got our first working model by augmenting the receptive field to about a week (using strided convolutions).

This model was about 4 times more specific than a dummy baseline.

The next period, from November to January, was focused on trying as many different architectures as we could.

We tried 14 different models. Some were state-of-the art like InceptionTime, some were more classic machine learning like DTW and some were our own like ConvRNN.

Layer layout for ConvRNN model
Layer structure of ConvRNN shows the convolutions output directly piped into the RNN layer.

ConvRNN ​is a deep learning architecture of a few convolutional layers followed by a recurrent layer. All the layers are arranged sequentially as shown in the diagram above. It is an end to end model: input is the sensor data and output of the last layer is the default probability.

Convolutions take time series as input and also produce time series output. The output length is divided by the striding, for example a striding of 2 will halve the length.

Since there are multiple stridings happening sequentially, we can achieve very high reduction in the time series length. In some sense, the convolutions act as smart “sub samplers”.

This reduction in size is important if the sensor data time series are long. Recurrent neural networks can only effectively be trained on short time series, in the order of hundreds of points.

The recurrent layer would diverge if trained on the original data, so the convolutions are needed to shorten the time dimension.

Finally, we noted that the receptive field of this model is theoretically infinite. This is the advantage of the recurrent layer, which is able to combine and accumulate information over long ranges. As noted in the ResNet model discussion, a large receptive field allows us to understand a farmer’s behavior at a scale ranging from various time frames — days to seasons.

This model had the best performance among deep learning architectures. The original idea came from Charles Nichols. We haven’t so far seen this architecture being used in the literature.

The lesson learned for choosing architecture is not to hesitate trying something new. Even if something hasn’t been done before it can still work great. Though you might still want to start your project with an existing model for simplicity.

Organizing experiments

For each of the 14 architectures we had different runs and configurations. At the end of the project, the total number of individual runs was easily in the thousands. A big part of a data science project is to manage them efficiently.

Each experiment, whether it brings an improvement in the metrics or not, needs to be documented so that we can use it to guide our future decisions. Some of these experiments were quite long, up to a few days, hence we saved all the key information to avoid having to rerun it later.

At first I used an online wiki system to document the experiments. At one point we realized that a significant portion of my time was spent documenting the experiments, as opposed to running them. We then started using a website called weight & biases, which made the whole process much more productive.

A screenshot from “weight and biases”. Here we can see the training curves of different models being plotted on the same timeline. Useful for drawing comparisons.

There were also other issues at this point: processing new data that kept coming, versioning datasets, maintaining high data quality. The main problem was having to manage two things: to do things quickly, which requires fixed processes, and to explore the solution space, which requires flexibility.

Managing Data

We’ll go over three main aspects of data management: data versioning, data processing and data debugging.

Data Version Control or DVC is a very useful tool for managing datasets. It allows control of revisions of data just like git does for code.

Every update corresponds to a version and keeps accessible records of all previous versions. If a dataset update proved to be a mistake, we can recover a previous version and cancel the changes. So it is kind of a safety net.

Working in a distributed environment makes data management more difficult. In our case, we were using up to 5 different machines at the same time, including laptops and cloud instances.

This is common in machine learning where the bulk of work is done on a local machine, but training is done on powerful cloud instances. The challenge is to synchronize the data.

We set up an Azure blob storage which, through DVC, centralized all the data. This is like an online hard-drive that can be accessed from anywhere, as long as you have the credentials. It stored not just the current data but all the revisions. This blob storage was essentially our data library.

This allowed machines to synchronize and modify the global library by using a system of push and pull. Typically, once a worker machine has finished processing a dataset it pushes its updates to the central storage. Then, a training machine can pull the latest data from the central storage. DVC makes these operations very easy to do.

We use scripts for data processing because they are easier to reuse and maintain than notebooks. Scripts can be called directly from the command line, like any UNIX utility. They operate “file-to-file”, that is both the input dataset and the resulting processed data are written on the disk, which simplifies debugging.

For data debugging, we realized that unit testing is not sufficient. It is easy to miss data errors if we do not actually plot it. At the same time data testing is useful for checking basic sanity of the data such as outliers and formats. So, we decided to use both formal testing and visualizations.

The natural tool for this is Jupyter notebooks. Each would contain a few plots such as distribution histograms and display some days of data for a randomly chosen sensor. This is very useful as many errors can be detected visually. It would also show basic statistics. In a sense, a notebook could be used as a quick “identity card” of a particular asset.

Notebooks would also contain tests. Typically, we test for outliers and possible issues in the data. If we expect the data to have a particular property, for example the average is 0, then we write a test for it.

Tests are interesting in that they operate on the whole dataset, whereas visualization typically can only display detailed information about a few samples.

Finally, notebooks would be automatically converted to Markdown format and then saved to Wiki for documentation. This process can be made very efficient by using Papermill to execute the notebooks on the fly.

Dealing with the low number of defaults

In the last two months of the project, we focused on the issue of the number of loan defaults. Our number of defaults was only a few dozens in the test set with thousands of customers. This is a different issue than class imbalance, which is a low ratio of default, here it was a low total number. We had a lot of non-default available, however.

Deep learning needs a lot of high quality data. It’s very likely that the models could do much better with default numbers in the thousands. All the examples where deep learning shines, such as image recognition, use huge datasets.

We explored day by day training. The idea is for the model to look at a single day of data, and output whether it belongs to a good or a bad payer. After a prediction is done for each day, we aggregate the output to give a global score.

The main advantage of this technique is that for every farmer, we have a lot of days of data. In essence, the number of defaults is multiplied by the number of days in the dataset. So we had defaults in the thousands, perfect for deep learning.

The downside is that one day of data might not contain enough information to score a farmer.

Due to the high volume of good payers, the model was very good at detecting good payers. It reached a specificity of 99% (versus 95% for a dummy baseline) and a sensitivity of 30%.

The model was apparently able to recognize daily habits of most serious farmers.

It is very difficult to determine “intuitively” who will pay back and who will not.

The time series data showed us that some users used their pump only once, yet always paid back on time. Similarly, some used their pump very regularly over a long period, yet defaulted.

Together with Justin Nguyen, a junior data scientist, we analyzed the main statistical variables such as pump usage average and variance.

We considered a total of 36 features. Surprisingly, those basic features bring very little classification power, with only a 30% increase in performance over a dummy baseline. ​

This means that the distinction between good payers and bad payers comes from usage patterns that are too subtle to be described by simple feature engineering.

Here the farmers are plotted in feature space. We observe very little separation between the good and bad payers. The two axes are the two first principal components of a PCA.​

This data exploration also revealed an interesting pattern: farmers who did not fill certain survey fields were more likely to default. In other words, missing information can improve the prediction model and should be explicitly added to the model.

We also explored different options, such as self-supervised learning. It is a way to use the data without using the labels (defaults or non-defaults). Since the labels quantity was the issue for us, it was a good fit. It gave us interesting results in terms of training times and performance stability.

Finally, both self -supervised learning and day-by-day training have the advantage of giving us more stable metrics. Since the models are evaluated on tens of thousands of days their performance is very statistically meaningful. This is important especially when the models will go into production.

What we learned handling low number of defaults: use specialized techniques like self-supervised learning.

Results

We try to explain the quality of our results by comparing them with a baseline, then show how they can be applied.

Evaluating machine learning projects is always a challenge. Metrics values have little significance by themselves because they depend on the data used. For example, 90% accuracy could be either a very strong performance or a very poor one, depending on the difficulty of the task.

To give a more accurate picture, we discuss our model results by contrasting them to a simplistic baseline model that always predicts the same class.

This way we’ll have a good payer baseline that will always predict that a farmer will pay back, no matter his sensor readings. Similarly, the bad payer baseline always predicts default.

Note that some readers may object that this baseline is too simple. However, it is very difficult to detect “bad payers” just from looking at the sensor data, so this naive baseline is actually close to the human-level performance. Furthermore, we tried slightly more sophisticated baselines as well but they didn’t significantly outperform.

We evaluate the models in terms of their specificity and sensitivity. Specificity is the ratio of true positives over all positives. Intuitively, a high specificity (close to 100%) means that there are few false alarms.

Sensitivity is the ratio of true positives over all the true cases. Intuitively, a high sensitivity means that there are few false negatives.

Bad payer classification results
Good payer classification results

We can apply those results for credit attribution. Clearly, the risk of credit attribution increases with both the probability of default and the amount of the loan.

The global probability of default is 5%, but the models are able to spot groups of farmers where this probability is significantly higher or lower. We can then use this information to correspondingly adapt the loan amounts.

Let’s give an example. If the good payer classification classifies a new farmer as “good payer” then we know that the likelihood of default is 1% (since specificity is 99%), which is 5 times lower than the rest.

With this input we know that the risk is low and it makes sense to attribute a larger loan to this farmer. Of course, giving out larger loans translates into larger returns for the company.

Similarly, we can use the bad payer model to reduce the loan amounts to farmers detected as “bad payers”, which will in turn reduce our losses in average.

How many farmers could be impacted by a loan increase or decrease? Since the sensitivity is around 30%, we can estimate that it would affect around a third of all clients.

Conclusion

Bringing lending to small scale African farmers would allow them to grow economically and improve their lives. This segment forms a significant proportion of the total population on the continent, so a lot of people could potentially benefit from such a solution.

Credit scoring from sensor data is difficult because the data is only indirectly predictive of the financial stability. The particularities of the data are long time series and low number of defaults. All these aspects have to be taken into account to build an effective solution. The three main technical takeaways from this project are:

  • Starting with popular research articles that come with code
  • Taking the time to find an efficient way to organize experiments, from data preparation to documentation
  • Understanding the main characteristics of the data, a state-of-the art model will only perform well on data similar to the one it was created for.

I am thankful to SunCulture and Charles Nichols for giving me this opportunity and hope that this report will turn out to be beneficial for others. We also express our thanks to Microsoft Airband for giving us free Azure credits.

--

--