The world’s leading publication for data science, AI, and ML professionals.

Predicting Mobile Financial Service Adoption with Machine Learning

A project walkthrough for building multi-class classification models for customer identification and targeting

Photo by William Felker on Unsplash
Photo by William Felker on Unsplash

Introduction

Mobile money in Africa has rapidly evolved from its traditional role as a payment service to a gateway for millions on the continent to gain access to an ever-increasing array of financial products and services. For banks and other traditional financial service providers, future profitability will greatly depend on their ability to form partnerships with mobile carriers and accurately target subscribers on the network with financial service offerings that are relevant.

There is a compelling business argument for effective customer targeting and cross-selling: for banks, digital channels with a high uptake boost low-cost deposit mobilization and increase lending capacity; for mobile carriers, digital financial product offerings that meet subscriber needs deepen engagement and increase retention.

In this post, I will explore how machine learning can be used to classify individuals into one of four categories based on the types of financial services they are most likely to use. This is an example of multi-class classification where the task involves using an algorithm to induce a mapping function between a given set of input features and a categorical target variable that takes on more than two values.

The Data

The dataset for this project was originally prepared for the Mobile Money And Financial Inclusion In Tanzania Challenge and was made available by Zindi. The training data contained 36 socioeconomic and demographic attributes on approximately 7,100 individuals across Tanzania. The target variable described the types of financial services used by an individual, grouped into four mutually exclusive categories:

0. No_financial_services: Individuals who do not use mobile money, do not save, do not have credit, and do not have insurance

1. Other_only: Individuals who do not use mobile money, but do use at least one of the other financial services (savings, credit, insurance)

2. Mm_only: Individuals who use mobile money only

3. Mm_plus: Individuals who use mobile money and also use at least one of the other financial services (savings, credit, insurance)

To enrich the training data, a geospatial mapping of all financial access points in the country (e.g. ATMs, bank branches, mobile money agents, etc.) was provided. Regional, demographic, economic and other contextual data on ArcGIS was also available to create additional input features.

The original code can be accessed through this GitHub link and the data can be found here.

Exploratory Data Analysis

EDA is an important first step that allows us to get a feel for the data and to form initial hypotheses **** about the functional form of the relationship between the input features and the target.

Reading the Data

First, read the data and rename the raw feature codes with more intuitive column names for easier analysis.

Image by Author
Image by Author

Most features appear to be categorical. Let’s verify this by looking at the number of unique values per feature:

df.nunique().sort_values(ascending=False).head(10)
ID                   7094
Latitude             7056
Longitude            7055
Age                    85
Goods_Sold             11
Services_Provided      11
Educ                    8
Last_Received           7
Last_Sent               7
Employer_Type           7
dtype: int64

All but four of the features appear to be a mixture of nominal and ordinal data. Check for missing values:

df.isnull().sum().sum()
0

Preprocessing the Data

When dealing with categorical attributes, category levels that contain few training examples add little information value and unnecessarily contribute to a large overall number of features once we apply one-hot encoding. More on this later.

First, drop all input features with a single category level that contains more than 95% of the training examples:

# Exclude the target 'Service_Type' from the filter
for col in df.columns.difference(['Service_Type']): 
    if df[col].value_counts(normalize=True).max(axis='rows') > 0.95:
        df.drop(columns=col,inplace=True)

Five features were eliminated based on this criterion. We still have features with category levels that have very few data points. Here’s an example:

df['Goods_Sold'].value_counts()
1     3300
-1    2712
6      329
3      223
5      169
7      117
4       88
2       58
8       51
9       26
10      21
Name: Goods_Sold, dtype: int64

For each of the remaining categorical input features, select all category levels with fewer than 100 training examples and bucket them under ‘other’:

# New arbitrary category level
new_level = 'other'
# Min number of examples per category level
floor = 100
# Categorical input features. Exclude the target 'Service_Type'
categorical = [col for col in df.columns.difference(['ID','Latitude','Longitude','Age','Service_Type'])]
for col in df[categorical].columns:
        all_levels = df[col].value_counts().index
        sparse_levels = df[col].value_counts()[df[col].value_counts() < floor].index
        combine_levels = {x: new_level if x in sparse_levels else x for x in all_levels} 
        df[col] = df[col].map(combine_levels)

The transformed feature:

df['Goods_Sold'].value_counts()
1        3300
-1       2712
6         329
other     244
3         223
5         169
7         117
Name: Goods_Sold, dtype: int64

Visualizing the Data

Now that we’re done with some basic preprocessing of the data let’s take a look at the frequency distribution of the target:

Image by Author
Image by Author

‘Mm_plus’ takes up 44% of the observations while "Mm_only" accounts for 11% so there is class imbalance. Whether or not to adopt a resampling strategy to achieve an even target distribution will depend on a number of factors. For one, downsampling might increase our ability to accurately predict each target class but it results in the loss of valuable training examples that help predict the majority class. We’ll come back to this in our discussion on modeling.

Now let’s explore feature interactions. We will start with mobile phone ownership and the target:

Image by Author
Image by Author

As we might expect, mobile phone ownership is a good predictor of mobile money use.

This is informative, but with 31 features we need a more efficient way to analyze interactions. Pearson’s correlation is not appropriate here because it is measured by normalizing the expected value of the product of the deviations of two variables from their respective means. There is no quantitative relationship between nominal category levels so mean levels don’t make much sense. Cramér’s V measures the degree of association between two categorical features and is based on Pearson’s chi-squared statistic. We will use the formula for Cramér’s V in Wikipedia to generate a heatmap of pairwise associations and correct for bias. A great discussion by Shaked Zychlinski on different association measures with implementations in Python as functions can be found in this post.

The function for computing Cramér’s V, with some minor edits:

from scipy import stats
from scipy.stats import chi2_contingency
import math
def cramersv(x,y):
    c_matrix = pd.crosstab(x,y)
    r,k = c_matrix.shape
    n = c_matrix.values.sum()
    chi2,p,dof,e = chi2_contingency(c_matrix)
    phi2 = chi2/n
    phi_hat2 = max(0,phi2 - (k-1)*(r-1)/(n-1))
    k_hat = k - math.pow((k-1),2)/(n-1)
    r_hat = r - math.pow((r-1),2)/(n-1)
    min_r_hat_k_hat = min(r_hat,k_hat)
    v_hat = math.sqrt(phi_hat2/(min_r_hat_k_hat-1))
    return v_hat

The heatmap:

Image by Author
Image by Author

Mobile phone ownership and recency and frequency of financial transactions are strongly associated with the target, ‘Service_Type’, and with each other. Some recency and frequency measures are perfectly associated with each other which suggests that there is scope for further feature elimination.

Enriching the Data

Regional Location

Let’s test our intuition that a greater proportion of individuals who do not use financial services will tend to live in rural regions in Tanzania like Ruvuma and Tabora as opposed to urban centers like Dar-es-Salaam.

First, use reverse geocoding with ArcGIS to map individuals’ regional location from their approximate latitude and longitude coordinates:

from arcgis.gis import GIS
from arcgis.geocoding import reverse_geocode
gis = GIS()
df_region = pd.DataFrame()
region = []
subregion = []
error = []
for i in df.index:
    try:
        results = reverse_geocode([df['Longitude'].iloc[i], df['Latitude'].iloc[i]])
        region.append(results['address']['Region'])
        error.append(0)
    except Exception as e:
        region.append(e)
        error.append(1)
df_region['Region'] = region
df_region['Error'] = error

Then plot the distribution of the target by region:

Image by Author
Image by Author

As expected, residing in a remote region like Unguja appears to increase the likelihood that an individual will fall into the ‘No_financial_services’ category.

Proximity To Financial Infrastructure

Now let’s test the theory that greater proximity to financial access points (e.g. bank branch) is associated with higher levels of financial inclusion and vice versa. For each type of access point (e.g. ATM, third party payment provider) we will use the great circle distance formula in Wikipedia to compute the straight-line distance between an individual’s location and the nearest access point from their respective latitude and longitude coordinates. Here is the function:

def haversine(coord1,lat_2,lon_2):
    lat_1,lon_1 = coord1 
    r = 6371
    phi1 = np.radians(lat_1)
    lamda1 = np.radians(lon_1)
    phi2 = np.radians(lat_2)
    lambda2 = np.radians(lon_2)
    delta_phi = phi2 - phi1
    delta_lambda = lambda2 - lamda1
    d = 2*r*np.arcsin(np.sqrt((1- np.cos(delta_phi))/2 + 
        np.cos(phi1)*np.cos(phi2)*(1- np.cos(delta_lambda))/2))
    return d

The routine to calculate the minimum distance to each access point type:

#Create a dataframe to store computed distances
df_points = pd.DataFrame()
# Individual lat lon coords
indiv = df[['Latitude', 'Longitude']].apply(tuple, axis=1).values
# Lat lon coords of each access point type
access_points = {'mma': df_mma, 'bank': df_bank, 'atm': df_atm, 'sacco': df_sacco, 'tppp': df_tppp, 'pos': df_pos, 'bus': df_bus, 'mfi': df_mfi, 'post': df_post}
# Compute min distance to each access point type
for i, point_type in access_points.items():
    distance = []
    min_distance = []
    for j in range(len(indiv)): 
        distance =
haversine(indiv[j],point_type.latitude.values,point_type.longitude.values)
        nearest = np.amin(distance)
        min_distance.append(nearest)
    df_points[i] = min_distance

Next, plot the distribution of the distance from each individual to the nearest mobile money agent, grouped by target class:

Image by Author
Image by Author

Individuals in the ‘No_financial_services’ and ‘Other_only’ categories tend to live further away from mobile money agents than ‘Mm_plus’ and ‘Mm_only’. Notice how the distribution for ‘Mm_plus’ and ‘Mm_only’ is heavily skewed to the right.

Feature Encoding

One-Hot Encoding

We saw that much of our dataset consists of nominal features with some ordinal data. Now that we’ve cut down the number of categorical features and combined sparse category levels, we’ll use pandas _pd.getdummies to convert them to dummy variables and remember to pass the argument _dropfirst = True to avoid multicollinearity.

Discretization

For continuous features such as the distance to the nearest financial access point and ‘Age’ we will apply equal frequency binning to handle skew and outliers.

Target Encoding

Location attributes that we obtained by reverse geocoding such as ‘District’ are high-cardinality nominal attributes:

df_mean.District.value_counts().describe()
count    169.000000
mean      41.976331
std       26.654525
min        7.000000
25%       23.000000
50%       35.000000
75%       53.000000
max      143.000000
Name: District, dtype: float64

One-hot encoding ‘District’ would increase the dimensionality of the data by 168 features. With limited training examples this may lead to overfit. Instead, we will use the M-estimate encoder from the Category Encoders library to compute the average value of the target for each ‘District’ and replace each ‘District’ with the computed average. To mitigate overfit, M-estimate uses additive smoothing to blend the ‘District’ average with the average over all ‘District’s. For ‘District’ values with few training examples, M-estimate can be parametrized to rely more heavily on the overall average and vice versa. Here we will set ‘m’ to 25, slightly above the lower quartile of frequency counts for ‘District’:

import category_encoders as ce
encoder = ce.MEstimateEncoder(m = 25)
for label in target_labels:
    df_mean['Mean_' + label] = encoder.fit_transform(df_mean.District,df_mean[label])
Image by Author
Image by Author

Note that since this is a multi-class problem, we first one-hot encoded the target into four binary columns. We then dropped one column and mean encoded ‘District’ with the remaining three.

Modeling

Now we’ll examine the relative performance of three well known classifiers on our final dataset: Support Vector Machines, Logistic Regression and Random Forests.

SVMs classify observations by searching for optimal hyperplanes in n-dimensional space that separate positive and negative examples. When classes are not linearly separable SVMs use kernel functions to map the original input space into a higher dimensional feature space in which to search for an optimal decision boundary.

Logistic regression assumes a linear relationship between the input features and the log-odds of class membership. Optimal feature weights maximize the log likelihood over the training data.

Random Forest algorithms construct multiple decision trees, each of which is trained on a randomly selected subset of the training data, and then combine the predictions of individual trees to produce more accurate and stable class predictions.

We will perform k-Fold cross validation with k = 10 folds for each model to assess overall performance. Cross validation results are more reliable because a single train test split yields only one statistic which may vary greatly depending on the observations selected for the training and test sets.

from sklearn import svm
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
svm_clf = svm.LinearSVC()
lr_clf = LogisticRegression()
rf_clf = RandomForestClassifier(random_state = 5)
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score, confusion_matrix, classification_report
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score 
kfold = KFold(n_splits=10, random_state=5, shuffle=True)
moments = ['Mean', 'Stdv']
models_df = pd.DataFrame(index = moments)
df_accuracy = pd.DataFrame()
models = {'SVM': svm_clf, 'LR':lr_clf, 'RF': rf_clf}
for k,v in models.items():
    stats = []
    cv_result = cross_val_score(v, features, labels, cv = kfold, scoring = "accuracy")
    cv_result = cv_result*100
    df_accuracy[k] = cv_result
    stats.append(cv_result.mean())
    stats.append(cv_result.std())
    models_df[k] = stats

Results

The performance metric employed by Zindi for this challenge was accuracy score. Here is a box plot of the distribution of results over the 10 folds:

Image by Author
Image by Author

The mean and standard deviation of the scores:

Image by Author
Image by Author

Based on these results, even though SVM and Logistic regression have a similar overall accuracy at 68.3% and 68.5% respectively, SVM would be preferable for deployment because it exhibits lower variability in performance.

Baseline Accuracy

To establish a baseline score we will use the ZeroR classifier. This model simply predicts the majority class for all examples. This benchmark strategy yields a 44% accuracy.

Class Imbalance

With imbalanced datasets accuracy is not always an appropriate performance measure. In cases of extreme class imbalance even poor models can achieve a high score. To get a feel for how we did on each of the class labels let’s examine the confusion matrix for the SVM model:

Image by Author
Image by Author

The classification report:

Image by Author
Image by Author

The confusion matrix shows that ‘Mm_only’ is almost always misclassified as ‘Mm_plus’, the majority class label. Recall for ‘Mm_only’ is almost nil. The model also incorrectly classifies almost 50% of "No_financial_services’ as ‘Other_only’.

Returning to our earlier discussion on the trade-offs involved with imbalanced datasets, we note that the objective of the challenge was to accurately predict ‘Mm_plus’, the majority class label. This is analogous to a business strategy that prioritizes a specific customer segment. Here we have a 78% precision and a recall of 91%, significantly outperforming the other class labels – a satisfactory result for the task at hand.

What if we were equally concerned about correctly classifying each target label? With limited training data or extreme class imbalance the loss of observations from downsampling is often too costly. An alternative solution is upsampling. SMOTE (Synthetic Minority Oversampling Technique) randomly selects observations from the minority class and uses their nearest neighbors in the feature space to synthetically generate new observations. Here is the implementation on the training set after performing an 80%-20% train-test split on the data:

from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state = 5)
X_train_smote, y_train_smote = smote.fit_sample(X_train,y_train)

The classification report after running the SVM model on the test set:

Image by Author
Image by Author

Recall for ‘Mm_only’ has increased dramatically. It has also risen modestly for ‘No_financial_services’. However, accuracy has degraded from 68.3% to 64.1%. As expected, precision for ‘Mm_plus’ increased to 87% while recall decreased substantially after the model’s majority class bias was mitigated.

Ultimately, the decision as to whether or not to represent each target class in proportion to its overall frequency will depend on the relative penalty associated with misclassifying one type of potential customer for another.

Conclusion

The goal of this post was to provide an end-to-end view of how we can use supervised learning to refine customer targeting and cross-selling strategies for mobile network operators and financial service providers. Support Vector Machines, Logistic Regression and Random Forests each showed promising results in the task of classifying individuals into mutually exclusive categories representing varying levels of financial inclusion. Still, there are a variety of ways in which the model performance could be enhanced including hyperparameter tuning and additional feature engineering.

Thanks for reading!

References

[1] T. Cook, Why Fintech matters: Reflecting On FSD Kenya’s Work (2018), FSD Kenya Blog

[2] Digital Access: The Future Of Financial Inclusion In Africa (2018), International Finance Corporation

[3] Cramér’s V (2020), Wikipedia, Wikimedia Foundation

[4] S. Zychlinski, The Search for Categorical Correlation (2018), Towards Data Science

[5] Haversine formula (2020), Wikipedia, Wikimedia Foundation

[6] W. McGinnis et al., Category Encoders: a scikit-learn-contrib package of transformers for encoding categorical data (2018), The Journal of Open Source Software

[7] Additive Smoothing (2020), Wikipedia, Wikimedia Foundation

[8] N. Chawla et al., SMOTE: Synthetic Minority Over-sampling Technique (2002), Journal of Artificial Intelligence Research


Related Articles