The world’s leading publication for data science, AI, and ML professionals.

Who are the unbanked in Morocco?

Insights from a 2017 World Bank survey using supervised learning (Part One)

Recently, the Moroccan government sent monthly allowances to citizens who’ve been economically affected due to COVID-19. The massive operation was coordinated through mobile phones: people entered their information and the type of assistance they were eligible for, and they then received a PIN code that they later used to withdraw money from an ATM.

My grandfather, who was a beneficiary of the program, called me excitedly after he retrieved his money from the bank. He had never used an ATM before and was amazed at how the machine handed him the cash without him needing to interact with a human.

I was perplexed: Why was my 80-year old grandfather – whose excitement about the internet, smartphones & WhatsApp has long waned – suddenly excited by an ATM? Truth is, it never occured to me that my grandpa has never had a Banking account, a debit card, a loan, or any contact with a banking institution at all!

So I sought to understand who are the millions of Moroccans, like my grandfather, who have never had a banking product. I knew that the official statistics of the "banked" population hovered around 56% as of 2017 (Source: Bank al Maghrib), but I wanted to understand the micro-economic characteristics of the unbanked population: Who are they?

Luckily, the World Bank has conducted a detailed Global Financial Inclusion Survey in 2017 and made the data available to the public (and it’s free!). The survey sampled ~5100 Moroccans and asked a wide range of questions about their demographics and usage of banking products. The analysis below is derived from that data. I do have some doubts about the quality of the data and its conclusions, which I present at a later stage of this blog post.


1. Descriptive Statistics

Of the ~5100 sampled individuals in the Morocco survey, only 28% reported having had a banking product in 2017. While the official statistic of the banked population was close to 56% in that year, this sample – supposedly representative of the Moroccan population – suggests a much lower statistic. More on my doubts on the survey in a later paragraph. For now, let’s accept the sampling and the data collection methodologies as they are.

The share of the banked population depends largely on the demographic characteristics of the sample population. Below I show a set of graphs that describe these characteristics.

The first significant difference is at the gender level. As the graph below shows, men are more than 2.5x times more likely than woman to have a banking product. Indeed, only 17% of the surveyed women had a banking product, compared to 45% of the men.

Next, as we group survey respondents by age groups of equal bins (each of the bins on this graph contain the same number of people), we see only a mild difference between age groups, except for those aged 15–25. This suggests that factors other than age might be more important.

Those who are in the workforce are 2.5 times more likely to have a banking product than those out of the workforce (whether unemployed, retired or at school).

When we look at the economic status of the respondents, we conclude that those at the top 20% of income are ~3.5 times more likely than those at the bottom 20% of income to have a banking product. Again, I think it’s strange that only 50% of those at top income quintile have a bank account, but we’ll take this data as is for now.

Finally, the most striking difference is the one explained by education: those with a tertiary education or more are almost 4 times as likely as those with only primary education to have a bank account. As you’ve probably guessed, the sample disproportionately contains respondents with only primary education.

Of course, in real life, demographic characteristics like these do not exist in a vacuum. Indeed, they interact with one another: a women in the bottom 20% of income and out of the workforce is different than a women in the top 40% of income and in the workforce. The next set of graphs expand on the above to show to try to understnd if certain collection of characteristics are more striking than other in explaining the access to banking products in Morocco.

The graph below tells us that the "banking gap" between genders in Morocco becomes smaller the higher the education level is attained: from ~3x at the primary education level to ~1.5x at the tertiary education level.

Interestingly, the graph belows shows that with high education attainment, even those at the lowest income quintile have a banking product, namely, 67% of those at the bottom 20% of income report having a product. This suggests that education might be more important than income level in determining acess to banking products.

To understand the relative importance of each of these characteristics on determining who is likely to have a bank account, I will explore machine learning techniques in the next sections.

(Random thought: If you’re reading this diligently, now is a good time to take a break and listen to this Cheb Khaled song)


2. Who is likely to have a bank account in Morocco? A machine learning approach

I use a logistic regression model and a decision tree model to understand the relative importance of each of the characteristics presented above.

A. Logistic Regression

Logistic regression is a classification method that predicts whether an outcome would happen or not (e.g., will you own a banking product?) given a set of predictors (e.g., demographic charateristics). For more details on the methodology, read this Towards Data Science post.

a. Data prep and selecting predictors

The premise of supervised learning (of which logistic regression is just one method) is building a model that can learn to predict a given outcome as accurately as possible.

To accomplish this, we need a training dataset (which the model will use to learn) and a testing dataset (which the model will use to apply what it has just learnt). I thus divided my data into random 2 samples: a training sample (75% of the ~5100 respondents) and a testing sample (the remaining 25% of the ~5100 respondents).

The next step was to choose the set of predictors that would go into the model. The descriptive statistics in the first section gave me an idea of the predictors that would likely build a good model. After some tuning and trying out different predictor combinations, I landed on the following predictors:

Predict has_banking_product as a function of gender + education_level + the interaction of gender and education + employment_status + income_quintile + age

b. Running and evaluating the model

I built my model using the standard glm() function for logistic regression in R, and I use it to generate a "predicted" column within my testing dataset. There are many ways to evaluate whether a model is good:

(i) Accuracy – how often the "predicted" column matches the "actual" column within my testing dataset: 78% in my model.

(ii) Confusion matrix – This is a table that shows the breakdown of the predicted outcome vs. the actual outcome. It is a simple way for us to see what we predicted to be true where in fact it is false, and what we predicted to be false, where in fact is true.

You can see from this confusion matrix that 15% of the time, my model predicted a person to have a banking product, where in reality they had not. Similarly, 7% of the time, the model predicted a person to not have a banking product, where in reality they did have one. The rest of the time (78%) is the accuracy mentioned above.

More on confusion matrices here.

So how do we interpret all of this?! Well, both measures above suggest that the model is not perfect, and misses predictions 22% of the time in the testing data. This could be because the data is missing crucial characteristics that also influence whether a person has a banking product. These characteristics could be other demographics not included in the survey, such as whether the person lives in an urban or rural area, the type of job they have, whether they have dependents, their income level, etc.

c. Interpreting the model

If we assume the 78% accuracy above as "good enough", we can now look at the results of the model to try to understand the relative importance of the demographic characteristics in influencing whether a person will have a banking product.

What the graph above tells us is what the descriptive statistics intuitively guided us through: the most important predictor of having a banking product is whether someone has a tertiary education, followed by whether they belong to the top 20% of income, followed by whether they are in the workforce.

How to interpret this graph in more detail? The points in the graph are the probability estimates predicted by the model, while the bars are the standard errors (indicating the possible minimum and maximum values of each estimate), and the asterix represent whether the estimate is statistically significant (not due to chance). One way that has helped me interpret regression estimates in the past is to think about two identical people. In this example, pretend we have just cloned two people. However, one of them has a tertiary education and the other one does not; all other characteristics are the same. The model above suggests that the person with a tertiary education has 89% probability of having a banking product than her clone. What is crucial in regression interpretation is to keep all other characteristics constants when looking at the effect of a single variable.


B. Decision Tree Model

Decision trees are a supervised learning algorithm that classify the data into an outcome (true or false) based on a set of predictors. Decision trees continuously partition the data into binary subsets (e.g., is male or female, is employed or not, age: above 45 or below 45) until no further partition is possible. More on them here.

a. Selecting predictors

Similarly to the above, I select the following predictors and run a decision tree model on them (I removed the continuous age variable and the interaction between education and gender to keep things simple).

Predict has_banking_product as a function of gender + education_level employment_status + income_quintile

b. Running and evaluating the model

I built my model using the rpart() function in R, and I use it to generate a "predicted" column within my testing dataset. Let’s evaluate how good the model is by looking at the indicators below:

(i) Accuracy : Also 78% .

(ii) Confusion matrix:

Compared to the logistic regression model, you can see that the proportions for each bucket are very similar, suggesting that the two models did a similar job predicting the outcome

c. Interpreting the model

The visual representation of a decision tree model is .. surprise.. a TREE!

That’s a big tree! Trees are not easily interpretable, but we can try to make some sense of this one. Let’s look at the end points (colored here in red and green, and conveniently called leaves) – they contain three numbers within them:

  • Whether the outcome is 1 (has a banking product) or 0 (does not have a banking product). The outcomes 1 are green, the outcomes 0 are red.
  • The probability that the above outcome will occur. For example, in leaf number 1, on the far left, that figure is 0.11.
  • The percentage of the sample that has been categorized into this leaf. Again, if you look at leaf number 1, that figure is 39% (this is to say that out of the 5,100 respondents, the model predicts around 1,989 will belong to leaf number 1)

To determine which sets of characteristics has the most influence in whether someone has a banking product, let’s look at the leaves with the highest probabilities of achieving the outcome. In the graph above, let’s look at leaf 16, which has a probability of 0.84:

  • On leaf 16, let’s look at how we got from the top of the tree to the leaf by looking at the path traced by the tree. The tree is first split by employment status (in workforce on the right, outside workforce on the left), and then is split by education level (Tertiary on the right, Secondary on the left), and again is split by income level (Richest 20% on the right, the rest on the left). What this tells us is for you to have a 84% chance of having a bank account, you have to be in the workforce, must have a tertiary education and you have to be at the richest 20% income level.

While it is hard to go through each leaf to try to understand how we got to it, a simpler way to read decision trees is to look at the first few splits. Usually, those can be interpreted as the most important predictors. In this case, we can see that employment status, gender and education level are the strongest predictors. This is consistent with the findings of the regression model as well.

(Random thought: maybe now is a good time to listen to this song I really like)


3. Conclusion and a note on limitations

The descriptive and predictive methodologies outlined in this post both point to certain demographic characteristics having more importance than others in determining whether a Moroccan will own a banking product. Namely, those are employment status, income level and education level (specifically, having a tertiary degree). The predictive analyses even suggest that gender alone might not have the outsize effect we initially thought about when looking at the outcome distribution by gender. Indeed, it is more the fact that Moroccan women are less likely than men to have a tertiary degree, be at the top quintile of income, or be in the workforce, that is influencing their ability to own a banking product.

The models presented above, and their conclusions, have the limitations that they have a good, but not great accuracy (both around 78%). We would need to include more variables or fine tune existing ones to reach a higher predictive power, which in turn can help us explain things better.

Finally, I wanted to include a note on the World Bank data on which this analysis is based. Specifically, I wanted to verify that the dataset is indeed representative of the population. Below, I compare the proportions within the dataset and those published at the macro level (Source: World Bank macro data):

  • Proportion of female in dataset (59%) vs. in overall population (50%)
  • Proportion of tertiary degree holders in datasset (7%) vs. in overall population (35%)
  • Proportion of people in the workforce in dataset (39%) vs. in overall population (48%)

If we believe that the dataset is somewhat skewed and does not reflect the reality of the Moroccan population, then the results unfortunately cannot be generalizable on the entire population. This sampling error could explain why the dataset reports only 28% of Moroccans as having a bank account, while official statistics from the Central Bank put that number at 56% in 2017. I might have missed something in my analysis, but the official World Bank report also report the same numbers I reported here. If you are a stats expert and reading this, and think I missed something crucial here, please get in touch 🙂


4. Final thoughts

Now that I explored what makes a Moroccan more likely to own a banking product, I now understand why my grandfather was so excited at how the ATM handed him his COVID19 stimulus cash.

Next, I hope to take a closer look at this "unbanked" population and try to understand it better: can we use unsupervised learning techniques (e.g.,clustering) to classify the population according to characteristics beyond demographics? Stay tuned for my next post 🙂

As usual, if you’ve read all the way to here: Choukran, Thank you, Gracias, Merci. If you are Moroccan, let me know if this is worth translating.

PS. Read Part Two here.


Related Articles