Designing and developing machine learning models and AB testing experiments to reduce disengaged users in FinTech
Fintech represents the digital transformation disruption in the finance industry. Startups like Wise (formerly known as Transferwise), Monzo, and many others, started with remittance and money transfer services and grown into borderless digital bank. Competition in FinTech is getting tougher, as prominent players like Samsung, Walmart and Google established their e-wallet services.
To survive the competition, FinTech institutions gauge different strategies. Some expanded the service and embrace cryptocurrencies, and some others work together in a symbiotic relationship. One example is the integration of Wise and Google Pay to enable their user base to send money overseas. Just like almost any B2C business models, one common strategy is retain the user base by keeping the current user happy.
This exercise aims to demonstrate a practical Machine Learning approach to this issue, starting from its definition to the intervention.
- Defining engagement
- Predicting engagement (data pipeline, feature extraction, and model training using pySpark and Scikit-Learn)
- Model evaluation
- Designing intervention (contextual recommender system using ALS)
- Designing A/B Testing
- Evaluating and interpreting the results
Data and Codebase
The data is a structured and synteshized dataset of a denormalized table, representing basic information in financial transactions. The data contain of detailed transactions log (inbound, outbound, fee, etc) for each customer in a 19 months timeframe, which includes transaction amount, timestamp, status, and basic demographic and location. The main goal of this exercise is to demonstrate the methods and approaches in addressing the issue and less about the data itself.

Data, notebooks, and complete codebase are available at: https://github.com/kristiyanto/fintech-user-engagement.
Part I: Defining and Predicting Engagement
Defining Engagement
Engagement can translate into many different meanings. One of the most common approaches in defining engagement in this scenario is checking if a user has made a transaction within a certain amount of time. For example, an engaged user is a user who used the service at least once the past 14 days as the heartbeat.
While more straightforward, the one-for-all approach may miss a particular group of users. Naturally, FinTech users use the service in different ways. Some may make occasional or monthly remittances to send money to family abroad. Some others may use the service daily with debit cards or receive inbound transfers as a merchant.

One of the better approach is to learn user behavior patterns from the historical data. Instead of the total transaction on a period of time, counting the days between transaction provide a better signal. Each individuals have different threshold and a user is marked as disengaged once threshold is breached.
In this case, a disengaged user is a user who has not made any transaction longer than their usual. A monthly user, for example, is considered a drop-off user if it has been more than a month since their last transaction; while for a weekly users, if they it has been more than a week since last transaction. Since patterns can change overtime, another variation of this solution is to have cutoff date (e.g. only consider data in the last 6 months).
Machine Learning Interpretation
Early detection is vital to ensure that there is enough time to perform intervention. Once a user drop-off, it usually too late. Using the historical data, we can train a machine learning model to make prediction. There are various machine learning interpretations applicable for this problem, for example:
- Time-series forecasting or regression: Forecast the number of activities a user or a group of a user will make in a period
- Survival model or regression: Predict when the next action will occur at the individual level
- Classification: Predict if a user will be a dropout user at the individual level
However, they also come with limitations: forecasting at the individual level can be a demanding task if the purpose is to identify outliers solely. Similarly, predicting the next transaction with good accuracy may require an elaborated machine learning solution such as deep learning (RNN/LSTM).
In this particular case, solution 3 is simple enough to execute and iterate. The goal is to predict if a user will be categorized as a disengaged user next month. The model is designed to run monthly, providing a one-month window to perform the intervention and evaluation.
y = end_of_month(days_between_transaction > 90_percentile_of_ days_between_transaction)
y^ = next_month(y)
Limitation Adequate information is needed to learn users’ behavior patterns. Users with less than ten activities ever are excluded from both training and testing and a different approach can be designed to address these users. Ten is an arbitrary number and can be adjusted based on the business needs or based on the data (e.g., population average).
Machine Learning Pipeline

This solution uses two data pipelines: a) Spark Pipeline for data ingestion, clean-up, feature extraction, and data exploration. The input of this pipeline is the raw data file, and the output is a structured features datasets that can be used for various downstream, including dashboardings and model training, and the intervention and A/B testing (Part II of the manuscript).
b) Scikit-Learn Pipeline for model specific data tasks. The input is the feature data (output of spark pipeline), and the output is the trained model/prediction. This pipeline feed in the feature and perform additional data pre-processing, imputation, feature selection (both univariate and multivariate), and model traning.
Features Extraction
Although the training data is small enough to fit in a computer, feature computation uses pyspark to ensure that the pipeline can run on a larger scale. Features are assembled in a pyspark custom transformer and executed in a spark pipeline.
Features are extracted into logical buckets (Historical activity, time and recency, demographics, etc.), which can be easily expanded or used for other purposes such as business insight (BI) reporting.
Below, an example how the feature extraction pipeline is assembled. The output as of this pipeline can be saved as an artifact for easier deployment or versioning.
Preprocessing
The pre-processing task for each of the features is determined by its functionality and datatype. Categorical features are converted into one-hot-encoding, and no data imputation is necessary. Non-destructive data pre-processing were done within pyspark, to increase downstream usability (BI or ML). This means the data is preserved as much as the original with minimum imputation or filtering.
Non-destructive data pre-processing were done within pyspark, to increase downstream usability (BI or ML). This means the data is preserved as much as the original with minimum manipulation such as data imputation or filtering.
Additional data processing used only in the ML, such as log transform or standardization, is done in a scikit-learn pipeline (more on this below).

A correlation matrix within these features provides a glimpse of diagnostic to prevent multi-collinearity that may hurt the model. By scanning this chart, user_transcCurrMonth_count
seems to have medium or strong correlations with many other features. These strong correlations are due to features such as how much transfers, or total amount, are eventually reflected into user_transcCurrMonth_count
. All makes sense.
In contrast, only one variable/feature, user_isDropOutCurrentMonth
, is highly correlated with is_dropOut_nextMonthwith
0.97 correlation. The correlation indicates that disengaged users this month are likely to continue to be disengaged next month. For this reason, user_isDropoutCurrentMonth
is dropped from the feature list. This information also suggests that early prevention is crucial; users who this month’s users will likely carry forward to the following months.
Model Training and Baseline
As the model is scheduled to run monthly as a batch mode, forward chaining cross-validation is fitting, mimicking the real-world scenario. Random forest is a great starting point, given the size, shape, and distribution of the data (data exploration was done separately and not included in this manuscript).
Although pyspark comes with random forest built-in, the model training uses scikit-learn for convenience. When needed, the pipeline can easily be written as a pyspark pipeline using both built-in and custom transformers.
Pipeline consisted of additional pre-processing, such as:
- performing standardization for some selected features
- converting some features into logarithmic
- univariate feature selection by removing homogenous features
- multivariate feature selection by using random forest feature selection
Subsequently, grid-search-cv tune hyper-parameters to find the best configuration for each step of the pipeline.
As a baseline, a stratified dummy model, a strategy to randomly assign prediction based on the training label distribution.
Model Evaluation and Diagnostic

Random Forest performed with AUC 0.9 on training data and 0.94 on testing data. Over each iteration, the training and testing divergence are consistently low, indicating that the model is not overfitting. As more data available, improvement showed in almost all metrics. The model also outperformed the baseline model in all iterations.
Feature Importance Feature importance is extracted throughout each training iterations to understand how the model makes decisions.

The graph above shows thatuser_atmPrevMonth_count
consistently ranked as one of the best predictors of user engagement, indicating that ATM is an essential part of the service.
Deployment Strategy Both pyspark and scikit-learn pipelines can be easily saved as a pickle or an artifact. Storing, versioning and deploying these artifacts are fully supported by MLFlow or other ML deployment framework.
Part II: Designing Intervention and AB Testing Experiment
It is not possible to make everyone happy. To find out the common reason a user happy (or unhappy) with the service is a vast space to explore. However, we can improve certain parts of the service and encourage the current user base to fully utilize all the services and products offered. As a start, we will use one of the insights from the previous exercise. The model that predicts disengagement ranked ATM usage during the last month as the most critical factor.
Further investigation reveals that only 54% of the user base have used the ATM service at least once. Ensuring that the users are aware and fully utilize all the services is low-hanging fruit. The hypothesis is increasing ATM use will reduce the drop-out rates.
There may be many different approaches to increase ATM usage:
- Reviewing, evaluating, partnerships with ATM network providers; aiming to expand the services strategically
- Reduce the ATM transaction fees
- Increase more awareness about the current existing ATM services
- Offer promotional or incentives for those who are not using the service enough to transition
Any of the combinations of the solutions above may apply. However, given the complex business contracts, cost, and other considerations, solution 1 and 2 are costly and can be risky in terms of ROI. Using the already existing platform (apps or web), solution 3 and 4 is more feasible with little or no risk.
One of the strategies in increasing awareness is by persuading users that the service is available. This can be done through advertisements, such as email campaigns or in-apps notifications.
Recommender system is ubiquitous in almost any data-driven company. The main goal of a recommender system is to use data to generate an effective recommendation. Effective can translate to a more personalized recommendation, prescriptive, useful, or any combination of these.
As one of the most actively researched areas, many different techniques are available; from a one-line-of-code approach to rocket-science-complex. For this particular case, the problem can be articulated into different scenarios:
- Rule-based/expert system: For example, recommend (or give incentives) to everyone who has not used the ATM service to use the service.
- Collaborative filtering: For example, users who often use ATMs in New York usually often use the ATM in Cancun. Therefore, promote ATM usages in Cancun for New Yorkers. Common techniques include matrix factorization (ALS, SVD, NMF) and its variation.
- Neighbourhood/similarity: For example, if a user often uses ATM in New York, most probably will also use the ATM in London (because New York and London are similar). Therefore, give promotional (discount/incentives) if they do transactions in London for these users. Common techniques include kmeans, LSH, or other similarity-based models, ranking, or information retrieval techniques.
- Deep learning/other complex models: For example, if a user uses the ATM in New York today, the next city would most likely be Washington DC. Therefore, give promotional (discount/incentives) if they do transactions in Washington DC for these users. Common techniques include LSTM, RNN, or its variation.
- Hybrid recommender systems: any combination of the above.
One challenging aspects of building a recommender system is narrowing down the nuances and the context. Questions like what should be done if there are previous data available to learn from (cold start problem) what metrics are necessary, and how we should measure them; how this metric relates to the key performance indicator (KPI)? What if it gives a incorrect recommendation? – are all difficult to answer as there are no right or wrong answers. Addressing these questions is an area of its own and deserves its own discussion and requires a good understanding about the domain and the business model.
Given the data available, collaborative filtering (option 2) combined with a rule-based recommender (option 1). This configuration is reasonable for a pilot study and simple enough to execute. The outcome may serve as a benchmark for future models or improvements.
Contextual Recommender System
Context is essential in recommender systems. Given the day of the week as a context, what drinks, movies, items we consume on a Monday night may not be the same as on a Friday night. Adding recency, weather, geolocation, preferences, or other information as the context may make more personalized recommendations, alas, more complex to design as well. Finding which context is practical requires substantial experimentations.
To start, we will use the month of the year as the context. Since the data is limited (there is no geodata or merchant information), we can reframe the problem and scope it down. For example: for each user, rank the top five cities where ATMs services will likely be used and provide promotional rates for these cities. As a context, we will use the month of the year. For example, users who usually utilize ATMs in New York are more likely to use the ATM in Florida in February but Chicago in August.
Designing Contextual Recommendation using ALS
Alternating Least Squares (ALS) is one of the common techniques used for collaborative filtering. Like other collaborative filtering techniques, ALS decomposes the original matrix (user and interaction) into matrix U and V, such as the multiplication of both matrices produces a close approximation to the original matrix. What makes ALS different from the others is how ALS learns and approximates U and V.
While some other algorithm uses gradient descent or other methods, ALS uses similar methods to Ordinary Least Squares (OLS) in regression problems. ALS usually performs better for noisy data, such as when the ranking is infered (implicit ranking). A simplified explanation of how ALS learns:
- It starts by initializing U and V pseudo-randomly.
- Learn matrix V by performing a variation of OLS on each row in matrix V, with every row as the feature and its corresponding column vector from the original matrix as the label.
- Similar to step 2; except that we switch with column vector in matrix V as the features and its corresponding row from the original matrix the label.
- Iterates until it converges or maximum iteration is reached.
Spark ML has the ALS provided, so we don’t have to implement the algorithm from scratch. Additionally, collaborative filtering results in big and sparse matrices and spark ML can easily scale.

Similar to other collaborative filtering techniques, ALS takes user and interaction as the input. One of the possible approaches to add context is by simply splitting the interaction into different contexts. This approach is not recommended for extensive context, as it further increases the size and sparsity.
ALS is evaluated by comparing the original matrix and the reconstructed matrix. The comparison can be calculated by measuring the mean absolute differences (MAE) or Root Mean Square of Error (RMSE) to penalize large errors. The ALS model for this exercise yield with RMSE: 2.4, and MAE: 1.8.
Designing A/B Testing
In scientific research (commonly referred as Randomized Controlled Trials, RCT) and industry, A/B testing is the gold standard methodology to study cause-and-effect. Generally, A/B testing comprises of:
- Splitting a cohort of people randomly
- Apply the treatment in one of the groups for a period of time
- Evaluate if there is any difference in the outcome between groups
- Decide on the next step

For each step above, there are many different variations of techniques and scenarios to consider:
Step 1: Cohort selection
Q: Who are eligible users? A: In this context, it only makes sense to prevent users who are in the verge of dropping out. Giving out cash-back to everyone may hurt the revenue, and there is no point of offering rewards to users who no longer use the service. In this case, we only consider users who are predicted as a drop-out users and users who are still using the service at least once in the past 60 days.
Q: How many users do we need for the study? A: The number of samples can be calculated using power analysis before the study is conducted. The calculation requires: a) A metric for for both cohorts (e.g., a total of drop out users) b) How many percent difference is anticipated for group B, and error rates c) Alpha (type 1 error/False Positive Rate, usually 5% or lower), and d) Power (type 2 error/False Negative Rate, ideally 80% or higher). Parameters b, c, and d can be adjusted, and the more definitive the results, usually the higher sample sizes are needed.
More detailed information about power analysis is available here.
Q: How to make sure groups A and B are comparable? A: One of the techniques is to perform stratified random sampling. Stratum can be decided based on the user’s demographic, geographic, or other data points, especially suspected confounding factors. Testing (parametric or non-parametric) on sample sizes, stratum distributions, and KPIs/measurement metrics also need to be performed to ensure that both groups are similar.
More in-depth article on how to design the cohort sizes and selection is available here.
Step 2: Experimentation
Q: What needs to be done differently across groups? A: Nothing needs to be done for the control group (group A), except on recording the metrics.

For the treatment group (group B), promotional messages (email or push notifications) are given at a specific interval. The recommender system produces a list of cities personalized for each individual based on their historical data and the month of the year as the context. Users are given promotional rates, which is a no-fee (in the form of cashback) for ATM transactions performed in these cities during the study period.
To note, this cashback could make a confounding factor in the results, that is when a user is eager to use ATM solely because of the rewards. A follow-up study, a cashback vs. no cashback, may need to be conducted to evaluate this factor. A practical example is to conduct A/A testing on this cohort to compare before (with cashback) and after the promotion period (without rewards) to see if there is any significant difference.
Q: How long the study needs to be conducted? A: There is no silver bullet for this. The reasonable period should long enough to evaluate the shift in the behavior. In this case, since the model and KPIs are calculated monthly, the study can be conducted for a quarter.
Step 3: Results and Evaluation
Q: How do we know if the study is a success? A: A metric needs to be pre-defined prior to the study. In this case, the number of dropout users for each cohort. Step A ensures that the metrics are comparable before the survey. During the study period, the metrics are recorded or calculated periodically (e.g., daily or weekly). At the end of the study, these metrics are evaluated using statistical methods (e.g., paired t-test if the metrics are normally distributed or Wilcoxon Rank Sum Test if otherwise, regression analysis to study the interaction for each variable). These statistical analyses conclude the outcome of the study.
Step 4: Follow-up Actions
Q: What if the study is a failure?A: A fail study -that is when there is no different variation between control and treatment groups- is normal, even when each step is perfectly executed. The resulting data can provide a better cues to come up with better hypothesis may provide valuable pointers, for example, whether the a longer period is needed for the study.
Q: What to do if the study is a success?A: The insight from the analysis can be an input for to promote changes in the organization. Breaking down into groups, geo location, or other logical category may reveal valuable insight, which provide sound evident to make the product better. Additional and follow-up studies may also be taken to further make the product better.
Data, notebooks, and complete codebase are available at: https://github.com/kristiyanto/fintech-user-engagement.