The most critical process for any data science problem that you should learn.

What you’ll learn?
- Develop the critical thinking skills required for feature engineering
- Feature engineering for an anti-money laundering algorithm
Introduction
Feature Engineering? [1]
A feature is a numeric representation of raw data. In structured data, they are the independent variables on which one of the variables is dependent. The features that are already present in a dataset are commonly known as data fields and the ones which are created through domain knowledge are known as candidate variables or expert variables. This process of encoding information into a form of a new variable is known as feature engineering.
Why do we need more Features?
A machine learning model’s performance is directly associated with how accurately the independent features capture the right information about the problem at hand. As a result, to deal with any problem we should create as many variables as we can so that later, we can select the most important features for our model and hence, enhancing the model performance. However, this process of creating new features is a tedious job and requires a good understanding of the problem with some domain knowledge. In this article, I am going to describe an example to demonstrate how you can create various candidate variables for an anti-money laundering machine learning model.
We will start first by understanding the problem and then applying that knowledge to perform feature engineering. The following are the series of steps described below.
STEP 1
Understanding the problem

Money laundering is an illegal process of turning the "dirty" money (money obtained from illegal businesses like selling drugs) into "clean" money (legitimate money) either through an obscure sequence of banking transfers or through commercial transactions.
The three broad stages of Money laundering are:
Placement – It is that stage when the "dirty" money is put in the legitimate financial system. The most common way of achieving it is through smurfing, which involves sending small amounts of money to bank accounts that are below anti-money laundering reporting thresholds and later depositing it to the same sender.
Layering – This is the second stage and one of the most complex stages which involves making the money as hard to detect as possible, and further moving it away from the source. The money is purposefully transferred so fast such that the bank cannot detect it.
Integration – The final stage involves putting the "clean" money back into the economy. One of the most common ways is to buy a property in the name of a shell company which shows a legitimate transaction.
Because of the space constraints, I have mentioned only the broader definition of the problem just for demonstration. However, one should be doing proper research on the problem by reading various research papers, patents, and many more.
STEP 2
Breakdown the problem into smaller fragments for effective variable creation [2]
After researching the problem, you should highlight all the insights that you have developed. For instance, I have written a few of them below that you can easily get from the Step-1:
- Substantial increases in cash deposits of any individual or business without apparent cause
- Deposits subsequently transferred within a short period out of the account and to a destination not normally associated with the customer
- Accounts dominated by cash transactions rather than using cheques or letters of credit
- A **** large number of individuals making payments into the same account without an adequate explanation
- Large cash withdrawals from a brand-new account, or from an account which has just received an unexpected large credit from abroad
Similarly, the better understanding that you develop about the problem the more insights you will get and hence, better features would be there to enhance the performance of your model. Therefore, all the above insights from the problem should be accounted for while creating the candidate variables to inject more information into the model.
STEP 3
Understand the Dataset
To progress further, let’s assume we have a hypothetical dataset with the following data fields with their description for developing an anti-money laundering model. In a real scenario, the data scientists working for the banks can easily get such data with the following data fields. You can also take a look at the public dataset available.

STEP 4
Building Candidate Variables
Here comes the most interesting, the most crucial, and the most difficult part of any Data Science problem aka Feature engineering:
4.1. Concatenate two or more different data fields to form a new categorical variable
For our first set of variables, you can think of joining two or more data fields together to make a new variable. To understand this, let’s make pairs of "Origin_acct", "Destination_acct", and the "Transaction_type" as shown in the table below:

In the table above, you can see a new column named "Origin_acct-Destination_acct" containing the concatenated values of their respective data fields. Similarly, other columns can also be seen with the same approach.
Why?
Concatenation of "Origin_acct" with "Destination_acct" would help in policing the process of "Smurfing" where multiple intermediate accounts transfer small amounts to a single sender multiple times. Additionally, in one of the problems discussed earlier, it was observed that these criminals prefer cash transactions rather than the forms of debit and credit such as cheques, bills of exchange etc. Therefore, concatenating with the "Transaction_type" would also help in giving another dimension of learning to our algorithm to know more about the nature of transactions and track whether the number of cash transactions for that particular account has increased or not (discussed in 4.2). Such activities are strictly not normal and you would see how other numerical candidate variables (discussed later) linked to these concatenated fields would certainly help us in going in the right direction.
4.2. Frequency Candidate Variables [1]
The frequency variables would encode the number of transactions being done by each feature(shown in figure) which would help in capturing the information such as an increase in the number of transactions for a particular pair of accounts which could be a signal of suspicious activity. The figure below shows different combinations of frequency variables. For example, you can calculate the number of times the origin_acct was used on the same day (0), in the last 1 day, in the last 3 days, and so on.

A higher value calculated for a time period means there is something abnormal in the behaviour of that account. In the table below you can see how the frequency variable would look like for one of the Origin accounts:

The column Origin_frequency_0 starts with 1 (assuming it is for the first time used for transaction) and the number remains 1 for the next day because on 05/02/2014 it is seen for the first time. Similarly, you can deduce how other numbers were calculated.
4.3. Amount Variables
The amount variables would help in calculating the average, maximum, median, and the total amount of the transaction from each account over the past 0, 1, 3, 7, 14, and 30 days (0 indicates the same day) which would help in tracking the third stage namely Integration where a large sum of money is withdrawn from a bank account without any adequate reason possibly for buying a property. Hence, would help the model in identifying any abnormality in the transaction amounts. For instance, there would be one column which contains the total amount transacted for a Destination account over the last 3 days. Similarly, other combinations can be formed as shown below:

The table below shows a pair of amount variables:

In the table above, the column "Origin_acct-total_Amount_3_days" contains the total amount transacted by the Origin account with #4586524 over the past 3 days. This is why the total remains 98.4 in the last row because the account was not used in the last 3 days. The other column calculates the actual amount transacted on the same day divided by the total amount over the last 3 days.
4.4. Time-since Variables [1]
To encapsulate the information of how fast the transactions are taking place for accounts, these variables can be very handy. It calculates the time between when an account was last used for transaction and the time of the current transaction. The faster the subsequent transactions for a single entity, the higher would be the probability of fraud. Hence, this would help in tracking the 2nd stage namely Layering. The following table shows an example of the time since variable:

4.5. Velocity-change candidate Variables [1]
This last set of variables can track the sudden change in the normal behaviour of an account by calculating how the number of transactions or the amount transferred in the past day (0 & 1 day) has changed over the other set of periods (7, 14, & 30 days). The formula for the same is as follows:

Hence, if there is an unexpected change in the number of transactions or in the average amount for that account then our model would be able to learn that change. The following table shows an example of velocity-change variables:

Summary
You saw how we all were able to encode more and more information about the given problem through many candidate variables using only the given data fields and without any external data. To summarise, you learned the following variables with the respective information encoded in it:
Concatenated Variables- Helped in linking the origin account, destination account, and transaction type with each other that assisted in tracking the problem of smurfing and the higher cash withdrawals
Frequency Variables- Helped in learning how frequently the account is used
Amount Variables- Helped in learning about the magnitude of the amount of transactions.
Time-since Variables- Helped in learning the speed of transactions
Velocity-change Variables- Helped in identifying a sudden change in the behaviour of accounts
I know the above-discussed problem seems very specific to only fraud detection models but, trust me, it would surely help you in developing those critical thinking skills required for creating expert variables for any data science problem. I hope you found it helpful and worth reading. Cheers!
References
[1] Gao, J.X., Zhou, Z.R., Ai, J.S., Xia, B.X. and Coggeshall, S. (2019) Predicting Credit Card Transaction Fraud Using Machine Learning Algorithms. Journal of Intelligent Learning Systems and Applications, 11, 33–63. https://doi.org/10.4236/jilsa.2019.113003
[2] Guideline on Combating Money Laundering and Terrorist Financing. https://www.imolin.org/doc/amlid/Trinidad&Tobago_Guidlines%20on%20Combatting%20Money%20Laundering%20&%20Terrorist%20Financing.pdf