Feature engineering

Outstanding features can be used across many applications.
In my earlier article, I’ve already pointed out the 5 fundamental domains of feature engineering. It involves statistics, time, ratio, crossing, and geo-location domains.
To add value for our reader upon that point
Today, we will focus on the customer level feature. The customer level is the most entity we deal/talk about with. We often touch them individually through various campaigns.
Also, we can use the customer level feature for explaining the persona of the whole portfolio.
In this article, I will illustrate a code snippet
that you can make use of your machine learning model instantly.
- We will write the code in
python
with thepandas
library. It’s the most well-known library to handle Data manipulation. - I genuinely recommend you to take a look and bookmark 🔖 the
pandas
documents here. If you need further information about any snippets. - let’s start your analysis with these three lines
import pandas as pd
import numpy as np
import datetime
- Shall we?
Statistics 📐
Here, we use the mockup data as the following scheme.

The data could be any transaction types ranging from mobile behavior usage, buying grocery items, browsing website traffic as long as it contains each customer.
As a result, you get the statistical_features
data frame with a unit of analysis at a customer level.
- You can see that for some aggregation. The function can return the
NaN
value.

Time 🕑
We can see one of the DateTime columns here, which is txn_dt
.
PROPERTY
Firstly, we have to ensure the data type of this column to be datetimens64[ns]
by using pd.to_datetime
function.

Then, we can extract the property of the DateTime column by the following snippet.

Finally, we can aggregate data to the customer level.

Other good features are the Recency and Month On Book (MOB).
RECENCY 🏃
Recency is the number of days since the customer’s last activities with us.
Firstly, we group by customer id and find the max_txn_dt
of each customer then subtract with the current date.

Here is the result of the Recency calculation

MONTH ON BOOK 📆
Also, for the Month On Book, we assume that we have a start_member_dt
of each customer like the following table

We, then, can find the number of months that the customer has been using our products.

The idea of Recency and Month On Book can be broke into sub-categories like the latest and start date of each product.
⭐️ Insight
From my experience in creating a propensity model The MOB always ranks high in the overall feature importance. If you’ve never tried it, you should!
Ratio ➗
We can compare things together when there is the same scale. That’s why we need a ratio feature.

We can then find the ratio of customers based on their income.

You can see that the customer who has a low spending amount can have a high utilization ratio, which leads to different actions in the marketing campaign.
Crossing ✖️
We can cross-categorical data together, or it could be amongst integer/numeric values.

Then we can put it into the model and treat it as categorical data.

Maybe you need to do the one-hot encoding or label encoding before feeding it to the model.

Geo-location 📌
This is the longest one here. Suppose we have a customer geo-location and point of interest like the following.


We can find the distance between locations by using the Haversine formula. We have a particular example provided by this stack overflow.
With a minor fixing for the function, we can use it for calculating the distance in Kilometer. The result is as follows. You can validate the calculation logic with this website.

⭐️ Insight
This features usually shows up on the top feature importance, especially for the model that has a relationship with the specific place. (eg. Propensity for health insurance and how often you go to the hospital)
In summary,
We illustrate the code snippet with the mock-up result to start creating your Feature Engineering process. I’ve accumulated those snippet codes over time.
And think It would be great if there were resources like this, my life would be more comfortable.
Besides, You can apply the concept with other libraries/framework such as pyspark
. The differences would be the function/method and data type you will use in the manipulation process.
Noted that: As you can see in the geo-location, in some cases, there is someone already built the useful function for us.
We don’t have to re-invent the wheel again. We need to understand what it is and how to apply it with our use cases.
That’s what I did all the time. I’m googling every day when I am coding. That’s what everyone does it. Don’t waste your time memorizing each function/method.
If you have helpful resources similar to this one, please feel free to share them with us in the comment below.
References
[1] https://en.wikipedia.org/wiki/Haversine_formula
[2] https://stackoverflow.com/questions/59736682/find-nearest-location-coordinates-in-land-using-python
[3] https://www.nhc.noaa.gov/gccalc.shtml