The world’s leading publication for data science, AI, and ML professionals.

Feature Engineering for Any Business – Tutorial with Pandas

A starter code snippet to get the powered machine learning features

Feature engineering

Image by: Pathairush Seeda
Image by: Pathairush Seeda

Outstanding features can be used across many applications.

In my earlier article, I’ve already pointed out the 5 fundamental domains of feature engineering. It involves statistics, time, ratio, crossing, and geo-location domains.

To add value for our reader upon that point

Today, we will focus on the customer level feature. The customer level is the most entity we deal/talk about with. We often touch them individually through various campaigns.

Also, we can use the customer level feature for explaining the persona of the whole portfolio.

In this article, I will illustrate a code snippet that you can make use of your machine learning model instantly.

  • We will write the code in python with the pandas library. It’s the most well-known library to handle Data manipulation.
  • I genuinely recommend you to take a look and bookmark 🔖 the pandas documents here. If you need further information about any snippets.
  • let’s start your analysis with these three lines
import pandas as pd
import numpy as np
import datetime
  • Shall we?

Statistics 📐

Here, we use the mockup data as the following scheme.

Mockup transaction data set
Mockup transaction data set

The data could be any transaction types ranging from mobile behavior usage, buying grocery items, browsing website traffic as long as it contains each customer.

As a result, you get the statistical_features data frame with a unit of analysis at a customer level.

  • You can see that for some aggregation. The function can return the NaN value.
statistical features in the customer level
statistical features in the customer level

Time 🕑

We can see one of the DateTime columns here, which is txn_dt .

PROPERTY

Firstly, we have to ensure the data type of this column to be datetimens64[ns] by using pd.to_datetime function.

Mockup data set for the time features
Mockup data set for the time features

Then, we can extract the property of the DateTime column by the following snippet.

Date properties features
Date properties features

Finally, we can aggregate data to the customer level.

Aggregate date properties feature to a customer level
Aggregate date properties feature to a customer level

Other good features are the Recency and Month On Book (MOB).

RECENCY 🏃

Recency is the number of days since the customer’s last activities with us.

Firstly, we group by customer id and find the max_txn_dt of each customer then subtract with the current date.

latest transaction date of each customer
latest transaction date of each customer

Here is the result of the Recency calculation

Recency calculation result
Recency calculation result

MONTH ON BOOK 📆

Also, for the Month On Book, we assume that we have a start_member_dt of each customer like the following table

Mockup data set for Month On Book feature
Mockup data set for Month On Book feature

We, then, can find the number of months that the customer has been using our products.

Month On Book feature result
Month On Book feature result

The idea of Recency and Month On Book can be broke into sub-categories like the latest and start date of each product.

⭐️ Insight

From my experience in creating a propensity model The MOB always ranks high in the overall feature importance. If you’ve never tried it, you should!

Ratio ➗

We can compare things together when there is the same scale. That’s why we need a ratio feature.

Mockup data set for the ratio feature
Mockup data set for the ratio feature

We can then find the ratio of customers based on their income.

Calculation of ratio feature
Calculation of ratio feature

You can see that the customer who has a low spending amount can have a high utilization ratio, which leads to different actions in the marketing campaign.

Crossing ✖️

We can cross-categorical data together, or it could be amongst integer/numeric values.

Mockup data set for crossing feature
Mockup data set for crossing feature

Then we can put it into the model and treat it as categorical data.

Crossing feature result
Crossing feature result

Maybe you need to do the one-hot encoding or label encoding before feeding it to the model.

One hot encoding for crossing feature
One hot encoding for crossing feature

Geo-location 📌

This is the longest one here. Suppose we have a customer geo-location and point of interest like the following.

Mockup data set for geolocation feature
Mockup data set for geolocation feature

We can find the distance between locations by using the Haversine formula. We have a particular example provided by this stack overflow.

With a minor fixing for the function, we can use it for calculating the distance in Kilometer. The result is as follows. You can validate the calculation logic with this website.

The calculation result of the geolocation feature
The calculation result of the geolocation feature

⭐️ Insight

This features usually shows up on the top feature importance, especially for the model that has a relationship with the specific place. (eg. Propensity for health insurance and how often you go to the hospital)

In summary,

We illustrate the code snippet with the mock-up result to start creating your Feature Engineering process. I’ve accumulated those snippet codes over time.

And think It would be great if there were resources like this, my life would be more comfortable.

Besides, You can apply the concept with other libraries/framework such as pyspark . The differences would be the function/method and data type you will use in the manipulation process.

Noted that: As you can see in the geo-location, in some cases, there is someone already built the useful function for us.

We don’t have to re-invent the wheel again. We need to understand what it is and how to apply it with our use cases.

That’s what I did all the time. I’m googling every day when I am coding. That’s what everyone does it. Don’t waste your time memorizing each function/method.

If you have helpful resources similar to this one, please feel free to share them with us in the comment below.

References

[1] https://en.wikipedia.org/wiki/Haversine_formula

[2] https://stackoverflow.com/questions/59736682/find-nearest-location-coordinates-in-land-using-python

[3] https://www.nhc.noaa.gov/gccalc.shtml

Related articles

5 Basic Fundamental of Feature Engineering for Any Business


Related Articles