From Numerical to Categorical

Three ways to bin numeric features

Andrew Engel
Towards Data Science

--

Photo by frank mckenna on Unsplash

Binning numerical features into groups based on intervals the original value falls into can improve model performance. This can occur for several reasons. First, these bins can be defined based on domain knowledge to help the model better identify the patterns it is looking for. Second, data always has measurement errors, and binning can reduce the impact of these errors.

There are three common approaches outside of domain knowledge-based binning: equal widths, equal frequency, and a k-means approach. (The k-means approach is outside the scope of this article.) Equal widths takes the range of the numerical variable and divides it into equal-sized intervals. This means the size of the group is identical, but the count of observations in each bin can vary widely. An example of this is to divide the age of individuals into five or ten-year buckets. Equal frequency bins the feature to create roughly equal counts in each bin. In the age case, if most of the individuals are in their twenties and thirties, binning by ten or even five years can create bins that lack usefulness. Binning by frequency, these common ages will be better separated and more beneficial to the model.

Binning in pandas

Using weather data extracted from the database using the open-source package RasgoQL,

dataset = rql.dataset('Table Name')
df = dataset.to_df()

equal width bins can easily be created using the cut function from pandas. In this case, 4 even sized bins are created.

df['HIGH_TEMP_EQ_BINS'] = pd.cut(df.DAILY_HIGH_TEMP, bins=4,
labels=False, include_lowest=True)

Similarly, qcut can be used to create bins with approximately equal counts in each.

df['HIGH_TEMP_FQ_BINS'] = pd.qcut(df.DAILY_HIGH_TEMP, q=4, 
precision=1, labels=False)

Binning with scikit-learn

These same bins can also be created using scikit-learn’s preprocessing function KBinsDiscretizer. To create equal with bins, we set the strategy to ‘uniform’.

est = KBinsDiscretizer(n_bins=4, encode='ordinal', 
strategy='uniform')
df['HIGH_TEMP_SK_EQ_BINS'] = est.fit_transform(
df[['DAILY_HIGH_TEMP']])

Similarly, setting strategy to ‘quantile’ will create approximately equal frequency bins

est = KBinsDiscretizer(n_bins=4, encode='ordinal', 
strategy='quantile')
df['HIGH_TEMP_SK_FQ_BINS'] = est.fit_transform(
df[['DAILY_HIGH_TEMP']])

In either case, to take models to production, these transformations will need to be rewritten for the production environment. In the case of the pandas versions, the cutoffs will need to be extracted from cut or qcut and hard coded into the production code. In the case of scikit-learn, either the cutoffs will need to be removed, and the entire transformation rewritten to use those cutoffs, or the preprocessing objects will need to be saved and reloaded into the production environment.

Beyond issues refactoring modeling code to production code, when working with vast amounts of data, this approach can be slow at best and nearly impossible at worst. With massive amounts of data, substantial amounts of time are wasted simply waiting for the data to transfer to the modeling environment. If the data is large enough, it may not fit into memory for pandas and not all of the data may be available for these transformations.

Binning in the modern data stack

By leveraging the open-source Python package RasgoQL, both of these issues can be avoided. First, because RasgoQL creates the bins directly in the database, it will work with any size data. Second, in creating these bins and examining them within Python, the underlying SQL code is saved in the database. This means that as new data arrives in the database, it will automatically be available with the bins applied.

To create the equal width bins, the RasgoQL function bin can be called with a type of ‘equalwidth’.

eq_bin = dataset.bin(type='equalwidth',
bin_count=4,
column='DAILY_HIGH_TEMP')
eq_bin.save(table_name="HIGH_TEMP_EWB")

Alternatively, setting type to ‘ntile’ will create equal count bins.

fq_bin = dataset.bin(type='ntile',
bin_count=4,
column='DAILY_HIGH_TEMP')
fq_bin.save(table_name="HIGH_TEMP_NTB")

At this point, these datasets are published back to the database and available for use in both a modeling and production environment.

While binning the data directly in pandas or using scikit-learn’s binning functions is easy to implement in traditional Python machine learning workflows, when working with big data stored in a database converting to a SQL approach has advantages in speed and ease of promotion to production. SQL based feature engineering allows for faster processing, an easier path to production, and reuse of the features across multiple projects. However, most data scientists are more comfortable in Python (or R) than SQL, and, for many calculations, the SQL is complicated. The open-source package RasgoQL allows data scientists to continue working in Python but execute the calculations in the database.

If you want to check out RasgoQL, the documentation can be found here and the repository here.

--

--

Head of Customer-Facing Data Science at Weav AI. Devloper of pgetu to bring tsfresh to Postgres. Previously at DataRobot, SAS, HP.