
According to Forbes, data scientists and machine learning engineers spend around 60% of their time prepping data before training machine learning models. A large chunk of that time is spent on feature engineering.
Feature engineering is the process of transforming and creating features that can be used to train machine learning models. Feature engineering is crucial to training accurate machine learning models, but is often challenging and very time-consuming.
Feature engineering involves imputing missing values, encoding categorical variables, transforming and discretizing numerical variables, removing or censoring outliers, and scaling features, among others.
In this article, I discuss Python implementations of feature engineering for machine learning. I compare the following open-source Python libraries:
And, I will show the code to perform:
- Missing data imputation
- Categorical encoding
- Variable transformation
- Discretization
The feature engineering pipeline
Most feature engineering techniques learn parameters from the data. For example, to impute data with the mean, we obtain the mean from the training set. To encode categorical variables, we define mappings of strings to numbers, utilizing the training data as well.
Many open source Python packages have the functionality to learn and store the parameters to engineer the features, and then retrieve them to transform the data.
In particular, Scikit-learn, Feature-engine and Category encoders share the method fit to learn parameters from the data and the method transform to modify the data.
Pandas also has a lot of tools for feature engineering and data prepping. However, it lacks the functionality to store learned parameters. For this reason, we won’t talk about pandas in this article.
Python libraries for Feature Engineering
Scikit-learn, Feature-engine and Category encoders share the fit and transform functionality to learn parameters from data, and then transform the variables.
There are, however, some differences among the packages in terms of i) their output, ii) their input, and iii) their versatility.

Output: NumPy array vs Pandas dataframe
Feature-engine and Category Encoders return pandas dataframes. Scikit-learn returns NumPy arrays instead.
NumPy arrays are optimized for machine learning, as NumPy is more computationally efficient. Pandas dataframes are better suited for data visualization.
Often, we want to understand how the feature engineering transformations affect the variable distribution and their relationships with other variables. Pandas is a great tool for data analysis and visualization, and thus, libraries that return Pandas dataframes are inherently more data analysis "friendly".
If we choose to work with Scikit-learn, we may need to add a line of code or two to transform the NumPy arrays into Pandas dataframes to continue with data visualization.
Input: Data slice vs full data set
Data scientists apply different feature engineering methods to different variable subsets.
For example, we would only impute variables with missing data, and not the entire data set. We would apply certain imputation methods to numerical variables and other methods to categorical variables.
The Python libraries offer the possibility to select the variables that we want to transform.
With Feature-engine and Category encoders, we select the variables to transform within the transformer.
With Scikit-learn, we need to use a special transformer to slice the dataset into the desired group of variables.We can do this y using the ColumnTransformer or Feature-engine’s [SklearnWrapper](https://feature-engine.readthedocs.io/en/latest/wrappers/Wrapper.html). The beauty of using Feature-engine’s SklearnWrapper, is that the output is a pandas dataframe!
Versatility
Sometimes, we don’t know which transformation technique returns the most predictive variable. Should we do equal-width or equal-frequency discretization? Should we impute with the mean, median, or an arbitrary number?
Most Scikit-learn transformers are centralized, meaning that one transformer, can carry out different transformations. For example, we can apply 3 discretization techniques by simply changing the parameters of the KBinsDiscretizer() from Scikit-learn. Feature-engine, on the other hand, presents 3 different transformers for discretization.
The same is true for imputation; by changing the parameters of SimpleImputer(), we can perform different imputation techniques with Scikit-learn, whereas Feature-engine has several transformers, each of which can perform at most 2 different imputation variations.
With Scikit-learn, we can easily do a GridSearch over the parameters of the feature engineering transformers. With Feature-engine, we need to decide before hand which transformation we want to use.
In the rest of the blog, I will compare the implementation of missing data imputation, categorical encoding, mathematical transformation, and discretization among Scikit-learn, Feature-engine and Category encoders.
Missing data Imputation
Imputation consists of replacing missing data with a probabilistic estimate of the missing value. There are multiple missing data imputation methods, each of which serve different purposes.

If you want to learn more about these techniques, their advantages and limitations and when we should use them, check out the course "Feature engineering for Machine Learning".
Scikit-learn and Feature-engine support many imputation procedures for numerical and categorical variables.
Both libraries contain functionality for most common imputation techniques:
- Mean and median imputation
- Frequent category imputation
- Imputation with an arbitrary value
- Adding a missing indicator
Feature-engine can additionally perform:
- Random sample imputation
- Complete case analysis
- Imputation with values at the extremes of the distribution
Scikit-learn, on the other hand, offers Multivariate imputation of chained equations in its functionality.
Feature-engine transformers can automatically identify numerical or categorical variables, depending on the imputation method. With Feature-engine, we will not inadvertently add a string when we impute numerical variables, or a number to categorical ones. With Scikit-learn, we need to select the variables to modify beforehand.
Scikit-learn’s SimpleImputer(), can perform all imputation techniques just by adjusting the strategy and the _fillvalue parameters. Thus, giving us the freedom of performing a GridSearch of imputation techniques, as shown in the code implementation in Scikit-learn’s documentation. Feature-engine instead has at least 5 different imputation transformers.
In the next paragraphs, we will first carry out median imputation and imputation with the most frequent category.
Mean/Median Imputation
For median imputation, Feature-engine offers the MeanMedianImputer(), and Scikit-learn offers the SimpleImputer().
Feature-engine’s MeanMedianImputer() automatically selects all numerical variables in the training data set. Scikit-learn’s SimpleImputer() will, on the other hand, transform all variables in the data set, and it will raise an error if there are categorical variables during the execution.
Feature-engine
Below, we see the implementation of the MeanMedianImputer() using the median as the imputation. Mean imputation can be implemented by simply replacing "median" with "mean" for _imputationmethod.
import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine.imputation import MeanMedianImputer
# Load dataset
data = pd.read_csv('houseprice.csv')
# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['Id', 'SalePrice'], axis=1),
data['SalePrice'],
test_size=0.3,
random_state=0
)
# set up the imputer
median_imputer = MeanMedianImputer(
imputation_method='median',
variables=['LotFrontage', 'MasVnrArea']
)
# fit the imputer
median_imputer.fit(X_train)
# transform the data
train_t= median_imputer.transform(X_train)
test_t= median_imputer.transform(X_test)
Feature-engine returns the original dataframe, where only the numerical variables were modified. For more details visit the MeanMedianImputer() documentation.
Scikit-learn
With the SimpleImputer(), we can also specify the mean or median imputation method through its parameters:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
# Load dataset
data = pd.read_csv('houseprice.csv')
# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['Id', 'SalePrice'], axis=1),
data['SalePrice'],
test_size=0.3,
random_state=0
)
# Set up the imputer
median_imputer = SimpleImputer(strategy='median')
# fit the imputer
median_imputer.fit(X_train[['LotFrontage', 'MasVnrArea']])
# transform the data
X_train_t = median_imputer.transform(
X_train[['LotFrontage', 'MasVnrArea']]
)
X_test_t = median_imputer.transform(
X_test[['LotFrontage', 'MasVnrArea']]
)
As we can see above, Scikit-learn requires that we slice the dataframe before we pass it onto the imputation transformer, whereas this step was not required with Feature-engine.
The result of the precedent code block is a NumPy array with the 2 numerical variables that were imputed.
Frequent Category Imputation
Frequent category imputation consists of replacing missing values in categorical variables by the most frequent category of the variable.
Feature-engine
The CategoricalImputer() replaces missing data in categorical variables by its mode if we set the imputation_method parameter to ‘frequent’.
We can indicate the variables to impute, as we do below; otherwise, the imputer will automatically select and impute all categorical variables in the training data set.
import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine.imputation import CategoricalImputer
# Load dataset
data = pd.read_csv('houseprice.csv')
# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['Id', 'SalePrice'], axis=1),
data['SalePrice'],
test_size=0.3,
random_state=0
)
# set up the imputer
imputer = CategoricalImputer(
imputation_method='frequent',
variables=['Alley', 'MasVnrType']
)
# fit the imputer
imputer.fit(X_train)
# transform the data
train_t= imputer.transform(X_train)
test_t= imputer.transform(X_test)
The result is a dataframe with the original variables, where the indicated ones were imputed.
Scikit-learn
The SimpleImputer() is also used for frequent category imputation by using "most_frequent" as the imputation strategy.
Note that the SimpleImputer()’s "most_frequent" imputation strategy can operate over numerical and categorical variables. So we need to be very careful.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
# Load dataset
data = pd.read_csv('houseprice.csv')
# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['Id', 'SalePrice'], axis=1),
data['SalePrice'],
test_size=0.3,
random_state=0
)
# set up the imputer
mode_imputer = SimpleImputer(strategy='most_frequent')
# fit the imputer
mode_imputer.fit(X_train[['Alley', 'MasVnrType']])
# transform the data
X_train= mode_imputer.transform(
X_train[['Alley', 'MasVnrType']]
)
X_test= mode_imputer.transform(
X_test[['Alley', 'MasVnrType']]
)
The output of the precedent code block is a Numpy array with 2 columns containing the imputed variables.
Categorical encoding
Machine learning models require input data in a numerical format. Thus, data scientists need to convert categorical variables into numerical. These procedure is called categorical variable encoding.

There are many ways to encode categorical variables. The method of encoding we choose is completely data context and business problem driven; how we represent and engineer these features could have a major impact on the performance of the model.
Scikit-learn, Feature-engine and Category encoders offer a wide range of categorical encoders. All three libraries offer the commonly used encoders such as one-hot Encoding and ordinal encoding, one that we will be demonstrating below.
Feature-engine and Category Encoders also offer target-based encoding methods such as target mean encoding and weight of evidence.
Overall, Category Encoders is the front runner in this field of categorical encoding, offering the widest arsenal of encoding techniques. They were originally derived from a host of scientific publications.
Supporting both NumPy arrays and pandas dataframes input formats, the Category Encoders transformers are fully compatible Scikit-learn functionality and can be used within pipelines. In addition to the more commonly implemented encoders mentioned above, Category Encoders also offer some special use-case encoders including:
In the following paragraphs, we will compare the implementation of Ordinal encoding among the 3 Python open source libraries.
Ordinal encoding
Ordinal encoding numerically labels the categories into the number of unique classes. For a categorical variable with n unique categories, ordinal encoding will replace the categories by integers from 0 to n-1.
Feature-engine
Feature-engine’s OrdinalEncoder() works only with categorical variables, where a list of variables can be indicated, or the encoder will automatically select all categorical variables in the train set.
If we select "arbitrary" as the encoding method, then the encoder will assign numbers in the sequence that the labels appear in the variable (i.e., first-come, first-served).
If we select "ordered," the encoder will assign numbers following the mean of the target value for that label. The labels for which the mean of the target is higher will be assigned the number 0, and those where the mean of the target is smallest will be assigned n-1.
import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine.encoding import OrdinalEncoder
# Load dataset
def load_titanic():
data = pd.read_csv(
'https://www.openml.org/data/get_csv/16826755/phpMYEkMl'
)
data = data.replace('?', np.nan)
data['cabin'] = data['cabin'].astype(str).str[0]
data['pclass'] = data['pclass'].astype('O')
data['embarked'].fillna('C', inplace=True)
return data
data = load_titanic()
# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['survived', 'name', 'ticket'], axis=1),
data['survived'],
test_size=0.3,
random_state=0
)
# set up the encoder
encoder = OrdinalEncoder(
encoding_method='arbitrary',
variables=['pclass', 'cabin', 'embarked']
)
# fit the encoder
encoder.fit(X_train, y_train)
# transform the data
train_t= encoder.transform(X_train)
test_t= encoder.transform(X_test)
The output of the precedent code block is the original pandas dataframe where the selected categorical variables were transformed into numerical.
Scikit-learn
Scitkit-learn’s OrdinalEncoder() requires the input to be sliced for the categorical variables. During the encoding process, the numbers are simply assigned per the alphabetical order of the labels.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder
# Load dataset
def load_titanic():
data = pd.read_csv(
'https://www.openml.org/data/get_csv/16826755/phpMYEkMl'
)
data = data.replace('?', np.nan)
data['cabin'] = data['cabin'].astype(str).str[0]
data['pclass'] = data['pclass'].astype('O')
data['embarked'].fillna('C', inplace=True)
return data
data = load_titanic()
# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['survived', 'name', 'ticket'], axis=1),
data['survived'],
test_size=0.3,
random_state=0
)
# set up the encoder
encoder = OrdinalEncoder()
# fit the encoder
encoder.fit(
X_train[['pclass', 'cabin', 'embarked']],
y_train
)
# transform the data
train_t= encoder.transform(
X_train[['pclass', 'cabin', 'embarked']]
)
test_t= encoder.transform(
X_test[['pclass', 'cabin', 'embarked']]
)
The output of the precedent code block is a NumPy array with 3 columns, corresponding to the imputed variables.
Category encoders
Category encoders’ OrdinalEncoder() allows us to specify the variables/columns to transform as a parameter. An optional mapping dictionary can be passed as well, in cases where we have the knowledge that there is some true order to the classes themselves. Otherwise, the classes are assumed to have no true order and the numbers are assigned to the labels at random.
import pandas as pd
from sklearn.model_selection import train_test_split
from category_encoders.ordinal import OrdinalEncoder
# Load dataset
def load_titanic():
data = pd.read_csv(
'https://www.openml.org/data/get_csv/16826755/phpMYEkMl'
)
data = data.replace('?', np.nan)
data['cabin'] = data['cabin'].astype(str).str[0]
data['pclass'] = data['pclass'].astype('O')
data['embarked'].fillna('C', inplace=True)
return data
data = load_titanic()
# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['survived', 'name', 'ticket'], axis=1),
data['survived'],
test_size=0.3,
random_state=0
)
# set up the encoder
encoder = OrdinalEncoder(cols=['pclass', 'cabin', 'embarked'])
# fit the encoder
encoder.fit(X_train, y_train)
# transform the data
train_t= encoder.transform(X_train)
test_t= encoder.transform(X_test)
Transformation
Data scientists transform numerical variables with various mathematical functions e.g., logarithmic, power and reciprocal, with a general aim of obtaining a more "Gaussian" looking distribution.

Scikit-learn offers the FunctionTransformer() which, in principle, can apply any function defined by the user. It takes the function as an argument, either as a NumPy method, or as a lambda function.
Through transformers such as LogTransformer() and ReciprocalTransformer(), Feature-engine, instead, supports mathematical transformations with individual specific transformers.
When it comes to "automatic" transformations, both Scikit-learn and Feature-engine support the Yeo-Johnson and Box-Cox transformations. While Scikit-learn centralizes the transformations within the PowerTransformer() just by changing the ‘method’ argument, Feature-engine has 2 individual Yeo-Johnson and Box-Cox transformers.
Feature-engine returns an error if a transformation is not mathematically possible, for example log(0), or reciprocal of 0, while Scikit-learn will introduce NaNs instead, necessitating you to do a rationality check afterwards.
In the next paragraphs, we will compare the implementation of the logarithmic and Box-Cox transformations between the packages. For the demonstrations, we use the house prices data set from Kaggle.
Logarithmic Transformation
The logarithmic transformation consists in applying the log transform to the variables.
Feature-engine
Feature-engine’s LogTransformer() applies the natural logarithm or the base-10 logarithm to numerical variables. It only works with numerical, positive values. If the variable contains a 0 or a negative value, the transformer will return an error.
As with all Feature-engine’s transformers, the LogTransformer() allows us to select the variables to transform. A list of variables can be passed as an argument, or alternatively, the transformer will automatically select and transform all numerical variables.
import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine.transformation import LogTransformer
# Load dataset
data = data = pd.read_csv('houseprice.csv')
# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['Id', 'SalePrice'], axis=1),
data['SalePrice'],
test_size=0.3,
random_state=0
)
# set up the variable transformer
tf = LogTransformer(variables = ['LotArea', 'GrLivArea'])
# fit the transformer
tf.fit(X_train)
# transform the data
train_t = tf.transform(X_train)
test_t = tf.transform(X_test)
Scikit-learn
Scikit-learn applies the logarithmic transformation through its FunctionTransformer() by passing the logarithmic function as a NumPy method into the transformer, as shown below.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import FunctionTransformer
# Load dataset
data = data = pd.read_csv('houseprice.csv')
# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['Id', 'SalePrice'], axis=1),
data['SalePrice'],
test_size=0.3,
random_state=0
)
# set up the variable transformer
tf = FunctionTransformer(np.log)
# fit the transformer
tf.fit(X_train[['LotArea', 'GrLivArea']])
# transform the data
train_t = tf.transform(X_train[['LotArea', 'GrLivArea']])
test_t = tf.transform(X_test[['LotArea', 'GrLivArea']])
Box Cox Transformation
The Box-Cox transformation is a method of transforming non-normal variables by using a transformation parameter λ.
Feature-engine
The BoxCoxTransformer() applies the Box-Cox transformation to numerical variables and works only with non-negative variables.
A list of variables to modify can be passed as an argument, or the BoxCoxTransformer() will automatically select and transform all numerical variables.
import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine.transformation import BoxCoxTransformer
# Load dataset
data = data = pd.read_csv('houseprice.csv')
# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['Id', 'SalePrice'], axis=1),
data['SalePrice'],
test_size=0.3,
random_state=0
)
# set up the variable transformer
tf = BoxCoxTransformer(variables = ['LotArea', 'GrLivArea'])
# fit the transformer
tf.fit(X_train)
# transform the data
train_t = tf.transform(X_train)
test_t = tf.transform(X_test)
The transformation implemented by this transformer is that of scipy.stats.boxcox and returned as a pandas dataframe.
Scikit-learn
Scikit-learn offers both Box-Cox and Yeo-Johnson transformation through its PowerTransformer(). Box-Cox requires the input data to be strictly positive values.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PowerTransformer
# Load dataset
data = data = pd.read_csv('houseprice.csv')
# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['Id', 'SalePrice'], axis=1),
data['SalePrice'],
test_size=0.3,
random_state=0
)
# set up the variable transformer
tf = PowerTransformer(method="box-cox")
# fit the transformer
tf.fit(X_train[['LotArea', 'GrLivArea']])
# transform the data
train_t = tf.transform(X_train[['LotArea', 'GrLivArea']])
test_t = tf.transform(X_test[['LotArea', 'GrLivArea']])
As with all Scikit-learn transformers, the results are returned as a NumPy array.
Discretization
Discretization partitions continuous numerical variables into discrete and contiguous intervals, that span across the full range of the variable values. Discretization is often implemented to improve the signal to noise ratio for a given variable and reduce the effects of outliers.

Scikit-learn offers KBinsDiscretizer() as a centralized transformer, through which we can do equal-width, equal-frequency, and k-means discretization. With the KBinsDiscretizer() we can optimize the model through grid search over all discretization techniques.
With Feature-engine, the discretization procedures are implemented through separate transformers. Feature-engine supports
- Equal-width discretization
- Equal-frequency discretization
- Discretization with decision trees
- Arbitary discretization.
Additionally, Scikit-learn allows us to one hot encode the bins straightaway, just by setting up the encoding parameter. With Feature-engine, if we wish to treat the bins as categories, we would run any of the categorical encoders at the back end of the discretization transformer.
In the following paragraphs, we will compare the implementation of equal frequency discretization between the packages.
Equal Frequency Discretization
This type of discretization bins variables into a predefined number of contiguous intervals. The bin intervals are normally the percentiles.
Feature-engine
EqualFrequencyDiscretiser() sorts the numerical variable values into contiguous intervals of equal proportion of observations, where the interval limits are calculated according to percentiles.
This number of intervals, in which the variable should be divided, is determined by the user. The transformer can return the variable as either numeric or object (default being numeric).
Inherent to Feature-engine, a list of variables can be indicated, or the discretizer will automatically select all numerical variables in the train set.
import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine.discretisation import EqualFrequencyDiscretiser
# Load dataset
data = data = pd.read_csv('houseprice.csv')
# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['Id', 'SalePrice'], axis=1),
data['SalePrice'],
test_size=0.3,
random_state=0
)
# set up the discretisation transformer
disc = EqualFrequencyDiscretiser(
q=10,
variables=['LotArea', 'GrLivArea']
)
# fit the transformer
disc.fit(X_train)
# transform the data
train_t = disc.transform(X_train)
test_t = disc.transform(X_test)
The EqualFrequencyDiscretiser() first finds the boundaries for the intervals for each variable as it fits the data. Then it transforms the variables, by sorting the values into the intervals and returns a pandas dataframe.
Scikit-learn
Scikit-learn can implement equal frequency discretization through its KBinsDiscretizer() transformer, by setting the "strategy" parameter to "quantiles".
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import KBinsDiscretizer
# Load dataset
data = data = pd.read_csv('houseprice.csv')
# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['Id', 'SalePrice'], axis=1),
data['SalePrice'],
test_size=0.3,
random_state=0
)
# set up the discretisation transformer
disc = KBinsDiscretizer(n_bins=10, strategy='quantile')
# fit the transformer
disc.fit(X_train[['LotArea', 'GrLivArea']])
# transform the data
train_t = disc.transform(X_train[['LotArea', 'GrLivArea']])
test_t = disc.transform(X_test[['LotArea', 'GrLivArea']])
By default, the NumPy array output is one-hot encoded into a sparse matrix. This can be further configured, such as setting to an ordinal encoding method instead, with the "encode" parameter.
Wrapping up
Feature engineering is an essential component in end-to-end Data Science and machine learning pipelines. It is meant to be an iterative process that every data scientist should master in order to optimize model performance. Feature engineering is very time-consuming, and gaining those little efficiencies by knowing the advantages and edges of eachPython library will definitely stack up through your workflow.
References
- Feature Engineering for Machine Learning – Online Course
- Python Feature Engineering Cookbook – Book
- Feature-engine: Python library for Feature Engineering
- Preprocessing data with Scikit-learn
Related articles
This article is the eighth in a series of articles on feature engineering for Machine Learning. You can learn more about how data scientists preprocess their data at the following links: