The world’s leading publication for data science, AI, and ML professionals.

Streamlining Feature Engineering Pipelines with Feature-engine

Find out the challenges encountered when developing machine learning pipelines for deployment and learn how open source software can help.

Getting Started

Streamlining Feature Engineering Pipelines - Image from Pixabay, no attribution required
Streamlining Feature Engineering Pipelines – Image from Pixabay, no attribution required

In many organizations, we create machine learning models that process a group of input variables to output a prediction. Some of these models predict for example the likelihood of a loan being repaid, the probability of an application being fraudulent, if a car should be repaired or replaced after an accident, whether the customer is going to churn, and much more.

The raw data collected and stored by the organizations is almost never suitable to train new machine learning models, or be consumed by existing models to produce a prediction. Instead, we perform an extensive amount of transformation before the variables can be utilized by machine learning algorithms. The collection of variable transformations are commonly referred to as Feature Engineering.

In this article we are going to describe the most common challenges encountered when building and deploying feature engineering and machine learning pipelines, and how utilizing open source software and a new Python library called Feature-engine, can help mitigate some of these challenges.

What is feature engineering?

Feature engineering is the process of using domain knowledge of the data to create features that make Machine Learning algorithms work. There are several aspects in raw data that need to be addressed. For example, very often data is missing, or variable values take the form of strings instead of numbers. Many machine learning libraries, like Scikit-learn are unable to process missing values or strings, thus we need to transform these into numerical values. In addition, some models tend to work better when the variables show certain characteristics like a normal distribution or similar scales. Thus, we try to transform variables to make them take these characteristics.

Feature engineering includes data transformation procedures to tackle all of these aspects, including imputation of missing data, encoding of categorical variables, transformation or discretization of numerical variables, and setting features in similar scales. In addition, feature engineering also involves feature creation by combination of existing features into new variables, or by extracting information from dates, aggregating transaction data, or deriving features from time series, text or even images.

In summary, Feature engineering includes every aspect of data transformation and new data creation to return datasets fit to train and be consumed by machine learning models. I described extensively the different feature engineering techniques elsewhere. More details and code implementations can be found in the course "Feature Engineering for Machine Learning " and the book "Python Feature Engineering Cookbook".

Feature engineering is repetitive and time consuming

According to a survey by Forbes, data scientists and machine learning engineers spend around 60% of their time cleaning and organizing data for analysis and machine learning, or in other words, in data transformation and feature engineering.

Feature engineering can be quite repetitive and data scientists tend to carry out the same types of transformations to the various data sources that will be used to create the different machine learning models. Most data sources show the same challenges: lack of information or missing data, variables in forms of strings instead of numbers, distributions that are not optimal for the performance of the machine learning model we intend to build or different scales. Thus, we find ourselves doing the same type of data transformations over and over, model after model. If every time we do this, we start coding from scratch, the whole process becomes extremely inefficient.

We should strive to minimize repetition and duplicate code, and optimize reliability, to improve performance and efficiency and also reduce the time data scientists spend on data processing. If data scientists spend less time cleaning and pre-processing data, they will be able to spend more time creating innovative solutions, tackling more interesting challenges and learning and developing in other areas.

Feature engineering when working in teams

Feature engineering is very repetitive and we tend to carry out the same transformations to several of our data sources. Coding these transformations from scratch every time, or copying and pasting code from one notebook to the other is very inefficient and poor practice.

In scenarios like these, working in teams brings the additional complication that we could end up with different code implementations of the same techniques both in the research and production environments, each one developed by a different team member. This is not only inefficient, but it also makes it very difficult to track what has been done in the team. We end up with multiple sources or versions of same code, with little, if any, code standardization, versioning and testing.

It is generally preferred to create or use tools that can be shared across the team. Common tools have several advantages: first, we don’t need to re-write our pipelines from scratch in every project, and also, they facilitate knowledge sharing. When all members can read from and use existing libraries, they can learn from each other, and develop theirs skills and that of their colleagues faster.

It is also worth considering that, as data scientists research and develop machine learning models, code testing and unit testing is often omitted or forgotten. During procedural programming, frequently used within in Jupyter notebooks, inputs derived from previous commands, tend to be re-engineered and re-assigned over and over, and then used to train the machine learning models. This process makes it hard to track transformations and code dependencies, which in itself can contain bugs, and thus can lead to bug propagation and difficulty to debug.

Ad-hoc processes for feature engineering are not reproducible

Reproducibility is the ability to duplicate a machine learning model exactly, such that given the same raw data as input, both models return the same output. And this is ultimately, what we aim to achieve between the models we develop in the research environment and deploy in the production environment.

Refactoring the feature engineering pipelines developed in the research environment to add unit tests, and integration tests in the production environment, is extremely time consuming, provide new opportunities to introduce bugs, or find bugs introduced during model development. More importantly, refactoring code that achieves the same outcome, but was written by different developers, is extremely inefficient. Ultimately, we want to utilize the same code in our research and production environment, to minimize the deployment timeline and maximize reproducibility.

The surge of open source libraries for feature engineering

In the last few years, a growing number of open source Python libraries began to support feature engineering as part of the Machine Learning Pipeline. Among these, the library Featuretools supports an exhaustive array of functions to work with transaction data and time series; the library Category encoders supports a comprehensive selection of methods to encode categorical variables; and the libraries Scikit-learn and Feature-engine support a wide range of transformations including imputation, categorical encoding, discretization and mathematical transformations, among others.

Open source projects simplify the development of machine learning pipelines

Using open source software, helps data scientists reduce the amount of time they spend on feature transformation, improves code sharing standards across the team, and allows the use of well versioned and tested code, minimizing deployment timelines while maximizing reproducibility. In other words, open source allows us to use the same code, with clear versions both in the research and development environment, therefore removing or minimizing the amount of code that needs to be refactored to be put into production.

Why using well established open source projects is more efficient? For multiple reasons, first open source projects tend to be thoroughly documented, so it is clear what each piece of code intends to achieve. Second, well established projects have been widely adopted and approved by the community, giving us peace of mind that the code is of quality and will be maintained and improved in the years to come. Third, open source packages are extensively tested to prevent the introduction of bugs and maximize reproducibility. Fourth, packages are clearly versioned, so we can navigate to more modern, or previous implementations of the code, to obtain the desired results. Fifth, open source packages can be shared, facilitating the adoption and spread of knowledge. And finally utilizing open source packages removes the task of coding from our hands, improving team performance, reproducibility and collaboration dramatically.

An extensive discussion on why we should use open source can be found in this article.

Feature-engine helps streamline our feature engineering pipelines

Feature-engine is an open source Python library that I created to simplify and streamline the implementation of and end-to-end feature engineering pipeline. Feature-engine was originally designed to accompany the course Feature Engineering for Machine Learning but has been now adopted by the community, and has a growing number of contributors to the code base.

Feature-engine preserves Scikit-learn functionality with the methods fit() and transform() to learn parameters from and then transform the data. Keep in mind that many feature engineering techniques, need to learn parameters from the data, like statistical values or encoding mappings, to transform data. The Scikit-learn like functionality with the fit and transform methods makes Feature-engine easy use.

Feature engine includes multiple transformers to impute missing data, encode categorical variables, discretize or transform numerical variables and remove outliers, thus providing the most exhaustive battery of feature engineering transformations

Some key characteristics of Feature engine transformers are that i) it allows the selection of the variable subsets to transform directly at the transformer, ii) it takes in a dataframe and returns a dataframe, facilitating both data exploration and model deployment and iii) it automatically recognizes numerical and categorical variables, thus applying the right pre-processing to the right feature subsets.

Feature-engine transformers can be assembled within a Scikit-learn pipeline, making it possible to store an entire machine learning pipeline into a single object which can be saved and retrieved at a later stage for batch scoring, or placed in memory, for live scoring.

Feature-engine is being actively developed and welcomes feedback from users and contributions from the community. More details on how to use Feature-engine can be found in its documentation and in this article:

Feature-engine: A new open source Python package for feature engineering

References and further reading


Related Articles