What is Feature Engineering?

A brief introduction to a very broad machine learning concept

Max Steele (they/them)
Towards Data Science

--

It can be difficult to find any sort of consensus on what “feature engineering” specifically refers to. My goal for this post is to provide an introduction to this very broad, yet fundamental aspect of building successful machine learning (ML) models for new and aspiring data scientists. We’ll cover the difference between a variable and a feature, why feature engineering is important, and when you might want to engineer features. In future posts, I will walk through some basic examples of how to use Python, Pandas, and NumPy to engineer features.

Image by Kevin Ku via Unsplash

So What is Feature Engineering?

Some people consider feature engineering to include the data scrubbing that gets your data into a format useable by machine learning (ML) algorithms. This includes things like dealing with missing or null values, handling outliers, removing duplicate entries, encoding non-numerical data, and transforming and scaling variables. I tend to think of those things as important preprocessing steps, but mostly separate from feature engineering. I say “mostly” because (like most everything else in data science) exploring, cleaning, and engineering features should be treated as part of an iterative process. Each of these steps tends to inform the others.

To me, feature engineering is focused on using the variables you already have to create additional features that are (hopefully) better at representing the underlying structure of your data. Feature engineering is a creative process that relies heavily on domain knowledge and the thorough exploration of your data. But before we go any further, we need to step back and answer an important question.

What are Features?

A feature is not just any variable in a dataset. A feature is a variable that is important for predicting your specific target and addressing your specific question(s). For example, a variable that is commonly found in many datasets is some form of unique identifier. This identifier may represent a unique individual, building, transaction, etc. Unique identifiers are very useful because they allow you to filter out duplicate entries or merge data tables, but unique IDs are not useful predictors. We wouldn’t include such a variable in our model because it would instantly overfit to the training data without providing any useful information for predicting future observations. Thus a unique ID is a variable, but not a feature.

So a feature can be thought of as a potentially useful predictor.

What is the Purpose of Feature Engineering?

Very rarely when you get your hands on a dataset, do you feel like you’ve got all the information you could possibly want to tackle your problem. So what can you do if you can’t go collect more data or measure additional variables? You can engineer features.

Really what this means is applying domain knowledge to figure out how to use the information you do have in new ways to improve model performance. And you won’t fully understand the information you have until you start exploring your data. This is why it’s difficult to find very specific resources covering the topic of feature engineering. It is so dependent on

  • the domain you’re working in,
  • your specific problem or task within that domain,
  • the variables you already have,
  • and your ability to generate additional information.

Nobody can give you a step-by-step guide on which features you should engineer and how. Sorry.

Creating additional features that better emphasize the trends in your data has the potential to boost model performance. After all, the quality of any model is limited by the quality of the data you feed into it. Just because the information is technically already in your dataset does not mean a machine learning algorithm will be able to pick up on it. Important information can get lost amidst the noise and competing signals in a large feature space. Thus, in some ways, feature engineering is like trying to tell a model what aspects of the data are worth focusing on. This is where your domain knowledge and creativity as a data scientist can really shine!

When Should You Engineer Features?

As you explore the data you already have, here are a few questions to keep at the back of your mind:

  1. Is it possible to gain information or reduce noisy signals by representing the same variable in a different way?
  2. Do any of the variables have important threshold values that are not explicitly reflected in how the variables are currently represented?
  3. Can any of the variables be decomposed into two or more variables that would provide useful information?
  4. Can any of the variables be combined in some way to become more informative than the sum of their parts?
  5. Do you have information that would allow you to scrape or otherwise obtain useful external data?

If you answer “yes” to any of these questions, taking some time to engineer features is likely a useful endeavor.

In Conclusion

Feature engineering, like so many things in data science, is an iterative process. Investigating, experimenting, and doubling back to make adjustments are crucial. The insights you stand to gain into the structure of your data and the potential improvements to model performance are usually well worth the effort. Plus, if you’re relatively new to all this, feature engineering is a great way to practice working with and manipulating DataFrames! So stay tuned for future posts covering specific examples (with code) of how to do just that.

--

--