
A data analysis project always begins with a dataset. This may have been delivered by the customer, found publicly on sites like Kaggle.com, or created by us and our team.
In any of these cases, the dataset will show an anatomy that will vary according to the type of phenomenon it wants to describe, and will have a certain number of columns that will make up this structure.
During the development of our project, the team will be interested in several aspects of this dataset:
- How representative of the phenomenon is the actual data?
- What are the data types of our columns?
- How many rows and columns are there?
And many others. It is important to answer these questions because they help define the scope of our team’s investigation.
They help us understand how much EDA (exploratory data analysis) to do, what we can predict and how to do it (with machine learning or other statistical methods) and how to structure a data preprocessing plan.
I have touched on some of these issues in dedicated articles. You can read them here
- Building Your Own Dataset: Benefits, Approach, and Tools
- Exploratory Data Analysis in Python – A Step-by-Step Process
- How To Structure Your Machine Learning Project
An analysis pipeline usually goes through several steps, one of which is called feature engineering.
During this phase, the analyst modifies, transforms and adds the columns (also called features, dimensions, variables and so on) present in the dataset with the aim of enriching the information about the phenomenon being described. Usually this is done in view of a modeling phase of the phenomenon itself.
For example, if we are studying a time series, a feature engineering activity can be to create additional columns starting from the date present within the series to extract the year, month, day, and days belonging to the weekends and so on.
The hypothesis is that the target variable (if we are talking about supervised learning) can be modeled better if we add more information about it— in this case we enrich the series by engineering some temporal features.
But sometimes this process can give rise to some problems and these can be difficult to recognize. Such problems often lead to overfitting the training set.
I’ve already written about overfitting and why it’s one of the biggest obstacles in machine learning. I recommend that you read this article if you are interested in finding out more about it.
In this article I will write specifically about the issues that can arise when we go too far with feature engineering, meaning when we add too much information to our dataset.
Why do feature engineering?
The motivation for feature engineering is to increase the availability of valuable information and train the model by including this new "perspective" on the data.
The interpretation of this activity is rightly correct. Wanting to add information is generally an activity to be pursued, but only if we are sure that the data is actually really useful to the model.
As I often write and publish on the web, our first concern should be to collect high quality data, possibly as much as possible, in order to best represent the phenomenon we want to study and model.
By "high quality" I mean data that contains examples that are as close to the observable reality.
We want to be sure that the answers to our questions are contained in our dataset.
Doing feature engineering, therefore, can help us answer these questions in an analysis phase or help our model find an easier way to predict the target.
Why is it a problem to have a large number of features?
The unwary analyst, however, could run into problems if he adds too many new variables into his dataset.
Some of these might actually be useful, other might even hinder our model from generalization.
A model that receives potentially irrelevant information could have a hard time in generalizing the phenomenon.
By formalizing, the analyst could make mistakes if he adds too many columns to his dataset, such as
- add irrelevant information by increasing the signal-to-noise ratio
- increase the complexity of the phenomenon to be mapped
- insert some confounds in the dataset
- have more columns than rows
Each of these aspects can negatively affect our model’s ability to produce usable results. Let’s see how.
Add irrelevant information (increase the signal-to-noise ratio)
The more the examples in our dataset are representative of the phenomenon, the higher the signal-to-noise ratio will be. We always want to maximize this ratio in such a way that we can align the sample as closely as possible with the population from which the data collected is collected.
Each example in our dataset should be described by certain features that represent of the phenomenon we want to study.
If this does not happen, this example essentially useless. However, this uselessness is not filtered out by our model, which will try to learn the best way to map X to y using this noisy information. This deteriorates the performance of the model.
Increase the complexity of the phenomenon to be mapped
The more complex the phenomenon we want to model, the more difficult it will be for the model to find a function that describes that behavior.
What do I mean by complexity? In this case we mean the number of columns. Each column describes something of the phenomenon, and with a richer description naturally there is a growth in complexity.
Sometimes a phenomenon is inherently complex to model (such as weather forecasts) – in this case we need to be able to understand how much the columns we add further complicate the problem.
Adding confounds
A confound is a variable that "confuses" the model. Variables related to each other, but which are not really impacting (spurious correlations and the like) can confuse the model in the training phase.
In this case, the model erroneously learns that the variable K exerts a big impact on the target – when the latter is then deployed in production the performances are far from those seen in training.
Again, the solution is the same: carefully study the nature of the variable added to the dataset, thinking about how this might actually help the model.
Have more columns than rows
This is a somewhat borderline scenario, but it can happen if the analyst uses external libraries to do feature engineering. For example, if you want to do FE on a time series made up of financial candles, technical analysis libraries would apply hundreds of more columns to the dataset.
Columns: things to learn Rows: examples to learn from
Basically, my advice is to always avoid the scenario where we have more columns than rows.
How to identify the most important features?
We have seen how having a too liberal process of feature engineering can cause damage to our model.
By contrast, the process of identifying which features to include in our dataset is called feature selection.
Feature selection helps us isolate the variables that contribute most to model performance.
There are various approaches to feature selection. This article will not go into detail on this topic, but I can link to you a piece on Boruta, a library for feature selection in Python. This library is easy to use and allows to calculate the importance of each feature in the dataset. An important feature is certainly to be included in the training set, while some could even be removed or transformed.
Conclusion
Feature engineering is certainly an activity we must always think about, but with a focus in not overdoing it.
It is easy to think that a certain variable can have a positive impact on the performance of the model – the best way to test a hypothesis is through experimentation. Boruta helps in this regard: when adding a new variable we can use this library (together with other approaches) to estimate the relevance of this feature on the results.
If we do not want to study the features with a selection process, the advice is to avoid the obstacles mentioned and to evaluate how the performance of the model changes with the gradual insertion of more variables.
If you want to support my content creation activity, feel free to follow my referral link below and join Medium’s membership program. I will receive a portion of your investment and you’ll be able to access Medium’s plethora of articles on Data Science and more in a seamless way.
Till next time! 👋
Recommended Reads
For the interested, here are a list of books that I recommended for each ML-related topic. There are ESSENTIAL books in my opinion and have greatly impacted my professional career. Disclaimer: these are Amazon affiliate links. I will receive a small commission from Amazon for referring you these items. Your experience won’t change and you won’t be charged more, but it will help me scale my business and produce even more content around AI.
- Intro to ML: Confident Data Skills: Master the Fundamentals of Working with Data and Supercharge Your Career by Kirill Eremenko
- Sklearn / TensorFlow: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow by Aurelien Géron
- NLP: Text as Data: A New Framework for Machine Learning and the Social Sciences by Justin Grimmer
- Sklearn / PyTorch: Machine Learning with PyTorch and Scikit-Learn: Develop machine learning and deep learning models with Python by Sebastian Raschka
- Data Viz: Storytelling with Data: A Data Visualization Guide for Business Professionals by Cole Knaflic
Useful Links (written by me)
- Learn how to perform a top-tier Exploratory Data Analysis in Python: Exploratory Data Analysis in Python – A Step-by-Step Process
- Learn the basics of TensorFlow: Get started with TensorFlow 2.0 – Introduction to deep learning
- Perform text clustering with TF-IDF in Python: Text Clustering with TF-IDF in Python