Data Science Buzzwords: Feature Engineering

There is a lot of vocabulary to learn for Data Science, here’s a quick summary of Feature Engineering in under 5 minutes.

Dr Shruti Turner
Towards Data Science

--

Feature Engineering is one of those terms that, on the surface, seems to mean exactly what it is saying: you want to refactor or create something from the data that you have.

Okay, fine…but what does that actually mean in real life when you’re sitting in front of your data set and wondering what to do?

Image by sergei akulich from Pixabay

What makes up Feature Engineering?

The term encompasses a variety of methods that each have a variety of sub-methods associated with them. I’m just going to cover some of the main ones to give you an idea of the sort of thing Feature Engineering contains, with some indication of widely used methods.

Encoding — I think this is one of the most simple and commonly used aspects of Feature Engineering. In fact, I was doing this long before I realised Feature Engineering as a whole existed! Encoding is used to transform your categorical data into numerical data. This is necessary as computational algorithms deal with numbers, not words. A couple of the more simple methods are Label Encoder and One Hot Encoder.

Feature Generation — Feature Generation or Feature Construction is where you manually create new features from the ones that you have already got in your data. This might be something as simple as joining two columns together or something that takes a bit more code and calculating how many entries there have been in a certain time period before the one in question. Again, you might do stuff like this anyway, just because it makes your life a bit easier. There are several benefits of Feature Generation:

  1. It can give you one column of data which says what two do. Used in conjunction with encoding you can look at the specifics of your data in more detail e.g. if you have a column for car manufacturer and a column for car colour, you can combine these to give you a column of manufacturer_colour. This may cut down the number of features in your final model (yay efficiency!) without losing the data.
  2. You can get information instantly by looking at your data and use that to simplify it so it is clearer for you to use e.g. calculations of time since an event or number of previous entries.
  3. You can simplify your data, for instance, if you have over 100s of different countries in your data, many occurring only once or twice you might want to create a new feature which gives the relevant continent. That way, you have a higher number of instances of fewer categories which might be more meaningful for your analysis.

Feature Extraction — This is where you create features automatically directly from the raw data. Normally used for image, audio or text data (though it’s possible to use it for numerical data when there’s loads of it!) You use Feature Extraction to reduce the complexity of the data for analysis, reducing the dimensions of the data, which allows for your model to run more quickly. Common methods to do this are Principle Component Analysis or Unsupervised Clustering algorithms. Points 2 and 3 from the list above are also relevant for this.

Feature Selection — This is a super important part of Feature Engineering: once you’ve made a whole load of new features you need to work out which ones are actually useful for you. Now, you can do this with some trial and error, but it is probably more efficient to use a build in function to be able to do that. A simple example of this is Univariate Feature Selection, e.g. SelectKBest, which selects the given number of features (you can decide how many) that have the strongest relationship with the dependent variable. There are lots of options you can use to make sure you’re using the right algorithm for your data.

Take Home Message

There are lots of ways you can manipulate your data to a format that the computer is happy with, simplify it or even create new variables for analysis. There is no “right way” and a lot of it can be an art form and down to experience. Different solutions will work more effectively for different data sets.

Helpful Resources

I have linked to Python documentation throughout the article so you can have a look at the parts of interest in more detail, perhaps with a view of implementing the methods in your code.

Kaggle — Feature Engineering Short Course. This course isn’t for beginners to programming or Machine Learning. It briefly covers 3 aspects of Feature Engineering. Personally, I don’t think it’s the best course for details and learning in-depth, but it’s a good place to start if you want to dip your toe into the area.

--

--

Data Scientist | PhD | TEDx Speaker | Editor of Trusted Data Science @ Haleon