The world’s leading publication for data science, AI, and ML professionals.

5 Best Practices for Feature Engineering in Machine Learning Projects

Improving your Machine Learning Experience

Photo by Daniel Chekalov on Unsplash
Photo by Daniel Chekalov on Unsplash

When approaching a new Machine Learning problem, there is no way of knowing from the beginning what the solution would be unless a variety of different experiments are tried and tested. Over time, practitioners have implemented a variety of different techniques to see what has worked and what has not on the majority of Machine Learning projects. From this, we have been able to generate a set of best practices when performing the Feature Engineering step within a Machine Learning pipeline.

Feature Engineering for Numerical Data

Mind you, each one of these best practices may or may not improve your solution for each specific problem, but they are unlikely to be extremely detrimental to the final output.

1 Generate Simple Features

When you first begin the modeling process, attempt to generate as many simple features as possible. By simple I mean, try to generate features that do not take a long time to code. For instance, instead of training a Word2vec model, start by implementing a simple Bag-of-words which generates thousands of features with minimal code. The idea is that you want to use anything that is measurable as a feature in the beginning since there is no clear way to know in advance whether a feature or features in combination will be useful for prediction.

2 IDs can be Features (When they are Required)

It may sound silly to add IDs as part of the feature set since a unique ID probably wouldn’t contribute much to the model’s generalization. However, including IDs enables practitioners to create a model that has one behavior in a general case and different behaviors in others.

For instance, say we want to make a prediction about some location based on some features that describe the location. By including IDs of the locations as part of the feature set, we would be able to add more training examples for one general location and train the model to behave differently in other specific locations.

3 Reduce Cardinality (When Possible)

As a general rule of thumb, if we have some categorical feature that has many different unique values (say more than 12), we should only use this feature if we want the model to behave differently depending on that feature. For instance, in the US there are 50 states therefore you may consider using a feature called "States" if you’d like the desired behavior of the model to be one way in California and a different in Florida.

On the other hand, if we do not need a model that behaves differently depending on the "States" feature then it would be better for us to reduce the cardinality of the "States" feature.

We will cover techniques to do this in another article.

4 Be Cautious about Counts

There are some instances where counts remain roughly within the same bounds as time goes on – such is the case of the Bag-of-Words (BoW) if the document length does not grow or shrink as time advances.

Instances where counts can cause problems. Take for instance a scenario where we have a feature that monitors the number of calls a user made since they subscribed to a service. If the business providing the subscription service has been around for a long time, it’s likely they are going to have some subscribers that subscribed a long time before a recent subscriber hence it’s likely they’ve made tons of calls in comparison to the more recent subscriber.

Values that are less frequent today may become more frequent in the future as the data grows, therefore, reevaluating the features is very important.

5 Do Feature Selection (When Necessary)

Here are some reasons to justify performing feature selections only when it’s absolutely necessary;

  • The model has to be explainable so it’s better to keep the most important features
  • There are strict hardware requirements
  • There isn’t much time to perform lots of experiments and/or to rebuild the mode for a production environment
  • There’s an expected distribution shift between multiple model training rounds

Wrap Up

By no means have we covered an exhaustive list of best practices – as mentioned in the title, I only covered 5. The curious reader should consider purchasing [Machine Learning Engineering](https://geni.us/ML-engineering), a book in which this article was highly inspired by. Machine Learning Engineering was written by Andriy Burkov, the author of The Hundred – Page Machine Learning Book and I highly recommend it to anyone that is seeking to improve their Machine Learning skills.

Note: By clicking on the book links above, you would be directed to Amazon via my affiliates link. I’ve also integrated geo-linking so if you are not in the UK, you would automatically be directed to your local Amazon store.

Thanks for Reading!

If you enjoyed this article, connect with me by subscribing ** to my FRE**E weekly newsletter. Never miss a post I make about Artificial Intelligence, Data Science, and Freelancing.

Related Articles

A Peek at Data Sampling Methods

Tackling Different Types of Bias in Data Projects

7 Common Gotcha’s of Data Projects


Related Articles