Feature Engineering for Machine Learning (1/3)

Part 1: Data Preprocessing

Published in

Towards Data Science

13 min readMar 14, 2022

Multiple gears on a page — Image by Pete Linforth from Pixabay

The era of Deep Learning has popularized the approach of end-to-end machine learning wherein raw data goes into one end of the pipeline and predictions out the other end. This has certainly produced speedups in model inference in some domains, especially in computer-vision pipelines, as evidenced, for example, by the higher frame-rates of Single-Shot Detectors versus models that rely on region proposals followed by object-detection. The ability of sophisticated models to automatically extract features has made it possible to trade off computational resources to save on human resources.

This approach has meant that machine-learning practitioners have been increasingly churning through models rather than honing their data. But there comes a point when the easy gains have been reaped, and doubling the model size can only eke out a tiny bit of performance improvement. This is when hand-crafting features can have a better payoff.

“Applied machine learning is basically feature engineering”
— Andrew Ng

In part, the automatic vs hand-crafted features tradeoff has been made possible by the richness, high-dimensionality and abundance of visual data. When working with a paucity of data, or less feature-rich data, which is all too common for data scientists tasked with coming up with predictions based on just a dozen or so features, feature engineering is essential to eke and tease out all the available ‘signal’ that’s present in the limited data; as well as to overcome the limitations of popular machine-learning algorithms, for example, difficulty in separating data based on multiplicative or divisive feature interactions.

When competing in Kaggle competitions, the best teams don’t win simply on account of model selection, ensembling and hyperparameter tuning, but rather, significantly on their ability to engineer new features, sometimes seemingly out of thin air, though more often than not, that which is born of a combination of truly understanding the data (bringing domain knowledge), supplementing with auxiliary data, and the dogged, creative (more an art than a science) but tedious, trial-and-error work of constructing and testing new features to see what works.

In this multi-part series, we’ll go over the three parts of a complete Feature Engineering pipeline:

Data Preprocessing
Feature Generation
Feature Selection

These three steps are performed in order but sometimes there’s ambiguity as to whether a certain technique constitutes data preprocessing, feature extraction, or generation. But we’re not hung up about semantics here … instead we’ll focus on surveying a gamut of techniques any good machine-learning practitioner and data scientist can bring to bear on a project.

The intent of this series of articles is to raise awareness of these sometimes forgotten, especially in the era of deep-learning and billion-parameter models, techniques to always keep in the back of one’s mind, and knowing some library functions that can greatly facilitate their use. Describing the inner-workings of each technique would each require an article onto itself, of which many can be found here on Towards Data Science.

1. Data Preprocessing

1.1 Data Cleaning

“Garbage In, Garbage Out.”

During EDA, one of the first steps to undertake should be to check for and remove constant features. But surely the model can discover that on its own? Yes, and no. Consider a Linear Regression model where a non-zero weight has been initialized to a constant feature. This term then serves as a secondary ‘bias’ term and seems harmless enough … but not if that ‘constant’ term was constant only in our training data, and (unbeknownst to us) later takes on a different value in our production/test data.

Another thing to be on the lookout for is duplicated features. This may not be blatantly obvious when it comes to categorical data, as it might manifest as different labels names being assigned to the same attribute across different columns, e.g. One feature uses ‘XYZ’ to denote a categorical class that another feature denotes as ‘ABC’, perhaps due to the columns being culled from different databases or departments. pd.factorize() can help identify if two features are synonymous.

Next up, redundant and highly-correlated features. Multicollinearity can cause model coefficients to be unstable and highly-sensitive to noise. Besides its negative impacts on storage and computational costs, redundant features weaken the effectiveness of other features when weight regularization is taken into account, making the model more prone to noise. pd.DataFrame.corr() can be used to identify correlated features.

Duplication may occur not just across columns, but also across rows. Such sample duplications can cause data imbalance and/or over-fitting during training. pd.DataFrame.duplicated() will return a series with a True value in each duplicated row beyond its first occurence.

1.2 Data Shuffling

It’s important to differentiate between shuffling during preprocessing versus during training.

During preprocessing it’s important to shuffle your dataset prior to splitting it into train/validation/test subsets. For small or highly-imbalanced datasets (e.g. in anomaly, fraud or disease detection), utilize the stratify on feature of sklearn.model_selection.train_test_split() to ensure there is consistent distribution of your minority targets across all your subsets. pd.DataFrame.sample(frac=1.0) can be used to easily shuffle your dataframe.

For training, most ML frameworks will shuffle the data for you, but it’s important to understand whether it’s doing a one-time shuffle, i.e. only when loading the dataset, or if it’s doing so continually on a batch-by-batch basis. The latter is preferred for obtaining the lowest training loss, but can result in slower training since the mini-batches of data cannot be cached and reused for the next epoch.

1.3 Data Imputation

This is a big topic! Missing features should be thoughtfully dealt with.

Missing features may not be immediately obvious, so merely using pd.DataFrame.isna.sum(axis=0) may not surface all of them. Missing fields may be denoted using a special (non-null) string, or numerical value (e.g. “-”, 0 or -999). The dataset you’re using may have already been preprocessed and pre-imputed by someone. Fortunately, detecting missing features can be done by plotting the histogram of each feature — unusual outlier spikes indicate the use of special values, while a spike in the middle of the distribution is a sign that mean/median imputation has already been performed.

How to impute missing fields is the next question. The most common and straightforward method is to substitute the mode (for categorical features), mean (for numerical features without a lot of outliers) or median (where outliers significantly skew the mean) for the missing value.

Even so, especially if you think the feature is an important one, do not blindly substitute the mean / median / mode of the entire dataset. To illustrate, in the Titanic dataset, several passengers are missing their ages. Rather than to impute all such passengers with the mean age of all passengers on the ship, we can get a more precise estimate by realizing that the mean (and median) age between passenger classes were quite different aboard the ship.

Table showing median ages of Titanic passengers by sex and fare class — Table 1: Median Passenger Age by Passenger Class and Gender

In fact, since there are no missing values for passenger sex in this dataset, you can go one step further when imputing and substitute the median age based on both i. fare class and ii. gender of each sample that is missing an age value.

For a time series, one should never impute missing samples using mean / median substitution, as this will invariably cause abrupt changes in the series that are unrealistic. Instead, impute using value repetition, or interpolation. Signal-processing denoising methods like median-filtering or low-pass zero-phase filtering can also be used to fill small gaps in the training data; but bear in mind that non-causal methods cannot be used during production unless a delayed output from the model is acceptable.

An alternative is to not impute at all and instead add a binary flag to allow the downstream learning algorithm to learn on its own how to handle such conditions. The disadvantage of doing so is that you may have to add many more such low-signal features if the missing values are spread out over many features. Note that XGBoost handles NaN out of the box, so there’s no need to add missing-data columns when using it. In general, Decision Trees can natively handle special-value label-encoding of missing values, by setting them to lower/upper extreme values that decision nodes can then easily split upon.

Another popular method is to run k-NN to impute missing values. Using a neural network for imputation is another popular route. For example, an auto-encoder can be trained to reproduce the training data with input dropout, and once trained, its outputs can be used to predict missing feature values. Using ML methods to learn imputations can be tricky however, since it’s difficult to assess how the imputation model’s hyperparameters (e.g. the value of k) are affecting the final model’s performance.

Regardless of the imputation method, where possible, always perform Feature Generation (discussed in the second part of this series) prior to data imputation, as this will allow for more accurate calculation of generated feature values, particularly when special-value encoding is used for imputation. Knowing that a relevant feature value is missing leaves the door open for special-handling when new features are being generated.

1.4 Feature Encoding

Ordinal features may have integer values, but they differ from numeric features in that although ordinal values obey the transitive comparison relation, they don’t abide by the arithmetic rules of subtraction or division.

Ordinal features, such as Star Ratings, often have highly non-linear ‘true’ mappings, corresponding to a strongly-bipolar distribution. i.e. The quantitative ‘difference’ between a 4-Star vs 5-Star rating is oftentimes small, and considerably smaller than half of the difference between a 4-Star vs 2-Star rating, which can confuse linear models.

As such, linear models may benefit from a more linear remapping of ordinal value assignments, but tree-based models on the other hand are able to handle such nonlinearity intrinsically. It’s better to leave ordinal encodings as a single quantitative feature (i.e. ‘column’) rather than dummifying, as dummifying decreases the Signal-to-Noise Ratio of other predictive features (e.g. when parameter regularization is applied).

Handling of categorical features depends on whether your model is tree-based. Tree-based models can use label-encoding (i.e. fixed strings or integers denoting class membership) and don’t need further preprocessing. Non-tree methods require that categorical features be one-hot encoded, which can be performed using either pd.get_dummies() or sklearn.preprocessing.OneHotEncoder(). Avoid dropping the first-column (i.e. don’t specify drop=’first’) unless you’re dealing with a binary category (i.e. use drop=’if_binary’), since dropping creates more problems than it solves.

An alternative to one-hot encoding is to use, or supplement with, frequency-encoding. This involves calculating the normalized frequency of the target-variable corresponding to each category. E.g. Assign a value of 0.4 to a binary categorical feature if 40% of samples in that category result in a target value of one. Doing so is helpful when the categorical feature is correlated to the target value.

Yet another way to encode categorical features is to use categorical embeddings. This is particularly suitable for high cardinality categorical features, such as ZIP codes or products. As with word embeddings, these embeddings are learned using a dense neural network. Analysis of results show that these continuous embeddings are also meaningful for clustering and visualization, while reducing overfitting.

A final point on handling categorical features is what to do if an unseen category is encountered in the validation/test subset. In this case, treat it the same way you would a missing feature, or assign it the reserved category of ‘Unknown’.

1.5 Numerical Features

Scaling or normalization is not required for tree-based methods, but is otherwise essential for achieving low training loss, fast convergence, and in order for weight regularization to work properly. Choose one or the other.

Scaling sklearn.preprocessing.MinMaxScaler() is fast and often all that is required for image data. Min-Max Scaling however can be significantly impacted by outliers (even just a single one!) or missing values that are encoded as values outside of the normal feature range.

Normalization sklearn.preprocessing.StandardScaler() is more robust to outliers as long as the proportion of outliers is small (i.e. if the outliers do not significantly skew the mean and standard-deviation).

Numerical features can often benefit from transformations. Log transformation, np.log(1 + x), is a very strong transformation that is particularly helpful when a feature observes a power-law relationship, or when there’s a long tail in the outliers distribution. Square-Root transformation, np.sqrt(x) is less strong, and can serve as a useful intermediate transform to try. The Box-Cox scipy.stats.boxcox() transform allows one to smoothly transition from a straight linear to a highly non-linear transfer function, due to the use of the lambda hyperparameter, and is often used to turn a skewed distribution (e.g. COVID test-positivity as a function of age) into a more normal distribution, which is the underlying assumption made by many traditional ML algorithms like NB, LogR and LinR. [NOTE: x >= 0 is assumed for all of the abovementioned transformations].

1.6 Geolocation Features

Tree-based methods can sometimes benefit from rotational transformations on geolocation coordinates. For example, South-of-Market (SoMa) is a neighborhood in San Francisco that is bordered by Market Street. However, Market St. doesn’t run North-South or East-West, but rather SW-NE. Decision trees will thus have difficulty cleanly segmenting out this neighborhood (when tree depth is bounded, as is often necessary), since trees can only draw partition lines that are axis-parallel (i.e. to either the Longitude or Latitude axes). The difficulty with performing rotational transformations though is choosing the pivot point to rotate the coordinates around, as there may not be a globally-optimum one.

One solution is to perform a cartesian to polar-coordinate transformation. The city of Paris for example is divided into arrondissements that are laid out roughly in a circular manner around the city center. Alternatively, clustering can be performed and a new categorical feature created to indicate which cluster a point belongs to — after which polar coordinates can be used to encode each point’s location from its cluster centroid/nucleus.

1.7 Time Features

Time in a dataset may appear as UTC/GMT time rather than local-time, even when all events are exclusive to a single (and different) time-zone. Even when timestamps are given in local time, it may be beneficial to perform time-shifting on time features.

For example, ride-sharing fares tend to be higher ‘later’ in the day (especially Fri/Sat/Sun) due to a combination of higher demand and lower supply. However, as any late-night reveler knows, the partying doesn’t stop at midnight. Due to the roll over of time, it’s difficult for a model therefore to discern that 00:30H is ‘later’, or even ‘close’ to, 23:30H. However, if time-shifting is performed and the tricky 23:59H → 00:00H wrap-around happens during the time of the day when there’s the least amount of activity, linear models will regress better while tree-based models will require fewer branching levels.

1.8 Text

A sea of text — Image by Gerd Altmann from Pixabay

Text is arguably the feature that requires the most preprocessing. We begin with transformations, of which there are many.

Lowercasing is necessary for simple models, and it is the most basic form of text normalization, providing the benefits of producing a stronger and more robust signal (due to higher word frequency occurrence) while reducing vocabulary size. For more advanced pretrained language models (LM) that come with their own tokenizers, it’s best to let the LM perform tokenization on the original text, as capitalization (e.g. with bert-base-cased) may help the model perform better on sentence parsing, name-entity recognition (NER) and part-of-speech (POS) tagging.

Stemming and Lemmatization provide the same benefits as lowercasing, but are more complicated and take far longer to run. Stemming pares down words using fixed rules which do not take into account the word’s context and usage within a sentence, e.g. university gets truncated to univers, but so does universal. Lemmatization takes into account context and requires the use of large LMs and is therefore slower than stemming. However it’s much more accurate, e.g. university and universal remain as separate root words. Given the compute resources readily available today, the authors of Spacy make the argument that accuracy matters most in production and therefore the library only supports lemmatization.

Contractions Expansion is yet another form of text normalization with the same motivations. Here can’t and you’ll get expanded to can not and you will. Simple count/frequency-based models benefit from this standardization, but LMs like those in Spacy handle contractions natively through the use of subword tokenization (can’t gets tokenized into ca followed by n’t).

Text-normalization aims to convert text and symbols to a canonical form so that models can learn better. They convert abbreviations (e.g. BTW), misspellings (e.g. ‘definately’), diacritics (e.g. café/naïve → cafe/naive), emoticons (e.g. <grin> | :-) | ;) | 🙂 → <SMILE>, ROFL | LOL | LMAO | 😆 → <LAUGH>), etc. Without text-normalization, many NLP models will have a rough time making sense of Twitter and social-media posts!

Next up is filtering — stopwords, punctuations, numbers, extra whitespace, emojis (when not performing sentiment analysis), HTML tags (e.g. <br>), HTML escape chars (e.g ‘&’, ‘ ’), URLs, Hashtags (e.g. #blackfriday) and Mentions (e.g. @xyz)

The final preprocessing step is tokenization. Simple tokenization can be performed using python regular-expressions, but NLP frameworks provide dedicated tokenizers that are written in C/C++ or Rust that are much more performant. When using pretrained LMs, such as those from Hugging Face, it’s crucial to use the exact same tokenizer (and weights) as that which is used by the LM when it was trained.

1.9 Images

Preprocessing images is nowadays less common with the rise in popularity of CNNs. However, in resource-constrained hardware without a GPU, or when high frame-rates are desired, these techniques form the bread-and-butter of traditional computer-vision processing pipelines.

Color-Space conversion can provide a slight performance boost with simple CNNs. This study, for example, found that converting the CIFAR-10 dataset to the L*a*b color-space resulted in a ~2% improvement in classification accuracy, but the best results came from using multiple color-spaces concurrently. It has been found that color-spaces with separate chrominance and luminance channels (e.g. YUV) are helpful for picture colorization and style transfer. I’ve personally found that specialized color transformations tailored for specific domains (e.g. Haematoxylin-Eosin-DAB for histopathology) can be particularly helpful in boosting model performance.

Histogram equalization, especially adaptive histogram equalization (e.g. CLAHE), is often performed on medical imaging to improve visual contrast. It’s particularly helpful when illumination is uneven across the image and distinguishing features are small relative to the entire image frame, such as detecting cancerous tissue in mammography images. In retinal fundus imaging where there’s high variability in image quality during image acquisition, performing non-uniform illumination correction has been shown to improve grading accuracy.

Traditional CV feature extraction techniques include Local Binary Patterns (LBP), Histogram of Oriented Gradients (HOG), and Gabor Filter banks. There’s been a long track-record of successfully using these feature extractors with non CNN-based models. Merging these older techniques (e.g. learnable Gabor filters) with standard convolutional layers can result in faster convergence and better accuracy with some datasets (e.g. Dogs-vs-Cats).

This concludes Part I of this series on feature engineering. In Part II, we’ll turn our attention to Feature Generation, where we’ll look at extracting and synthesizing brand new features. This is truly where art meets science, and is what separates a Kaggle Grandmaster from a novice!