Feature Engineering Techniques

Mapping raw data to machine learning features

Samuel Ozechi
Towards Data Science

--

Feature engineering is one of the key steps in developing machine learning models. This involves any of the processes of selecting, aggregating, or extracting features from raw data with the aim of mapping the raw data to machine learning features.

Mapping raw data to feature vectors. (Image by author).

The type of feature engineering process that is applied to a data depends on its datatype. Numerical and non-numerical data are the two basic datatypes that can be present in a raw dat; These can be subdivided into discrete, continuous, categorical, text, image and temporal data types.

This post focuses on feature engineering numerical, categorical and text data. It also considers ways of handling missing data and deriving new features to improve model performance.

The numerical and non-numerical columns of a data that has been read into a Pandas dataframe can be differentiated according to the datatypes of its columns thus:

Getting numerical and non numerical columns in a dataframe. (Image by Author).

Numerical data is a general term for the features data that hold numerical values that can be arranged in a logical order. It is often easier to work with numerical features in machine learning as numerical data formats (integer, floats, etc.) are ingestible by machine learning algorithms.

For example, Let’s look at some of the numerical data of the popular Housing price dataset.

Sample Numerical columns of the Housing price dataset . (Image by Author).

The numerical data of a raw data are often expressed in different scales. Like in the example above, some columns such as the “FullBath” and “HalfBath” have lower scales (<10) while others such as “1stFlrSF” and “2ndFlrSF” have higher scales ( >1000). It is often a good practice to standardize the scales of the input numerical data to have a similar range in order to avoid the algorithms from assigning more weights to the features with larger scales during training.

Numerical data can be scaled using Sklearn’s MinMaxScaler or StandardScaler. The MinMaxScaler scales all the numerical data to have a range of 0 to 1 while the StandardScaler scales the numerical data to a have a unit variance and mean of 0, thereby standardizing the numerical input.

Scaled numerical data using MinMaxScaler. (Image by Author).

Non-numerical data include categorical and text data that are often encoded into numerals before being ingested into machine learning models.

Categorical data could be sometimes expressed as numbers but are not to be confused as numerical data as they show the relationship between different classes of a data. Below is a dataframe of some categorical columns of the Housing price dataset.

Sample columns of the Housing price dataset showing Categorical features. (Image by Author).

To avoid making assumptions that categorical data that are expressed as numerical values are numerical features during training, it’s often preferable to one-hot encode categorical features as against label encoding them, which may trick machine learning algorithms to assume algebraic relationships among categories.

Unlike in label encoding where numerical values are simply assigned to represent respective categories, With one-hot encoding, each categorical value is converted into a new categorical column and assigned a binary value of 1 or 0, thereby representing each value as a binary vector. This prevents the algorithm from assuming numerical ordering or relationships among categories (like assuming that a boy > a girl, if they are label encoded with 1 and 0 respectively), thereby improving the performance of the model in learning the relationship within the data.

Following the previous example, below is subpart of sample categorial columns from the Housing price dataset.

Sample categorical columns of Boston Housing dataset. (Image by Author).

Rather than represent each of the categories in these columns with a numerical value (0, 1, 2, 3,…) as in label encoding for model ingestion, it is rather optimal to one-hot encode the features in the dataframe, such that the presence of a category in a datapoint is denoted with one (1) while an absence is denoted with zero (0). This can be easily done using using Scikit Learn’s oneHotEncoder.

One hot encoded features of the categorical column. (Image by Author).

The results show one-hot encoded features that could be ingested into models for efficient performance. Notice that the output dataframe now contains more dimensions (columns) which have numerical values (1.0, 0.0) that are representative of the presence or absence of that category in each datapoint. Thus, each column has been expanded into separate columns according to the number of categories in that column, for all the categorical columns.

For text data, it is necessary to represent them as numerical values which are recognized by machine learning algorithms through a process known as vectorization. Some common text vectorization techniques are word count, term frequency–inverse document frequency (TF–IDF) and word embeddings.

Word count is a simple vectorization technique of representing text data as numerical values according to the frequency of their occurrence in the data. Consider a list of texts below for example:

text_data = ['Today is a good day for battle',
'Battle of good against evil',
'For the day after today is Monday',
'Monday is a good day for good']

To vectorize the text data above using the word count technique, a sparse matrix is formed by recording the number of times each word appears in the text data. This can be achieved using using Scikit Learn’s CountVectorizer.

Output dataframe of word count vectorization for the sample text data. (Image by Author).

The output dataframe shows the vectorized features of the input text data which can now be ingested into algorithms. While this technique achieves the vectorization purpose, it is suboptimal for machine learning algorithms as it only focuses on the frequency of words in the data. A better alternative approach is the Term Frequency–Inverse Document Frequency (TF–IDF) method that focuses on both the frequency of the words and their importance in the text data.

In TF-IDF, the importance of each word in a text data is inferred by the inverse document frequency (IDF) of the word. The inverse document frequency is a statistical measure that evaluates how relevant a word is to a text in a collection of texts. The IDF for each word is calculated thus:

idf = ln[1+N]/(1+df)]+1
where:
idf = Inverse document frequency
N = Number of texts in the data, N=4 for our example.
df = the document frequency
The df is the number of texts that a particular word appears in the data; for instance, the word "good" appears four (4) times in the data but has a df of (3) as it appears in 3 texts (once for two texts and twice for another text).

An important aspect of the TF-IDF approach is that it assigns low scores to words that are either abundant or rare in the text data as it assumes that they are of less importance in finding patterns in the data . This is usually helpful in building efficient models as common words such as ‘the’, ‘is’, ‘are’, ‘of’ and rare words are mostly of little or no help in real time pattern recognition in text data. Sklearn’s TfidfVectorizer is useful to easily vectorize text data in this way.

Output dataframe showing the inverse document frequencies for the words in the data. (Image by Author).

The output shows that the text data is vectorized according to the TF-IDF values of the words and can be then ingested into algorithms for learning. The disadvantage of the TD-IDF approach is that it doesn’t take account for words which share similar meanings as in word embeddings; nevertheless, it’s still a good fit for vectorizing text data.

Another common technique of improving the performance of machine learning models through feature engineering is by deriving new features from existing features. This can be done by mathematically combining the input features in some way, thereby transforming the input features before training on it.

In the Housing price dataset example for instance, features such as the Square feet per room or Total number of rooms in the house are often more reflective and indicative on the target price of the house than individual features such as the number of bedrooms, bathrooms etc. In cases where such useful features are not present in the data, they could be derived by mathematically combining the input features.

Missing data are also another problem when working with raw data in machine learning. Dealing with missing data is an important aspect of feature engineering since raw data are often incomplete. Handling missing values in the data usually involves dropping the data points with missing values altogether, replacing missing values by utilizing any of the measures of central tendency(mean, median or mode) or by matrix completion.

After carrying out most of the previously outlined steps according to the data type, your raw data are now transformed into feature vectors that can be passed into machine learning algorithms for the training phase.

Summary: Feature engineering involves the processes of mapping raw data to machine learning features. The processes of feature engineering depends on the types of the data. common feature engineering processes includes scaling numerical data, label or one-hot encoding categorical data, vectorizing text data, deriving new features and handling missing data. Proper feature engineering helps to make raw data suitable for ingestion into machine learning models and improves the performance of machine learning models.

--

--