Building a One Hot Encoding Layer with TensorFlow

How to create a Custom Neural Network layer to One Hot Encode categorical input features in TensorFlow

George Novack
Towards Data Science

--

One Hot Encoding is a common way of preprocessing categorical features for machine learning models. This type of encoding creates a new binary feature for each possible category and assigns a value of 1 to the feature of each sample that corresponds to its original category. It’s easier to understand visually: in the example below, we One Hot Encode a color feature which consists of three categories (red, green, and blue).

One hot encoding example with color as categories
One Hot Encoding a simple categorical feature (Image by author)

Sci-kit Learn offers the OneHotEncoder class out of the box to handle categorical inputs using One Hot Encoding. Simply create an instance of sklearn.preprocessing.OneHotEncoder then fit the encoder on the input data (this is where the One Hot Encoder identifies the possible categories in the DataFrame and updates some internal state, allowing it to map each category to a unique binary feature), and finally, call one_hot_encoder.transform() to One Hot Encode the input DataFrame. The great thing about the OneHotEncoder class is that, once it has been fit on the input features, you can continue to pass it new samples, and it will encode the categorical features consistently.

One Hot Encoding using Sci-kit Learn’s OneHotEncoder

One Hot Encoding for TensorFlow Models

Recently, I was working with some categorical features that were being passed as inputs to a TensorFlow model, so I decided to try and find a “Tensorflow-native” way of One Hot Encoding.

After a lot of searching, I mostly came across two suggestions for how to do this:

Just use Sci-kit Learn’s OneHotEncoder

We already know it works, and One Hot Encoding is One Hot Encoding right, so why bother doing it with TensorFlow?

While this is a valid suggestion that works well for simple examples and demonstrations, it can lead to some complications in scenarios where you plan to deploy your model as a service so that it can perform inference in a production environment.

To take a step back, one of the big benefits of using OneHotEncoder with Sci-kit Learn models is that you can include it, along with the model itself, as a step in a Sci-kit Learn Pipeline, essentially bundling One Hot Encoding (and potentially other preprocessing) logic and inference logic as a single deployable artifact.

Sci-kit Learn Pipeline with One Hot Encoding and prediction logic

So back to our TensorFlow scenario: if you were to use OneHotEncoder to preprocess input features for a TensorFlow model, you would have some additional complexity to deal with, because you would either have to:

  1. Duplicate the One Hot Encoding logic anywhere that the model is used for inference.
  2. Or, deploy both the fit OneHotEncoder and the trained TensorFlow model as separate artifacts, and then ensure that they are used properly and kept in-sync by all applications that use the model.

Use the tf.one_hot Operation

This was the other suggestion I came across. The tf.one_hot operation takes a list of category indices and a depth (for our purposes, essentially a number of unique categories), and outputs a One Hot Encoded Tensor.

The tf.one_hot Operation

You’ll notice a few key differences though between OneHotEncoder and tf.one_hot in the example above.

  • First, tf.one_hot is simply an operation, so we’ll need to create a Neural Network layer that uses this operation in order to include the One Hot Encoding logic with the actual model prediction logic.
  • Second, instead of passing in the string categories (red, blue, green), we’re passing in a list of integers. This is because tf.one_hot does not accept the categories themselves, but instead accepts a list of indices for the One Hot Encoded features (notice that category index 0 maps to a 1x3 list where column 0 has value 1 and the others have value 0)
  • Third, we have to pass in a unique category count (or depth). This value determines the number of columns in the resulting One Hot Encoded Tensor.

So, in order to include One Hot Encoding logic as part of a TensorFlow model, we’ll need to create a custom layer that converts string categories into category indices, determines the number of unique categories in our input data, then uses the tf.one_hot operation to One Hot Encode the categorical features. We’ll do all of this next.

Creating a Custom Layer

The first order of business is to convert string categories into integer indexes in a way that converts the categories consistently (e.g. the string blue should always be converted to the same index).

Enter TextVectorization

The experimental TextVectorization layer can be used to standardize and tokenize sequences of strings, such as sentences, but for our use case, we’ll simply convert individual string categories into integer indices.

Adapting the TextVectorization Layer to the color categories

We specify output_sequence_length=1 when creating the layer because we only want a single integer index for each category passed into the layer. Calling the adapt() method fits the layer to the dataset, similar to calling fit() on the OneHotEncoder . After the layer has been fit, it maintains an internal vocabulary of unique categories and maps them consistently to integer indices. You can view the layer’s vocabulary by calling get_vocabulary() after it has been fit.

The OneHotEncodingLayer Class

Finally, we can now create the Class that will represent a One Hot Encoding Layer in a neural network.

Custom Layer Class for One Hot Encoding categorical features

The class inherits from PreprocessingLayer so that it inherits the base adapt() method. When this layer is initialized, a TextVectorization layer is also initialized, and when adapt() is called, the TextVectorization is fit on the input data, and two class attributes are set:

  • self.depth is the number of unique categories in the input data. This value is used when calling tf.one_hot to determine the number of resulting binary features.
  • self.minimum is the minimum index output by the TextVectorization layer. This value is subtracted from the indices at runtime to ensure that the indices passed to tf.one_hot fall in the range [0, self.depth-1] (e.g. if the TextVectorization outputs values in the range [2, 4] we will subtract 2 from each value so that the resulting indices are in the range [0, 2].

The get_config() method allows TensorFlow to save the state of the layer when the model is saved to disk. The values from the layer’s config will be passed to the layer’s __init__() method when the model is loaded into memory. Notice that we’re explicitly setting the vocabulary, depth, and minimum whenever these values are passed in.

Using the Custom Layer

Now we can try the new layer out in a simple Neural Network.

Simple Neural Network with One Hot Encoding

This simple network just accepts a categorical input, One Hot Encodes it, then concatenates the One Hot Encoded features with the numeric input feature. Notice I’ve added a numeric id column to the DataFrame to illustrate how to split categorical inputs from numeric inputs.

And that’s it! We now have a working Neural Network layer that can One Hot Encode categorical features! We can also save this model as a JSON config file, deploy it, and reload it into memory to perform inference. Notice that it One Hot Encodes the color category in the same way as before, so we know that the subsequent layers of our model will be provided the same features in the same order as they appeared during training.

Saving and loading model from config

You can find a notebook containing all of the code examples here: https://github.com/gnovack/tf-one-hot-encoder/blob/master/OneHotEncoderLayer.ipynb

Thanks for reading! Feel free to leave any questions or comments below.

References

--

--