
Feature embeddings are one of the most important steps when training neural networks on tabular data tables. Unfortunately, this technique is seldom taught outside of natural language processing (NLP) settings and is consequently almost completely ignored for structured datasets. But skipping this step can lead to significant drops in model accuracy! This has led to a false understanding that gradient boosted methods like XGBoost are always superior for structured dataset problems. Not only will embedding enhanced neural networks often beat gradient boosted methods, but both modeling methods can see major improvements when these embeddings are extracted. This article will answer the following questions:
- What are feature embeddings?
- How are they used with structured data?
- If they are so powerful, why are they not more common?
- How are embeddings implemented?
- How do I use these embeddings to enhance other models?
Feature Embeddings Explained
Neural networks have difficulty with sparse categorical features. Embeddings are a way to reduce those features to increase model performance. Before discussing structured datasets, its helpful to understand how embeddings are typically used. In natural language processing settings, you are typically dealing with dictionaries of thousands of words. These dictionaries are one-hot encoded into the model, which mathematically is the same as having a separate column for every possible word. When a word is fed into the model, the corresponding column will show a one while all other columns will show zeros. This leads into an incredibly sparse dataset. The solution is to create an embedding.

An embedding will essentially group words with similar meanings based on the training text and return their location. So, for example, ‘fun’ might have a similar embedding value as words like ‘humor’, ‘dancing’, or ‘Machine Learning‘. In practice, neural networks perform far better on these representative features.
Embeddings with Structured Data

Structured datasets also often contain sparce categorical features. In the example customer sales table above, we have zip code and store ID. Because there may be hundreds or thousands of different unique values for these columns, utilizing them would create the same performance issues noted in the NLP problem above. So why not use embeddings the same way?
The problem is, we are now dealing with more than one feature. In this case, two separate sparse categorical columns (zip code AND store ID) as well as other powerful features like sales totals. We simply cannot feed our features into an embedding. We can, however, train our embeddings in the first layer of the model and add in the normal features along side those embeddings. Not only does this transform zip code and store ID into useful features, but now the other useful features are not diluted away by thousands of columns.

Why Embeddings Are Ignored
In the largest ML focused companies, this technique is absolutely used. The problem is that the vast majority of data scientists outside of those major companies have never heard of using embeddings this way. Why is that? While I would not say these methods are overly difficult to implement, they are above the complexity level of your typical online course or specialization. Most aspiring machine learning practitioners simply never learn how to merge an embedding with other noncategorical features. As a result, features like zip code and store ID are often simply dropped from the model. But these are important features!
Some of the feature value can be captured through techniques like mean encoding but these improvements are often marginal. This has led to a trend of skipping Neural Networks altogether because gradient boosted methods can handle these categorical features better. But as mentioned above, embeddings can improve both models as will be seen in the next section.
How to Implement Embeddings
The most difficult part of this process is getting familiar with TensorFlow datasets. While they are nowhere near as intuitive as pandas data frames, they are a great skill to learn if you ever plan on scaling your models to massive datasets or want to build a more complex network.
For this example, we will use the hypothetical customer sales table above. Our goal is to predict the target month’s sales. For simplicity and brevity, we will skip the engineering steps and start with pre-split pandas data frames. For larger datasets, you likely would not start with a data frame but that is a topic for another article.
The first step is the convert these data frames into TensorFlow datasets:
trainset = tf.data.Dataset.from_tensor_slices((
dict(X_train),dict(y_train))).batch(32)
validationset = tf.data.Dataset.from_tensor_slices((
dict(X_val),dict(y_val))).batch(32)
One thing to note is that these TensorFlow datasets and the transformations hereafter are not stored into memory the same way a pandas data frame is. They are essentially a pipeline that the data will pass though batch by batch, allowing the model to efficiently train on datasets too large to fit into memory. That is why we are feeding in dictionaries of the data frames rather than the actual data. Notice that we are also defining the batch size now rather than at training like you normally would using Keras API.
Next, we will want to create a list of all unique values for zip code and store id. This will be used for creating and extracting the embeddings later.
zip_codes = X_train['zip_code'].unique()
store_ids = X_train['store_id'].unique()
Now we can define the data pipelines using TensorFlow feature columns. Depending on the types of features in your table, there are many options to choose from. Please check out TensorFlow’s feature_column documentation for more information.
# numeric features being fed into the model:
feature_columns = []
feature_columns.append(
tf.feature_column.numeric_column('gender')
feature_columns.append(
tf.feature_column.numeric_column('age)
feature_columns.append(
tf.feature_column.numeric_column('previous_sales')
# categorical columns using the lists created above:
zip_col = tf.feature_column.categorical_column_with_vocabulary_list(
'zip_code', zip_codes)
store_col = tf.feature_column.categorical_column_with_vocabulary_list(
'store_id', store_ids)
# create an embedding from the categorical column:
zip_emb = tf.feature_column.embedding_column(zip_col,dimension=6)
store_emb = tf.feature_column.embedding_column(store_col,dimension=4)
# add the embeddings to the list of feature columns
tf.feature_columns.append(zip_emb)
tf.feature_columns.append(store_emb)
# create the input layer for the model
feature_layer = tf.keras.layers.DenseFeatures(feature_columns)
Notice in the embedding step we had to specify the number of dimensions. This refers to how many features we want to reduce the categorical columns down to. The rule of thumb is that you typically reduce the features by the 4th root of the total number of categories (e.g. 1000 unique zip codes down to ~6 embedding columns) but this is another parameter that can be tuned in your model.
Now let us build a simple model:
model = tf.keras.models.Sequential()
model.add(feature_layer)
model.add(tf.keras.layers.Dense(units=512,activation='relu'))
model.add(tf.keras.layers.Dropout(0.25))
# add any layers that you want here
Model.add(tf.keras.layers.Dense(units=1))
# compile and train the model
model.compile(loss='mse', optimizer='adam', metrics=['accuracy'])
model.fit(trainset, validation_data=valset, epochs=20, verbose=2)
Congratulations, you have now trained a model with embeddings! Now how do we extract those embeddings from this model to feed to other models? Simply grab the weights from the model:
zip_emb_weights = model.get_weights()[1]
store_emb_weights = model.get_weights()[0]
Notice that the order of the embedding layers can shift around so check that the length of the weight layer matches the length of the unique values defined above to ensure that you are grabbing the correct layer. Now save the weights to a data frame.
# create column names for the embeddings
zip_emb_cols = ['zip_emb1', 'zip_emb2', 'zip_emb3', ...]
store_emb_cols = ['store_emb1', 'store_emb2', 'store_emb3', ...]
# create a pandas data frame:
zip_emb_df = pd.DataFrame(columns=zip_emb_cols,
index=zip_codes,data=zip_emb_weights)
store_emb_df = pd.DataFrame(columns=store_emb_cols,
index=store_ids,data=store_emb_weights)
# finally, save the data frames to csv or other format
zip_emb_df.to_csv('zip_code_embeddings.csv')
store_emb_df.to_csv('store_id_embeddings.csv')
Conclusion
Now that the embeddings are stored, we can merge them back into the original dataset. We can even merge them into other datasets using the same categorical features. I am yet to find a case where these new enhanced datasets have not boosted the accuracy of ALL models utilizing them. Give it a try and I promise you that this process will become a part of your standard machine learning workflow.