The world’s leading publication for data science, AI, and ML professionals.

Using neural networks with embedding layers to encode high cardinality categorical variables

How can we use categorical features with thousands of different values?

There are multiple ways to encode categorical features. If no ordered relation between the categories exists one-hot-encoding is a popular candidate (i.e. adding a binary feature for every category), alongside many others. But one-hot-encoding has some drawbacks – which can be tackled by using embeddings.

Slightly helpful illustration, image done by author using draw.io
Slightly helpful illustration, image done by author using draw.io

One drawback is that it does not work well for high cardinality categories: it will produce very large/wide and sparse datasets, and a lot of memory and regularization due to the shear amount of features will be needed. Moreover one-hot-encoding does not exploit relationships between the categories. Imagine you have animal species as a feature, with values like domestic cat, tiger and elephant. There are probably more similarities between domestic cat and tiger compared to domestic cat and elephant. This is something a smarter encoding could take into account. In very practical scenarios these categoricals can be things like customers, products or locations – very high cardinality categories with relevant relationships between the individual observations.

One way to tackle these problems is to use embeddings, a popular example or embeddings are word embeddings in NLP problems. Instead of encoding the categories using a huge binary space we use a smaller dense space. And instead of manually encoding them, we rather define the size of the embedding space and then try to make the model learn a useful presentation. For our animal species example we could retrieve representations like [0.5, 0, 0] for domestic cat,[1, 0.4, 0.1] for tiger and [0, 0.1, 0.6] for elephant, [-0.5, 0.5, 0.4] for shark, and so on.

In the following post we will build a neural network using embeddings to encode the categorical features, moreover we will benchmark the model against a very naive linear model without categorical variables, and a more sophisticated regularized linear model with one-hot-encoded features.

A toy example

Let us look at a generated toy problem. Imagine we repeatedly buy different products from various suppliers and we want to predict their size. Now let us assume every product comes labelled with a supplier_id and a product_id, identifiers for the supplier and product itself. Let us also assume the items have some obvious measurements/features x1 and x2, like price and weight, and some secret, unmeasurable features, s1, s2 and s3, from which the size could theoretically be computed like this (s3 has no impact to the size):

y = f(price, weight, s1, s2, s3) 
  = price + s1 + weight * s2

The problem is that we do not know the secret features s1, s2 and s3, and we cannot measure them directly, which is actually a pretty common problem in Machine Learning. But we have a little bit of a margin here, because we have the product and supplier ids – but there are too many to one-hot-encode them and use them straightforwardly. And let us assume from experience we know that products from different sellers have different sizes, so it is reasonable to assume that the items from the seller have very similar secret properties.

y = g(supplier_id, product_id, size, weight)

The question is, can our model learn the relation g above just from the product and supplier ids, even if we have hundred thousand of different ones? The answer is yes, if we have enough observations.

Let us look at a small dataset to get a better picture. We generate 300 samples with 4 different values in s1 and 3 different values in s2 (remember that s3 had no impact) and visualize the obvious impact the secret properties have on the relationship between price, weight and size.

import seaborn as sns
import matplotlib.pyplot as plt
data = generate_secret_data(n=300, s1_bins=3, s2_bins=6, s3_bins=2)
data.head(10)
##   s1   s2  s3  price  weight       y
## 0  1  0.0   2  1.269   2.089   4.055
## 1  3  2.0   1  2.412   1.283   9.764
## 2  2  1.0   2  3.434   1.010   8.230
## 3  1  3.0   1  4.493   1.837  12.791
## 4  3 -2.0   2  4.094   2.562   3.756
## 5  1  2.0   2  1.324   1.802   7.714
## 6  1  2.0   1  2.506   1.910   9.113
## 7  3 -2.0   1  3.626   1.864   4.685
## 8  2  1.0   1  2.830   2.064   8.681
## 9  1  2.0   1  4.332   1.100   9.319
g = sns.FacetGrid(data, col='s2', hue='s1', col_wrap=3, height=3.5);
g = g.map_dataframe(plt.scatter, 'weight', 'y');
plt.show()
Different weight to size (y) relations per s2 property
Different weight to size (y) relations per s2 property
g = sns.FacetGrid(data, col='s2', hue='s1', col_wrap=3, height=3.5); g = g.map_dataframe(plt.scatter, 'price', 'y'); 
plt.show()
Different price to size (y) relations per s2 property
Different price to size (y) relations per s2 property

A toy example how it would appear in (a kind) reality

But now the problem: we do not know the features s1, s2 and s3, but only the product ids & supplier ids. To simulate how this dataset would appear in the wild let us introduce a hash function that we will use to obfuscate our unmeasurable features and generate some product and supplier ids.

import hashlib
def generate_hash(*args):
    s = '_'.join([str(x) for x in args])
    return hashlib.md5(s.encode()).hexdigest()[-4:]
generate_hash('a', 2)
## '5724'
generate_hash(123)
## '4b70'
generate_hash('a', 2)
## '5724'

We can now generate our data with obfuscated properties, replaced by product ids:

data = generate_data(n=300, s1_bins=4, s2_bins=1, s3_bins=2)
data.head(10)
##   product_id supplier_id  price  weight      y
## 0       7235        a154  2.228   2.287  4.470
## 1       9cb6        a154  3.629   2.516  8.986
## 2       3c7e        0aad  3.968   1.149  8.641
## 3       4184        0aad  3.671   2.044  7.791
## 4       4184        0aad  3.637   1.585  7.528
## 5       38f9        a154  1.780   1.661  4.709
## 6       7235        a154  3.841   2.201  6.040
## 7       efa0        0aad  2.773   2.055  4.899
## 8       4184        0aad  3.094   1.822  7.104
## 9       4184        0aad  4.080   2.826  8.591

We still see that different products tend to have different values, but we cannot easily compute the size value from the product id any more:

sns.relplot(
    x='price',
    y='y',
    hue='product_id',
    sizes=(40, 400),
    alpha=1,
    height=4,
    data=data
);
plt.show()
Price to size (y) relations per product id - image done by author
Price to size (y) relations per product id – image done by author

Let us generate a bigger dataset now. To be able to fairly compare our embedding model with a more naive baseline, and to be able to validate our approach, we will assume rather small values for our categorical cardinalities S1_BINS, S2_BINS, S3_BINS.

If S1_BINS, S2_BINS, S3_BINS >> 10000 the benchmark models will run into memory issues and will perform poorly.

from sklearn.model_selection import train_test_split
N = 100000
S1_BINS = 30
S2_BINS = 3
S3_BINS = 50
data = generate_data(n=N, s1_bins=S1_BINS, s2_bins=S2_BINS, s3_bins=S3_BINS)
data.describe()
# cardinality of c1 is approx S1_BINS * S3_BINS,
# c2 is approx. S2_BINS * S3_BINS
##             price      weight           y
## count  100000.000  100000.000  100000.000
## mean        3.005       2.002      17.404
## std         0.997       0.502       8.883
## min        -1.052      -0.232      -2.123
## 25%         2.332       1.664       9.924
## 50%         3.004       2.002      17.413
## 75%         3.676       2.341      24.887
## max         7.571       4.114      38.641
data.describe(include='object')
##        product_id supplier_id
## count      100000      100000
## unique       1479         149
## top          8851        0d98
## freq          151        1376

We will now split the data into features and response, and train and test.

x = data[['product_id', 'supplier_id', 'price', 'weight']]
y = data[['y']]
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=456)

Building benchmark models: naive and baseline

Let us first assemble a very naive linear model, which performs quite poorly and only tries to estimate size from price and weight:

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
naive_model = LinearRegression()
naive_model.fit(x_train[['price', 'weight']], y_train);
y_pred_naive = naive_model.predict(x_test[['price', 'weight']])
mean_squared_error(y_test, y_pred_naive)
## 77.63320758421973
mean_absolute_error(y_test, y_pred_naive)
## 7.586725358761727

The bad performance close to the overall variance of the response is obvious if we look at the correlations between price and weight and the response size, ignoring the ids:

sns.pairplot(data[['price', 'weight', 'y']].sample(1000));
plt.show()
Correlations between price, weight and size (y) - all over the place - image done by author
Correlations between price, weight and size (y) – all over the place – image done by author

For a better benchmark we can one-hot-encode the categorical features and standardize the numeric data, using the sklearns ColumnTransformer to apply these transformations to different columns. Due to the amount of features we will use Ridge regression instead of normal linear regression to keep the coefficients small (but non-zero, unlike with Lasso, which would lead to losing information about specific classes).

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
def create_one_hot_preprocessor():
    return ColumnTransformer([
        ('one_hot_encoder', OneHotEncoder(sparse=False, handle_unknown='ignore'), ['product_id', 'supplier_id']),
        ('standard_scaler', StandardScaler(), ['price', 'weight'])]
    )

The one hot preprocessor spreads the features to columns and makes the data wide, additionally the numerical features get standardized to zero mean and unit variance:

Building a neural net with embedding layers, including preprocessing and unknown category handling

Now let us build a neural net model with embedding layers for our categoricals. To feed them to the embedding layer we need to map the categorical variables to numerical sequences first, i.e. integers from the intervals [0, #supplier ids] resp. [0, #product ids].

Since sklearns OrdinalEncoder cannot handle unknown values as of now, we need to improvise. Unknown categories might occur when splitting test data randomly or seeing new data in the wild during prediction time. Therefore we have to use a simple implementation (working with dataframes instead of arrays, not optimized for speed) of an encoder which can handle unknown values: we essentially use an ordered dictionary as a hashmap to map values to positive integer ranges, where unknown values will get mapped to 0 (we need to map to non-negative values to conform with the embedding layer later on).

The encoder can now encode and decode our data:

ce = ColumnEncoder()
ce.fit_transform(x_train)
##        product_id  supplier_id  price  weight
## 17414         941          104  2.536   1.885
## 54089         330          131  3.700   1.847
## 84350         960          122  3.517   2.341
## 68797         423           77  4.942   1.461
## 50994         617          138  4.276   1.272
## ...           ...          ...    ...     ...
## 55338         218          118  2.427   2.180
## 92761         528           10  1.705   1.368
## 48811         399           67  3.579   1.938
## 66149         531          126  2.216   2.997
## 30619        1141           67  1.479   1.888
## 
## [75000 rows x 4 columns]
ce.inverse_transform(ce.transform(x_train))
##       product_id supplier_id  price  weight
## 17414       a61d        b498  2.536   1.885
## 54089       36e6        e41f  3.700   1.847
## 84350       a868        d574  3.517   2.341
## 68797       4868        80cf  4.942   1.461
## 50994       69f3        eb54  4.276   1.272
## ...          ...         ...    ...     ...
## 55338       2429        cc4a  2.427   2.180
## 92761       5c02        0ec5  1.705   1.368
## 48811       426c        7a7d  3.579   1.938
## 66149       5d45        dc6f  2.216   2.997
## 30619       c73d        7a7d  1.479   1.888
## 
## [75000 rows x 4 columns]
x_train.equals(ce.inverse_transform(ce.transform(x_train)))
## True

It can also handle unknown data by mapping it to the zero-category:

unknown_data = pd.DataFrame({
    'product_id': ['!§$%&/()'],
    'supplier_id': ['abcdefg'],
    'price': [10],
    'weight': [20],
  })
ce.transform(unknown_data)
##    product_id  supplier_id  price  weight
## 0           0            0     10      20
ce.inverse_transform(ce.transform(unknown_data))
##   product_id supplier_id  price  weight
## 0       None        None     10      20

To feed the data to the model we need to split the input to pass it to different layers, essentially into X = [X_embedding1, X_embedding2, X_other]. We can do this using a transformer again, this time working with np.arrays, since the StandardScaler returns arrays:

emb = EmbeddingTransformer(cols=[0, 1])
emb.fit_transform(x_train.head(5))
## [array([['a61d'],
##        ['36e6'],
##        ['a868'],
##        ['4868'],
##        ['69f3']], dtype=object), array([['b498'],
##        ['e41f'],
##        ['d574'],
##        ['80cf'],
##        ['eb54']], dtype=object), array([[2.5360678952988436, 1.8849677601403312],
##        [3.699501628053666, 1.8469279753798342],
##        [3.5168780519630527, 2.340554963373134],
##        [4.941651644756232, 1.4606898248596456],
##        [4.27624682317603, 1.2715509823965785]], dtype=object)]

Let’s combine those two now, and fit the preprocessor with the training data to encode the categories, perform the scaling and bring it into the right format:

def create_embedding_preprocessor():
  encoding_preprocessor = ColumnTransformer([
      ('column_encoder', ColumnEncoder(), ['product_id', 'supplier_id']),
      ('standard_scaler', StandardScaler(), ['price', 'weight'])
  ])
embedding_preprocessor = Pipeline(steps=[
      ('encoding_preprocessor', encoding_preprocessor),
      # careful here, column order matters:
      ('embedding_transformer', EmbeddingTransformer(cols=[0, 1])),
  ])
return embedding_preprocessor
embedding_preprocessor = create_embedding_preprocessor()
embedding_preprocessor.fit(x_train);

If we feed this data to the model now it will not be able to learn anything reasonable for unknown categories, i.e. categories that did not exist in the training data x_train when we did the fit. So once we try to make predictions for those we might receive unreasonable estimates.

One way to tackle this out-of-vocabulary problem is to set some random training observations to unknown categories. Therefore during the training of the model, the transformation will encode these with 0, which is the token for unknowns, and it will allow the model to learn something close to the mean for unknown categories. With more domain knowledge we could also pick any other category as default, instead of sampling at random.

# vocab sizes
C1_SIZE = x_train['product_id'].nunique()
C2_SIZE = x_train['supplier_id'].nunique()
x_train = x_train.copy()
n = x_train.shape[0]
# set a fair share to unknown
idx1 = np_random_state.randint(0, n, int(n / C1_SIZE))
x_train.iloc[idx1,0] = '(unknown)'
idx2 = np_random_state.randint(0, n, int(n / C2_SIZE))
x_train.iloc[idx2,1] = '(unknown)'
x_train.sample(10, random_state=1234)
##       product_id supplier_id  price  weight
## 17547       7340        6d30  1.478   1.128
## 67802       4849        f7d5  3.699   1.840
## 17802       de88        55a0  3.011   2.306
## 36366       0912        1d0d  2.453   2.529
## 27847       f254        56a6  2.303   2.762
## 19006       2296   (unknown)  2.384   1.790
## 34628       798f        5da6  4.362   1.775
## 11069       2499        803f  1.455   1.521
## 69851       cb7e        bfac  3.611   2.039
## 13835       8497        33ab  4.133   1.773

We can now transform the data

x_train_emb = embedding_preprocessor.transform(x_train)
x_test_emb = embedding_preprocessor.transform(x_test)
x_train_emb[0]
## array([[ 941.],
##        [ 330.],
##        [ 960.],
##        ...,
##        [ 399.],
##        [ 531.],
##        [1141.]])
x_train_emb[1]
## array([[104.],
##        [131.],
##        [122.],
##        ...,
##        [ 67.],
##        [126.],
##        [ 67.]])
x_train_emb[2]
## array([[-0.472, -0.234],
##        [ 0.693, -0.309],
##        [ 0.51 ,  0.672],
##        ...,
##        [ 0.572, -0.128],
##        [-0.792,  1.976],
##        [-1.53 , -0.229]])

Time to build the neural network! We have 3 inputs (2 embeddings, 1 normal). The embedding inputs are both passed to Embedding layers, flattened and concatenated with the normal inputs. The following hidden layers consist of Dense & Dropout, and are finally activated linearly.

import tensorflow as tf
model = create_model(embedding1_vocab_size=C1_SIZE+1, 
                     embedding2_vocab_size=C2_SIZE+1)
tf.keras.utils.plot_model(
    model,
    to_file='../../static/img/keras_embeddings_model.png',
    show_shapes=True,
    show_layer_names=True,
)
Neural network with embedding layers, image done by author using tensorflow.keras.utils.plot_model
Neural network with embedding layers, image done by author using tensorflow.keras.utils.plot_model
num_epochs = 50
model.fit(
    x_train_emb,
    y_train,
    validation_data=(x_test_emb, y_test),
    epochs=num_epochs,
    batch_size=64,
    verbose=0,
);

If you see an error like InvalidArgumentError: indices[37,0] = 30 is not in [0, 30) you chose the wrong vocabulary size, which should be the largest index of the embedded values plus 1 according to the docs.

As we can see the model performs comparably good and even better as the baseline linear model for our linear problem, though its true strength it will only come into play when we blow up the space of the categoricals or add non-linearity to the response:

y_pred_emb = model.predict(x_test_emb)
mean_squared_error(y_pred_baseline, y_test)
## 0.34772562548572583
mean_squared_error(y_pred_emb, y_test)
## 0.2192712407711225
mean_absolute_error(y_pred_baseline, y_test)
## 0.3877675893786461
mean_absolute_error(y_pred_emb, y_test)
## 0.3444112401247173

We can also extract the weights of the embedding layers (the first row contains the weights for the zero-label):

weights = model.get_layer('embedding1').get_weights()
pd.DataFrame(weights[0]).head(11)
##         0      1      2
## 0  -0.019 -0.057  0.069
## 1   0.062  0.059  0.014
## 2   0.051  0.094 -0.043
## 3  -0.245 -0.330  0.410
## 4  -0.224 -0.339  0.576
## 5  -0.087 -0.114  0.324
## 6   0.003  0.048  0.093
## 7   0.349  0.340 -0.281
## 8   0.266  0.301 -0.275
## 9   0.145  0.179 -0.153
## 10  0.060  0.050 -0.049

These weights could potentially be persisted somewhere and used as features in other models (pre-trained-embeddings). Looking at the first 10 categories we can see that values from supplier_id with similar responses y have similar weights:

column_encoder = (embedding_preprocessor
  .named_steps['encoding_preprocessor']
  .named_transformers_['column_encoder'])
data_enc = column_encoder.transform(data)
(data_enc
  .sort_values('supplier_id')
  .groupby('supplier_id')
  .agg({'y': np.mean})
  .head(10))
##                   y
## supplier_id        
## 1            19.036
## 2            19.212
## 3            17.017
## 4            15.318
## 5            17.554
## 6            19.198
## 7            17.580
## 8            17.638
## 9            16.358
## 10           14.625

Moreover the model performs reasonably for unknown data, since it performs similar as our very naive baseline model and also similar to assuming a simple conditional mean:

unknown_data = pd.DataFrame({
    'product_id': ['!%&/§(645h'],
    'supplier_id': ['foo/bar'],
    'price': [5],
    'weight': [1]
})
np.mean(y_test['y'])
## 17.403696362314747
# conditional mean
idx = x_test['price'].between(4.5, 5.5) & x_test['weight'].between(0.5, 1.5)
np.mean(y_test['y'][idx])
## 18.716701011038868
# very naive baseline
naive_model.predict(unknown_data[['price', 'weight']])
## array([[18.905]])
# ridge baseline
baseline_pipeline.predict(unknown_data)
## array([[18.864]])
# embedding model
model.predict(embedding_preprocessor.transform(unknown_data))
## array([[19.045]], dtype=float32)

Wrap up

We have seen how we can leverage embedding layers to encode high cardinality categorical variables, and depending on the cardinality we can also play around with the dimension of our dense feature space for better performance. The price for this is a much more complicated model opposed to running a classical ML approach with one-hot-encoding. If a classical model is preferred, the category weights can be extracted from the embedding layer and used as features in a simpler model, therefore replacing the one-hot-encoding step.


Originally published at https://blog.telsemeyer.com.


Related Articles