The world’s leading publication for data science, AI, and ML professionals.

Data Disruptions to Elevate Entity Embeddings

Injecting random values during neural network training can help you get more from your categoricals

Photo by dylan nolte on Unsplash
Photo by dylan nolte on Unsplash

Today I will discuss a stochastic regularization method to improve generalizability of entity embeddings in neural network models. I use a data generator to randomly inject selected input values into data during training, to help a model learn how to deal with unseen codes.

Performance improvements are especially dramatic for hierarchical categorical features. Randomization helps models leverage higher-level group information to compensate for unseen lower-level codes.

Adding noise, removing information, or otherwise messing with data, is often used to increase model robustness and reduce overfitting [1,2]. Here, it’s used to help a model learn what to do with missing categorical information. I examine one public test dataset, comparing unmodified data with randomization done two ways.

When unseen codes matter, randomly injecting vales helps a model generalize.

Using a data generator to shuffles random values so each mini-batch sees different scenarios performs better than static data modification.

A caveat is that overfitting can occur when a coding hierarchy is unrelated to the problem; I test random groupings of codes and see performance drops for randomized training data.

Background

Categorical Data

A categorical feature represents a category, rather than a numeric value. Everyday examples include gender, vehicle make, T-shirt size, or US zip code.

When categorical features have a lot of possible levels ("high cardinality"), both modeling and analytics become tricky. US zip codes is an example of a high-cardinality categorical, having approximately 40,000 values.

Many high-cardinality categoricals can be organized in a hierarchical manner. For example, you might (approximately) group zip codes into counties, counties into states, states into regions, etc. Hierarchies are almost always available for coding systems that are owned by industry groups or government agencies, for example US Bureau of Labor Statistics job title codes or ICD diagnosis codes. Unseen codes can be very important here, as systems are periodically updated.

Entity Embeddings

In a neural network model, entity embeddings create lower-dimensionality representations of features that can take many discrete values [3]. Entity embeddings assign a numeric vector to each level of the categorical feature. These vectors are initialized to random values, but values are updated during training. The final, trained vectors (embeddings) have the additional advantage of providing a distance measure for categorical feature levels.

Methods

My models predict defaults for a public (CC BY 4.0) dataset related to loans from the US Small Business Administration [4–5]. As described previously [6–7], I select a subset of features reflecting general firmographics, including the high-cardinality NAICS feature, which represents industry [8]. I drop rows with missing industry (which are mostly from older loans). I do a 70/15/15 train-test-validation split, plus set aside 10% of NAICS codes as a holdout set to analyze model results on unseen codes.

Neural network models are built with Tensorflow/Keras. Early stopping is used for all models, as well as the tanh activation. NAICS information is incorporated into the model using Keras Embedding layers [9]. Embedding layers take integer inputs; I use scikit-learn’s OrdinalEncoder [10] to map NAICS codes to integers (more on this later).

Code for this project can be found in [11]. Table data is also available there; I use images of tables in this post for readability.

Injecting "Unseen" Codes Rescues Overfitting

I previously showed that entity embeddings perform poorly on unseen codes for this dataset, but randomly replacing NAICS encodings in the training dataset with values used to represent missing codes improved performance [6]. But the solution felt sloppy to me.

Thinking on this more, I thought it might be better to move random replacements downstream into training, shuffling modified cases for each batch. This way, the model will, over time, see most of the training data. In addition, maybe it would be possible to tune the amount of randomization to get the right balance.

Neural Networks are so flexible. It’s straightforward to write and use a custom data generator, which will select a different random sample for modification at each epoch. I chose "1" as the code to represent missing values; I replace actual NAICS encodings with this value during randomization.

Here, I set 10% of the training NAICS to the unseen code, in two different ways. Fixed randomization is randomly modifying 10% of the training data prior to fitting. Shuffle randomization uses the data generator to modify a different 10% sample for each mini-batch. Only the training data is modified; validation, test, and NAICS holdout data remains unchanged and is the same across tests.

Table 1. Model performances measured for a randomly-selected dataset ("test" column), and unseen "holdout" NAICS codes (right column). Including entity embeddings with no data modifications (second row) leads to a performance decrease for unseen codes. Randomly injecting encodings into the training data reverses this (third and fourth rows). "Fixed randomization" modifies the data prior to fitting. "Shuffle randomization" uses a data generator to select cases for modification at each training step.Table by author.
Table 1. Model performances measured for a randomly-selected dataset ("test" column), and unseen "holdout" NAICS codes (right column). Including entity embeddings with no data modifications (second row) leads to a performance decrease for unseen codes. Randomly injecting encodings into the training data reverses this (third and fourth rows). "Fixed randomization" modifies the data prior to fitting. "Shuffle randomization" uses a data generator to select cases for modification at each training step.Table by author.

Table 1 contains model performance, measured by precision-recall area under the curve (PR-AUC). Incorporating NAICS via entity embeddings is a big win for randomly selected test data. However, in the absence of data modification, performance is below the baseline for the unseen NAICS codes (the "holdout" sample). Interestingly, injecting random values for NAICS recovers baseline performance!

It’s great to reduce over-fitting for entity embeddings, but there’s more. NAICS codes have a hierarchical structure; can data modification help leverage that?

Randomization Lets the Hierarchy Shine Through

As discussed above, some high-cardinality categoricals are organized in a hierarchical manner. NAICS Xcode’s, which I use here, are maintained by the US government and contain a 5-level classification of establishment type. Examples are shown in Table 2.

Table 2. Examples of NAICS codes illustrating their hierarchical structure. Copied from [7]. The low-level codes are bucketed into more general groups of varying specificity. The lowest-level, 6-digit National Industry code can take ~1,200 values, whereas there are only 21 codes at the highest (sector) level. Table by author.
Table 2. Examples of NAICS codes illustrating their hierarchical structure. Copied from [7]. The low-level codes are bucketed into more general groups of varying specificity. The lowest-level, 6-digit National Industry code can take ~1,200 values, whereas there are only 21 codes at the highest (sector) level. Table by author.

I add features based on the code structure in Table 2 to the model. Table 3 shows tests for models that incorporate the base NAICS (6 digits), industry group (4 digits), subsector (3 digits), and sector, using entity embeddings for each feature.

Table 3. Effect of including features based on the NAICS code hierarchy in models. The top two rows of Table 3 are copied from Table 1 for comparison. Without data modification, including hierarchical features increases performance (row 3). Using data randomization leads to an even larger increase (rows 4–5). Table by author.
Table 3. Effect of including features based on the NAICS code hierarchy in models. The top two rows of Table 3 are copied from Table 1 for comparison. Without data modification, including hierarchical features increases performance (row 3). Using data randomization leads to an even larger increase (rows 4–5). Table by author.

Table 3 shows that hierarchy features enhancing performance in the absence of data modification. But data randomization takes it to the next level. When data modifications are used in conjunction with entity embeddings, performance for unseen codes is comparable to codes used in training!

Shuffled Randomization Beats Fixed

In Tables 1 and 3, I see pretty similar results from injecting random unseen codes into data, whether I use a fixed data modification or shuffle during batching.

Are both data modification methods equivalent? Intuitively, it feels like shuffling would be better. Fixed modification drops information in the training data; additionally, fewer combinations are used to train with injected values. Figure 4 shows the effects of downsampling the training data only (validation, test, and NAICS holdout data is unchanged).

Figure 1. Model performance vs. volume of training data. A. Results measured on randomized test data for models that include NAICS entity embeddings without data modification (blue line), fixed randomization (green line), and shuffled randomization via a data generator (red line). B. Similar to A, but on a dataset containing unseen NAICS codes. C. Similar to A, but additional features representing higher levels of the NAICS hierarchy are added to the model. B. Similar to C, but on a dataset containing unseen NAICS codes. Image by author.
Figure 1. Model performance vs. volume of training data. A. Results measured on randomized test data for models that include NAICS entity embeddings without data modification (blue line), fixed randomization (green line), and shuffled randomization via a data generator (red line). B. Similar to A, but on a dataset containing unseen NAICS codes. C. Similar to A, but additional features representing higher levels of the NAICS hierarchy are added to the model. B. Similar to C, but on a dataset containing unseen NAICS codes. Image by author.

In Figure 1, the plots in the right column show distinct differences in performance for unseen codes. The unmodified data performance is lowest. Data modification improvements are most dramatic when the NAICS hierarchy is used (Fig 1D).

For all the comparisons in Figure 1, if I just count the number of times each data treatment "wins" (has the highest PR-AUC), I get this:

Table 4. Summary of results in Figure 1. For each point in the curves, I select a "winner" as the data treatment with the highest performance. The table shows the number of wins by method. Table by author.
Table 4. Summary of results in Figure 1. For each point in the curves, I select a "winner" as the data treatment with the highest performance. The table shows the number of wins by method. Table by author.

Out of 36 comparisons, shuffle randomization is the winner in 24. Although fixed randomization is often better than no modification, it seldom beats shuffled randomization.

Below ~100,000 training cases, Figure 1 curves drop sharply; in this range, I don’t have enough data to make a good prediction. Unmodified data seems to win these low case counts.

Overfitting is a Risk for Some Hierarchies

Some NAICS encodings used in tree-based models are sensitive to the details of categorical coding systems; the wrong hierarchy leads to overfitting [6,7]. I didn’t think neural networks would have the same issue but decided to test it.

Instead of the standard NAICS hierarchy, I grouped codes randomly, and retried the models. Entity embeddings were included for base NAICS and random groups with roughly the same number of levels as the standard NAICS hierarchy.

Table 5. Effects of using a random NAICS hiearchy for different data treatments. Results are compared to models with only the lowest-level NAICS code for each treatment. "NAICS only" results are copied from Table 1.
Table 5. Effects of using a random NAICS hiearchy for different data treatments. Results are compared to models with only the lowest-level NAICS code for each treatment. "NAICS only" results are copied from Table 1.

Unfortunately, table 5 shows overfitting with random groups, across data treatments (guess I was wrong). Sadly, the effects are worst with shuffled random values, especially if you consider the performance drop relative to the NAICS only model.

Data modification opens a window to useful information that can enhance predictions for unseen codes, but bad as well as good stuff can come in the window.

When the hierarchy is meaningful to the problem at hand, using it with random injection is very powerful. But an unrelated organization can be counter-productive. Therefore, it’s necessary to be thoughtful about the coding system.

How Much Randomization is Enough?

For situations where randomization is useful, how can we decide how many cases to modify? There may be quantitative answers to this question, but for this blog, I chose a 10% injection rate just because if felt right. Let’s test some different injection rates, using the data generator to shuffle cases (Figure 2):

Figure 2. Model performance for the shuffled randomization method, as a function of the fraction of cases modified. A. Performance on models that include only the lowest-level NAICS code. B. Performance on models that include the lowest-level NAICS code plus features based on the hierarchy. Image by author.
Figure 2. Model performance for the shuffled randomization method, as a function of the fraction of cases modified. A. Performance on models that include only the lowest-level NAICS code. B. Performance on models that include the lowest-level NAICS code plus features based on the hierarchy. Image by author.

For NAICS without higher-level features (Figure 2A), there is a wide range of effective randomization levels. Performance on the test data decreases above ~80% modification rates. Of course, high levels of randomization mean NAICS codes are unavailable for training.

For the holdout data in Figure 2A, the low-rate curves show a sharp increase. Not many values need to be modified to correct overfitting. Even 1% seems to work! In addition, there is no upper limit to the performance improvements. This can be expected because modification only reduces overfitting; no NAICS information is available for unseen codes.

Figure 2B shows results from models that incude multiple levels of the coding hierarchy. For the test dataset, there’s a slow decrease in test dataset performance above maybe 40–50%. I think what is happening is that, when too many lower-level codes are masked in training the model starts to rely on the less specific (and less predictive) higher-level groups.

The holdout dataset in Figure 2B has the same sharp increase at low rates as was seen in the test data in Figure 2A. But there is also a decrease at high injection rates (<~80%). Now the model is using (higher-level) NAICS information and needs sufficient volume.

Photo by Annie Spratt on Unsplash
Photo by Annie Spratt on Unsplash

Is this Method Good?

Since using a generator beats static modification of training data, I’ll discuss strengths and weaknesses of that strategy.

Performance on seen codes for a single categorical: GOOD. At least for this one test dataset, the performance on the random test data is unchanged or maybe even improved under most conditions. With a high level of randomness, there can be performance loss, but there is a wide range of injection rates that work.

Performance on unseen codes for a single categorical GOOD. For NAICS-only models, injection of unseen values rescues overfitting, and even improves performance above the baseline.

Ability to leverage a relevant code hierarch: GREAT. When the hierarchy is used, the model performance for unseen codes is comparable to seen! It’s a huge step up from XGBoost techniques I tried on similar data [7].

Bias avoidance when a hierarchy is irrelevant: BAD. This is one place random injection falls short. Performance can drop if a coding hierarchy doesn’t relate to the response.

Low feature engineering burden: GOOD. I first tried static random injection of codes for unseen values in a previous blog [6]. But upstream modification just felt wrong to me. Using the generator feels much better. The data generator prevents total loss of information in the training data, plus simulates a broader range of scenarios with missing codes.

Ease of implementation: _OK._ The generator itself is very straightforward; feel free to copy and improve my code. In my ideal world, it might be nice if Keras (PyTorch etc.) had randomization built in, maybe as a parameter to the embedding layer.

Things are a little less pleasing upstream with integer encoding, which is needed for input to the embedding layer. I used scikit-learn’s OrdinalEncoder, which doesn’t make it easy to reserve a specific value for unseen or missing codes. I got around this by wrapping OrdinalEncoder in a class to add 2 to the encoded values so I could reserve 0 for missing and 1 for unseen codes. Note that if you don’t do this, you would need a different code to represent missing/unseen for each categorical, and you’d need your custom data generator to inject these disparate values into the corresponding features. You also need to count the number of levels in training before the encoder is fit.

Tentative verdict: Pretty good! I think the method is promising for many scenarios, but some thought and testing would be required when using code hierarchies. However, I only test one dataset and one coding system; examining results on more datasets and different coding systems would be nice.

It may be that other datasets or coding systems won’t work as well. In some contexts, it might be necessary to tune the injection fraction, or to vary the fraction by feature.

What About Missing Values?

Missing values can be encoded with the same value as used for unseen, or a different value. If missing and unknown map to the same code, and your dataset contains sporadic missing values, use of a randomizer may not be necessary, as combinations with unknown NAICS would "naturally" be present.

However, in many datasets, including this one, that strategy could be problematic. For the SBA loans data, missing values occur for very old loans, which are different in many ways from current loans. In this case, using the missing code for unseens could bias the results badly. In addition, you would need to worry about having enough missing values for the method to work, and the rate would not be tunable. The generator might provide a more versatile solution.

Final Thoughts

It’s no secret that industry data is not really like datasets used in coursework, competitions, research papers, etc. In my opinion, one difference is the number of high-cardinality coding systems occurring in the wild. Actually, there are entire careers and companies built around coding systems. Government-generated codes are quickly coopted by industry and used in contexts quite different from their original purpose.

Some standard coding systems have been like a rock in my shoe; I feel that more can be done with them, but I haven’t had the opportunity to explore. Often, only general groupings are considered. I’m excited to be able to explore different options for these. Randomization during training seems like it may be very useful in my work.

Next up I hope to do deeper dive into visualizations and analytics for the NAICS embeddings. Thanks for reading!

References

[1] Devansh Devansh, Using Randomness Effectively in Deep Learning (2022), Medium.

[2] R. Moradi, R. Berangi and B. Minaei, A survey of regularization strategies for deep models (2020), Artificial Intelligence Review 53:3947–3986

[3] C. Guo and F. Berkhahn, Entity Embeddings of Categorical Variables (2016) arXiv:1604.06737

[4] M. Li, A. Mickel and S. Taylor, Should This Loan be Approved or Denied?: A Large Dataset with Class Assignment Guidelines (2018), Journal of Statistics Education 26 (1). (CC BY 4.0)

[5] M. Toktogaraev, Should This Loan be Approved or Denied? (2020), Kaggle. (CC BY-SA 4.0)

[6] V. Carey, Exploring Hierarchical Blending in Target Encoding (2024), Towards Data Science.

[7] V. Carey, No Label Left Behind: Alternative Encodings for Hierarchical Categoricals (2024), Towards Data Science.

[8] United States Census, North American Industry Classification System.

[9] Keras 3 API documentation, Embedding layer (2024)

[10] Scikit learn documentation, OrdinalEncoder (2024)

[11] V. Carey, GitHub Repository, https://github.com/vla6/Blog_naics_nn


Related Articles