Getting Started
At the very beginning of this millennium, when my hair was a lot darker, I wrote a little article with a pretty long name, entitled "A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems". It was a straightforward article, which I decided to write, driven by a very practical need for a method to deal with data types that were hard to plug into Machine Learning (ML) models. At the time, I was working on ML models to detect fraudulent e-commerce transactions. Therefore I was dealing with very "sparse" categorical variables, like ZIP codes, IP addresses, or SKUs. I could not find an easy way to "preprocess" such variables, except for the traditional one-hot encoding, which didn’t scale well to situations where one deals with hundreds or even thousands of unique values. A decision tree method popular at the time, the C5.0 algorithm by R. Quinlan, provided the capability to group together individual sets of values as part of the tree generation process. However, C5.0 was a proprietary algorithm, and therefore the methods used to come up with that they called "discrete values subsets" were not disclosed.
Thus, I came up with my own, very pragmatic method. I replaced the original categorical value with the target’s expected value, given the variable’s value. That’s a fancy way to say that I used, for binary classification problems, the proportion of true targets Y observed in the training data when the input variable X assumed value, Xi. However, the method extends well to regression problems and multi-class classification problems as well. Just replace the input with the average of the target when X=Xi, for regression models, or use N-1 probability encodings for N-class classification problems. That was it, or almost it, and your intractable categorical variable easily becomes a much more mellow and informative, continuous variable.
What complicates things is that when dealing with very high cardinality input variables, many values may have a relatively small sample size. Therefore, simply trusting the statistics of the target could mean trouble (namely overfitting trouble). For this reason, I looked up some literature and adopted a smoothing approach whereby the estimate of the target given X=Xi gets blended with the prior probability of the target (or the baseline). The blending is controlled by a parameter that depends on the sample size. The larger the sample, the more the estimate is weighted toward the value of the target given *X*=Xi; the smaller the sample, the more the estimate is weighted toward the overall baseline for the target.
The formula above describes a way of implementing the "blending," where Ni is the sample size for X=Xi, and the Lambda function is some monotonically increasing function of the sample size.
The article appeared in a secondary publication, the ACM SIGKDD Explorations Newsletter, which gratified me a lot. Still, after that, nobody seemed to notice it much besides the author (the destiny of the long tail of peer-reviewed articles). But as years passed, and Data Science reached its third wave of interest in the past 5 years, I noted that a few people reach out to me for questions about the article. Then, thanks to Google Scholar, I realized that all of a sudden, the article was being recently cited by other authors in the ML literature – not a ton of citations, of course. Still, somehow a few people were reading it, almost 20 years after its publication!
Digging a little more, I discovered that people had implemented the algorithm in Python packages and that even H20 seems to support this type of transformation. People even gave a name to it, or a couple actually, "Target Encoding," or "Impact Encoding" – it sure sounds better than "A preprocessing scheme for….".
Anyhow, I am glad that ultimately what I wrote turned out helpful to some people. In the meantime, I have been using this method myself in various use cases and sharing it with my peers and colleagues. I also reviewed some of the work that cites my 2001 article to understand in what context this prior work was referenced, discovering that:
- An open-source ML algorithm named CATBOOST, developed by Yandex, utilizes target statistics to represent categorical variables but provides an alternative approach to smoothing and overfitting avoidance.
- The proposed target depending encoding is considered by many authors the base approach to dealing with high-cardinality categorical variables, including in a recent book by Kuhn __ and Johnson entitled _"Feature Engineering and Selectio_n"
- Various applied Data Science papers (in social media, e-commerce, credit scoring, etc.) refer to the article because it was their chosen method for treating categorical variables.
As much as I was amazed to find so many references to the use of target-dependent encoding, I noticed that none of the papers I was able to read, nor the various open-source implementation seem to leverage one important suggestion made in the paper: the application to categorical variables with an implicit hierarchical structure. That’s yet another fancy name to describe something straightforward, namely that a high-cardinality categorical variable has a natural way to group values "up" in a logical way. There are many examples of such variables in ML applications:
- ZIP Codes: in the US, the postal codes are not assigned to regions randomly, of course. Similar ZIP codes generally cover neighboring or nearby areas. Thus, if we consider all the ZIP codes starting with the same 4 digits (ZIP4), we would be grouping together neighboring codes, and we can go to ZIP3 as well.
- IP Addresses: While there are potentially 4,294,967,296 different IP addresses (the actual number is smaller due to reserved blocks), they are designed with a built-in hierarchical structure: ZZZ.YYY.XXX.NNN. Generally, they are reserved in continuous blocks. Thus we can also leverage their natural grouping structure.
- Industry Codes: The NAICS system provides a very detailed 6-digits classification scheme for business types, which is very useful in various types of ML applications. While the number of unique codes is huge, the NAICS codes are also hierarchical. NAICS5, NAICS4, and up to NAICS2 represent broader and broader business categories (for example, NAICS2 = 51 indicates Publishers, while 22 represents Utilities).
There are many other examples of categorical variables that can be easily "rolled up" to a lower level of granularity and used in a model. However, the point here is not to simply reduce cardinality by "lowering" the number of unique values using the hierarchical structure; using target encoding things gets somewhat more interesting…
Earlier, I referred to the ‘blending’ formula, which controls the weighting between the "value statistics" (namely the target statistics when X=Xi) and the target’s prior probability. We use the unconditional prior probability of the target simply because we assume there is no better alternative when P(Y|X=Xi) is unreliable because of the small sample size. However, when we have a hierarchical structure we can leverage, we can look at the target statistics for the "parent group." Thus, if the sample size for ZIP code 94089 is too small, we can factor in the target statistics for all records with ZIP code 9409x, which will hopefully provide a more reliable estimate, and if not…what about considering 940xx? This hierarchical version of the Target Encoding method provides a couple of advantages:
- Variable, data-driven granularity: Going back to the ZIP code cases, we may observe that some ZIP5 codes in our dataset have a ton of data, while others are very sparsely represented. One may be tempted to consider ZIP4 or even ZIP3 aggregation to address the many sparse ZIP5 codes; however, it would cause an unnecessary blurring of information for those highly populated ZIP codes. With the proposed approach, the level of aggregation is effectively variable – high-frequency values will be using target statistics specific to their value, while others will be grouped with other related (by the hierarchy) codes, thus using broader group statistics.
- Higher differentiation: With its standard formulation, Target Encoding may produce a continuous variable (in actuality, a discretized continuous variable), for which a relatively large number of cases are mapped to the prior probability of the target, which is a "neutral" value. This is particularly true for categorical variables with highly skewed distributions of values or long tails. This is because the target estimate for low-frequency values will be heavily weighted toward the same number, namely the target’s prior probability. However, when we have a hierarchical structure available for the categorical variable, using this variation, we will end up with a much more distributed representation since each sparse value will be blended with a potentially different target estimate.
In the fast-moving world of Data Science, where superlatively complex and powerful new algorithms, devouring massive amounts of computing power, seem to be the dominant trend, the art and science of inventive and creative Feature Engineering seem to be less and less relevant. I often run into online material or ML models I see built by passionate yet junior Data Scientist where minimal effort is spent creating "better predictors" – the algorithm will figure it out, is the common thinking. Don’t worry about creating sophisticated, domain-driven predictors for your model – give the raw data to the algorithm, and the ideal internal representation will be created via learning. Indeed, Deep Learning has shown that this is very true when the input data has a Spatio-temporal structure – neural network models do indeed create very "deep" spatial features and abstraction. But unfortunately, not all data is made up of images, sounds, videos. Or, not always, we have a zillion data points available to train these types of models. Thus, in many cases, I still believe that a model’s performance can be heavily dependent on its creator’s savviness in crafting "good predictors" from raw data. But, I have been proven wrong before.
Thus, in conclusion, thanks to all of those who discovered my little paper with the long title from 2001, and do not forget to check out Section 4, where the hierarchical version of target encoding is explained – definitively worth it, in some use cases.