The world’s leading publication for data science, AI, and ML professionals.

Practical Guide to Entity Resolution – part 3

Featurization and blocking key generation

Photo by Honey Yanibel Minaya Cruz on Unsplash
Photo by Honey Yanibel Minaya Cruz on Unsplash

This is part 3 of a mini-series on Entity Resolution. Check out part 1, part 2 if you missed it

What is featurization and blocking and why does it matter?

In the context of ER, featurization means transforming existing columns into derived features that can inform whether disparate records refer to the same thing. Blocking means selecting a targeted subset of features to use as self join keys to efficiently create potential match candidate pairs. The reason these two steps are grouped in our discussion is blocking keys are often derived from features.

Good featurization enables efficient blocking and also good match accuracy downstream, therefore it is a critical piece of the ER process. It is also the least defined step, where a lot of creativity can be injected into the process.

Good blocking can dramatically improve the efficiency of the ER process, and make it possible to scale up to many input sources and large datasets. As we have discussed in part 1, the universe of potential candidate pairs is N² with the number of records. However, not all of those candidate pairs are worth evaluating. For example, learning quickbooks 2007 and superstart fun with reading writing are clearly not the same thing and should not be included in the candidate pairs. One way to address this is to use a more specific blocking key, such as the exact string name of the product, to create candidate pairs. This way only products that have the exact same normalized name string will be included in the candidate pairs. However, this is clearly too restrictive and will miss potential matches like learning quickbooks 2007 and learning qb 2007 . Good blocking key selection is crucial to effectively balance this trade off between good efficiency and low false negative rate.

How to implement featurization and blocking?

Featurization, perhaps even more so than previous steps in ER, is dependent on the type of source data and the desired downstream comparison algorithm. We won’t try to provide an exhaustive set of possible techniques here, but instead focus on an illustrative implementation for our example use case of Amazon vs Google products. In the normalized dataset, we have 4 columns to work with in terms of featurization – name, description, manufacturer, price . 3 out of the 4 columns are text based, and where the bulk of the identifying information lives. Therefore we will focus most of featurization efforts here. There are many ways to featurize text data, but at a high level they generally boil down to

  1. Tokenization— e.g. break sentences into words
  2. Token standardization – e.g. stemming or lemmatization
  3. Embedding – e.g. TF-IDF, word embedding, sentence embedding

For our particular example, we will use a basic TF-IDF model, where we tokenize the text, remove english stop words, and apply TF-IDF vectorization. We will also use the Universal Sentence Encoder model from Tensorflow, to transform name and description strings into 512-dimensional vectors.

By applying these two transformations to the text columns in name, description, manufacturer , we end up with a number of rich features that we can use to evaluate potential match between candidates pairs, as well as to help select blocking keys. Specifically, the TF-IDF vector and sentence encoding vector can be very useful in generating good blocking keys.

As we discussed above, a good blocking key is something that balances specificity and false negative rate. The full name of a product is specific, but also has a very high false negative rate. We can remedy this by choosing individual words within the name or description of a product. It is important to note, however, that some tokens are better than others for the purposes of blocking. For example in learning quickbooks 2007 , quickbooks is intuitively the best blocking key, followed by 2007 , followed by learning . This is because quickbooks is the most meaningful keyword in terms defining the product, 2007 is specific to the version of the product, and learning is a pretty generic descriptor. Conveniently, this is also precisely what TF-IDF tries to systematically measure, and the normalized TF-IDF token weights for learning quickbooks 2007 is quickbooks: 0.7, 2007: 0.51, learning: 0.5. As you can see, the TF-IDF weights of the tokens is a decent proxy metric to prioritize and select good blocking keys.

Similarly, the sentence encoding vector weights also tells us which dimensions in the latent vector space is most salient to the meaning of the sentence, and can used to select blocking keys. However, in the case of sentence encoding, the dimension with the highest weight does not directly map to a word token, and is therefore less interpretable.

The example Pyspark code that implements the above is as follows

The code above leverages pyspark.ml libraries to

  1. Apply TF-IDF transformations to the name, description and manufacturer
  2. Pick out the top 5 tokens based on TF-IDF token weight as blocking keys
  3. Use the Universal Sentence Encoder to transform name and descriptions to a 512 dimensional vector representation. We left out manufacturer here because it is sparse and also does not have much semantic meaning as an english sentence. Note that we did not actually generate blocking keys from the encoding, but a similar approach as the TF-IDF tokens can easily be applied here as well
  4. Combine the top name tokens, top description tokens, and top manufacturer into an array of blocking keys for each record. Important to note that each record can have more than 1 blocking key. This again is aimed reducing false negative rate on potential matches.

The user defined function that applies the Universal Sentence Encoder is worth quickly unpacking

# magic function to load model only once per executor
MODEL = None
def get_model_magic():
  global MODEL
  if MODEL is None:
      MODEL = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
  return MODEL
@udf(returnType=VectorUDT())
def encode_sentence(x):
  model = get_model_magic()
  emb = model([x]).numpy()[0]
  return Vectors.dense(emb)

What exactly is this doing, and why is it necessary? Tensorflow hub provides a pre-trained model that helps users transform text into vector representation. This model is not distributed, so that means in order to leverage in a distributed computation framework like Spark, we have to wrap it in a user defined function and make it available on every executor node. However, we do not want the expensive load call to execute each time we transform a piece of text. We only want it to load once per executor, and then re-used for all the tasks that are distributed to that node. The get_model_magic method is essentially a trick to enable this.

Now that we have our features and blocking key, we are ready to tackle candidate pair generation and match scoring.


Related Articles