
This is part 5 of a mini-series on Entity Resolution. Check out part 1, part 2, part 3, part 4 if you missed it
In most real world ER use cases, there is no ground truth on which candidate pair should match and which should not match. The only way to achieve good matching accuracy is to introduce human judgement in an iterative learning loop and incrementally improve the scoring algorithm. In this post, we will discuss how to set up this learning loop.
At a high level, this process looks like
- Sample from the initial naive score function output
- Manually inspect and label the sampled candidate pairs
- Evaluate score accuracy by comparing the manual label and score distribution
- Improve decision criteria by introducing additional comparison feature where warranted and appropriate
- Feed the labels and features into a classification algorithm to learn the optimal scoring function
- Repeat until acceptable level of accuracy is achieved

Sampling from initial scoring function and manually examine the output is fairly straightforward, but there are a few practical things worth noting
- Definite matches and definite non-matches are not worth examining. Practically, this means pairs that scored 0 (no match in any feature) and 1 (perfectly matching across all features) should be filtered out.
- It is worth thinking through the best visual layout to make human review and labeling as easy as possible. There are tools like Snorkel to help with this type of workflow. In our example, we chose a relatively low lift of approach of stacking corresponding features between the candidate pairs into a combined string value, to make it easier to compare them.
-
There may be use cases where you want to use stratified sampling. For example, you may want to focus more on candidate pairs between records from different source systems, and spend less time review candidate pairs from the same source. You may also want to weight different ranges of the similarity score differently. We have not done this in the example code, but it is not difficult to implement using the Pyspark method
sampleBy
Example outputs from the sampled candidate pairs
By looking the sampled output, we can immediately see that the simplistic scoring function that we used is far from perfect. One of the highest scoring candidate pair between adobe creative suite cs3 web standard upsell
and adobe creative suite cs3 web standard [mac]
is clearly not a match. In fact there are quite a few examples of this type of mismatch, where the upgrade version of the product looks very similar to the original product. One potential approach to address this is to introduce another feature that indicates whether a product is an expansion / upgrade / upsell vs the full featured version.
However, before doing that, we can feed the existing features and the manually entered human_label
into an ML classification algorithm that can learn how to optimally combine the individual similarity metrics into a scoring function. The example PySpark code that implements this is below.
A few practical things worth noting
- Even with good blocking strategies, a large portion of the candidate pairs can be clear non-matches. It is useful to filter them out before apply the model. This reduces the imbalance of the match vs non-match classes, which can negatively affect model efficacy. In our example, we did this filtering out any pairs below threshold of 0.06. This is chosen somewhat arbitrarily by manually checking the sampled candidate pairs, and can be adjusted as needed.
- It is helpful to supplement the human labels with additional labels based on programmatic rules, where possible. The more labeled data the algorithm has to learn on, the better it will tend to do. In our example, we have chosen to label candidates pairs that have
overall_sim
of 1 or where thename_tfidf_sim
of 1, as a programmatic match. The intuition here is that if the name matches exactly, we can be pretty confident that the two listing are the same product. On the other hand, we have chosen to label candidate pairs that haveoverall_sim
is less 0.12. This again is chosen somewhat arbitrarily based on looking the sampled output. These threshold can and should be tuned as part of the iteration. -
We have chosen to use the Scikit-learn‘s implementation of the random forest classifier for our example use case, because the more flexible APIs around hyper-parameter tuning and evaluation that it offers. But for larger scale use cases that will not fit into memory on a single node, it may be necessary to use the PySpark ML libraries, which implements a number of classification algorithm that integrates nicely with the rest of the PySpark ecosystem.
Once the model is tuned and we have decent cross-validation metrics, we will want to apply the model to score the broader universe of all and examine the output match probabilities vs human judgement. The example PySpark code is below.
It is worth noting the use of pandas_udf
, which can achieve much better performance than standard udfs, by leveraging more efficient serialization via Apache Arrow, and more efficient vectorized computation via numpy. To learn more about pandas_udf
, I recommend reading through the examples here.

From the new samples, we can see that model is doing a much better job at picking out pairs that match, even where the initial naive scoring function gives suggest low similarity. With this, we are effectively closing the learning loop with an updated scoring function and an updated sample from the new scoring distribution. Based on the results of further comparison against human judgement, we may want to add additional features, target the labeling efforts, update the model / scoring function, and repeat the process a few more times, until the desired level of accuracy is achieved.
Check out the final part on entity generation