Practical Guide to Entity Resolution — part 6

Generating entities

Yifei Huang
Towards Data Science

--

Photo by Michael Dziedzic on Unsplash

This is the final part of the mini-series on entity resolution. Check out part 1, part 2, part 3, part 4, part 5 if you missed it

The final output of ER is a data structure that has unique identifiers for each resolved entity, as well as mappings between the unique entity identifier and the corresponding identifiers of the resolved data records in the disparate source systems. This is relatively straightforward to do by

  1. eliminating the non-matching candidate pair, using the tuned model-based scoring function. This typically means picking a cut off threshold for the match probability after scoring iteration. We have chosen 0.5 for our example use case based on manual examination, but this value can vary depending on the datasets, features, model, and use case
  2. re-generating the graph using the only the “strong” edge that exceed the match probability threshold
  3. creating components (or entities) mapping via the connected components algorithm

And that’s it! We now have a mapping table that connects the disparate data records into unified entities. From here, it is typically beneficial to create canonical entity tables by selecting or generating canonical metadata (e.g. name, description, etc) based on prioritizing or combining individual metadata values from each of the connected records. The specific approach here will depend on heavily on the desired application and workflow, so we won’t delve too deep into it.

This concludes the mini-series on entity resolution. It is not an exhaustive study by any stretch, but hopefully provides a practical overview of the core concepts and implementation steps to help you get started on your own application. I welcome any comments and suggestions you might have.

--

--