Making Sense of Big Data

What is Entity Resolution (ER)?
Entity resolution (ER) is the process of creating systematic linkage between disparate data records that represent the same thing in reality, in the absence of a join key. For example, say you have a dataset of products listed for sale on Amazon, and another dataset of products listed for sale on Google. They may have slightly different names, somewhat different descriptions, maybe similar prices, and totally different unique identifiers. A human can likely look at records from these two sources and determined if they refer to the same product or not, but this does not scale across all records. The number of potential pairs one has to look at is A x G where A is the number of records in the Amazon dataset and G is number of records in the Google dataset. If there were 1000 records in each dataset, then there are 1M pairs to check, and that’s just for 2 fairly small datasets. The complexity increases quickly with more and larger datasets. Entity resolution is an algorithm to address this challenge and and derive canonical entities in a systematic and scalable way.
Why should I care about ER?
Unlocking value from data often depend on data integration. A holistic data picture, created by integrating different pieces of data about the entity of interest, is much more effective at informing decision making than a partial or an incomplete view. For example, a warranty analyst at a car manufacturer can much more effectively diagnose a claim if she had an integrated view of all the data concerning the part in question, starting from the supply chain, to the assembly into the vehicle, through the sensor reading generated during operation. Similarly, a venture investor can be much more effective at selecting the best investment if she had an integrated view of all the data on potential companies, starting from the quality of its team, to its traction in the market and the buzz surrounding it. The are also numerous workflows in law enforcement, counter terrorism, anti-money laundering, just to name a few, that also required an integrated data asset. However, the unfortunate reality with data (especially third party data) is that it is often the exhaust of specific applications or processes, and it was never designed to be used in conjunction with other datasets to inform decision making. As such, datasets containing different facets of information about the same real world entity are often silo’ed in different places, has different shapes, and do not have consistent join keys. This is where ER can be particularly useful, to systematically integrate the disparate data source and provide a holistic view on the entity of interest.
How do I ER?
At a high level, there are 5 core steps to ER:
- Source normalization – cleaning data and harmonizing different sources into a common schema of features (columns) that will be used for evaluating potential matches
- Featurization and blocking key generation – creating features for blocking keys, which are targeted tokens that are likely shared between matching records, to constrain the search space from N² to something more computationally manageable
- Generate candidate pairs – using blocking join keys to create potential candidate pairs. This is essentially a self join on the blocking key, and is typically implemented using graph data structures, with individual records as nodes, and potential matches as edges
- Match scoring and iteration – Deciding which candidate pairs actually match via a match scoring function. This can be rules based, but is typically better implemented as a learning algorithm that can adapt and optimize over non-linear decision boundaries
- Generate entities – Eliminating non-matching edges in the graph, and generating resolved entities and associated mapping to individual source records
Over the next few posts, I will dive into each of the steps above, with a concrete example, provide in-depth explanation, and Pyspark code examples of how to implement them at scale.
Hope this was a useful discussion. Feel free to reach out if you have comments or questions. Twitter | Linkedin | Medium
Check out part 2 on source normalization