
This is part 2 of a mini-series on Entity Resolution. Check out part 1 if you missed it
Part 2 of this series will focus on the source normalization step of entity resolution, and will use the Amazon-GoogleProducts dataset obtained here as an example to illustrate ideas and implementation. The rest of the series will also refer to this example. The dataset contains two distinct product catalogs that have records that are refer to the same physical product. The goal is to systematically figure out which of the records are duplicative and are actually the same entity.
This what the example data looks like
Google products

Amazon products

And, concretely, we want to be able to systematically match records like apple shake 4.1 digital compositing software for mac os x effects software in the Google dataset to apple shake 4.1 visual effects (mac) in Amazon dataset. But first, we must start with source normalization.
What is source normalization and why does it matter for ER?
In the context of ER, source normalization means
- Creating unified dataset that encompasses all of the disparate sources – this is important because it allows you to scale up to an arbitrary number of input sources and still maintain consistent downstream processing and matching.
- Cleaning the data by standardizing data type (e.g. making sure nulls are actually nulls and not the string null), unifying format (e.g. everything to lower case), and removing special characters (e.g. brackets and parenthesis) – this eliminates extraneous differences and can dramatically improve matching accuracy.
Here is the example Pyspark code that implements normalization for the two product catalogs. Note that this is meant to illustrative and not exhaustive.
The schemas in this cases are already pretty similar, so it’s relative straightforward to create the unified schema. Note that you will want to include a source and source id column to uniquely identify records in case there are ID clashes across the input sources.
Cleaning and standardizing data is a little bit more involved, and the specific pre-processing approach will heavily depend on the type of data and the desired comparison algorithm. In this particular example, because the metadata available is primarily text and natural language, the comparison will likely need to lean on string similarity and NLP techniques. Therefore it will be important to normalize the string columns like name, description
and remove the extraneous characters that carry little to no meaning.
To illustrate the the normalization process, here is an example comparison between a product name before and after the cleaning step
compaq comp. secure path v3.0c-netware wrkgrp 50 lic ( 231327-b22 )
vs
compaq comp secure path v3 0c netware wrkgrp 50 lic 231327 b22
With our datasets normalized, we are ready to tackle featurization and blocking.