The world’s leading publication for data science, AI, and ML professionals.

Performing Deduplication with Record Linkage and Supervised Learning

Identifying duplicate records with a machine-learning approach

Hands-on Tutorials

Photo by Valentino Funghi on Unsplash
Photo by Valentino Funghi on Unsplash

Introduction

Most data are recorded manually by humans and most often is not reviewed, not synchronized, and simply because there were mistakes made such as typos. Think for a second, have you ever filled out the same form twice before but with a slight difference in your address? For example, you submitted a form like this image below:

Sample Name & Address Details Fill Up (Image by Author)
Sample Name & Address Details Fill Up (Image by Author)

Notice that the details are actually referring to the same person "Jane" with the same "Address". Many organizations are dealing with data like this that clearly shows is duplicates and represents the same entity but the words are not exactly equal. Therefore, a python function "dropduplicates" will not be able to identify these records as duplicates as the words are not an exact match. So the solution to these messy data is to perform Deduplication with Record Linkage. Record Linkage determines if the records are a match and represent the same entity (Person / Company / Business)_ by comparing the records across different sources.

In this article, we will explore the usage of Record Linkage and combining Supervised Learning to classify duplicate and not duplicate records. Below are the topics that we will be covering in this article:

Table of Contents: 
- What is Record Linkage? 
- Understand our Data Set
- Applying Record Linkage Process
(a) Preprocessing
(b) Indexing
(c) Comparison & Similarity
(d) Supervised Learning (Classification) 
- Conclusion

What is Record Linkage?

Record Linkage refers to the method of identifying and linking records that correlates with the same entity (Person, Business, Product,….) within one or across several data sources. It searches for possible duplicate records and links them together to be treated as a single record, which also makes it possible to avoid data redundancy.

When unique identifiers variables are present in the data sets such as ( Identification numbers, hash codes, etc), the process of linking the same entity will be simple. However, there are cases where unique identifiers are not present in the data sets, therefore we will then need to identify good candidates of variables that are being duplicated and pair them up (eg: State, Last Name, Date of Birth, Phone No) – We will understand more about this while we perform the step: Indexing.

We will be using the Python Record Linkage Toolkit library which provides the tools and functions required for performing record linkage and deduplication. Installation and import of the record linkage toolkit as below:

Understand our Data Set

For this tutorial, we will be using the public data set available under the Python Record Linkage Toolkit that was generated by Febrl Project(Source: Freely Extensible Biomedical Record Linkage). There are four sets of data available, but we will be using the 2nd data set – FEBRL 2. Let’s import the data set from the sub-module recordlinkage.datasets.

Loading data set - FEBRL 2 (Image by Author)
Loading data set – FEBRL 2 (Image by Author)

The data set is return in the format of a "Data Frame" and we can see that this data set has a total of 5000 records. Based on the source of this data set from Febrl, there are 4000 original records and 1000 duplicates in this table.

Let’s get a better understanding of the data types present in our table using:

Data Type Info (Image by Author)
Data Type Info (Image by Author)

From the result above, the columns values in our data set are having the same data type – "object" (Also known as "String") and all having non-null conditions.

Moving on, we should also have a basic check on the statistical summary of our data set with the following function:

Statistical Summary (Image by Author)
Statistical Summary (Image by Author)

From the statistical summary results, we can quickly see that the unique count for the surname, given_name is not 5000, which indicates that there are possibilities that the same person could have multiple records in the data set with different addresses/streets number/states, etc.

An example of duplicate records from our data set will look like this:

Sample Duplicate Records from Data Set (Image by Author)
Sample Duplicate Records from Data Set (Image by Author)

Notice from this sample pair of records that are known as duplicates, the difference is on the "surname", "address_2", and "suburb" with only a few characters of difference. Our goal is to identify and highlight records such as this sample as duplicates.

Now that we have a basic understanding of our data set, let’s understand and apply the process of Record Linkage to deduplicate that data set and classify them correctly.

Applying Record Linkage Process

Record Linkage Process (Image by Author)
Record Linkage Process (Image by Author)

(a) Pre-processing

This step is important as standardizing the data into the same format will increase the chances of identifying duplicates. Depending on the values in the data, pre-processing steps can include :

  • Lowercase / Uppercase

This is the easiest step and most important step for text pre-processing which is to standardize your text data set to all "Lowercasing" or "Uppercasing". In the example below, we are converting the text in our data set to Uppercase.

Pre-processing: Uppercase (Image by Author)
Pre-processing: Uppercase (Image by Author)
  • Stopwords removal

Stop words are common words that are removed to provide more importance to more important information in the text. For example in a complete sentence stop words are "the", "a", "and", etc. For Company Names – stop words could be "Co", "Corp", "Inc", "Company", "Limited", etc. For People Names – stop words could be "Mr", "Mrs", "Ms", "Sir", etc. For Address – stop words could be "Street", "St", "Place", "Rd", "Road", etc.

For our data set, there are no stop words to remove from the names but there are stop words that we can remove from the address field "address_1". In the example below, we are removing common stopwords noticed in the data set.

Pre-processing: Stopwords Removal (Image by Author)
Pre-processing: Stopwords Removal (Image by Author)
  • Postcode Clean Up

Postcode Cleaning is done by removing the symbols that are possibly being included such as "-", "+", or blank spaces. (Commonly, this clean up are done on phone numbers but since we do not have a phone number in our data set, we shall apply similar logic on Postcode)

Sample Clean Up on Postcode (Image by Author)
Sample Clean Up on Postcode (Image by Author)

The example below shows the clean-up done for "Postcode" where only numeric values are kept.

Pre-processing: Postcode Clean Up (Image by Author)
Pre-processing: Postcode Clean Up (Image by Author)
  • Removal of Irrelevant Symbols

Special symbols will not be helpful in helping to identify similarities in text and should be clean up. The below example shows the clean-up done to remove irrelevant symbols in the address field.

Pre-processing: Removal of Irrelevant Symbols (Image by Author)
Pre-processing: Removal of Irrelevant Symbols (Image by Author)

(b) Indexing

Now that our data set has been pre-processed and considered a clean set of data, we will need to create pairs of records (also known as candidate links) Pairs records are created and similarities are calculated to determine if the pair of records are considered a match/duplicates. The Python Record Linkage Toolkit provides the indexing modules to create the pairing of records which simplified the process.

There are several indexing techniques available for record linkage – such as:

  • Full Index

A Full Index is created based on all the possible combinations of record pairs in the data set. Using Full Index has a risk on data volume as the number of records will increase quadratically. For example, based on our data set of 5000 records, a total of 12497500 pairs are created using the Full index.

Index Approach: Full (Image by Author)
Index Approach: Full (Image by Author)
  • Blocking

Index by blocking is a good alternative to Full Index as records pairs are produced based on the same block (Having common value). By blocking based on a particular column, the number of record pairs can be greatly reduced. For example, by blocking on the column "State", only pairs of records from the same state are link with each other and a total of 2768103 pairs are created – which is also lesser records compared to Full Index.

Index Approach: Block (Image by Author)
Index Approach: Block (Image by Author)

However, do take note that having lesser record pairs might not always be the best approach as there could be a possibility of missing out on actual matches if there are duplicate records but a typo on the value for "State".

  • Sorted Neighbourhood

Index by Sorted Neighbourhood is another alternative that produces pairs with nearby values, for example, the following records are pair up together as there are similarities in the column "Surname" – Laundon and Lanyon.

Index Approach: Sorted Neighbourhood sample pair (Image by Author)
Index Approach: Sorted Neighbourhood sample pair (Image by Author)

A total of 75034 pairs are created using index by Sorted Neighbourhood – which is also lesser records compared to Full Index & Block Index. (It also depends on the value content of the selected column)

Index Approach: Sorted Neighbourhood (Image by Author)
Index Approach: Sorted Neighbourhood (Image by Author)

In this tutorial, we will index our data set with a combination of two approaches which are index by "Blocking" and index by "Sorted Neighbourhood".

Why did I choose to use more than one index approach?

  • Using "Full Index" will provide us all possible matches of record pairs but will result in a huge growth in the total number of records.
  • Therefore, using Index by "Blocking" or "Sorted Neighbourhood" is able to resolve the issue of having a huge growth in the total number of records.
  • However, just by using the Index by "Blocking" or "Sorted Neighbourhood" approach, there are chances of missing out on actual matches. So why not reduce the possibility of missing out on actual match records by combining both approaches and still have a lesser volume of records compared to Full Index!

The command below is to append both record pairs created by "Blocking" and "Sorted Neighbourhood".

Appending Both Index Approaches (Image by Author)
Appending Both Index Approaches (Image by Author)

(c) Comparison & Similarity

Now that we have our record pairs generated, we would like to perform a comparison on the record pairs to create a comparison vector that calculates the similarity score between both pairs. The image below shows a similarity score that was calculated and compared based on the index pairing on the column "given_name". For example, the record pairing for "rec-712-dup-0" and "rec-2778-org" has a low similarity score of 0.46667 on the given_name.

Sample Comparison Vector (Image by Author)
Sample Comparison Vector (Image by Author)

Comparison can be done in many different methods to compute similarity values in a string, numeric values, or dates. In our scenario, where we are calculating the similarity score for string values, we can use the following algorithm:

  • Jarowinkler
  • Levenshtein
  • Longest Common Substring (LCS)
  • Jaccard

Let’s proceed to compute the similarity score for the different columns we have in our dataset.

Comparison Vector output (Image by Author)
Comparison Vector output (Image by Author)

Notice that the similarity function used in this example was "Jarowinkler" or "Levenshtein". Jarowinler similarity score is calculated by giving more importance to the beginning of the string, therefore this algorithm is used to calculate the similarity score for features such as name, address, state, etc. The Levenshtein similarity score is calculated and provides higher importance based on the order of the character, therefore this algorithm is used to calculate the similarity score for features such as street number, postcode, etc. _(There are also many other different similarity functions that can also be explored such as "cosine", "dameraylevenshtein", etc). Now that we have our similarity features created, we can proceed to the next step which is building a supervised learning model.

(d) Supervised Learning (Classification)

In this section, we will train a model to classify duplicates and non-duplicates based on the data set provided. But before we can train the model, we will need to have a "label" column (Target Variable) in our data set for the model to know which are duplicates and which are not.

When loading the data, specifying "return_links = True" the known duplicate record pairs will be returned.

df, links = load_febrl2(return_links=True)
Return "True Duplicate" Record Pairs (Image by Author)
Return "True Duplicate" Record Pairs (Image by Author)

We can also compute and create a comparison vector for the True duplicates pairs to get an overall view of how high their similarity score will be and also convert this set of pairing into a data frame format for the next step.

duplicate_pairs_vectors = compare.compute(links,df)
Comparison Vector output for "True Duplicate" Record Pairs (Image by Author)
Comparison Vector output for "True Duplicate" Record Pairs (Image by Author)

From the vector output, we can give a rough estimate by observing and notice that duplicate pairs tend to have a high similarity score for most of the features.

The following steps are some ETL processes to create the column "Label" on our data set whereby if the pairing is found in the data set "duplicate_pairs" then is label as "1" else "0" (Duplicate = 1, Not Duplicate = 0)

Data Set with column "Label" added
Data Set with column "Label" added

After labeling the data set, notice that there are 1901 pairs of duplicates and 2824073 pairs of duplicates, which also indicates that many pairings are indexed but are unique.

With a set of labeled data, we can begin training a supervised learning model to classify the records as "duplicate" or "not duplicate". In this example, I will be training an XGBoost Model to perform the classification. Below are the commands for importing the model libraries and splitting the data set to train and test set.

Test set grouping (Image by Author)
Test set grouping (Image by Author)

By looking at the test set distribution, we have 760 pairs of duplicates for the model to test and predict. Next, we can train the XGBoost model and apply the trained model to the test set to classify records into "duplicate" or "not duplicate"

Let’s view the output for the pairing records that the model classify as "duplicates" (predict = 1)

Model classify records as "Duplicate" (Image by Author)
Model classify records as "Duplicate" (Image by Author)

Next, we shall cherry-pick the first two pairs and see the actual records to identify what is the difference.

Sample Records - Classify as "Duplicate" (Image by Author)
Sample Records – Classify as "Duplicate" (Image by Author)

From the sample records, notice that for the first pairing – the difference can be seen on both the address field. Where else for the second pairing – the difference can be seen on the field for street number and address. Looks like the model is able to classify the records even with different values in the data set.

Congrats! We have completed building a model to identify duplicates in our data set.

Conclusion

In this article, we have learned how to use the combination of record-linkage with supervised learning to perform deduplication. Whereby, records need to be indexed into pairs before being able to perform a comparison to calculate the similarity score and for the model to train on. However, do take note that this is a practice to understand the process of performing deduplication and the data set values are simple. Real-world data are often more messy and complicated than the example we have.


Thanks for reading my article and if you enjoyed and would like to support me:

References & Links:

[1] https://github.com/dedupeio/dedupe-examples.git

[2] https://github.com/vintasoftware/deduplication-slides/blob/master/slides.ipynb

[3] https://recordlinkage.readthedocs.io/en/latest/


Related Articles