The world’s leading publication for data science, AI, and ML professionals.

Addressing The John Smith Problem

Using Fuzzy Logic to Identify Non-Matching Duplicates

Special thanks to Kailash Awati.

Image recreated by Author, inspired by LiberalDictionary
Image recreated by Author, inspired by LiberalDictionary

1. Introduction

Many databases have duplicate data. Especially if manual data entry is required. In order to clean the data and to resolve unnecessary duplicates, it is necessary to identify and rectify messy data. However, many duplicates are non-matching; meaning there could be duplicate data that contains, for example, spelling errors. It is challenging to identify these duplicates perfectly using the SQL database language because this relies on exact matching (due to the tenets of Relational Database theory). Therefore, it is necessary to look for other methods of identifying non-matching duplicates, which is where Fuzzy Matching is able to be used.

Fuzzy matching works off the notion of edit distance, which is essentially the minimum number of operations required to transform one string into another. One of the most commonly used edit distance metrics is Levenshtein distance, which is essentially, "the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.". Other metrics include the Hamming distance, the Jaccard distance, the Jaro-Winkler distance, the Longest Common Substring distance, or the Cosine distance. For the purposes of this paper, the Optical String Alignment distance (OSA) is used (which is a variant on the Damerau-Levenshtein distance).

To illustrate OSA, the following example is given:

bear -> bear    = 0 steps (0 characters changed)
bear -> beat    = 1 step  (1 characters changed)
bear -> meat    = 2 steps (3 characters changed)
bear -> tearing = 4 steps (1 character changed, 3 characters added)
bear -> ear     = 1 step  (1 character removed)

As seen, the number of changes needed to transform one string to another is the number of steps taken. This logic will be used to match records in a given database.

2. Set Up

2.1. Load Packages

The first step is to load the necessary packages.

# Load required packages
library(tidyverse)  #Various tools
library(readxl)     #Read in Excel's
library(readr)      #Read in data
library(magrittr)   #Great piping tools
library(kableExtra) #Pretty tables
library(stringdist) #For Edit Distance algorithms
library(DBI)        #Testing SQL
library(RSQLite)    #Testing SQL
library(tictoc)     #Check code run-time

2.2. Load Data

The next step is to load the data. The JohnSmiths data is ‘dummy data’, used to illustrate the example of fuzzy matching. The original data can be found on Eight2Late.

# Load files from Excel
dat_JohnSmiths <- find_rstudio_root_file() %>% 
   paste0("/Data/", "many_john_smiths.xlsx") %>% 
   read_excel() %>% 
   data.frame()

2.3. Check Data

Having loaded the spreadsheet, the JohnSmiths data looks as follows:

Upon reviewing the dataframe, the details appear as follows:

>  Name: dat_JohnSmiths
>  Type: data.frame
>  Dims: 10 x 8
>  Size: 4.9 Kb
>  Field Counts:
>    field           distinct count distinct%
>  1 CustomerID      10       10    1.0      
>  2 FirstName        4       10    0.4      
>  3 LastName         2       10    0.2      
>  4 Title            4       10    0.4      
>  5 AddressLine1    10       10    1.0      
>  6 AddressSuburb    5       10    0.5      
>  7 AddressPostcode  6       10    0.6      
>  8 Phone           10       10    1.0

3. Using SQL

When using SQL to check the data, the following script was used. Noting that it joins on the FirstName, LastName, AddressPostcode, and CustomerID fields. The results of the query return all the data. Meaning that it was unable to identify any duplicates. Therefore SQL cannot be used for this purpose.

/* Check for duplicates between the two tables. */
SELECT *
FROM JohnSmiths t1
WHERE EXISTS (
    SELECT 'x'
    FROM JohnSmiths t2
    WHERE t1.FirstName=t2.FirstName
        AND t1.LastName=t2.LastName
        AND t1.AddressPostcode=t2.AddressPostcode
        AND t1.CustomerID=t2.CustomerID
    )

4. Using R

4.1. Overview

To find duplicates via Fuzzy Matching, the following steps are followed:

  1. Concatenate the key fields in to a single text string for each row
  2. Calculate the string lengths of each record.
  3. Create the String Distance Matrix (using the [[stringdist](https://www.rdocumentation.org/packages/stringdist/versions/0.9.5.5)matrix()](https://www.rdocumentation.org/packages/stringdist/versions/0.9.5.5/topics/stringdist) function from the stringdist package).
  4. Normalise the String Distance Matrix to a value between 0 and 1: 4.1. Create a pair-wise vector of the maximum length of set of pairs. 4.2. Divide the distance length by the length of the longer string.

  5. Create a similarity matrix
  6. Identify each pair of records which are above a similarity threshold.
  7. Display the resulting records.

4.2 Walkthrough

The first step is to concatenate the strings together. This is done using the paste() function. The resulting vector appears as follows (each element is placed on a new line for convenience).

# Create concatenated vector
vec_Strings <- dat_JohnSmiths %>% 
    mutate(strings=paste0(FirstName
                         ,LastName
                         ,AddressLine1
                         ,AddressPostcode
                         ,AddressSuburb
                         ,Phone
                         )) %>% 
    select(strings) %>% 
    pull()
# Review
vec_Strings %>% cat(sep="n")
>  JohnSmith12 Acadia Rd9671Burnton1234 5678
>  JhonSmith12 Arcadia Road967Bernton1233 5678
>  JSmith12 Acadia Ave867`1Burnton1233 567
>  JohnSmith13 Kynaston Rd9671Burnton34561234
>  JohnSmithNA9671Burnton34561233
>  JohnS12 Kinaston Road9677Bernton34561223
>  JonSmith13 Kinaston Rd9761Barnston36451223
>  JohnSmith12 Aracadia St9761Brenton12345666
>  JohnSmith13 Acacia Ave8961Burnside67231231
>  JohnSmith12 Kingsford Rd9671Burnton89624328

Next, the string lengths must be calculated for each record, using the [str_length()](https://www.rdocumentation.org/packages/[stringr](https://www.rdocumentation.org/packages/stringr/versions/1.4.0)/versions/1.4.0/topics/str_length) function from the stringr package.

# Create vector of lengths
vec_Lengths <- str_length(vec_Strings)
# Review the vector
vec_Lengths %>% print()
>   [1] 41 43 39 42 30 40 42 42 42 43

Next, the String Distance Matrix is calculated, using the [[stringdist](https://www.rdocumentation.org/packages/stringdist/versions/0.9.5.5)matrix()](https://www.rdocumentation.org/packages/stringdist/versions/0.9.5.5/topics/stringdist) function from the stringdist package. Note the use of the Optimal String Alignment algorithm, via the use of the hyper-parameter method="osa". This function saves the values in a 1-Dimensional array, but when printed it displays the lower triangle of a distance matrix.

# Create distance matrix
vec_Distances <- stringdistmatrix(vec_Strings, method="osa")
# Review
vec_Distances %>% print()
>      1  2  3  4  5  6  7  8  9
>  2   7                        
>  3  10 13                     
>  4  15 21 24                  
>  5  18 25 25 15               
>  6  22 21 28 12 17            
>  7  20 23 26  9 21 14         
>  8  10 13 17 20 22 25 22      
>  9  19 22 19 21 23 29 23 22   
>  10 17 22 25 13 22 19 16 22 24

Next, it is necessary to calculate the maximum length of each pair of records. This can be achieved with the [combn()](https://www.rdocumentation.org/packages/[utils](https://www.rdocumentation.org/packages/utils/versions/3.6.2)/versions/3.6.2/topics/combn) function from the utils package. This function takes the following arguments:

  1. The vector of lengths to be iterated over.
  2. A numeric value for the number of elements to choose. The value 2 indicates it will match pairs of records.
  3. The function which will be applied to each pair. In this instance, the max() function is chosen.
  4. A logical value for whether or not to simplify result to a single vector or to return a list.

The resulting vector appears as follows.

# Create Pairwise Vector
vec_PairwiseMax <- combn(vec_Lengths, 2, FUN=max, simplify=TRUE)
# Review
vec_PairwiseMax
>   [1] 43 41 42 41 41 42 42 42 43 43 43 43 43 43 43 43 43 42 39 40 42 42 42
>  [24] 43 42 42 42 42 42 43 40 42 42 42 43 42 42 42 43 42 42 43 42 43 43

Next, the normalised distance matrix is calculated.

# Calculate normalised values
vec_NormalisedDistances <- vec_Distances/vec_PairwiseMax
# Review
vec_NormalisedDistances
>          1      2      3      4      5      6      7      8      9
>  2  0.1628                                                        
>  3  0.2439 0.3023                                                 
>  4  0.3571 0.4884 0.5714                                          
>  5  0.4390 0.5814 0.6410 0.3571                                   
>  6  0.5366 0.4884 0.7000 0.2857 0.4250                            
>  7  0.4762 0.5349 0.6190 0.2143 0.5000 0.3333                     
>  8  0.2381 0.3023 0.4048 0.4762 0.5238 0.5952 0.5238              
>  9  0.4524 0.5116 0.4524 0.5000 0.5476 0.6905 0.5476 0.5238       
>  10 0.3953 0.5116 0.5814 0.3023 0.5116 0.4419 0.3721 0.5116 0.5581

Next, the Normalised Distances are coerced in to a matrix. Once this is created, it is important to ensure that duplicates are not counted twice. To do this, the top-right triangle, and the diagonal line are all made to be zero’s.

# Create Similarity Matrix
mat_SimilarityScore <- round(1-vec_NormalisedDistances, 2) %>% as.matrix()
# Make the upper triangle all zero's. This is to avoid double-counting duplicates.
mat_SimilarityScore[upper.tri(mat_SimilarityScore)] <- 0
# Make the diagonals all zero's. This is to ensure that the same string does not get matched to itself.
mat_SimilarityScore[diag(mat_SimilarityScore)] <- 0
# Review
mat_SimilarityScore
>        1    2    3    4    5    6    7    8    9 10
>  1  0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00  0
>  2  0.84 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00  0
>  3  0.76 0.70 0.00 0.00 0.00 0.00 0.00 0.00 0.00  0
>  4  0.64 0.51 0.43 0.00 0.00 0.00 0.00 0.00 0.00  0
>  5  0.56 0.42 0.36 0.64 0.00 0.00 0.00 0.00 0.00  0
>  6  0.46 0.51 0.30 0.71 0.57 0.00 0.00 0.00 0.00  0
>  7  0.52 0.47 0.38 0.79 0.50 0.67 0.00 0.00 0.00  0
>  8  0.76 0.70 0.60 0.52 0.48 0.40 0.48 0.00 0.00  0
>  9  0.55 0.49 0.55 0.50 0.45 0.31 0.45 0.48 0.00  0
>  10 0.60 0.49 0.42 0.70 0.49 0.56 0.63 0.49 0.44  0

For point of reference, the details of this similarity matrix are printed as follows.

>  Name: mat_SimilarityScore
>  Type: matrix
>  Dims: 10 x 10
>  Size: 2.6 Kb

Next, the values that are above a given threshold are to be identified. In order to achieve this, a custom function is written.

# Add function for identifying scores above a given threshold.
tlargest <- function(matrix, threshold, print=FALSE) {
    #' @title Return Matrix Values Above A Threshold
    #' @description Subset a matrix to return values that are above a specified threshold.
    #' @note Will also print the count of how many records were matched. Necessary due to the ambiguity of using a threshold value.
    #' @param matrix matrix. A matrix, which is the matrix to be sub-set.
    #' @param threshold numeric. A single numeric value, corresponding to the threshold, above which values are to be returned.
    #' @return A matrix containing columns:
    #' - 'row' : A numeric value corresponding to the row index of the returned value.
    #' - 'col' : A numeric value corresponding to the column index of the returned value.
    #' - 'val' : A numeric value which is the similarity score between the 'row' and 'col' values.

    # Validations:
    if (!is.matrix(matrix)) {stop("'matrix' must be a matrix")}
    if (!is.numeric(matrix)) {stop("'matrix' must be numeric")}
    if (!length(threshold)==1) {stop("'threshold' must be atomic")}
    if (!is.numeric(threshold)) {stop("'threshold' must be numeric")}
    if (!between(threshold, 0, 1)) {stop("'threshold' must be between 0 and 1")}

    # Determine the indices to be returned
    sort <- sort(matrix, decreasing=TRUE)
    order <- order(matrix, decreasing=TRUE)
    order <- order[sort>=threshold]

    # Generate the matrix of indices to be returned
    position <- arrayInd(order, dim(matrix), useNames=TRUE)
    position <- set_colnames(position, c("rec1", "rec2"))
    position <- cbind(position, val=matrix[order])

    # Print subset number
    if (print==TRUE) {
        cat("Selected threshold : ", threshold, "n"
           ,"Matching records   : ", nrow(position), "n"
           ,"Total records      : ", nrow(matrix), "n"
           ,"Matching percent   : ", nrow(position)/nrow(matrix)
           ,sep=""
           )
    }

    # Return
    return(position)

}

Therefore, when applied this function will generate a matrix as displayed below.

# Generate matching records
mat_MatchingRecords <- tlargest(matrix = mat_SimilarityScore
                               ,threshold = 0.7
                               ,print = TRUE
                               )
>  Selected threshold : 0.7
>  Matching records   : 8
>  Total records      : 10
>  Matching percent   : 0.8

# Review
mat_MatchingRecords
>       rec1 rec2  val
>  [1,]    2    1 0.84
>  [2,]    7    4 0.79
>  [3,]    3    1 0.76
>  [4,]    8    1 0.76
>  [5,]    6    4 0.71
>  [6,]    3    2 0.70
>  [7,]    8    2 0.70
>  [8,]   10    4 0.70

Once the matching records have been identified, they then need to be extracted from the original table. For this, again a custom function is written. It takes the original Data Frame, along with the Similarity Matrix, then loops through the matrix to extract the relevant records from the dataframe.

find_MatchingRecords <- function(dataframe, sim_matrix, print=FALSE) {
    #' @title Find Mathching Records
    #' @description Use 'sim_matrix' to find matching records from 'dataframe'.
    #' @note The `sim_matrix` contains the similarity matrix, generated from `stringdistmatrix()` and the `nlargest()` (or `tlargest()`) functions.
    #' @param dataframe dataframe. The dataframe from which the matching records will be extracted.
    #' @param sim_matrix matrix. The matrix containing the attributes for finding the matching records.
    #' @param print logical. Whether or not to print the output as a text string.
    #' @return A data.frame containing the score and the records that have been matched. Also, this is printed to the console.

    # Validations
    if (!is.data.frame(dataframe)) {stop("'dataframe' must be a dataframe")}
    if (!is.matrix(sim_matrix)) {stop("'sim_matrix' must be a matrix.")}
    if (!is.double(sim_matrix[[2]])) {stop("'sim_matrix' must be a numeric matrix.")}
    if (c("rec1","rec2") %>% is_in(colnames(sim_matrix)) %>% all() %>% not()) {stop("'sim_matrix' must contain two columns named: 'rec1' and 'rec2'.")}

    # Set output
    str_return <- NULL
    dat_return <- data.frame("Score"=NA_real_
                            ,"Index1"=NA_real_
                            ,"Record1"=NA_character_
                            ,"Index2"=NA_real_
                            ,"Record2"=NA_character_
                            ,stringsAsFactors=FALSE
                            )

    # Determine number of itterations
    iterations <- sim_matrix %>% nrow()

    # Loop through number of itteraions
    for (i in 1:iterations) {

        # Extract score
        sim_score <- sim_matrix %>% 
            extract(i, 3)

        # Extract the first matching index
        rec1_index <- sim_matrix %>%
            extract(i,1)

        # Extract first matching record
        rec1_record <- sim_matrix %>% 
            extract(i,1) %>% 
            extract(dataframe, ., ) %>% 
            as.character() %>% 
            paste(collapse=" ")

        # Extract the second matching index
        rec2_index <- sim_matrix %>% 
            extract(i,2)

        # Extract second matching record
        rec2_record <- sim_matrix %>% 
            extract(i,2) %>% 
            extract(dataframe, ., ) %>% 
            as.character() %>% 
            paste(collapse=" ")

        # Build return
        str_return %<>% paste0("n") %>% 
            paste0("Score: ", sim_score, "n") %>% 
            paste0("Record 1: (Index ", rec1_index, ") ", rec1_record, "n") %>% 
            paste0("Record 2: (Index ", rec2_index, ") ", rec2_record, "n") 
        dat_return[i,] <- c(sim_score, rec1_index, rec1_record, rec2_index, rec2_record)

    }

    # Print str_return
    if (print==TRUE) {
        cat(str_return)
    }

    # Return dat_return
    return(dat_return)

}

Lastly, when this function is applied, the data appears as follows.

# Generate data
dat_MatchingRecords <- find_MatchingRecords(dataframe = dat_JohnSmiths
                                           ,sim_matrix = mat_MatchingRecords
                                           ,print = TRUE
                                           )
>  Score: 0.84
>  Record 1: (Index 2) 2 Jhon Smith Mr 12 Arcadia Road Bernton 967 1233 5678
>  Record 2: (Index 1) 1 John Smith Mr 12 Acadia Rd Burnton 9671 1234 5678
>  
>  Score: 0.79
>  Record 1: (Index 7) 7 Jon Smith Mr. 13 Kinaston Rd Barnston 9761 36451223
>  Record 2: (Index 4) 4 John Smith Mr 13 Kynaston Rd Burnton 9671 34561234
>  
>  Score: 0.76
>  Record 1: (Index 3) 3 J Smith Mr. 12 Acadia Ave Burnton 867`1 1233 567
>  Record 2: (Index 1) 1 John Smith Mr 12 Acadia Rd Burnton 9671 1234 5678
>  
>  Score: 0.76
>  Record 1: (Index 8) 8 John Smith Dr 12 Aracadia St Brenton 9761 12345666
>  Record 2: (Index 1) 1 John Smith Mr 12 Acadia Rd Burnton 9671 1234 5678
>  
>  Score: 0.71
>  Record 1: (Index 6) 6 John S Dr. 12 Kinaston Road Bernton 9677 34561223
>  Record 2: (Index 4) 4 John Smith Mr 13 Kynaston Rd Burnton 9671 34561234
>  
>  Score: 0.7
>  Record 1: (Index 3) 3 J Smith Mr. 12 Acadia Ave Burnton 867`1 1233 567
>  Record 2: (Index 2) 2 Jhon Smith Mr 12 Arcadia Road Bernton 967 1233 5678
>  
>  Score: 0.7
>  Record 1: (Index 8) 8 John Smith Dr 12 Aracadia St Brenton 9761 12345666
>  Record 2: (Index 2) 2 Jhon Smith Mr 12 Arcadia Road Bernton 967 1233 5678
>  
>  Score: 0.7
>  Record 1: (Index 10) 10 John Smith Dr 12 Kingsford Rd Burnton 9671 89624328
>  Record 2: (Index 4) 4 John Smith Mr 13 Kynaston Rd Burnton 9671 34561234

Which has a dataframe that appears as follows:

# Review
dat_MatchingRecords %>% 
    kable() %>% 
    kable_styling(bootstrap_options=c("striped", "bordered", "condensed")
                 ,full_width=FALSE
                 ,position="left"
                 ) %>% 
    (function(x){
        x %>% save_kable("Images/JohnSmithsMatching.png")
        x %>% return()
    })

As seen, this method has identified these duplicates to a reasonable level of accuracy.

4.3. Wrap Up

Therefore, when working to identify duplicates within a database, it can be useful to use ‘Fuzzy Matching’ to identify the similarity between different sets of records.

However, considering that the data provided in this example is fictitious, it is useful to perform the same process on real-world data.

5. Real Example

Using business addresses scraped from the web, the same method was applied.

5.1. Pre Processing

The data was loaded from an Excel spreadsheet AddressData.xlsx.

# Load Data
dat_Addresses <- find_rstudio_root_file() %>% 
    paste0("/Data/", "AddressData.xlsx") %>% 
    read_excel(col_types=c("text")) %>% 
    data.frame()

The top 10 records appears as follows:

With the details of the data displayed as follows:

>  Name: dat_Addresses
>  Type: data.frame
>  Dims: 19388 x 8
>  Size: 4.2 Mb
>  Field Counts:
>    field        distinct count distinct%
>  1 COMPANY       6320    19388 0.3259748
>  2 ADDRESS       8646    19388 0.4459459
>  3 CITY          2459    19388 0.1268310
>  4 STATE            8    19388 0.0004126
>  5 ZIP           1264    19388 0.0651950
>  6 ISOCNTRYCODE     1    19388 0.0000000

Then some pre-processing manipulation is performed on the Addresses to reduce the size of the data.

# Mutate data.
# Steps:
# 1. Remove special characters
# 2. Remove numeric values (Street numbers, etc.)
# 3. Remove additional white spaces
# 4. Capitalise
# 5. Select distinct.
dat_AddressesClean <- dat_Addresses %>% 
    mutate_all(str_replace_all, "[[:punct:]]", "") %>% 
    mutate_at(c("COMPANY", "ADDRESS", "CITY", "STATE"), str_replace_all, "[0-9]", "") %>% 
    mutate_all(str_squish) %>% 
    mutate_all(str_to_upper) %>% 
    distinct() %>% 
    arrange(ISOCNTRYCODE, STATE, CITY, ADDRESS)
# Check result
check_ObjectDetails(dat_AddressesClean) %>% cat()
# Check result
check_ObjectDetails(dat_AddressesClean) %>% cat()
>  Name: dat_AddressesClean
>  Type: data.frame
>  Dims: 9945 x 6
>  Size: 1.5 Mb
>  Field Counts:
>    field        distinct count distinct%
>  1 COMPANY      6279     9945  0.6313725
>  2 ADDRESS      6267     9945  0.6301659
>  3 CITY         2344     9945  0.2356963
>  4 STATE           8     9945  0.0008044
>  5 ZIP          1264     9945  0.1270990
>  6 ISOCNTRYCODE    1     9945  0.0000000

Lastly, the data must be pasted together in to a single vector, as displayed below. Now the data is ready for Fuzzy Matching.

# Create vector
vec_Addresses <- dat_AddressesClean %>% 
    unite("vec", sep="") %>% 
    pull()
# Review
vec_Addresses %>%
    head(10) %>%
    cat(sep="n")
>  MELBOURNE FACADES PTY LTDANU ELLERY CRESACTONACT2601AU
>  IC FORMWORK SERVICES PTY LTDCLUNIES RDACTONACT2601AU
>  RICHARD CROOKES CONSTRUCTIONS PTYDALEY RDACTONACT2601AU
>  STOWE AUSTRALIA PTY LIMITEDELLERY CRESACTONACT2600AU
>  MELBOURNE FACADES PTY LTDELLERY CRESACTONACT2601AU
>  CONSTRUCTION CONTROL HOLDINGSSULLIVANS CREEK RDACTONACT2601AU
>  QUATTRO BUILDING SERVICES PTY LTDSULLIVANS CREEK RDACTONACT2601AU
>  IC FORMWORK SERVICES PTY LTDMORNINGTON ST AMANDA STAMAROOACT2914AU
>  RENROW STEEL PTY LTDCOPPER CRESBEARDACT2620AU
>  OPTIMUM BRICK BLOCKLAYINGCOPPER CRESBEARDACT2620AU

5.2 Process

The processing for the address data has been combined in to a single chunk, and processed accordingly. The following key call-outs must be listed:

  • The initial input vector was 9,964 elements long, while the distances vector was 49,635,666 elements long.
  • The distances vector was 378.7 Mb large, while the similarity matrix was 758.7 Mb large.
  • The time-taken was 128.2 seconds. Note, this varies depending on the computational power of the processor.
# Start the clock
tic("nnTime to Process Data")
# Calculate the lengths
vec_AddLengths <- str_length(vec_Addresses)
# Calculate the distances
vec_AddDistances <- stringdistmatrix(vec_Addresses, method="osa")
# Calculate the max lengths
vec_AddPairwiseMax <- combn(vec_AddLengths, 2, max, simplify=TRUE)
# Normalise the distances
vec_AddNormalisedDistances <- vec_AddDistances/vec_AddPairwiseMax
# Add object details to the output
str_Output <- c("vec_Addresses", "vec_AddLengths", "vec_AddDistances", "vec_AddPairwiseMax", "vec_AddNormalisedDistances") %>% 
    check_ObjectDetails() %>% 
    paste(sep="n")
# Create the similarity matrix
mat_AddSimilarityScore <- round(1-vec_AddNormalisedDistances, 2) %>% as.matrix()
# Add matrix to output
str_Output %<>% paste(check_ObjectDetails(mat_AddSimilarityScore), sep="nn")
# Tidy the matrix
mat_AddSimilarityScore[upper.tri(mat_AddSimilarityScore)] <- 0
mat_AddSimilarityScore[diag(mat_AddSimilarityScore)] <- 0
# Re-add matrix to output
str_Output %<>% paste(check_ObjectDetails(mat_AddSimilarityScore), sep="nn")
# Print the output
cat(str_Output)
# Stop the clock and print the time taken
toc()
>  Name: vec_Addresses
>  Type: character
>  Dims: 1 x 9945
>  Size: 1.2 Mb
>  
>  Name: vec_AddLengths
>  Type: integer
>  Dims: 1 x 9945
>  Size: 38.9 Kb
>  
>  Name: vec_AddDistances
>  Type: dist
>  Dims: 1 x 49446540
>  Size: 377.2 Mb
>  
>  Name: vec_AddPairwiseMax
>  Type: array
>  Dims: 1 x 49446540
>  Size: 188.6 Mb
>  
>  Name: vec_AddNormalisedDistances
>  Type: dist
>  Dims: 1 x 49446540
>  Size: 377.2 Mb
>  
>  Name: mat_AddSimilarityScore
>  Type: matrix
>  Dims: 9945 x 9945
>  Size: 755.8 Mb
>  
>  Name: mat_AddSimilarityScore
>  Type: matrix
>  Dims: 9945 x 9945
>  Size: 755.8 Mb
>  
>  Time to Process Data: 258.7 sec elapsed

Resultingly, the records can now be matched. Note that the number of matching records was 2389 (24% of total).

# Find matches
mat_AddMatchingRecords <- tlargest(matrix = mat_AddSimilarityScore
                                  ,threshold = 0.8
                                  ,print = TRUE
                                  )
>  Selected threshold : 0.8
>  Matching records   : 2185
>  Total records      : 9945
>  Matching percent   : 0.2197
# Find records
dat_AddMatchingRecords <- find_MatchingRecords(dataframe = dat_AddressesClean
                                              ,sim_matrix = mat_AddMatchingRecords
                                              )

Once these matching records have been identified, they can be returned. For simplicity, the top 10 matches for each Similarity score have been returned in the below chunk.

# View 0.99
dat_AddMatchingRecords %>% 
    filter(Score==0.99) %>% 
    head(10) %>% 
    print_MatchingRecords() %>% 
    cat()
>  Score: 0.99
>  Record 1: (Index 685) AUTOMATIC FIRE PROTECTION DESIGN THE NORTHERN RD BERKSHIRE PARK NSW 2765 AU
>  Record 2: (Index 684) AUTOMATIC FIRE PROTECTION DESIGN THE NORTHERN RD BERKSHIRE PARK NSW 2756 AU
>  
>  Score: 0.99
>  Record 1: (Index 1062) EJ CONSTRUCTIONS NSW PTY LIMITED ROBERTSON RD CENTENNTIAL PARK NSW 2021 AU
>  Record 2: (Index 1061) EJ CONSTRUCTIONS NSW PTY LIMITED ROBERTSON RD CENTENNIAL PARK NSW 2021 AU
>  
>  Score: 0.99
>  Record 1: (Index 2050) CPB DRAGADOS SAMSUNG JOINT VENTURE A COMMERCIAL RD KINGSGROVE NSW 2208 AU
>  Record 2: (Index 2049) CPB DRAGADOS SAMSUNG JOINT VENTURE A COMMERCIAL RD KINGSFROVE NSW 2208 AU
>  
>  Score: 0.99
>  Record 1: (Index 4593) CULLEN STEEL FABRICATIONS PTY LTD GATE WHARF HICKSON RD WALSH BAY NSW 2060 AU
>  Record 2: (Index 4592) CULLEN STEEL FABRICATIONS PTY LTD GATE WHARF HICKSON RD WALSH BAY NSW 2065 AU
>  
>  Score: 0.99
>  Record 1: (Index 5677) BM ALLIANCE COAL OPERATIONS PL CY HAY POINT OPERATIONS HAY POINT QLD 4740 AU
>  Record 2: (Index 5676) BM ALLIANCE COAL OPERATIONS PL CY HAY POINT OPERATIONS HAY POINT QLD 4744 AU

When selecting threshold=0.99, the algorithm has matched nearly perfectly, with only single character differences between the records.

# View 0.98
dat_AddMatchingRecords %>% 
    filter(Score==0.98) %>% 
    head(10) %>% 
    print_MatchingRecords() %>% 
    cat()
>  Score: 0.98
>  Record 1: (Index 25) KENNARDS HIRE PTY LIMITED OATLEY CT BELCONNEN ACT 2617 AU
>  Record 2: (Index 24) KENNARDS HIRE PTY LIMITED OATLEY CT BELCONNEN ACT 2616 AU
>  
>  Score: 0.98
>  Record 1: (Index 152) ABS FACADE PTY LTD SHEPPARD ST HUME ACT 2620 AU
>  Record 2: (Index 149) ABS FACADE PTY LTD SHEPARD ST HUME ACT 2620 AU
>  
>  Score: 0.98
>  Record 1: (Index 229) CRYSTAL POOLS PTY LTD DAVE MCINNES RD STROMLO ACT 2611 AU
>  Record 2: (Index 228) CRYSTAL POOLS PTY LTD DAVE MCINNES RD STOMLO ACT 2611 AU
>  
>  Score: 0.98
>  Record 1: (Index 410) RICHARD CROOKES CONSTRUCTIONS PTY PRINCESS HWY ARNCLIFFE NSW 2205 AU
>  Record 2: (Index 403) RICHARD CROOKES CONSTRUCTIONS PTY PRINCES HWY ARNCLIFFE NSW 2205 AU
>  
>  Score: 0.98
>  Record 1: (Index 875) ODK CONSTRUCTION PTY LTD PARRAMATTA RD BURWOOD NSW 2134 AU
>  Record 2: (Index 478) ODK CONSTRUCTION PTY LTD PARRAMATTA RD B URWOOD NSW 2134 AU
>  
>  Score: 0.98
>  Record 1: (Index 574) EODO PTY LTD HAVANNAH ST BATHURST NSW 2795 AU
>  Record 2: (Index 570) EODO PTY LTD HAVANNAH ST BATHRUST NSW 2795 AU
>  
>  Score: 0.98
>  Record 1: (Index 749) LIN BETTY BUILDING GROUP PTY LTD A OXFORD ST BONDI JUNCTION NSW 2026 AU
>  Record 2: (Index 748) LIN BETTY BUILDING GROUP PTY LTD A OXFORD ST BONDI JUNCTION NSW 2022 AU
>  
>  Score: 0.98
>  Record 1: (Index 927) LAING O ROURKE AUSTRALIA CARDIGAL LN CAMPERDOWN NSW 2050 AU
>  Record 2: (Index 926) LAING OROURKE AUSTRALIA CARDIGAL LN CAMPERDOWN NSW 2050 AU
>  
>  Score: 0.98
>  Record 1: (Index 1055) CJ SJ O KEEFFE BUILDING PTY LTD LOT NELSON ROAD CATTAI NSW 2756 AU
>  Record 2: (Index 1054) CJ SJ OKEEFFE BUILDING PTY LTD LOT NELSON ROAD CATTAI NSW 2756 AU
>  
>  Score: 0.98
>  Record 1: (Index 1083) METRO CHATSWOOD JHCPBG JV MOWBRAY RD CHATSWOOD NSW 2067 AU
>  Record 2: (Index 1081) METRO CHATSWOOD JHCPBG JV MOWBRAY RD CHATSWOOD NSW 2057 AU

When selecting threshold=0.98, the algorithm has also matched nearly perfectly, also with only single character differences between the records.

# View 0.8
dat_AddMatchingRecords %>% 
    filter(Score==0.8) %>% 
    head(10) %>% 
    print_MatchingRecords() %>% 
    cat()
>  Score: 0.8
>  Record 1: (Index 67) IC FORMWORK SERVICES PTY LTD CHALLIS ST DICKSON ACT 2602 AU
>  Record 2: (Index 2) IC FORMWORK SERVICES PTY LTD CLUNIES RD ACTON ACT 2601 AU
>  
>  Score: 0.8
>  Record 1: (Index 43) IC FORMWORK SERVICES PTY LTD CNR OF HOBART PL LONDON CCT CANBERRA ACT 2601 AU
>  Record 2: (Index 36) IC FORMWORK SERVICES PTY LTD CNR CONSTITUTION AVE AND LONDON CCT CANBERRA ACT 2601 AU
>  
>  Score: 0.8
>  Record 1: (Index 141) TS ANDREW TAYLOR SAWMILL CCT HUME ACT 2620 AU
>  Record 2: (Index 139) TS ALEX GARNER SAWMILL CCT HUME ACT 2620 AU
>  
>  Score: 0.8
>  Record 1: (Index 187) ACT INTERIORS HUDDART CT MITCHELL ACT 2911 AU
>  Record 2: (Index 175) ACT INTERIORS BROOKS ST MITCHELL ACT 2911 AU
>  
>  Score: 0.8
>  Record 1: (Index 292) TS DHAWAL NANDA BOURKE RD ALEXANDRIA NSW 2015 AU
>  Record 2: (Index 270) TS GEORGE NAJJAR BOURKE RD ALEXANDRIA NSW 2015 AU
>  
>  Score: 0.8
>  Record 1: (Index 295) SCHAI DARREN BOURKE RD ALEXANDRIA NSW 2015 AU
>  Record 2: (Index 278) SMITH STEVEN BOURKE RD ALEXANDRIA NSW 2015 AU
>  
>  Score: 0.8
>  Record 1: (Index 302) ABS FACADE PTY LTD BOURKE RD ALEXANDRIA NSW 2015 AU
>  Record 2: (Index 281) DELTA PTY LTD BOURKE RD ALEXANDRIA NSW 2015 AU
>  
>  Score: 0.8
>  Record 1: (Index 311) CPB CONTRACTORS PTY LTD BOURKE RD ALEXANDRIA NSW 2015 AU
>  Record 2: (Index 297) FLEXEM CONSTRUCTION PTY LTD BOURKE RD ALEXANDRIA NSW 2015 AU
>  
>  Score: 0.8
>  Record 1: (Index 300) CHAN PETER BOURKE RD ALEXANDRIA NSW 2015 AU
>  Record 2: (Index 299) LOCANO PAULO BOURKE RD ALEXANDRIA NSW 2015 AU
>  
>  Score: 0.8
>  Record 1: (Index 307) PETER CHAN BOURKE RD ALEXANDRIA NSW 2015 AU
>  Record 2: (Index 305) PAULO LOCANO BOURKE RD ALEXANDRIA NSW 2015 AU

When selecting threshold>0.9, the algorithm has also matched nearly perfectly, also with only single character differences between the records. However, when selecting threshold=0.8, the algorithm has made false-matches. For example:

  • Records 43 and 36 may actually be matching duplicates, because they’re both for IC FORMWORK SERVICES on LONDON CCT.
  • Records 141 and 139 has matched different businesses on the same street in the same suburb. This is a false match.
  • Records 187 and 175 has matched the same businesses on different streets in the same suburb. This is a false match.

Therefore, discretion is advised when using a threshold value less than 0.9.

6. Conclusion

In conclusion, Fuzzy Matching is able to successfully identify duplicates to a reasonable level of accuracy. When matching together strings with a single-character difference, these are given with a nearly perfect matching score (>0.99). However, when the threshold is set to ~0.8, then the logic begins to match different businesses on the same street in the same suburb, or the same business in different streets in the same suburb, etc. This is a risk for implementation, therefore the threshold must be chosen carefully.

Moreover, due to the substantially large objects created during processing, appropriate data segmentation is appropriate so as to avoid causing computational issues.

Also note, data input validation is always better than retrospective data cleaning. This solution should be implemented in instances where address data is given by external parties (such as an EDI connection). If the data is identified to be a confidence-interval of >0.99, then it is reasonable to auto-amend the data; however, if it is ~0.8, then it would be better for the data to be flagged for review by an employee.

7. References

Awati 2019, ‘Tackling the John Smith Problem – deduplicating data via fuzzy matching in R’, <https://eight2late.wordpress.com/2019/10/09/tackling-the-john-smith-problem-deduplicating-data-via-fuzzy-matching-in-r/>.

Wikipedia 2019, ‘Cosine similarity’, <https://en.wikipedia.org/wiki/Cosine_similarity>.

Wikipedia 2019, ‘Edit distance’, <https://en.wikipedia.org/wiki/Edit_distance>.

Wikipedia 2019, ‘Hamming distance’, <https://en.wikipedia.org/wiki/Hamming_distance>.

Wikipedia 2019, ‘Jaccard index’, <https://en.wikipedia.org/wiki/Jaccard_index>.

Wikipedia 2019, ‘Jaro–Winkler distance’, <https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance>.

Wikipedia 2019, ‘Levenshtein distance’, <https://en.wikipedia.org/wiki/Levenshtein_distance>.

Wikipedia 2019, ‘Longest common substring problem’, <https://en.wikipedia.org/wiki/Longest_common_substring_problem>.

Wikipedia 2019, ‘Optimal string alignment distance’, <https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance#Optimal_string_alignment_distance>.

8. Post Script

Acknowledgements: This report was compiled with some assistance from others. Acknowledgements go to:

  1. Kailash Awati for I used his original article (Tackling the John Smith Problem) as a basis for this article.

Publications: This report is also published on the following sites:

  1. RPubs: RPubs/chrimaho/AddressingTheJohnSmithProblem
  2. GitHub: GitHub/chrimaho/AddressingTheJohnSmithProblem
  3. Medium: Medium/chrimaho/AddressingTheJohnSmithProblem

Change Log: This publication was modified on the following dates:

  1. 04/Jul/2020: Original Publication date.

Related Articles