Hybrid Fuzzy Name Matching

Published in

Towards Data Science

7 min readNov 13, 2018

Introduction

My workplace works with large-scale databases that, amongst many things, contains data about people. For each person in the DB we have a unique identifier, which is composed of the person’s first name, last name, zip code. We hold ~500MM people in our DB, which can essentially have duplicates if there is a little change in the person’s name. For example, Rob Rosen and Robert Rosen (with the same zip code) will be treated as two different people. I want to note that if we get the same person an additional time, we just update the record’s timestamp, so there is no need for this sort of deduping. In addition, I would like to give credit to my co-worker Jonathan Harel who assisted me in the research for this project.

The Problem

There are different ways that I cleaned the DB. I am going to describe the one which is most interesting and deduped the DB very well for this case. Here we only try matching between two identifiers which hold the same zip code. For a substantial part of the DB, we hold gender data and age data, but very often this info is missing. So I am basically left with just the actual names. So how can I be sure that two names belong to the same person?

Machine Learning for the Rescue

This is easier said than done. What exactly are the features that can be used here to be inputted in a Machine Learning model?

There are some features that are relatively intuitive, and some had to be researched thoroughly. At first, I was obviously thinking of some sort of string similarity (for typos, etc.), and perhaps cases in which I have a name and the name’s nickname (e.g. Ben/Benjamin). I performed data exploration, looking to see if I can strengthen my thoughts or find other ideas for name similarities. This was a good start, but I needed more. After much research I got to the following list of cases of similarity between names:

Feature Extraction

Assuming I have two people that hold the same zip code, I want to give a score to how “close” they are. As previously said, in some of the cases I have data of age and/or gender, but very often this is not the case. Of course, this is also inputted to the model as features. So for each one of these ideas, I needed to extract their corresponding features (hence the name I chose: “Hybrid”):

For the nicknames I collected multiple large lists of names and their nicknames, followed by creating a Python dictionary with this data. Given two first names, first_name1 and first_name2, together with the nickname dictionary I built the following function which creates a binary feature, which is marked as 1 if one of the person’s name is the nickname of the other person’s name:

def is_nickname(first_name1, first_name2, nickname_dict):
    first_name1_dict_vals = nickname_dict.get(first_name1)
    first_name2_dict_vals = nickname_dict.get(first_name2)
    if first_name1_dict_vals:
        if first_name2 in first_name1_dict_vals:
            return 1
    if first_name2_dict_vals:
        if first_name1 in first_name2_dict_vals:
            return 1
    return 0

The ideas e. to k. are also features which are based on tests done by a script to check if the case is true (or “relatively true” as we will see in the following code) in a certain comparison. For example, the following code checks the similarity between the components of both of the full names:

# Out of order components ('Diaz | Carlos Alfonzo', 'Carlos Alfonzo # | Diaz') / first_name_last_name_swap  ('jesse | jones', 'jones |  # jesse')full_name1_splits = re.split("\W+|_", full_name1)
full_name2_splits = re.split("\W+|_", full_name2)out_of_order_components_score = 
len(Counter(full_name2_splits) & Counter(full_name1_splits)) /           float(max(len(full_name1_splits), len(full_name2_splits)))

For Textual Similarity I used Jaro-Winkler Distance, Hamming Distance, Damerau-Levenshtein Distance and also the regular Levenshtein Distance. This was done after testing an extensive amount of different algorithms that can be used for this case, having the above performing best. Here is a snippet of the (partial) code I used for these features:

from pyjarowinkler.distance import get_jaro_distance
import editdistancejaro_score = get_jaro_distance(name1, name2)
levenshtein_distance_score = editdistance.eval(name1, name2)

For Phonetic Similarity, I finalized on the NYSIIS and Double Metaphone algorithms. The idea behind these algorithms is that they create an encoding for English words. I then use a string distance between the two different encodings (specifically Levenshtein Distance here). For example, the Double Metaphone outputs a primary encoding and a second encoding. This is how the encodings look for the names Catherine and Kathryn:

In order to extract the phonetic similarity feature using the NYSIIS algorithm, I used the following code (in this algorithm there is only one encoding for each name given):

import editdistance
import fuzzynysiis_score = editdistance.eval(fuzzy.nysiis(name1), fuzzy.nysiis(name2))

Does anyone have some labels?

You might have been wondering by now: “Is this a classification problem? If so, where does he have labeled data from?”

Well, I don’t… I actually labeled data myself. I extracted cases in which there is a match (e.g. Jennifer Williams / Jenny Williams; labeled as 1), cases that are a “close” match (e.g. Don Anderson / Daniel Anderson; labeled as 0), and added a large random sample from the data for labeling. The “close” matches allowed me to build a robust model that can differentiate very well between real matches and matches that are close but not actually matches. This wasn’t quite a pleasure, but it made this project feasible :)

Building a model

I am now ready to train a model. Of course I have split the data to a train (splitting it as well for Hyperparameter Optimization) and test set. What most concerned me here is the precision. Let’s recall (pun intended :) ) what the precision is:

The reason for this is that it is much worse to match between two people who aren’t really the same person than missing a match between two people who are actually the same person. The data we have is used, among internal use, for exports to data partners. Therefore for business reasons we prefer having as fewer false-positives as possible.

I decided to go for models which do not need any scaling done for their features, so I mainly tried Random Forest, GBM and XGBoost. I also performed Hyperparameter Optimization, using sklearn’s GridSearchCV:

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV# Number of trees
n_estimators = [int(x) for x in np.linspace(start=5, stop=30, num=5)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(3, 20, num=3)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

grid_search = GridSearchCV(RandomForestClassifier(random_state=0), param_grid=random_grid, scoring='precision', n_jobs=-1, verbose=5)
grid_search.fit(train_features, y_train)

Notice that you can alter the GridSearchCV to optimize according to the precision score, changing the ‘score’ argument.

Initial Results

After running the optimized model for the first time I got a precision score of 0.85 on the test set. This is nice, but I am still looking to be close to perfect here. Since my model can output a probability, I tried finding the optimal threshold for raising the precision. There was a trade-off here since the recall went down drastically. I can lower the threshold very much, but then end up with close to zero matches.

I decided to analyze what my model gets wrong, checking to see if there is something that links most or perhaps all the false-positives. I found that there are many cases in which the age had given too much of an effect (Note: The data has been collected in the past few years, so it’s not expected that the age should be identical), for example:

False-Positive Example

So what can I do here? I taught the model that he has made a no-no. How can I do that? Just like in the real world, I will tell the model again and again that it was wrong. I took a vast amount of cases in which the age is similar, and also one of the names is the same (like seen above). This idea can be considered to be similar to the concept of Active Learning.

Final Results

This had a huge impact on the model. I managed to rise to a precision of 0.99 on the test set, together with holding a recall of 0.8.

When running this on the whole DB, the model had found about ~50MM matches, deduping the DB by 10%! I of course didn’t go over all of these matches, but I randomly selected a few good thousand and found that the precision here is also about 0.99. Here are some cool examples of matches made: