Notes from Industry

Deduplication of customer data using fuzzy-scoring

Handling complications in customer data using Python

Shreepada Rao
Towards Data Science
6 min readApr 25, 2021

--

Abstract:

For any organization, data management is always a challenge, through its life cycle of planning, collection, processing, storage, management, analysis, visualization and finally interpretation. For successful management and protection of data requires a lot of planning and collaboration between teams to move data from one stage to the next. Few prominent steps of data processing are wrangling, compression, deduplication and encryption. When it comes to most of the customer centric organizations it will be a biggest challenge to get a unique list of customers not because of storage cost reduction, but rather due to the fact that understanding their customer is a biggest asset for them.

Hence irrespective of size of the corporation, they spend a lot of time and effort on getting this data right, either through in-line deduplication or post-processing deduplication [1]. This paper talks through post processing deduplication using a fuzzy scoring method with python and relevant packages.

Challenges:

The data is collected in various methods with/without standard data collection forms and collected in various places but consolidated at one place. These lists are often compiled by third-parties without standard formatting and can often contain duplicate or otherwise ‘dirty’ data. Data goes back into a long history, contains inaccuracies and incompleteness.

When attempting to compare lists compiled by different collection methods are quickly challenged with several data concerns:

1. Unclear addresses: Based on when, where and what was the primary purpose of the data collection, it can have non-structured and non-specific addresses, postal codes and even names.

2. Transliteration problems: Inconsistencies when translating addresses with non-Roman characters into English text introduces variation in addresses and names.

3. Contact information complication: Customers will generally have multiple email addresses and phone numbers, this problem explodes when the business type is B2B where a business will have multiple phone numbers and email addresses.

4. Most repeated values: Commonly repeated values are because of incorrect data collection especially occurs under name, email, and phone number, etc. and each scenario must be handled appropriately.

Data normalization:

An essential step before starting deduplication is making sure the data is structured, aligned and has a consistent format. Hence it’s sensible to follow below steps:

1. Lowercase: One of the efficient and simplest steps is to convert everything to lowercase to make name and address comparable.

2. Abbreviation: Maintain consistency by words with abbreviation with be building a repository obtained in the internet or custom made of the data which is getting processed. Example: Street is also represented as St. , St, Str. , Str, etc.

3. Missing values: Assess missing value per record and if there is a significant missing value then remove it from proceeding further and the quality of such records will be too low to help deduplication scoring.

4. Incorrect country names: A Python package ‘iso3166’ contains a list of all countries and helps identifying the country names which are incorrect in country name.

5. Incorrect postal codes: ‘pgeocode’ is a python library which contains legitimate postal codes for all countries and used for validating postal codes in the address. There are several paid APIs as well for the same purpose which helps in getting the postal code right.

Validation of Phone numbers and email address:

Apart from identification of repeated phone numbers, an additional verification that can be done is assessing the quality and standardizing format with prefix of ‘+’ or ‘0’ or (country code). Below script helps standardizing millions of phone numbers in few seconds. Further precise validation can be done through the paid API which helps validation during in-line of post processing deduplication.

def check_phone(phone, cc):
try:
if int(phone) <1: return np.nan
if phone== np.nan: return np.nan
except:pass

phone= str(phone)
cc= str(cc)
#print(phone, cc)
if len(phone)>4 :
if phone != np.nan or cc != np.nan:
cc_len = len(cc)
if(phone[:1] == '+'):
if phone[1:1+cc_len] == cc:
return phone
else: return phone
else:
cc_len = len(cc)
if phone[:1] == '0':
if phone[1:1+cc_len]== cc:
return phone.replace('0','+',1)
else:
return phone.replace('0','+'+cc,1)
else:
if phone[0:cc_len]== cc:
return '+'+phone
else:return '+'+cc+phone
else: return np.nan

# Function caller
%time
account[['calling_code','contact_Phone']].apply(lambda x : check_phone(x.contact_Phone, x.calling_code),axis = 1)

Definition of Duplication:

Defining what duplication in the data is an important aspect for the process. Depending on the problem that is getting solved, the definition of duplication will change. For a most common customer data generally this will be Name, Postal address, Phone number(s) and email address. These will form a set of fields to look for scoring and help identifying duplication. Few fields such as Phone Number or email will give a clear indication of duplication, on the other hand duplication in Name or address can actually represent people that are actually detached individuals. Hence such cases should go through a semi-automated/automated verification process.

Finally the Scoring:

Scoring starts with a self-join of the table joining with itself based on city/postal code and then using fuzzy logic to score the rest of the columns such as name, email, and phone numbers. The scoring with the right cutoff gives the probable list of duplicates in the data and rest to be discarded. As always there is no magical number for the definite cutoff and it take a few iterations to arrive at a number between 0–100 to define what the right cutoff for the data is. In addition, it’s better to have distinct cutoffs for each column like in the code below.

Resulting table after self-join
# self-joining data based on the parameter
# Preparation of data for fuzzy
def joiner(cntry , file , mcol):
df_string_all = account[(account.ShippingCountryCode == cntry)]
df_string_all = df_string_all[address_match_columns]
global dup
dup = pd.DataFrame()
parts = round(df_string_all.shape[0]/1000)
start_time = time.time()
if cntry == 'ie':
mcol = mcol.replace('PostalCode','City')
print(cntry.upper(), mcol)
total_uni = len(df_string_all[mcol].unique())
unique_col_value = df_string_all[mcol].unique()
rem = ['xxxxx','Nan','', 'NAN', 'nan', np.nan]
unique_col_value = [uni for uni in unique_col_value if uni not in rem]

for i in range(1,parts+1):
my_list = unique_col_value[int(round(np.divide((1),parts)*total_uni*(i-1))):\
int(round(np.divide((1),parts)*total_uni*i))]
df_string = df_string_all[(df_string_all[mcol].isin(my_list))]
df_string = df_string.merge(df_string, on= mcol , how = 'left', suffixes= ('1', '2'))
col_list = df_string.columns.sort_values().drop( ['Id1', 'Id2']).drop(mcol)
df_string = df_string[(df_string.Id1 < df_string.Id2)]
even = col_list[::2]
odd = col_list[1::2]
df_string = df_string[(df_string[['Name1' , 'Name2']].apply( lambda x:fuzz.token_sort_ratio(x['Name1'], x['Name2']), axis = 1) > name_match_cutoff)]
if df_string.shape[0] >0:
dup = dup.append(identifier(df_string, even, odd, mcol))
del df_string
del df_string_all
end_time = time.time()
print('Time taken for : ' ,cntry.upper() , mcol , round((end_time - start_time)/60,2) , ' minutes')
print('Duplicates for : ',cntry.upper() , mcol, dup.shape)
return dup

Below block identifies duplicates in the data based on the cutoff levels defined by the users:

def identifier(df_string, col_even, col_odd, case):
for i in col_even:
for j in col_odd:
if(i[:-1] == j[:-1]):
new_col = i[:-1]+'_score'
df_string[new_col] = df_string.apply(lambda x: fuzz.token_sort_ratio(x[i], x[j]) , axis = 1)
df_string[new_col] = df_string.apply(lambda x: 0 if (pd.isnull(x[i]) | pd.isnull(x[j])) else x[new_col], axis=1)
col_score = [k for k in df_string.columns if 'score' in k]
street_score = [k for k in col_score if 'Street' in k]
city_score = [k for k in col_score if 'City' in k] +[k for k in col_score if 'Post' in k]

if case == 'Name':
duplicate_con = df_string[((df_string[street_score]> street_match_cutoff).sum(axis= 1) > 0) &\
((df_string[city_score]> city_match_cutoff).sum(axis=1)>0)]
elif case == 'BillingStreet':
duplicate_con = df_string[((df_string[city_score]> city_match_cutoff).sum(axis=1)>0) & \
(df_string['Name_score']> name_match_cutoff)]
else:
duplicate_con = df_string[(df_string['Name_score'] > name_match_cutoff) & \
((df_string[street_score] > street_match_cutoff).sum(axis=1) > 0) & \
((df_string[city_score]> city_match_cutoff).sum(axis=1)>0)]
if duplicate_con.shape[0] >0:
duplicate_con['2Final_Score'] = round((duplicate_con[col_score].mean(axis = 1)))
duplicate_con['1Match_Case'] = case
duplicate_con[case+'1'] = duplicate_con[case]
duplicate_con[case+'2'] = duplicate_con[case]
duplicate_con[case+'_score'] = 100
duplicate_con= duplicate_con.drop(columns= case)

return duplicate_con

Function caller

country = list(account.ShippingCountryCode.unique())
country = [e for e in country if e not in (['nan', np.nan])]
duplicate = pd.DataFrame()
duplicate_indi = pd.DataFrame()
start = time.time()
for cntr in country:
file_name = 'account_'+ cntr
for cols in ['MailingPostalCode', 'PostalCode']:
duplicate = duplicate.append(joiner(cntr , file_name, cols))

end = time.time()
print('Total Time taken:' , round((end - start)/60,2) , ' minutes')

Connect with me for interesting conversations through LinkedIn: www.linkedin.com/comm/mynetwork/discovery-see-all?usecase=PEOPLE_FOLLOWS&followMember=shreepadahs

References:

[1] https://www.datacore.com/blog/inline-vs-post-process-deduplication-compression/

--

--