Introduction
Having worked at an economics lab for the past two years, I have become more familiar with various concepts from the social science research domain. In particular, PII, which stands for Personally Identifiable Information (PII), is included in all kinds of individual level data that is collected in applications, surveys and administrative operations. PII is especially valuable when researchers want to link different datasets together and track movements or activities of individuals across different data or time span. This allows for more dynamic and in-depth analysis in research. However, PII is sensitive in itself and hence needs to be treated with prudence. That is why data that has PII in it is often treated separately with rigorous security protocols. Anyhow, this article will examine some good coding practices and pseudo code for parsing out PII using STATA, the go-to software for a lot of economics.
Use of Locals and Globals
Although STATA is not a programming language, the good coding practices still apply. Use of locals and globals allows the management of code paths more convenient. Say you are hard coding a file path of a directory that contains files you need to use. You would have to change every line of code that used that path if you were to make modifications to the path. This is where locals and globals come into play.
Locals can be anything from scalars to variables to list of variables and strings stored in the local memory. Here, local memory means that the locals defined within a script will only be valid within that script. They would not be able to referenced from another script. Globals, on the other hand, can be referred globally regardless of which script is it being called from.
Let’s assume there is a data set you want to read in called "pii_data.dta". It is located in "O:/josh/data". You can define the data directory path as the following two ways:
local data_dir = "O:/josh/data"
global data_dir = "O:/josh/data"
use "`data_dir'/pii_data.dta" # read in data using local path
use "$data_dir/pii_data.dta" # read in data using global path
Please pay attention to the differences in syntax between locals and globals. STATA locals make use of back-tick(`)s and apostrophe(‘)s while globals use dollar signs ($).
Define a Schema
What should be preceded before parsing out PII from data is defining a "schema". Defining a schema here means determining what types of PII we would be extracting and in what variables the information would be stored. This depends on what the research or project you are engaging is trying to get out from the PII. Is full name enough or does it have to be separated out into separate columns (e.g. first, middle, last name and suffix?). Are you only interested in the age of each individual or are you interested in the exact year, month and day of birth?
In this example, we will assume the data has full name, DOB(date of birth), address and zip code information. We will define the schema to be the following variables:
name_first | name_middle | name_last | name_suffix | dob_yyyy | dob_mm | dob_dd | addr_bldnum | addr_str | addr_city | addr_st | addr_zip
Parsing Names
Names can come in many different formats. In some cases, it just comes in one column with full name in each cell. In others, it comes in separate columns, each representing a portion of the full name (e.g. last name). Personally, I found it rare for suffixes to be extracted out into a separate column. Unfortunately, to enhance the performance of entity resolution (i.e. the process of linking different data points in disparate data sets and giving them the same unique individual identifier if those observations share similar PII with a probabilistic matching algorithm), having additional information such as suffixes helps differentiate one data point from the other as two different individuals or entities.
Assume the data came with three name related columns, LASTNAME, FIRSTNAME and MIDDLENAME. We want to transform these three columns so that they fit with our schema which has FOUR name related columns. The STATA code for doing so is as follows:
gen n_last = strtrim(stritrim(lower(LASTNAME)))
gen n_first = strtrim(strtrim(lower(FIRSTNAME)))
gen n_mid = strtrim(strtrim(lower(MIDDLENAME)))
foreach var of varlist n_* {
replace `var' = subinstr(`var', "-", " ", .)
replace `var' = strtrim(stritrim(subinstr(`var', ".", "", .)))
/* check for suffix */
gen `var'_suff = regexs(0) if regexm(`var', "(1|11|111|[ ]i|ii|iii|[ ]iv|[ ]v|[ ]v[i]+|jr|sr|[0-9](nd|rd|th)*)$"
replace `var' = strtrim(stritrim(subinstr(`var', `var'_suff, "", 1)))
}
gen suffix = ""
foreach var of varlist n_*_suff {
replace suffix = `var' if missing(suffix) & !missing(`var')
}
drop n_*_suff n_suffix
replace name_raw = raw_name
replace name_first = n_first
replace name_middle = n_mid
replace name_last = n_last
Replace name_suffix = suffix
The first three lines are straightforward – They are just cleaning up things like white spaces in each of the name variables and copying them in newly generated variables to work with. Please note that strtrim and stritrim functions are useful for cleaning up string variables in STATA.
Next it’s a for loop. It is looping over the three variables that we just created.
The first two lines of the loop(for each var…) removes special characters "-" and "." . If is often advisable to remove them before apply doing anything else because they can get in the way of things like regular expressions from effectively being applied.
The next part in the loop is utilizing regular expressions that you would have seen frequently in other programming languages. Here, we are trying to grab any suffixes that are mixed into each name column. It would be great if suffixes are mixed in only one column consistently but that is rarely the case. This is why we would have to loop through first, middle and last name columns to see if any suffixes exist in any of them. Welcome to the universe of real world data where there is rarely any consistency in the data! The regular expression aims to grab the typical suffixes including 2nd, 3rd, 4th, jr, sr etc. The regexm function in STATA returns 1 if the variable inputted as the first argument satisfies the regular expression pattern in the second argument and returns 0 otherwise.
DOB Parsing
Parsing Date of Birth (DOB) is quite simple in STATA owing to some custom functions that are in place. Assuming the DOB column is in string format with year, month and day (in this order), we can easily parse DOB like the following:
replace dob_yyyy = year(date(DOB, "YMD"))
replace dob_mm = month(date(DOB, "YMD"))
replace dob_dd = day(date(DOB, "YMD"))
Note that the date function first makes the date string variable into a date object and the year, month and day functions allow you to extract out the relevant components of the date. If the DOB variable is in a non-string format, you would have to change the format into a string before you proceed with the code above. Also, notice that the second argument of the date function is the format of the date (e.g. YMD, MDY…) so please make sure you are specifying the right format. Real world data, however, often has multiple formats mixed in one column which would need careful vetting before parsing.
Parsing Address
Typical addresses come in the following format:
2301 County Street, Ann Arbor, MI 40000 apt 220
where it starts with some street address, followed by city, state, zipcode and more granular information such as apartment number. If the data comes in with these different components, it would be easier to parse them into the address schema we defined earlier. But assuming that the data came in with full address in one column, the STATA code for parsing address information is as follows:
gen raw_addr = ADDRESS
replace raw_addr = strtrim(stritrim(lower(raw_addr)))
gen bldnum = strtrim(regexs(0) if regexm(raw_addr, "^([0–9]+) [ ]")
gen street = strtrim(strtrim(subinstr(raw_addr, bldnum, "", 1)))
replace ZIP = "" if ZIP == "0" # zipcodes that are just 0 are probably missing zipcodes that have been filled in with 0s by the agency
replace addr_raw = raw_addr
replace addr_bldnum = bldnum
Replace addr_str = street
Replace addr_str = substr(addr_str, 1,5)
Replace addr_city = strtrim(stritrim(lower(CITY)))
Replace addr_st = strtrim(stritrim(lower(STATE)))
Replace addr_zip = ZIP
Similar to name parsing, we also make use of regular expressions to parse addresses. STATA’s regular expression patterns and syntax are almost analogous to regular expressions used in any other programming languages and so please refer to this Python documentation for regular expressions to dive further into it.
Hope this helped! Please consider following me for more articles in the future!
About the Author
Data Scientist. Working as a research associate at the Criminal Justice Administrative Records System (CJARS) economics lab at the University of Michigan. Former Data Science Intern at Spotify. Inc. (NYC). Incoming PhD student in Informatics. He loves sports, working-out, cooking good Asian food, watching kdramas and making / performing music and most importantly worshiping Jesus Christ. Checkout his website!