The world’s leading publication for data science, AI, and ML professionals.

How to Quickly Anonymize Personal Names in Python

Eventually, most data scientists will handle datasets with personal information. Personnel data is highly sensitive, and aggregation of…

Photo by Julian Hochgesang on Unsplash.
Photo by Julian Hochgesang on Unsplash.

Eventually, most data scientists will handle datasets with personal information. Personnel data is highly sensitive, and aggregation of such data can reveal privileged information about an individual or an organization. The Federal Trade Commission’s (FTC) guide for Protecting Personal Information elaborates further [1]:

[If] sensitive data falls into the wrong hands, it can lead to fraud, identify theft, or similar harms [resulting in] losing your customers’ trust and perhaps even defending yourself against a lawsuit.

Thus, data scientists who fail to safeguard personnel data will have short-lived careers. Fortunately, several options exist within Python to anonymize names and easily generate fake personnel data. Follow the examples below and download the complete Jupyter notebook with code examples at the linked Github page.

The Scenario – Test Scores

First, let’s create a scenario. Suppose a professor has a set of test scores from a recent exam, but wishes to obscure the student names when discussing exam trends with the class. To facilitate this scenario, the Python library Faker enables us to generate fake data, to include names [2]. Generating a notional dataset is simple:

# Import Faker
from faker import Faker
faker = Faker()
# Test fake data generation
print("The Faker library can generate fake names. By running 'faker.name()', we get:")
faker.name()

This code should provide a fake name, which will change with each execution. An example output is below:

Screenshot by author.
Screenshot by author.

That only gets us one name, however; to generate an example dataframe of ten students with ten random test scores, run the following code:

# Create a list of fake names
fake_names = [faker.name() for x in range (10)]
df = pd.DataFrame(fake_names, columns = ['Student'])
# Generate random test scores
import numpy as np
df['TestScore'] = np.random.randint(50, 100, df.shape[0])
# Optional - Export to CSV
df.to_csv('StudentTestScores.csv', index=False)

The resultant dataframe is:

Screenshot by author.
Screenshot by author.

This will serve as the student test results data. For the scenario, the randomly generated names above represent the real names of the students.

1. Anonymization via AnonymizeDF

AnonymizeDF is a Python library capable of generating fake data, including names, IDs, numbers, categories, and more [3]. Here’s an example code block to generate fake names:

# Anonymize DF
from anonymizedf.anonymizedf import anonymize
# AnonymizeDF can generate fake names
anon.fake_names("Student")

The resultant output is:

Screenshot by author.
Screenshot by author.

AnonymizeDF can also create fake identifications. An example follows:

anon.fake_ids("Student")
Screenshot by author.
Screenshot by author.

AnonymizeDF can also create categories. This feature takes the column name and adds a number to it. An example follows:

anon.fake_categories("Student")
Screenshot by author.
Screenshot by author.

AnonymizeDF provides a powerful set of options for data scientists looking to obscure and anonymize user names, and is easy to use. But there are alternatives for those seeking other options.

2. Anonymization via Faker

Similar to AnonymizeDF, Faker is a python library that will generate fake data ranging from names to addresses and more [4]. Faker is very easy to use:

# Install Faker
from faker import Faker
faker = Faker()
Faker.seed(4321)
dict_names = {name: faker.name() for name in df['Student'].unique()}
df['New Student Name'] = df['Student'].map(dict_names)
Screenshot by author.
Screenshot by author.

Faker also has some unique capabilities, such as creating a fake address. For example:

print(faker.address())
Screenshot by author.
Screenshot by author.

3. Custom Built Word Scrambler

In addition to using third party libraries, homebuilt solutions are also an option. These can range from word scramblers or functions that replace names with random words or numbers. An example scrambler function is below:

# Scrambler
from random import shuffle
# Create a scrambler function
def word_scrambler(word):
    word = list(word)
    shuffle(word)
    return ''.join(word)

Apply this function to the dataframe with the following code:

df['ScrambledName'] = df.Student.apply(word_scrambler)
df['ScrambledName'] = df['ScrambledName'].str.replace(" ","")

This yields the following dataframe:

Screenshot by Author.
Screenshot by Author.

There are some limitations to this approach. First, an individual with prior knowledge of the student names could deduce who is who based on the capitalized letters in the scrambled name, which represent the initials. Second, the scrambled letters are not as clean in appearance or as interpretable as a pseudonym. Further customization could scrub capital letters or generate random numbers in place of names; the most appropriate choice depends on the scenario and needs of the customer.

4. Putting it All Together: Anonymize, Clean, and De-Anonymize the Data Frame

Once a technique is chosen, applying it to a dataframe, cleaning the frame, and storing a "key" is quite simple. Consider the original dataframe of student test scores from the beginning:

Screenshot by author.
Screenshot by author.

Let’s use AnonymizeDF to create anonymous names. The following code block will:

  • Generate fake names.
  • Create a CSV "Key" containing the real and fake names.
  • Drop the real names from the original dataframe.
  • Present a clean, anonymized dataframe that is structurally indistinguishable from the original.
# Create the Fake Student Names
anon = anonymize(df)
anon.fake_names('Student')
# Create a "Key"
dfKey = df[['Student', 'Fake_Student']]
dfKey.to_csv('key.csv')
df = df.assign(Student = df['Fake_Student'])
df = df.drop(columns='Fake_Student')

The output is the following dataframe:

Screenshot by author.
Screenshot by author.

Descrambling this is a simple matter of loading in the CSV "Key" and mapping the original student names to the fake student names:

# Load in the decoder key
dfKey = pd.read_csv('key.csv')
# Return to the original Data
df['Student'] = df['Student'].map(dfKey.set_index('Fake_Student')['Student'])

This is what it looks like in Jupyter Notebook (notebook downloadable at the linked Github):

Screenshot by author.
Screenshot by author.

Conclusion

It is inevitable that a data scientist will encounter datasets with personal information, the safeguarding of which is critical for protecting individuals and organizations. The simple Anonymization techniques highlighted above provide a means to quickly generate fake data as placeholders to protect individuals.

However, for certain datasets, simply anonymizing a name might be insufficient. Other datapoints such as addresses or personal attributes could allow a third party to reconstruct the identity associated with an observation. Thus, more complex datasets will require advanced anonymization and de-identification techniques, and in some cases synthetic data may be the best route for conducting analysis while protecting personal information.

References:

[1] Federal Trade Comission, Protecting Personal Information: A Guide for Business (2016).

[2] Faker PyPI, Faker 13.0 (2022), Python Package Index.

[3] Anonymize DF, Anonymizedf 1.0.1 (2022), Python Package Index.

[4] Faker PyPI, Faker 13.0 (2022), Python Package Index.


Related Articles