Data Anonymization ETL on Azure using Presidio

Create data pipelines to remove private data from text

Avishay Balter
Towards Data Science

--

Photo by Markus Spiske on Unsplash

Data anonymization or de-identification is a crucial part of certain systems and a core requirement for many organizations. Scenarios that are uncovered by the ability to mask, hash, or even replace PII entities with mocked ones in text and images include moving data to the cloud or to partners, generating test data from real data, storing data for AI/ML processing, etc.

The following article provides a walkthrough of how to use Azure and Presidio to create a fully functional ETL process that moves a set of documents from one location to another while scrubbing, replacing, hashing or encrypting the PII entities from the text.

Presidio and the eco-system

Some readers may already be aware of Presidio, the PII identification and anonymization library which is maintained by the Commercial Software Engineering group as a Microsoft open source project. The team recently released a new version V2.0, with the key goals of simplifying the evaluation of Presidio and enabling integration into data pipeline.

As part of the V2.0 release, Presidio team provides different deployment samples such as Kubernetes (the Presidio V1 native and only deployment option), Azure App Service (which the team is using for its internal V2.0 development and production environments), and others.

This post will focus on how to integrate Presidio with Azure Data Factory using the built in “Data Anonymization with Presidio” template which the team had worked on in collaboration with the Azure Data Factory Product Group. It is meant for data-engineers who already have an instance of Azure Data Factory and a storage account which content they would like to anonymize of PII data.

Additionally, the Presidio eco system includes presidio-research repository where advanced scenarios using Presidio are maintained by the team. For instance, the presidio_evaluator library uses Presidio to generate fake data based on analyzing real data.

The Building Blocks of the Azure ETL

The scenario covered in this post takes a generic type of an input data set, an Azure Storage blob container, and uses Azure services to move that set to another location, where the content of the files had been anonymized of all PII entities.

The Azure services which are used are:

· Azure Data Factory — A cloud based ETL service which is going to host the transformation pipeline.

· Azure Storage — Provides the ETL persistence layer.

· Azure Key-Vault — Storing Azure Storage access keys for use in the ETL in a secure way.

· Azure App Service — Hosting Presidio as an HTTP REST endpoint.

The following diagram displays the relationship between the parts of the ETL system where Presidio is used as an HTTP endpoint.

Data Anonymization ETL with Presidio as an HTTP endpoint

1. Read data set from source.

2. Get an access key for Azure Storage from Azure Key Vault.

3. Send the text value of each document in the set to be anonymized by Presidio.

4. Save the anonymized text to a randomly named text file on Azure Blob Storage.

The input to the data anonymization pipeline is a set of documents that contain PII text, such as:

Text input file

An output file from the pipelines should look like this:

Text output file

Provisioning the Azure services

If you already have an Azure Data Factory instance, from the Azure Data Factory UI browse to the template gallery and search for the template named “Data Anonymization with Presidio”. Follow the instructions on the screen to provision the pre-requisite infrastructure and to connect your Azure Storage input container.

Data Anonymization with Presidio Template

If you do not have an instance of an Azure Data Factory, you can follow through with the Presidio end-to-end samples that showcase two integration modes of Presidio into Data Factory (as an HTTP endpoint and as a Databricks Spark job). These samples will also provision Azure Data Factory and setup the required pipeline without using the Data Anonymization with Presidio template.

Running the sample

From Azure Data Factory UI open the pipeline named Anonymize.

Anonymize Pipeline Activities

Data Anonymization with Presidio Template Gallery

  • GetFileList — Gets the list of files from the source container.
  • FilterFiles — Filters the directory from the list, only files will be processed.
  • ForEachFile — A For-Each loop including an execution clause for each document in the array.
  • GetSASToken — Get the SAS token from Azure Key Vault. Will be used later for writing to the blob container.
  • LoadFileContent — Loads the content of a text file to a dataset.
  • PresidioAnalyze — Sends the text to Presidio analyzer endpoint.
  • PresidioAnonymize — Sends the response from Presidio analyzer to presidio anonymizer endpoint.
  • UploadBlob — Saves the anonymized response from Presidio to a text file on the target Azure Blob Storage.

Hit the debug button and fill the following pipeline parameters:

  • SourceStore_Location — Source container name.
  • DestinationStore_Name — Target account name.
  • DestinationStore_Location — Target container name. Has a default value of a container which was created during provisioning of the ARM template (presidio).
  • KeyVault_Name — Azure Key Vault name.
  • Analyzer_Url — Analyzer App Service URL.
  • Anonymizer_Url — Anonymizer App Service URL.

As the pipeline runs, you should see the files being created on the target blob container, where the data is anonymized of PII content.

Conclusions

Presidio V2.0 brings big change, and great opportunity for the Presidio community to leverage the platform in more ways than before.

The sample detailed in this post can be easily modified to fit other ETL scenarios, for instance:

  • Change the anonymization method — You can change the anonymization mechanism used in this sample from the simple replace implementation to any of the supported anonymizers, as the default anonymizer for all entity-types or specifically per entity type. You may also encrypt PII data and reopen it later using decrypt.
  • Other input/output — Read a JSON array from other source such as MongoDB, Elasticsearch, or other and move the scrubbed data to another instance.
  • Hybrid workflows — Use the preview of hybrid Azure Data Factory workloads with on-premise Hortonworks Hadoop clusters to anonymize data before it leaves the organization’s premise.
  • Stream data — Change the notebook to use Spark-streaming to implement an analyze-only pipeline that is performed on in-flight messages while they travel from one event source to another.

We invite you to join the growing Presidio community by contributing samples, documentation, bug-fixes, or issues or simply to contact the team at presidio@microsoft.com if you have questions or would like to talk to us about your high scale scenario using Presidio, we’d love it!

--

--

Code, Cloud, Ops and Analog synthesizers. Software Engineer and Architect @ Microsoft.