Data Privacy
GDPR principles directly affect the storage, processing, and use of personal data in your Data Analytics platform.
General Data Protection Rules or as they call it GDPR across the world is good enough to scare anyone collecting personal data of its customers for their business. However, it is way beyond the famous "Right to be forgotten" statement which everyone has in their mind the moment they hear the word GDPR.
This regulation lays down rules relating to the protection of natural persons with regard to the processing of personal data and rules relating to the free movement of personal data.
As per the definition provided by the GDPR regulation
‘personal data’ means any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person;

In this multi-series blog post, I have tried to present my interpretation of the official GDPR regulation in the context of an Enterprise Data and Analytics platform and what it would mean for the Architecture and implementation of such a platform. Let’s look into how I interpret the principles of GDPR.
Article 5, states that personal data should be processed lawfully, fairly, and in a transparent manner in relation to the data subject. Collection and processing of data are limited by purpose. Data, only relevant for your defined purpose should be collected and the accuracy of the data has to be ensured with mechanisms in place to keep the data up to date and correct or erase inaccurate data. Once the purpose has been met, the data should no longer be stored, except for archiving purposes. You need to ensure that data is being stored securely and is protected against unauthorized access, processing, or damage.
Following the principles of Article 5, I would design an Architecture for my Data platform with a strong focus on the following points:
- A metadata-driven design with a strong Data Catalog to identify what has been stored in the Data platform, where the data has been sourced from with definitions, ownership, and a detailed lineage from the reporting data point to the true source.
- The Data platform should be enabled for encryption to secure the data while at rest. In addition to this, the data platform should have detailed logging and auditing of the access and processing of data with Role-based access control implemented. Any sensitive data using which a data subject can be identified directly or indirectly such as name, identification number, location data, online identifier, or any information specific to the physical, physiological, genetic, mental, economic, cultural, or social identity of the data subject should be masked or anonymized using a Data anonymization framework.
- Accuracy of data should be handled at the source system or the place where the contract with the data subject is captured, serviced, and maintained and it is the responsibility of both the Data subject, to pass on the changes or inform of inaccuracy as well as the system owner to adapt to the changes. However, it is really important that any such change or inaccuracy is captured as soon as possible and the data platforms are updated with the changes and any inaccuracy is being dealt with proper data audit and logging. The Data platform should have either a batch or a real-time Data Extraction and Ingestion framework to source any such changes from the source systems.
- A strong focus should be put into setting up a strong Data Quality framework with a proactive feedback loop to highlight any data inaccuracy or anomaly early in the chain. The concept of Data observability makes a lot more sense to me to be included as part of the Data Quality framework to monitor the freshness, volume, distribution, schema, and lineage of the data being brought into the data platform. A combination of DQ and DO will help to bring greater transparency to the storage and processing of the data in your data platform.
GDPR necessarily does not mean that you cannot store any personal data, but you should be in control of your data and should be able to answer your data subject and the necessary supervisory authority on what information is being stored within the data systems for a particular data subject. You should be able to secure access to the data and be responsible for the accuracy of the data. From what I understand, it is really important to have an overarching and strong Data Governance framework to build GDPR compliant data platforms.
In the next post of this series, we will look into the rights of the data subject and its impact on the Architecture and design of your data Platform.
References
General Data Protection Regulation (GDPR) – Official Legal Text (gdpr-info.eu)
Data Observability: The Next Frontier of Data Engineering | by Barr Moses | Towards Data Science