The world’s leading publication for data science, AI, and ML professionals.

The Ethics of Data Collection

Is your data ethically sourced?

So you’re ready to collect some data and start modeling, but how can you be sure that your data has been ethically sourced?

CW: I talk about mental health and suicide prevention in a section below.

Giammarco Boscaro on Unsplash
Giammarco Boscaro on Unsplash

The Current Data Protection Landscape

The Health Insurance Portability and Accountability Act, known as HIPAA, was passed in 1996 in order to protect sensitive and identifying personal health data after medical treatment. The goal was a strict "need to know" sharing of medical data unless the patient signed a consent form for a particular usage. There were some exceptions in the interest of the "common good", including gunshot and stab wounds, crime-related injuries, possible abuse cases, and infectious diseases.

A later supplement, the Omnibus Final Rule of 2013, updated HIPAA to include heavier financial penalties for organizations violating the law, patients’ rights to access electronic information, and including genetic data in HIPAA protected territory. As Dr. Weisse notes, while complete control over access to personal medical records is the "holy grail of privacy rights advocates", our current systems of medical administration and insurance make this impossible.

While these laws are necessary and work in theory, in practice they have led to great confusion on both sides of the patient-physician boundary. Additionally, like much of the legislation governing emerging technologies (see facial recognition, Siri always listening, …), it is woefully inadequate for effectively covering technologies not yet established or imagined.

The recent European Union legislation, General Data Protection Regulation (GDPR), goes much further in protecting personal data. There has been much discussion over the efficacy of the law, but there is no doubt that it is one of the most stringent data protection laws in the world. Unlike HIPAA or other US data protection laws, GDPR requires organizations to use the highest possible privacy settings by default and limits data usage to six classes, including consent given, vital interest, and legal requirement.

Furthermore, no data can be collected until explicit consent for that purpose has been given and that consent can be retracted at any time. This means that one Terms of Service agreement cannot give a company free-reign over a user’s data indefinitely. Organizations that violate the GDPR are heavily fined, up to 20 million euros or 4% of the previous year’s total revenue. As an example, British Airways was fined 183 million pounds after poor security led to a skimming attack targeting 500,000 of its users.


Where These Measures Fall Short

Facebook’s Suicide Algorithm

In 2017, Facebook began scraping user’s social media content without consent in order to build a Suicide Prevention tool after a series of live-streamed suicides. Outside of the non-consensual collection, one would think that assessments of mental health, depression, and suicidal ideation would be classified as sensitive health information, right? Well, according to HIPAA, because Facebook is not a healthcare organization they are not subject to the field’s regulations.

This is a clear, yet at the time understandable, miss. When HIPAA was written it seemed reasonable that only healthcare organizations would have access to these personal health identifiers (PHIs). With the advent of sophisticated Artificial Intelligence and endless-resource tech giants, private, non-healthcare organizations are now attempting to innovate in the medical field without direct oversight.

What makes the implications concrete are the 3,500 cases of Facebook contacting law enforcement after their system flagged a user as suicidal. In one case, law enforcement even sent the user’s personal information to the New York Times, a clear breach of privacy.

The European Union’s GDPR effectively banned Facebook’s collection methods as explicit permission is required from users in order to collect mental health information. While Facebook’s program does have the potential for good, the next steps for its ethical, effective use are ambiguous.

23andMe’s Genetic Data

Another case of regulatory under-sight is the popular genetic and ancestry testing company 23andMe – again not subject to HIPAA – and their selling of users’ genetic information to pharmaceutical companies. There are potential risks for insurance companies using user’s genetic data to identify pre-existing conditions before any symptoms emerge. This practice was outlawed in some situations and for health insurance specifically, but not for life or disability insurance.

Some ethically ambiguous have already emerged from this practice. One example is Huntington’s Disease, a late-onset, brain disorder controlled by a single defective gene. The Huntington’s Disease Society of America has an entire guide on choosing whether to get genetically tested because while technically illegal for insurance companies to utilize, there is always a potential risk that this information could be misused.


The Future and You

As technology continues to stride forward, we will inevitably hear more stories of regulation misses. It is vital that governments remain up-to-date with the implications of emerging innovations, and how to protect citizens’ data privacy in a world increasingly devoid of it.

As a data scientist, you must be cognizant of how your data is collected and utilized. Here’s a great set of questions to ask of yourself and your model.

Here’s a shortlist:

  1. Consent: users must give explicit consent for each and every new usage of their personal data. This is a legal dependency in some jurisdictions, but a good practice in all cases.
  2. Transparency: especially in cases with concrete repercussions, can you explain how your model and data process is arriving at a decision?
  3. Accountability: evaluate the potential harm of a model and work to limit said harm. What is the potential for the model to be misinterpreted, both in good and bad faith?
  4. Anonymity: how will a user’s identifying information be protected throughout all stages of the data science process? Who, at any point, has access to this data? Does identifying data even need to be in the dataset? If not, remove it.
  5. Bias: what steps have been taken to understand the potential bias in a data set? Could even missing values be a proxy for bias? See Redlining.

Sources

[1] T. Truyen, W. Luo, D. Phung, S. Gupta, S. Rana, et al., A framework for feature extraction from hospital medical data with applications in risk prediction (2014), BMC Bioinformatics 15: 425–434.

[2] D. Wade, Ethics of collecting and using healthcare data: Primary responsibility lies with the organisations involved, not ethical review committees (2007), The BMJ 334: 1330–1331.

[3] S. Mann, J. Savulescu, and B. Sahakian, Facilitating the ethical use of health data for the benefit of society: Electronic health records, consent and the duty of easy rescue (2016), Philosophical Transactions of the Royal Society 374: 1–17.

[4] A. Weisse, HIPAA: a flawed piece of legislation (2014), Baylor University Medical Center Proceedings 27 (2): 163–165.


Related Articles