Safety Considerations For Working With Data

Published in

Towards Data Science

6 min readApr 11, 2020

Introduction

When working on data projects, scientist/analysts are often focussed on so many different elements that security can easily be forgotten. However, when the security aspect is forgotten it might backfire, big time. Having a security breach can result in many undesirable consequences like loss of data, a hijack of cloud resources, legal problems, hacking, clients leaving and much more.

This all sounds a bit dramatic but luckily, there are very simple steps that one can implement which can already prevent a host of issues. Since I get a lot questions regarding this, I have decided to bundle a few key points into a post.

Some Considerations

Always Use Two-Factor Authentication.

Two-factor authentication (or 2FA) is a simple principle where your credentials are not limited to “something you know” (your password), but extended to “something you are” (biometric data) and/or “something you have (phone/token/…).

2FA is usually very easy to implement and given the current technology (FaceID, smartwatches,…), it is rather hassle free. Implementing 2FA will greatly limit the risk of unwarranted access, since even when a password is stolen it would be useless without the second factor. If you don’t already have a 2FA in place on your most important accounts (email, cloud storage etc.) I would strongly advise you to get one today.

Do Not Store Data On a Local Machine

I get it, storing data on a local machine is easy and fast. However, the owner of the data (client, business partner,…) will not be happy with you storing the data locally. There is this saying in IT security: “Physical Access = Game Over.” Although the reality carries more of a nuance, a stolen or lost laptop/drive might bear serious consequences. For this reason, cloud storage with the appropriate security measures implemented is what is preferred.

Aside from major platforms like AWS and Azure, there are plenty of possibilities to store your data on the cloud.

In case you have no alternatives other than to store your data locally, you should always encrypt your drive in addition to locking your computer.

Physical access should be highly restricted. (Image source: pexels.com)

Use A Password Manager

I am often surprised by the number of people who still carry a sceptical attitude towards password managers even though they are more often than not the safer choice. Password managers allow you to store thousands of complex passwords, autofill them for you and generate new ones whenever you are asked to do so.

Aside from being data professional, using a password manager has also greatly improved my personal convenience. For example, nowadays you need a user account for almost any webshop, my personal managers simply suggest a difficult and unique password for this website and will fill it in every time I visit. In case the website were to get hacked and my information were to be breached, it would not mean much since the password was only allocated for that site access. Don’t forget to add 2FA to access your password manager.

Apply IP Restrictions To Your DB or Notebooks

tHiS one sIMpLe Trick prEVENTS Data THEft, HAckeRS Hate HIm!

Jokes aside, when hosting databases like MySQL or Jupyter Notebooks on a server, you can easily add IP restrictions. This means the database can only be accessed from certain locations like your home, office, client etc. This very simple, but effective measure has a great positive impact on your data security. Even though it is not foolproof, it is a very simple and extremely efficient way to prevent unwanted access.

Be careful though, most non-professional internet plans have a dynamic IP. Applying IP-restrictions on a dynamic IP might result in locking yourself out.

Keep Track

“To measure is to know”. Keeping track is often a disregarded aspect of data security, yet its importance cannot be discouned. By keeping track what I mean is:

who knows the password?
when is the last time the passwords changed?
who has IP access?
logging logins (and also attempts)
…

Keeping track will not only prevent a loss of data (for example, by restricting access to those who no longer need it) but it may also help diagnose what went wrong in the event of data loss.

Do Not Hardcode Passwords

Do not, I repeat, do not hardcode passwords. Not once, not briefly, not internally, never. Even though the risk of hardcoding is limited when you implement the previous steps mentioned (i.e. 2FA, IP restriction), there is a big chance you didn’t and it will likely backfire. It often goes like this:

Coder hardcodes database passwords because it’s easier. Currently, there is no risk because it’s only for internal use and will be removed later.
Coder forgets to remove it.
The code ends up on a private Github or some webscript
A crawler picks it up
Chaos ensues…

I am lucky to have never experienced this myself, but I have heard plenty of stories. There is however one good thing to hard coding passwords though: you will only need to do it once. Because once a crawler picks it up and wreaks havoc on your database/cloud instance/web server, you are unlikely to repeat your mistake.

Understand The Risk

For the last part, I will be a bit less strict. When it comes to security, it’s all about understanding the risk. Of course, ideally, you would implement all these measures and much more, but it might not be worth the effort. You need to distinguish between a simple personal project and a script for a big client with confidential data. The risk should match the measures and one can better overestimate the risk than underestimate them.

Lastly, you need to also understand the difference between data and insights. Most of the time (but not always!) raw data requires the highest degree of protection, whereas insights can be shared with less restriction. The reason I am emphasizing this difference is because it is often difficult for non-technical clients to comply with your security standards. However, if the insights don’t carry the same risk as the data itself, you can evaluate for yourself if you can simply email them or share them via other channels. Data often loses its confidentiality once rows are being aggregated.

Conclusion

“Data is the new oil,” and as a data professional, you will often be responsible for storing and safekeeping a lot of this valuable oil. While it might sound daunting to carry this responsibility, there are a lot of simple ways to reduce this risk. Implementing some or all of the steps as described in this post, will make it very difficult to access your data for those who are not supposed to. However, do keep in mind:

No measure is 100% guaranteed.
Your security is only as strong as its weakest point.

The list of considerations in this article is not exhaustive. There are plenty of other measures you could (and maybe should) implement like, encryption, injection vulnerabilities, hashing and many more.

When you are dealing with data of truly great importance or confidentiality, I suggest you solicit the consult of a certified IT Auditor who could guide you through the steps you ought to implement.

About me: My name is Bruno and I work as a data scientist, currently specializing in Power BI. You can connect with me via my website: https://www.zhongtron.me