As you may or may not know, a large portion of Data Science is working with bad data.
I had a lot of fun writing this, so hopefully you get a good kick out of this too. Here are 10 examples of instances that I had to work with extremely messy data. I’m sure many of you will be able to relate to a lot of these points!
1) USA, US, or United States?
Problem: I made this the first point because I think it’s something that many of us can relate to. I never understood why an application should give the user the choice to spell their country however they want as opposed to giving them a searchable list because it results in having to deal with this problem.
I once worked with geographical data and had to deal with differently spelled countries, i.e United States, USA, US, United States of America.
Solution: We created a mapping table to solve the problem, but that meant that we had to constantly update it to address any new variations that came into the system.
2) Is the first day of the week Sunday or Monday?
Problem: This was one of the most frustrating problems that I’ve ever encountered – there were two different tables that had information aggregated by week and each table had information for two different metrics.
The issue was that one table assumed that the first day of the week was Sunday while the other table assumed that the first day of the week was Monday, which meant that these two tables were incomparable.
Imagine not being able to properly compare your revenue to your costs because you can’t match time periods.
Solution: There wasn’t.
3) Missing ID’s for users
I think one of the first things that you learn in data architecture or SQL is the concept of primary keys, so I found this one quite mind boggling.
Problem: Believe it or not, there were multiple instances where I had to work with user data that didn’t have reliable primary keys for each user. In other words, I had to work with tables where there were missing ID’s for users. To make it worse, there wasn’t any other identifier that we could’ve used i.e. emails.
Solution: There wasn’t really a way to work around this aside from addressing it to the data eng team, so that was that. And if you think that’s bad, it gets even better…
4) Joining tables WITHOUT using primary and foreign keys, Part 1
Normally, if you want to join two tables, there’s typically a column that is in both tables that can be used to join them. For example, user ID might be in each table, allowing you to perform a join.
Problem: There was a time where I had to join two different tables without primary and foreign keys.
Solution: We ended up having to join the two tables by evaluating a similarity score via Euclidean distance on a cross-joined table match rows that were the most similar to each other.
Everything about that felt wrong, but we didn’t have any other option.
5) 50,000 does not equal 50,105
Problem: I’m sure this is also pretty relatable – working with inconsistent data. For example, one table may have told me that our monthly revenue was $50,000 but another table with similar information might have said that our monthly revenue was $50,105.
This is such a simplified example, but it can be incredibly frustrating when working with data that doesn’t reconcile because you don’t know what to trust!
Solution: Our short term solution was using the table that was more commonly used throughout the company. The long term solution was actually figuring out why the number didn’t match and essentially rebuilding the pipelines.
6) Psh. Who cares about logic
Problem: One of the worst things to deal with is working with data where the logic is compromised. To give an example, imagine there’s a record that shows that a given user uninstalled the same app twice without re-installing it in between.
Solution: I couldn’t do much on my end, aside from raising a red flag and getting the data engineers to dive into it. Sometimes, that’s all you can really do!
7) Do I use Table A, B, or C? Yes.
Problem: I’ve surprisingly had this problem more than once, where I had to work with data that was partitioned in several different tables – for example, one table was in a Google Sheet, while the other ones were ingested into Google BigQuery from the web application itself.
Solution: This wasn’t too big of a problem, but it just meant that we had to make sure that we didn’t forget about the Google Sheet every time that we wanted to work with this data.
8) Missing data? Just impute the mean right?…
Everyone’s dealt with missing data, but not all missing data can be addressed the same way. More important that handling missing data is understanding why the data was missing in the first place.
Handling missing data in a work environment is nothing like Kaggle. When it comes to Kaggle, you simply take the data for what it is, but when it comes to the real world, it’s a big problem if you notice missing data that shouldn’t be missing.
Problem: We wanted to get installation metrics for various applications, so we built the data pipelines only to find out that it was missing half of the applications that we wanted metrics for.
Solution: Long story short, we rebuilt the pipelines several times using different sources of data, only to find out that the problem was related to how an installation got tracked – in essence, half of the applications had an installation button on the website, while the other half had to be installed through a completely different process, and only the ones with the installation button got tracked. So that had to get addressed in order to track what we wanted.
9) Typos, typoes, tpyos
Problem: At one point, my entire job consisted of cleaning and manipulating data that was inputted by clients, and that meant **** TYPO’s. It was so bad that some clients spelled their names wrong, some inputted their emails without the @ symbol, and so forth.
Solution: I ended up building some functions to make sure things like the emails, postal codes, and provinces were valid, and so forth. That addressed approximately 80% of the inconsistencies. As for the remaining 20%, that was manually addressed, couldn’t be helped.
10) Joining tables WITHOUT using primary and foreign keys, Part 2
FYI this involved using Microsoft Excel, but it still deals with bad data!
Problem: Again, there was a time that I had to join two tables together by the names of the financial securities, but they didn’t follow the same naming conventions and they also didn’t have unique identifiers.
What made this worse than Part 1, this was way before I learned about similarity scores so there was only one solution.
Solution: I had to manually match the rows with each other. I sorted each of the tables by alphabetical order and moved rows around whenever there was a mismatch.
Thanks for Reading!
Hopefully this gives a clearer idea as to the variety of problems that exist in the real world. These 10 examples are just a snippet of all of the different types of scenarios and issues that you might run into.
As you may have noticed, some problems will be easier to solve than others. In a small number of cases, some of the problems will be completely outside of your control, which can lead to frustration, but know that it’s a normal part of the job.
I hope you had as much fun reading this as I did writing this! Let me know in the comments any other examples of bad data that you had to deal with.
Not sure what to read next? I’ve picked another article for you:
Terence Shin
- If you enjoyed this, follow me on Medium for more
- Sign up for my email list here!
- Let’s connect on LinkedIn
- Interested in collaborating? Check out my website.