10 Examples of Awful Data that I had to work with as a Data Scientist

A glimpse into the frustrating dark side of working with data

Terence Shin, MSc, MBA
Towards Data Science

Photo by Tim Gouw on Unsplash

As you may or may not know, a large portion of data science is working with bad data.

I had a lot of fun writing this, so hopefully you get a good kick out of this too. Here are 10 examples of instances that I had to work with extremely messy data. I’m sure many of you will be able to relate to a lot of these points!

1) USA, US, or United States?

Problem: I made this the first point because I think it’s something that many of us can relate to. I never understood why an application should give the user the choice to spell their country however they want as opposed to giving them a searchable list because it results in having to deal with this problem.

I once worked with geographical data and had to deal with differently spelled countries, i.e United States, USA, US, United States of America.

Solution: We created a mapping table to solve the problem, but that meant that we had to constantly update it to address any new variations that came into the system.

2) Is the first day of the week Sunday or Monday?

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Or, continue in mobile web

Already have an account? Sign in

Responses (7)

What are your thoughts?

That all sounds just too familiar

--

I chuckled while reading. Great article!

--

Solution: There wasn’t really a way to work around this aside from addressing it to the data eng team, so that was that.

This is the solution to all the problems in this blog … a data engineering pipeline that addresses these data gremlins so you, the data scientist, can focus on what you do best.

--