When Harvard Business Review called data scientist the "sexiest job of the 21st century" I doubt "sexy" included the mundane task of Data Cleaning. While many believe data science is all about using machine learning algorithms to build models and make business impact, data cleaning is also an essential part of being a data scientist.
To those unfamiliar with the data cleaning process, it can be as simple as checking for missing values, fixing formatting, and standardizing values – but this only scratches the surface. Today I want to discuss my experience using a real-life example that goes beyond the basics to give you an idea of what to expect as a data scientist.
Background
I had just started in a new company and was asked to calculate visitor to signup rate by marketing channel because this information wasn’t available in Google Analytics. As a new hire, I was unfamiliar with the data and after some preliminary research, I concluded there was sufficient data to proceed with the analysis. I realized I had spoken too soon once I started looking at the data in more detail and encountered a few stumbling blocks.
Solving The Missing Link
My first problem was connecting visitors to signup. There was no common key to connect data between website visitors and user signups. Visitors were assigned a cookie ID and if they signed up they were assigned a user ID but the signup table didn’t store the cookie ID 😕
Before going back to marketing to inform them this task was impossible, I decided to explore alternatives to link the data. I reviewed all of the data fields between the two sets of data to identify a common key I could use and came upon IP address. This wasn’t a perfect solution since one visitor could have multiple IP addresses but marketing had no idea what the signup rates were by channel. They needed directional insights to decide where to focus their efforts in the short term until engineering fixed the link between visitors and signups. Otherwise, they could end up wasting time optimizing a poor-performing channel.
To make sure IP address was a viable alternative, I compared the overall visitor to signup rate I calculated was similar to other marketing reports. If the rates were way off I would’ve had to tell marketing the analysis wasn’t possible until the data was fixed.
Takeaway:
- Data in the real world is imperfect and you may need to come up with alternatives to provide directional insights in the short term until a long term fix is available. In my case, I had to use IP address to link visitors to users.
- Compare your calculated rates are similar to other reports before proceeding. It’s possible your proposed solution will generate incorrect results and you need to wait for the data to be fixed.
- Communicate data limitations to stakeholders to set expectations the insights are directional and should be recalculated after the data is fixed.
Data Discovery and Removing Duplicates
The next step was to create an IP address to user ID mapping table. This sounded simple in theory until I looked at the data and discovered there were two user tables and each one had duplicate user ID records. In addition, the user tables had different coverage, meaning one user ID could be in two tables or just one.
First I had to dedupe the user table to make sure there was just one user ID record per table. I reviewed the duplicate records for one user ID at a time to assess what values were different. I noticed the earliest user record matched the visitor IP address on the signup date. I proceeded to dedupe the user data by selecting the earliest record by create date.
I used a full outer join by user ID to create a full list of users. I ran another duplicate check and removed records where an IP address mapped to multiple users because I couldn’t definitively tell which user ID corresponded to that visitor record.
Takeaway:
- For data with multiple sources, run a coverage check to determine if you need to create a new table. I noticed one table only contained historical data while new users were updated in another table. This meant I had to create a new table to get the complete list of users created.
- Review values of duplicate records to identify differences, if any, to determine the business logic to apply for deduping. In my case, I took the earliest record because that contained the right IP address to match to visitors. It’s possible all values are the same and you can dedupe the table by selecting any record.
- Always check for duplicates in source tables and new ones created from source data. You may end up calculating the wrong results if there are duplicates you didn’t catch. If I kept the records where one IP address mapped to multiple users I might have reported the wrong signup rates by channel.
Apply Business Logic To Generate New Data
After joining visitors to users, I needed to group by marketing channel. Unfortunately, there was no marketing channel field available and I had to derive it from other fields in the visitor data. I researched how Google Analytics defined marketing channels and used similar business logic to align with existing marketing reports.
Takeaway:
- If you’re unsure of the business logic to derive values, research existing reports stakeholders use and apply similar definitions. This will help them make sense of your numbers as they’re already familiar with definitions used in existing reports.
Final Thoughts
Tableau defines data cleaning as "the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset." In my case, I used IP address to fix the link from visitors to signup, removed duplicates, and applied business logic to generate incomplete channel data. While data cleaning isn’t as sexy as building Machine Learning models, it’s an essential part of being a data scientist and I hope you now have a better idea of what to expect in the real world.
You might also like…
How to Present Machine Learning Results to Non-Technical People