Although there are many resources to learn technical skills, there are few that go in-depth on how to clean data – an essential skill needed for data analysts. You may think you can apply the same set of rules to clean data but that’s not always the case. Today I’d like to share what I’ve learned over the years as a data analyst on how to approach cleaning different types of data for Data Analysis and reporting.
Numeric Values
By numeric values, I mean those useful for data analysis and reporting. A good rule of thumb is if the average would be useful. For example, the average of a numeric order number field is meaningless. However, the average revenue amount is useful.
Numbers Stored as Numeric Fields
For numbers saved in numeric fields apply the following cleaning rules:
- Calculate the minimum, maximum, median, _99th percentile, and average. If the minimum is negative but values should be zero or higher, replace them with zero if applicable. In the sample sales data below, notice the large difference between the median of $800 in row 13 and the average_ of $20,560 in row 12. If the median and average or the maximum and the 99th percentile have big differences, I typically check for outliers especially if I’m unfamiliar with the data. If you were to report average sales including the $100,000 in row 4, it would be $20,560 instead of $560 if you excluded the $100,000. This is why it’s good to check for outliers and exclude them especially if you plan to report averages or use the data to build machine learning models as outliers can affect model results. Prioritize your time accordingly on the most important fields because you may have dozens to check and it would be too time-consuming to review all of them.
- Count missing and non-missing values. If the number of missing outnumber the non-missing, then the numeric field may not be usable in your analysis. For example, if you have 1,000 records but 900 of them are missing then the 100 non-missing values probably won’t be useful in your analysis. If the count of missing is less than the non-missing, then replace missing with zero when applicable. If missing means a value besides zero, replace it with a value you deem more appropriate such as the median or average assuming you’ve removed the outliers. For example, I once had a day of missing DAU ( daily active user ) data. Since DAU varied depending on the time of year, I took the average of the prior 7 days to replace the missing day’s DAU.
Numbers Stored As String Fields
Numeric values can be loaded into a database as a string value because they contain characters that need to be removed before they can be converted to a number. To identify these kinds of string fields, the rule of thumb again is to ask if the average would be useful.
Examples of these kinds of string values are sales amounts with the currency such as $100.55 or numbers containing commas such as 1,000. In these instances, you would remove the character value, i.e. delete the currency sign to convert to 100.55 and comma to convert to 1000, and then transform the string value into a number. After the string values are converted to numeric, apply the Data Cleaning rules noted above for numbers saved as numeric fields.
String Values
For string value fields, apply the following cleaning rules:
- Check for duplicates by the primary key in the table. Primary keys are typically unique identifiers such as customer ID or a combination of fields that make the row unique in a table. When running a data analysis you want to ensure there are no duplicates in your data sources because this can cause you to report incorrect numbers.
- Count the number of null and non-null values. Determine if null values can be replaced with a string value. Sometimes I replace null values with ‘N/A’ or ‘None’ for reporting purposes but it depends on your requirements.
- Count the number of rows for each unique string value and sort by descending row count. Look for string values with the same meaning and change them to be the same value.
In the sample SQL query below in rows 1 to 7, I count the number of rows for each _item_name_ value and sort by descending count. Since you may have hundreds of unique string values, focus on unique values with the highest row count and clean those first because changing values with a few rows will take too much time and probably won’t impact the results of your analysis.
The SQL results in rows 1 to 3 show an _item_count of 33 for apples , 19 for apple, and 17 for Apples. Apples, apple, and apples can be changed to apple. Rows 4 and 5 with strawberries and strawberry can be changed to strawberry. After data cleaning, the item_count shown in rows 1 to 2 have 69 for apple and 26 for strawberry_.
- Search and remove special characters as needed. This is common in non-English text or free-form text where tab and carriage returns may be part of the text that’s loaded into the database. Most string fields don’t contain special characters but if you get no rows returned when querying for a particular string value, special characters may be the cause.
- Date or datetime values can be loaded as string values. Confirm the date values are the same date format in the field and convert it to a date or datetime field as applicable to allow date functions to be used.
Date and Datetime Values
Apply the following data cleaning rules for tables with date or datetime fields:
- Check for duplicates in tables where there should only be one row per date or a combination of the date and other fields that make the row unique. For example, a daily active user ( DAU ) table is the count of distinct active users per day and should only have one row per date. Another example is revenue by country where the date and country combined make the row unique in the revenue table. I once inherited an ETL job that didn’t remove the data from the same date it was updating. I accidentally re-ran the job for the same date and introduced duplicates in the revenue table. I realized this only because I ran the duplicate check or I would’ve overreported revenue to my stakeholders.
- When joining multiple tables with datetime values, confirm datetime values are in the same timezone. I once had to create a report to show the number of user upgrades 24 hours after an email was sent. The email table datetime was in CST ( Central Time ) but the upgrade table datetime was in UTC ( Coordinated Universal Time ). UTC is 5 hours ahead of CST. If I had just checked 24 hours from the email sent datetime in CST I would’ve missed 5 hours of users that upgraded after the email was sent. Fortunately, most databases have a SQL function called CONVERT_TIMEZONE to convert a datetime to different timezones. Check your database documentation to confirm.
Final Thoughts
While data cleaning is not the most favorite task for data analysts including myself, I hope the approaches I’ve covered will help you the next time you need to clean data for a project. Happy data cleaning!
Note: All SQL queries above were run on BigQuery sandbox that’s free to anyone with a Google account.
You might also like…
Data Cleaning: The Unsexy Side of Data Science