I work as a software engineer and Data scientist with my code both in notebooks and software packages. If you haven’t heard it yet, you should stop and think before you code, and the same concept applies to data cleaning. It has been valuable to step back from the initial work and begin to think about the problem at hand and the data you will be cleaning. It is also a great idea to consider the end-use case for the data. Do you need it in a report or dashboard? Will the data be used by many or by one? How often will you need to clean this data? With that, I would like to introduce to you 25 common questions I consider before tackling data cleaning.
Data Ingestion
When I start a project, the first consideration I have to make is how I will ingest the data for this project before I start cleaning it. Depending on where I am getting the data from, I may need to perform different data cleaning steps. Suppose the data had already come from another team. In that case, there is a possibility the data is clean before I ingest it, making it easier for me to preprocess the data before working with it.
1. Do you need to ingest the data and then clean it, or is it cleaned at the source?
2. If you are reading in files that contain your data, can you clean and overwrite the file, or will you need to save it somewhere else to keep the raw file separate?
3. Do you need to save your cleaned data or keep it in a dataframe and save your analysis's output?
4. Do you need a backup of the data somewhere?
5. What happens if the files become corrupted as you are cleaning it? Are you prepared to start over?
Empty or Null Fields
After I have received the data, the next step is to understand how to handle empty or NULL values. It is rarely that easy to have empty or NULL appear in your dataset. Instead, users will add obscure large values like -9999 or 9999, other characters, or words. It is good to take a step back and understand how users add empty values into the dataset and take action to look at how you will clean out these values. Once you know how these values are represented in your dataset, you can begin cleaning them out or impute values where needed.
6. Are there values that you can remove as empty or NULL values such as -1, -9999, 9999, or other characters?
7. Will these values be imputed as a string, numeric, or category?
8. Will you drop values that are empty or NULL?
9. Can these values be empty or NULL, and still have the data make sense to provide valuable action? If this value is missing, can you provide actionable insights?
10. Can work with those who created the data to develop a standard for what is considered empty or NULL?
Text Fields
The next thing I look for is text fields. When I use text fields, I either use them as categories, discussed next, or as plain text that will be displayed or used for additional information. But what happens if you are using your text fields for modeling? You may need to consider different types of cleaning or natural language processing techniques to work with your data, such as stemming, lemmatization, or removing filler words. I find it harder to work with text fields as there can be variations in the spelling, acronyms, mistypes information, and more.
11. Are there spelling mistakes in the column that you need to consider?
12. Can a word or abbreviation be spelled multiple ways?
13. Could there be more than one abbreviation for the same thing?
14. Do you need to extract data from the text field? If so, will you use regular expressions, stemming, lemmatization, remove filler words, white space, etc.?
15. Do you have timestamp or other numeric data type columns that read as strings but should be another data type?
Categories and Booleans
Categories and booleans are the most accessible two data types to deal with when cleaning data as they tend to have less variation in their columns. For categories, I tend to look at the unique categories listed for a given column to understand better what I am working with for different values. Then I can bring that information back to a subject matter expert (SME) as needed to get definitions for categories that may be unclear, such as numeric numbers or single letter values that map to a definition.
16. Will you keep your categories in a human-readable format or convert them using one-hot encoding?
17. Do you have too many or too little categories?
18. Do you have duplicate categories? Duplication can appear due to misspellings or added white space such as 'R' and 'R '.
19. Should you add another category if there are items that do not fit into the rest?
20. How will you represent boolean values? 0 and 1, True and False. Pick a schema and use it for all boolean columns to have consistency.
Numeric Fields
Lastly, there are numeric fields. A common task I have done when looking over numeric fields is to understand the summary statistics and data distribution in these columns. This quick analysis helps understand how the data is structured and if any noticeable outliers could exist. After looking at this data, you may also need to consider any units associated with your numeric points. Did you assume the units of specific columns, but something seems wrong? Double-checking this can help before you move forward with your work.
21. Does the data columns' distribution seem appropriate, or do you need to investigate an issue?
22. Do you need to know what metrics the data stands for, such as feet vs. inches or Celcius vs. Fahrenheit? Will the difference matter to your end calculations?
23. Is the column all numeric, or are there other values to clean out from the data?
24. Have you converted the column to the same data type, such as int or float? Will this conversion affect your end-use case for the data?
25. Is your data continuous or categorical? How will that data be used?
Summary
Data cleaning and preprocessing can take up much of the time during Data Science projects. As you ingest the data, consider how much work may have already been done to clean the data before receiving it and how much work you will need to do to clean it. As you look at your data, the first thing to consider is how you will handle null or empty data values. Can you clean them out, or will you need to impute a value in its place? Once you have made that determination, you can begin to look at the actual data points in each column and construct a plan to clean those columns as you see fit. Columns can include things like categories, booleans, numeric, and text fields. Each will require different considerations as you clean and preprocess the data before modeling.
What types of questions do you ask as you clean data?
If you would like to read more, check out some of my other articles below!
Top 3 Books for Every Data Science Engineer