
Data Cleaning. The process of identifying, correcting, or removing inaccurate raw data for downstream purposes. Or, more colloquially, an unglamorous yet wholely necessary first step towards an analysis-ready dataset. Data cleaning may not be the sexiest task in a data scientist’s day but never underestimate its ability to make or break a statistically-driven project.
To elaborate, let’s instead think of data cleaning as the preparation of a blank canvas that brushstrokes of exploratory data analysis and statistical modeling paint will soon fully bring to life. If your canvas isn’t initially cleaned and properly fitted to project aims, the following interpretations of your art will remain muddled no matter how beautifully you paint.
It is the same with data science projects. If your data is poorly prepped, unreliable results can plague your work no matter how cutting-edge your statistical artistry may be. Which, for anyone who translates data into company or academic value for a living, is a terrifying prospect.
As the age-old saying goes: Garbage in, garbage out
Unfortunately, real-world data cleaning can be an involved process. Much of preprocessing is data-dependent, with inaccurate observations and patterns of missing values often unique to each project and its method of data collection. This can hold especially true when data is entered by hand (data verification, anyone?) or is a product of unstandardized, free response (think scraped tweets or observational data from fields such as Conservation and Psychology).
However, "involved" doesn’t have to translate to "lost." Yes, every data frame is different. And yes, data cleaning techniques are dependent on personal data-wrangling preferences. But, rather than feeling overwhelmed by these unknowns or unsure of what really constitutes as "clean" data, there are a few general steps you can take to ensure your canvas will be ready for statistical paint in no time.
TL;DR: Data cleaning can sound scary, but invalid findings are scarier. The following are a few tools and tips to help keep data cleaning steps clear and simple.
Let’s get started.
Enter R
R is a wonderful tool for dealing with data. Packages like tidyverse make complex data manipulation nearly painless and, as the lingua franca of statistics, it’s a natural place to start for many data scientists and social science researchers (like myself). That said, it is by no means the only tool for data cleaning. It’s just the one we’ll be using here.
For alternative data cleaning tools, check out these articles for Python, SQL, and language-neutral approaches.
Next, Enter a Simple Checklist
No matter how useful R is, your canvas will still be poorly prepped if you miss a staple data cleaning step. To keep it as simple as possible, here is a checklist of best practices you should always consider when cleaning raw data:
- Familiarize yourself with the data set
- Check for structural errors
- Check for data irregularities
- Decide how to deal with missing values
- Document data versions and changes made
Don’t worry if these steps are still a bit hazy. Each one will be covered in greater detail using the example dataset below.
Lastly, Enter a Real-World Example
A toolbox and checklist are cool, but real-world applications of both are where true learning occurs.
Throughout the following, we’ll go over each of the data cleaning checklist steps in sequential order. To demonstrate these steps, we’ll be using the "Mental Health in Tech Survey" dataset currently available on Kaggle, along with related snippets of R code.
Step 1: Familiarize yourself with the data set
An important "pre-data cleaning" step is domain knowledge. If you’re working on a project related to the sleep patterns of Potoos in the Amazon but have no idea what a Potoo actually is, chances are you aren’t going to have a good grasp on what your variables mean, which ones are important, or what values might need some serious cleaning.
The short of it: Be sure to read up or ask an expert before diving into data cleaning if your variables don’t make sense to you at first.
To stay within my own realm of clinical psychology, I’ll be using the aforementioned "Mental Health in Tech Survey" dataset, a series of 2016 survey responses aiming to capture mental health statuses and workplace attitudes among tech employees. For more insight on the data, check out the original source here.
Knowing your dataset well from file size to data types is another crucial step prior to hands-on data cleaning. Nothing is more annoying than realizing a central feature is cluttered with noise or discovering a shortage of RAM partway through your analyses.

To avoid this, we can make some quick, initial steps to determine what will probably need extra attention. To determine the size of the data file before opening, we can use:
file.info("~/YourDirectoryHere/mental-heath-in-tech-2016_20161114.csv")$size
The data file we’re utilizing is 1104203 bytes (or 1.01 MB), not a big data venture by any means. RAM shortages most likely won’t be an issue here.
#an initial look at the data frame
str(df)

From the output, we can also see that the data frame consists of 1433 observations (rows) and 63 variables (columns). Each variable’s name and data type is also listed. We’ll come back to this information in Step 2.
Quick and dirty methods like the ones above can be an effective way to initially familiarize yourself with what’s on hand. But keep in mind these functions are the tip of the exploratory-type data analysis iceberg. Check out this resource for a sneak-peak of EDA in R beyond what’s covered here.
Step 2: Check for structural errors
Now that we have a feel for the data, we’ll evaluate the data frame for structural errors. These include entry errors such as faulty data types, non-unique ID numbers, mislabeled variables, and string inconsistencies. If there are more structural pitfalls in your own dataset than the ones covered below, be sure to include additional steps in your data cleaning to address the idiosyncrasies.
a) Mislabeled variables: View all variable labels with the names() function. Our example dataset has long labels that will be difficult to call in the code to come. We can modify them with dplyr’s rename() like so:
df <- df %>% rename(employees = How.many.employees.does.your.company.or.organization.have.)
colnames(df)[2]

b) Faulty data types: These can be determined by either the str() function utilized in Step 1 or the more explicit typeof() function. There are several incorrect data types in this dataset, but let’s continue using the "employees" variable to demonstrate how to identify and update these errors:
typeof(df$employees)

"Character" is returned but the variable should in fact be a factor with 5 levels: "6–25", "26–100", "100–500", "500–1000", and "More than 1000". We can use the as.factor() function to change the data type accordingly:
df$employees <- as.factor(df$employees)
c) Non-unique ID numbers: This particular dataset doesn’t have ID labels, responders instead identified by row number. If ID numbers were included, however, we could remove duplicates with the duplicated() function or Dplyr‘s distinct() function like so:
#with duplicated()
df <- df[!duplicated(df$ID_Column_Name), ]
#with distinct()
df <- df %>% distinct(ID_Column_Name, .keep_all = TRUE)
d) String inconsistencies: This includes typos, capitalization errors, misplaced punctuation, or similar character data errors that might interfere with data analysis.
Take for instance our "gender" column.
unique(df$gender)

The output goes on. There are in fact 72 unique responses in total. As we can see, there is variation due to inconsistent capitalization and term abbreviation. To unify these responses, we can use regular expressions in combination with gsub() to identify common character patterns and convert all female-identifying responses to the dummy coded value "1" and all male-identifying responses to "0".
df$gender <- gsub("(?i)F|(?i)Female", "1", df$gender)
df$gender <- gsub("(?i)M|(?i)Male", "0", df$gender)
Regular expressions vary greatly according to string data. Check out this regular expression cheat sheet for R here for more insight on how to use them.
Also, beware of missing values erroneously represented by character "NA" values rather than NA data types. Fix instances with the following code:
df <- df %>% na_if(gender, "NA")
Step 3: Check for data irregularities
Next, we’ll evaluate the dataset for irregularities, which consist of accuracy concerns like invalid values and outliers. Again, these are two common pitfalls in messy data frames, but be aware of irregularities specific to your own data.
a) Invalid values: These are responses that don’t make logical sense. For example, the first question in our dataset ("Are you self-employed?") should align with the second ("How many employees does your company or organization have?"). If there is a "1" in the first column indicating that the individual is self-employed, there should be an "NA" in the second column as he or she doesn’t work for a company.

Another common example is age. Our dataset consists of responses from tech employees, meaning anyone reporting an age older than 80 or younger than 15 is likely to be an entry error. Let’s take a look:

It is safe to say that a 3-yr-old and 323-yr-old did not complete an employee survey. To remove the invalid entries, we can use the following code:
df <- df[-c(which(df$age > 80 | df$age < 15)), ]

b) Outliers: This is a topic with much debate. Check out the Wikipedia article for an in-depth overview of what can constitute an outlier.
After a little feature engineering (check out the full data cleaning script here for reference), our dataset has 3 continuous variables: age, the number of diagnosed mental illnesses each respondent has, and the number of believed mental illnesses each respondent has. To get a feeling for how the data is distributed, we can plot histograms for each variable:

Both "total_dx" and "total_dx_belief" are heavily skewed. If we wanted to mitigate the impact of extreme outliers, there are 3 common ways to do so: delete the outlier, replace the value (aka Winsorize), or do nothing.
Delete the observation: Locate and remove the observation with the extreme value. This is common when dealing with extreme values that are clearly the result of human entry error (like the 323-year value previously entered in our "age" column). However, be careful when this is not the case as deleting observations can lead to a loss of important information.
Winsorize: When an outlier is negatively impacting your model assumptions or results, you may want to replace it with a less extreme maximum value. In Winsorizing, values outside a predetermined percentile of the data are identified and set to said percentile. The following is an example of 95% Winsorization with our dataset:
#looking at number of values above 95th percentile
sum(df$total_dx > quantile(df$total_dx, .95))
df <- df %>% mutate(wins_total_dx = Winsorize(total_dx))
Do nothing: Yep, just… do nothing. This is the best approach if the outliers, although extreme, hold important information relevant to your project aims. This is the approach we’ll take with our "total_dx" variable as the number of reported mental illnesses for each respondent has the potential to be an important predictor of tech employees’ attitudes towards Mental Health.
An additional note: It may be the era of big data, but small sample sizes are still a stark reality for those within clinical fields, myself included. If this is also your reality, take extra care with outliers as their effect on the sample mean, standard deviation, and variance increases as sample size decreases.
Step 4: Decide how to deal with missing values
I’ll cut straight to the chase here: There is no single "best" way to deal with missing values in a data set.

This can sound daunting, but understanding your data and domain subject (recall Step 1?) can come in handy. If you know your data well, chances are you’ll have a decent idea of what method will best apply to your specific scenario too.
Most of our dataset’s NA values are due to dependent responses (i.e. if you respond with "Yes" to one question, you can skip the following), rather than human error. This variance is widely explained by the diverging response patterns generated by self-employed and company-employed directed questions. After splitting the dataset into two frames (one for company-employed respondees and one for self-employed respondees), we calculate the total missing values for the company-employed specific dataset:
sum(is.na(df))
#percent missing values per variable
apply(df, 2, function(col)sum(is.na(col))/length(col))

It may look like a lot of missing values, but, upon further inspection, the only columns with missing values are those directed at self-employed respondees (for instance, "Do you have medical coverage (private insurance or state-provided) which includes treatment of mental health issues?"). Missing values in this variable should be expected in our company-employed dataset as they are instead covered by company policy.
Which leads us to the first option:
a) Remove the variable. Delete the column with the NA value(s). In projects with large amounts of data and few missing values, this may be a valid approach. It is also acceptable in our case, where the self-employed variables add no significant information to our company-employed dataset.
However, if you’re dealing with a smaller dataset and/or a multitude of NA values, keep in mind removing variables can result in a significant loss of information.
b) Remove the observation. Delete the row with the NA value. Again, this may be an acceptable approach in large projects but beware of the potential loss of valuable information. To remove observations with missing values, we can easily employ the dplyr library again:
#identifying the rows with NAs
rownames(df)[apply(df, 2, anyNA)]
#removing all observations with NAs
df_clean <- df %>% na.omit()
c) Impute the missing value. Substitute NA values with inferred replacement values. We can do so using the mean, median, or mode of a given variable like so:
for(i in 1:ncol(df)){
df[is.na(df[,i]), i] <- mean(df[,i], na.rm = TRUE)
}
We can additionally impute continuous values using predictive methods such as linear regression, or impute categorical values using methods like logistic regression or ANOVA. Multiple imputation with libraries such as MICE can also be used with either continuous or categorical data. When implementing these methods, be aware that the results can be misleading if there is no relationship between the missing value and dataset attributes. You can learn more about these techniques and their related R packages here.
KNN imputation offers yet another probable alternative to imputing either continuous or categorical missing values, but keep in mind it can be time-consuming and highly dependent on the chosen k-value.
#imputing missing values with the caret package's knn method
df_preprocess <- preProcess(df %>% dplyr::select(primary_role),
method = c("knnImpute"),
k = 10,
knnSummary = mean)
df_impute <- predict(df_preprocess, df, na.action = na.pass)
d) Use algorithms that support missing values. Some algorithms will break if used with NA values, others won’t. If you want to keep the NA’s in your dataset, consider using algorithms that can process missing values such as linear regression, k-Nearest Neighbors, or XGBoost. This decision will also strongly depend on long-term project aims.
Step 5: Document data versions and changes made
Let’s say it loud and clear for the folks in the back: Good research is reproducible research. If you or a third party cannot reproduce the same clean dataset from the same raw dataset you used, you (or anyone else) cannot validate your findings.
Clear documentation is a crucial facet of good data cleaning. Why did you make the changes that you did? How did you do them? What version of the raw data did you use? These are all important questions you need to be able to answer. Tools like R Markdown and RPubs can seamlessly weave documentation into your R project. Check them out if you’re not already familiar. Your future self will thank you.
For those of you who made it this far, thanks for reading!
Throughout the post, we clarified the essential data cleaning steps and potential ways to approach them in R with the help of a simple checklist and real-dataset application. For further insight, you can find the full R script at my GitHub repo here.
And remember, although this guideline is an effective anchor, data cleaning is a process strongly predicted by the dataset and long-term statistical aims at hand. Don’t be afraid to get creative with it! The canvas can quickly become a work of art itself 🎨