
If you work as a data analyst, the probability that you’ve came across a dataset that caused you a lot of trouble due to it’s size or complexity is high. Most data analysts today rely on a combination of several visualization and spreadsheet tools that help them make sense of the data around them, but the "curse of scattered files" still stands -particularly in big companies.
But, as we leave behind the first two decades of the millenium, we witness a huge growth in the creation of new data sources – not only data analysts need to make sense of data that is produced within the organization (and with organizations trying to be more data-savyy, the amount of data produced and stored grows exponentially) but sometimes they are asked to make sense of external data, extraneous to the company. This diversity asks for new ways to approach new problems that are not solvable with old tools.
Sure, Microsoft Excel is one of the best tools to analyze data due to it’s democratization and usability, but as soon as you pass a certain amount of rows your ability to gather insights and make sense of raw data gets pretty limited. Analyzing large files with Excel has two main problems:
- The limitation in terms of Rows and Columns (https://support.microsoft.com/en-us/office/excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3).
- The impact on productivity due to a buggy environment (slowness, crashes, etc.) – if you ever lost some parts of your analysis due to an Excel crash you know what I’m talking about. Yes, the "uh-oh here’s the white screen!" moment.
It’s not that Excel is a bad tool (I really think it is great), it’s just that it’s not tailored for large files – and yet, most people, particularly in large companies, still rely on it to do every kind of data tasks. Data visualization and self service Business Intelligence tools have been tackling this problem in some way – but to have the tools need you are overly dependent on budget or politics.
Luckily, the adoption of open source tools surged during the past decade (particularly in the data analytics space where R and Python shine) as, regardless of having a huge community, they have a smooth learning curve and are pretty accessible for most people.
Being able to code in these systems may enable analysts to improve their overall productivity and capacity of analyzing data – but, sometimes they get frustrated when they can’t load spreadsheet files properly into R or Python and end up sticking with spreadsheet software.
In this article we are going to explore how we can build and think of functions that prepare our spreadsheets files for analysis – the flow of the pipeline is the same that I teach on my R Programming Course on Udemy
The dataset
We are going to use the FBI Crime Data Set that has this general look:

As we can see, we have several columns that convey almost the same information such as nominal values and rates. If we try to ingest this data directly into R we are probably going to get some weird stuff due to some additional information we have on the columns (references, for example) as any tool will have a difficult time to target the "real data" in this spreadsheet (and remember that Excel should have this information as it is a user-facing tool).
As an example, the footnote will probably be included as a row somewhere in your data – this makes your data really hard to analyze – but you can fix most of these problems with a bit of data wrangling skills!
Importing it into R
To load this Excel file into R we will rely on the readxl (https://readxl.tidyverse.org/) **** R library of the tidyverse package:
library(readxl)
# Load the Data
crime_data <- read_xls("./data/FBI Crime Data.xls")
Looking at the head of the file that we just loaded:

Ugh, look at that! Our first two rows are probably useless and the third row contains our column names. We also have some problems with some Years (look at row index #8 with the year 20015 – this happened as the superscript 5 was read as a normal number).
The bottom of the table does not look good, also:

Some of the descriptions and metadata we had on the excel sheet ended up being read as rows – this metadata is useful for someone that is looking to the Excel file but not for someone that is trying to analyze this data consistently on a data analysis tool.
These are common data wrangling problems that data scientists and analysts face each time they want to analyze a new data set – particularly from a less structured source.
We have some work on our hands, so let’s get this done!
Starting with Column Names
This one is pretty easy – our column names are on the third row, so we can just set them up using the colnames function:
colnames(crime_data) <- crime_data[3,]
Our general table gets this look:

Automating Filtering
To build interesting and robust data pipelines we have to come up with systematic rules that can convey future changes in the possible files that we will be passing through our pipeline.
We could definitely subset the years by using a vector to subset the first column in the dataset and looking at the values 1997, 1998… 2016 (the latest year we have on this file) but what if someone gives us the file with data until 2017? or a file that contains 1996?
If we had our years hard-coded into the pipeline, we would be building hard-coded rules – these are never a good option because they only apply to this file and this file only (or a file with exactly the same structure).
With a new file that would contain data for more than 20 years (the ones that are available in our table) your pipeline would not be able to filter them – so by looking at the file below, can you come up with some rule that make it possible to subset the table, no matter the number of rows with yearly information?

One nice rule we can come up with is to only get the rows that have a value that can be converted to numeric in the first column. This is an example of a good type of a "data pipeline rule" as it prevents errors or loss of information in the future.
If we try to convert the first column of this dataset into a numeric type, elements that are not numbers will be returned as NA’s (Not available) – we can then filter those NA’s out, efficiently retaining the rows that we are interested in— we will rely on sapply to convert our first column into numeric – we will create a new column:
crime_data['converted_index'] = sapply(crime_data[,1], as.numeric)
Our new _convertedindex column will have the following look:

Notice how the NA’s were introduced by coercion when R couldn’t convert our column to numeric .
Filtering these rows out will give us the following table – short summary of the first 11 rows and first 9 columns:

How did we filter these NA’s out? We were able to achieve that with the following code:
crime_data_filter = (
crime_data[!is.na(
crime_data[,ncol(crime_data)]
),]
)
With this rule we select every row of our dataframe where our last column is not NA – as our _convertedindex column was created inside the pipeline it would also be safe to do the following (which yields the same result):
crime_data_filter = (
crime_data[!is.na(
crime_data[,'converted_index']
),]
)
In this case, having the column _convertedindex hard-coded would not be problematic – as the converted index was created inside our pipeline and the rules that created that column are flexible to the input, there’s less risk.
Correcting the Year Column
We still need to get rid of the weird values like 20015 – this happened because R read the superscript 5 that points to a reference in the excel file (and that makes sense, in that system) as a number.
Luckily, the Year Column is still a character so we are able to apply substring directly and only retrieve the first 4 "letters" (they are numbers but R treats them as a string because of the type of the data) of each year:
crime_data_filter$Year = substr(crime_data_filter$Year, 1,4)
Notice how I am rewriting the Year column with only the first 4 digits of each Year – the extra digits that were wrongly assumed by R as part of the Year are left out, as we want.
We are able to get the following data frame:

Automating Column Selection
As we have seen columns convey the same information (rate is just a division of the original column by a million inhabitants) – as we have done with row filtering, we can also do some automated column filtering. Can you think of a rule that will be able to select the non-rate columns in the table, without relying on hard-coded rules?
Let’s inspect the names of our columns with:
names(crime_data_filter)
This is the output:

Looking closely, every column that has rate in its name should be removed from our table. Instead of hardcoding the names or indexes of the columns we can rely on the grepl function. Using:
grepl( "rate" , names(crime_data_filter))
This function will returns us a vector with TRUE and FALSE – TRUE in the indexes where the column name contains "rate" and FALSE otherwise:

We can now rely on this vector to filter out these columns of our data frame – using indexers. We will also rewrite our original object:
crime_data_filter = crime_data_filter[,!grepl( "rate" , names(crime_data_filter))]
Converting our data to Numeric
We’re almost done! Let’s look at our crime_data_filter structure:

Most of our columns are still characters(a problem that migrated since the beginning of the pipeline as we had blank cells in the excel that made R assume this was a character column ) even if we only see numbers in it!
Let’s convert these columns to numeric – we can’t apply as.numeric directly to our object (we could only do this to Vectors, Matrixes or Arrays) – we need to rely on our apply family of functions!
sapply(crime_data_filter, as.numeric)
As sapply returns a list, we will wrap this function on a data.frame function to get a data frame.
pipeline_table <- data.frame(sapply(crime_data_filter, as.numeric))
Conclusion
Now that we have a really clean table and able to be analyzed – we can encapsulate every instruction we have done in a function that can be reused:
In conclusion, knowing how to build data pipelines is an essential task of analyzing data in systems that rely on code – particularly in a world where more and more professionals are acessing data using R and Python, it’s incredibly important to understand how to build error-proof rules in reading data files.
This lecture is taken from my R Programming course available on the Udemy platform – the course is suitable for beginners and people that want to learn the fundamentals of R Programming. The course also contains more than 50 coding exercises that enables you to practice as you learn new concepts.