
Objectives
- Download, extract, and load complex Excel files from the web into R
- Clean, wrangle, and filter the data efficiently
- Use custom functions to simplify our plotting
- Learn to use the _stack(), grep(), substr(), unique(), ggplot(), statsmooth(), and _facetwrap() functions in R
- Use the tidyverse, lubridate, and readxl libraries to make our lives easier
Introduction
Have you ever had a data set with hundreds of columns that repeat and just need a simple way to get some useful plots of the data? Well, you are in the right place!
In this project, I will show you the simplest and most effective tips and tricks that I have discovered to get ggplot to cooperate in R. Realistically, about 80% of the time spent on data analysis is just on getting the data and cleaning it up enough to be usable for what we need to do [1]. This means that the other 20% of the time is where the fun and (potentially) glorious part of the job occurs where we get to make some pretty graphs and interpret the results. Nobody really likes spending 80% of their time cleaning and finding data, so hopefully this project can help you cut down on the amount of time you spend mashing those buttons.
For your convenience, I have made an R Markdown file (.rmd) available on my GitHub at this link if you would rather take a look through the notated code and run it cell by cell or clone it to your machine.
To get started, make a new R Markdown file in your RStudio, use a regular .R file, or use a cloud version of R. We’re not going to be too fancy here, so you should not have any issues. I am assuming that you can get an R environment working either in RStudio or in the cloud. Let me know if you are having problems in the responses at the end, and I will do what I can to help you troubleshoot the problem!
Let’s get started!
The Code
Load Libraries and Get the Data
First, we need to load tidyverse, lubridate, and readxl into our R environment. If you get an error about the packages not being recognized, uncomment (remove the #) before the install.packages() line(s) that you need.
Here’s the code:
# If you need to install the packages, uncomment the following 3 lines:
# install.packages("tidyverse")
# install.packages("lubridate")
# install.packages("readxl")
# Load Libraries
require(tidyverse)
require(lubridate)
require(readxl)
It would also benefit you to set your working directory into a folder that contains your file and has room to download a ~3MB .xlsx file. That can be done with the setwd() command or using RStudio to navigate the Files menu to the folder location and setting the working directory with the More drop down menu as shown below:

Now we need to download the data. We are using some economic indicator data from the European Commission that has monthly data for each European country going back as far as 1985, which is over 480+ rows and 280+ columns. The link to the web page can be found here [2] or in the RMD file from my GitHub if you want to explore the EC’s website a bit more to learn about the data.
What we want is the "Seasonally Adjusted Data (total sector)" ZIP file. In R, we need to assign the link to the file to the url variable, create a temporary file, download the excel sheet, and unzip the data so we can use it.
Here’s the code:
# Define URL to the first link, "Seasonally Adjusted Data (total sector)" ZIP file
url = 'https://ec.europa.eu/economy_finance/db_indicators/surveys/documents/series/nace2_ecfin_2011/main_indicators_sa_nace2.zip'
# Create temp file, download, and unzip
temp <- tempfile()
download.file(url, temp)
unzip(temp)
Here’s the output:

Note that the red text is not an error; it is just telling you that R is doing its thing.
With the data unzipped and downloaded, we need to get it into our session in a way that makes R happy. Feel free to open the file in Excel (do not use Numbers on Mac or OpenOffice because of encoding issues!) on your computer if you like. We only need the MONTHLY sheet which can be defined in our import code.
Here’s the code:
# Create EuroData for Raw Data
# We only need the data in the MONTHLY sheet in the excel file. There is a lot of other data in here, which you can explore on your own.
EuroData <- read_excel("main_indicators_nace2.xlsx", sheet = "MONTHLY")
Here’s the output:

Again, the red text is normal. R is mad that some of the columns are completely empty. In the excel sheet, the formatting has blank columns separating some data for readability. We’ll deal with that soon in the cleaning process.
If we take a look at the newly created EuroData object in RStudio, we can see several formatting issues.

Clean the Data
One nice property of this data set is that the first column is the date to which every subsequent column is indexed. Before we do anything crazy, let’s quickly change the column name to be more useful to us.
Here’s the code:
# rename first column to Date
names(EuroData)[1] <- "Date"
Here’s the output:

To make our data a little easier to process, let’s change it to a tibble for a bit.
Here’s the code:
# make data a tibble
EuroData <- as_tibble(EuroData)
Fix the Date
Next, let’s fix the date column using the lubriate packages to format our column as a consistent year-month-day system while also making that column the right data type.
Here’s the code:
# Fix date column
EuroData$Date <- ymd(EuroData$Date)
Use grep()
Now, let’s get into the nitty gritty details of cleaning. There are several columns that still have the "…#" format, which is not helpful to us. We can use the grep() function to look for a pattern in some data. What I did here is make the _removecolumns object be a vector of the results where grep() looks for any column name in the EuroData object where it is true that the name contains one or more numbers.
Here’s the code:
# grep the "" columns out
remove_columns <- c(grep("[0-9]+", names(EuroData), value = T))
Here’s the output (type "remove_columns" into the Console):

Remove Useless Columns
Since we have a vector of 35 column names that are just numbers with no data, let’s remove them from the EuroData object. The code below takes the original EuroData object then looks for all the columns not in the _removecolumns vector, which gives us a EuroData object with only the columns we want. Sure, it’s a double negative, but it’s simple and efficient.
Here’s the code:
# actually remove the useless columns
EuroData <- EuroData[,!(names(EuroData) %in% remove_columns)]
Here’s the output:

Mutate Data Types
We now have to deal with a quirk of the data. It turns out that all those numbers in the other 245 columns are actually characters rather than numbers, which means that we cannot do any calculations until we fix that. Thankfully, we can use the _mutateif() function to find any columns that are characters and convert them to numeric data types.
Here’s the code:
# convert data to numeric
EuroData <- EuroData %>%
mutate_if(is.character, as.numeric)
Here’s the output:

Again, don’t worry about the red text. R is upset about introducing NA’s. It doesn’t actually matter because these are warning messages rather than error messages. You can just close them by clicking the little x on the right side. Everything is fine, I promise!
Get Country Abbreviations
The next step is a bit of preemptive preparation. We need to get the country abbreviations for reasons that will be clear in the next couple steps. To do this, I use the substr() function to get the first two characters in the column names from columns 2–246 of the EuroData object then wrap that in the unique() function to remove duplicate data. All of that work goes inside a the c() function to make the results a vector.
Here’s the code:
# get country abbreviations
country_abbvs <- c(unique(substr(names(EuroData[2:246]), 1, 2)))
Here’s the output (type "country_abbvs" into the Console):

Stack the Data
Now it is finally time to stack the data and change its structure. I am converting back to a data frame, explicitly naming the Date column, and using the stack() function on all but the Date column to reshape our data in a way that will make our lives much easier when plotting the data with ggplot() down the line. Let’s take a look at the data now.
Here’s the code:
# Stack the data to make ggplot happy
EuroData <- data.frame(Date = EuroData$Date, stack(EuroData[2:246]))
Here’s the output:

Notice that we went from having 246 columns down to 3. All we have is a long, skinny table of values and identifiers indexed to dates. At first, this probably seems a little strange why we would do this. I did it because of how ggplot() takes data. When we make the plots, you will see how easy it is to automatically generate the right plots with stacked data. If you want to try to do this with the original data set, have fun but be prepared for a lot of frustration and wasted time
Create Custom Plotting Function
When used judiciously, custom functions in R can save a ton of time. I made one called _countryplot() that uses ggplot() to automatically subset the data to remove NA values, get the aesthetics (the aes() function) right, make the plot a line chart, create a linear model that draws a best-fit line through each plot while removing the standard error shading for readability, and using the _facetwrap() function to make all the sub plots on a single plot. I will show you how to use it soon after the last couple steps. Feel free to modify it if you want.
Here’s the code:
# Create Plotting function
country_plot <- function(x) {
ggplot(data = subset(x, !is.na(values)), aes(x = Date, y = values)) +
geom_line() +
stat_smooth(method = "lm", se = FALSE) +
facet_wrap(. ~ ind, ncol = 4)
}
Generate Each Country’s Data
There are a lot of countries in Europe. Let’s us a for loop to filter, subset, and create data objects for each country.
I know for loops in R get a bad rap, but it works and can easily be read, so we’re using it here for simplicity’s sake. We start with setting the loop to go from 1 to the length of the vector of _countryabbvs because we will be looping through all 35 of them. The _datafilter goes through the country abbreviations one at a time and gets only the data for that particular country. The _country_subsetdata object filters the entire EuroData object to only include the current country abbreviation. We finally assign(), using the paste0() function, the country abbreviation to its corresponding data.
Here’s the code:
# Subset each country's data
for(i in 1:length(country_abbvs)) {
# create generic filter for each country
data_filter <- unique(grep(country_abbvs[i], EuroData$ind, value = T)) # This works
# filter for each country
country_subset_data <- EuroData %>%
filter(ind %in% data_filter)
# create object names and pass in each country's data
assign(paste0(country_abbvs[i],"_data"), country_subset_data)
}
Here’s the output (look at "Environment" in RStudio):

Here’s the output (click on one of the objects to see its structure):

Notice here that there are a few with more data than there should be. It’s a result of how the filtering process works. All but 4 came out right, so all we need to do is remove excess data from those before we start plotting.
Here’s the code:
# Get rid of the weirdness in EE, ES, SE, and SI
EE_data <- EE_data %>%
filter(substr(ind, 1,2) == "EE")
ES_data <- ES_data %>%
filter(substr(ind, 1,2) == "ES")
SE_data <- SE_data %>%
filter(substr(ind, 1,2) == "SE")
SI_data <- SI_data %>%
filter(substr(ind, 1,2) == "SI")
Plot the Data
Finally! We can enjoy the fruits of our labor. Plotting the data for any country is super easy. We can use out custom _countryplot() function with any of the country data objects to plot all seven indicators contained in the data set. Let’s take a look a few of them.
Here’s the code:
# EU - European Union
country_plot(EU_data)
Here’s the output (EU as a whole, Germany, and UK):



In the RMD file on my GitHub, I have all the countries read to plot with the click of a button.
Conclusion
If you care about European economic indicators, go ahead and read into what these numbers mean. The purpose of this project was to show a method that has saved me a ton of time to take large and complex sets of data then quickly plot that data in a way that makes sense. I hope this helps you save some time and generate more effective data visualizations!
References
[1] G. Press, Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says (2016), https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/?sh=2074b16c6f63
[2] European Commission, Time Series (2020), https://ec.europa.eu/info/business-economy-euro/indicators-statistics/economic-databases/business-and-consumer-surveys/download-business-and-consumer-survey-data/time-series_en