R Tutorial
Are you curious about delving into the world of R programming? While Python remains the dominant choice amongst the data science community, with approximately 60% of developers using it in 2022¹, there are instances where R may pop up now and again. That’s because R is optimized for statistics and data. If you, like me, have a foundation in Python but now encounter job listings and internal company tasks that demand R skills, this article aims to break that down. We will explore the fundamental distinctions between Python and R and wrap the project into a data cleaning and visualization tutorial to ensure a smooth transition to R.
Note: If you have a keen interest in green technology and electric vehicles, the tutorial includes some interesting visuals that showcase the popularity of electric and hybrid vehicles in Canada, so feel free to skip ahead to the tutorial section to explore these visuals and associated analyses firsthand!
A brief breakdown of R
R is an open-source, programming language that has a reputation for being used primarily in the fields of statistical modelling and Data Visualization. Originally developed in 1993 by statisticians Robert Gentleman and Ross Ihaka, R was designed to handle statistical analysis and data transformation tasks It still maintains its reputation for being a statistics-focused program. However, thanks to a vast library of over 18,000 packages, it has also evolved to support a wide range of projects and applications beyond data science over the years.
When it comes to setup and application, R is commonly used within the RStudio environment, which is both free and straightforward to install. You can find an installation guide here. Now that we have covered some initial explanations, let’s move onto our cheatsheet transition guide from Python to R.
Exploring the main differences between Python and R
While it is impossible to capture all the nuances between Python and R in one image, the diagram below provides a good initial overview of the key differences between the two programming languages:
Please note that this diagram is not exhaustive and does not encompass all distinctions between Python and R. For a more detailed, and comprehensive breakdown tailored to your specific projects, MIT has a good conversion resource here.
So to summarize the the key differences between Python and R, let’s highlight a couple of key items:
- Syntax: Python adopts a more straightforward and concise syntax, whereas R’s syntax tends to involve a higher usage of parentheses, brackets, and symbols. This can make R code initially appear more complex, but we will explore this concept later on in the Tutorial.
- Data Manipulation: Python relies more on external libraries like NumPy and pandas for complex data manipulation tasks. In contrast, R often provides built-in functions and features specifically tailored for data manipulation.
We can explore these differences in practice to gain a more complete understanding of the contrasting aspects of Python and R. Let’s move onto the tutorial section where we will go through some simple cleaning and data transformation and explore these visuals using data.
Our R package breakdown
Before we begin, let’s get familiar with the R packages we will be working with:
tidyverse
: This package was created to follow the principles of tidy data (as the name suggests) and contains many essential packages. Amongst them,dplyr
is popular for its capabilities in data manipulation and transformation, andggplot2
offers a powerful suite of tools for data visualization.sqldf
: This R package that allows you to perform SQL queries on R data frames, providing a more convenient way to apply the SQL syntax for data manipulation and analysis within the R environment.
Tutorial: Electric Vehicle Licenses in Canada
In this tutorial, we will focus on examining the popularity of light and zero-emission vehicles following the launch of the Canadian Federal Government’s Incentives for Zero-Emission Vehicle Program (iZEV); which is a national program which offers financial rebates to Canadians who purchase electric vehicles, including plug-in hybrid vehicles. Lucky for us, we have access to the Government of Canada’s data spanning from the program’s inception in 2019 up until March 2023.
The analysis will be separated into 2 sections: data loading and cleaning which makes some light comparisons between Python and R, and subsequently the data analysis. First, let’s outline some of the questions we want answered by our visuals to help guide the sections, where the primary focus is on understanding the shifts in popularity over time:
- How have the number of vehicles registered under the iZEV program evolved over the years?
- What changes have occurred in automaker brand preferences since the implementation of the program?
- Which vehicle models have experienced the most significant increases and decreases in popularity?
To keep the tutorial concise and focused on Python to R transition, analysis will be kept fairly broad, however we will show the final produced data visualizations to complete a picture on iZEV licenses registered in Canada so far. For a more in-depth analysis, including the complete R code, additional markdown, and visuals, as well as a link to the dataset, please refer to my GitHub repository _here_.
Data Loading and Cleaning
To begin, we will install and load the R packages we mentioned earlier, which can be followed out with the following code:
Next, let’s load in the packages into our R environment:
To load our data we will use the <-
operator to assign values to variables, as opposed to the =
operator commonly used in python. After which, we can quickly use the dim
function to retrieve the number of rows and columns of our loaded data set which is equivalent to the numpy.shape()
function in Python.
In R, you can explore the first rows of a dataframe using the head() function. This is similar to the df.head()
function in Python’s pandas library. Here’s an example of how to accomplish this in R:
The %>%
operator helps to chain the operations. In the code above it lets R know to take the df
dataframe and then show the first 5 lines, where you would have an output like the one below:
After obtaining an initial overview of the data frame and shape, right away we can see some irrelevant columns which can be removed. At the same time, we can revise some lengthy column names for easier reference later on.
All of these steps can be replicated in Python using drop()
, rename()
and map()
functions.
The next steps of the data cleaning process often entails removing nulls and duplicate rows, however for this particular dataset we will only remove nulls since we have many duplicate rows, but in the absence of unique row identifiers (i.e. license ID) we risk losing valuable data if we remove duplicates, so we need to trust that the row inputs are correct. Here’s how you would remove nulls in R:
The Python row equivalent function of dropping nulls can be called using dropna()
.
This final step is in preparation for the final question on vehicle make and models popularity, where we can make car model naming conventions in the Vehicle_Make_and_Model
column consistent. For example, we would consider ‘Hyundai Ioniq PHEV’ the same as ‘Hyundai Ioniq Plug-In hybrid’. We can do this by creating a list and referring to it with the str_replace_all
function.
Great, that’s some simple data cleaning out of the way, now onto the interesting part, the analysis!
Data Analysis
The following visuals will explore each of the three questions we had outlined at the beginning, and explain how ggplot
versus matplotlib
differs. Let’s take a look at our first question:
How have the number of vehicles registered under the iZEV program evolved over the years?
Here, we need to visualize years and counts of licenses registered, where the R dplyr
package can be used to edit the clean_df
data frame so it’s in a suitable format. Each row of the data set counts as a vehicle entry, so summarise(total=n())
is required to receive total row counts:
When it comes to plotting data, there are some differences between R’s ggplot
and Python’s plotting libraries. In ggplot
, a layering syntax is used, where different components are added using the +
operator.
Let’s compare with how you might code this out this exact plot using matplotlib
with Python:
Overall, the length of code is fairly similar between the two, but with R we can see the code looks more condensed with the +
operators. Now, let’s show this visual:
I also included a second plot illustrating the breakdown of iZEV recipients by province (Full R code can be found in my GitHub Repository here):
Observations
So what can we see from these two visuals? We can see that there was an overall increase in the number of zero-emission vehicles registered under the iZEV program from 33,611 licenses in 2019 to 57,564 licenses in 2022, supporting the growing transition to electric vehicles in Canada. Note: the EV market represents a small position of overall passenger vehicle registrations at ~5%⁴.
Breaking this out by province, we see Quebec accounted for the largest share of licenses, surpassing that of BC, Ontario and Other provinces combined, likely in part due to higher financial motivations as Québec offers an additional rebate of up to $8,000 on top of the Federal government scheme (in contrast BC offers only up to $3,000). In addition, a clear mandate to utility company Hydro-Quebec, has aided the EV charger infrastructure in the province, helping to ease driver concerns of where to recharge.
Breaking this out by province, we see Québec accounted for the largest share of licenses, surpassing that of BC, Ontario and Other provinces combined, likely in part due to higher financial motivations as Québec offers an additional rebate of up to $8,000 on top of the Federal government scheme (in contrast BC offers only up to $3,000). In addition, a clear mandate to utility company Hydro-Québec, has aided the EV charger infrastructure in the province, helping to ease driver concerns of where to recharge.
2. What changes have occurred in automaker brand preferences since the implementation of the program?
To address our second question, we want to examine changes in popularity among automakers. Instead of focusing on absolute totals as we did in the first question, we will explore relative changes. By analyzing proportions, we can compare the performance of different automakers on a similar scale, allowing for more meaningful comparisons.
Now, bear with me here since there is a fair amount of R code before we arrive at our next visual, but what we are ultimately going for here are subplots by brand showing proportional change year over year. All of the data we have thus far only represents absolute counts, so we need to calculate ‘per 1,000 vehicles sold’ for the proportional scale.
To start we will create a table that shows vehicle brand by year and count which we can carry out using the sqldf
R package:
Next, we want to have each year split out onto its own row where we can use the pivot_wider
function (fairly similar to the pivot
function in Python).
After, we want to calculate the 'per _1K'
licenses for each year which we can do by taking each vehicle count by brand, dividing it by the total vehicles registered and multiplying this by 1,000.
Now we want to calculate the difference between 2022 and 2019, looking only at complete years in terms of proportional change.
This next step re-pivots the years and the per_1K
columns that we calculated into one long pivot table to help prepare these for our graphs. After, we will join the absolute counts and the per_1K
counts into one long pivot table.
query_vehicle_counts
has everything we need already so we just need to join these two:
Lastly, we want to rank each vehicle by their totals by year, which is where we can reuse sqldf
and use window functions to do this easily.
Finally, to plot out how proportions of vehicles have changed over time we can use subplots (in ggplot
this is referred to as facet_wrap
, this is similar to subplots
in the Python matplotlib
library).
Here’s our visual:
Observations
Having the subplots laid out above show us that when comparing proportions Tesla has the largest share of cars, accounting for 300 out of 1000 electric vehicles (EVs) on the road in Canada from 2019–2022. In the past couple of years they have lost some market share as newer incoming players have arrived on the Canadian market such as Audi, Jeep, Mazda and Polestar.
3. Which vehicle models have experienced the most significant increases and decreases in popularity?
Our last set of questions focuses on the popularity of specific vehicle models, where we can examine the changes in proportions of models purchased between 2019–2022. The code for this analysis is fairly similar to what we used for the previous question, exchanging Vehicle_Make
for Vehicle_Make_and Model
(again, for a more detailed step-by-step guide please refer to my GitHub link here).
Let’s code these out and explore why we see might be seeing these patterns:
Observations
Based on the graph above we can see that the Hyundai IONIQ 5 saw the biggest boost in popularity, where for every 1,000 licenses there were 83 more purchased in 2022 versus 2019. In percentage terms, the model saw an 8.3% increase during this time period. It is important to note that the majority of the models that saw growth in popularity are SUVs, as evident from the top five listed above. This goes with the preferences of the North American market, where consumers have awaited larger electric vehicle options, shifting away from smaller sedan formats that were previously dominant, such as the Tesla Model 3.
Let’s move onto coding and plotting out the model decreases:
Observations
Whilst still in the Top 5 ranks of most popular automakers, the Toyota Prius Prime has seen the biggest drop in popularity, with 114 fewer licences (per 1,000) from 2019 to 2022. The drop may be due to supply issues and raised prices affecting popularity, but also since it is a plug-in hybrid model iZEV financial incentives are reduced, where you can only get half of the full available rebate offered.
The Tesla Model 3 saw a decrease in popularity, with a decline of licenses of 64 per 1,000 licenses, but it is worth noting that Tesla still remains a dominant automaker in the market when looking at absolute totals.
Closing Thoughts
In conclusion, we have embarked on a mini data analysis project to explore the coding differences between Python and R. Hopefully, this has made R feel a little more approachable and has lent some inspiration to create compelling visuals on your own!
As a final, friendly reminder, for access to the full R code, please check out my GitHub repository here. Happy Coding!
All images unless otherwise noted are by the author.
References
- SlashData, State of the Developer Nation, Q1 2023
- M.Omar , 101 Data Science with Munira: Getting started with R and RStudio, Jun 2020 Medium
- A.Amidi and S.Amidi, Data manipulation R-Python conversion guide, MIT.edu
- Statistics Canada, Automotive Statistics, March 2023