The world’s leading publication for data science, AI, and ML professionals.

Learning R data types

In this article, I am going to share my own learning experience of comprehending all data types in R programming language. We may argue…

Understanding the most critical step of mastering a programming language

Photo by Markus Spiske on Unsplash
Photo by Markus Spiske on Unsplash

In this article, I am going to share my own learning experience of comprehending all data types in R programming language. We may argue which is the best programming language but I think every one would agree that understanding Data Type is the most fundamental and critical step for mastering a language.

All the codes are available in https://github.com/frankligy/exercise_codes/tree/master/R


First, R doesn’t have scalar type which means the smallest unit in R language is vector. For instance, "John" is a vector that is represented as c("John"), Likewise when we see a scalar value100, it is actually a vector of length 1 as c(100). Don’t worry about what this weird "c()" means, it is just a R function to create a vector, here I use it to emphasize every scalar value is actually a vector. With that understood, it is natural to start with vector data types.

Vector

What is a vector? In R, it simply represents a series of scalar objects of same data type. For example, c(1,2,3,4,5) is a vector, or in particular, it is a numeric vector since all the elements are numeric. Using this concept, you can easily define what a character vector would look like, it would be something like this c("John", "Marry", "Bill"). To provide a comprehensive data types of vector, we have:

  1. Numeric vector

We can further divide numeric vector to three types:

a. Integer vector

b. Double vector (fraction number, it is called double since R uses 32 bits or double precision by default)

c. Complex vector (complex number encompassing real part and imaginary part)

2. Character vector

As mentioned above, it will be called character vector if all elements are characters.

3. Raw vector

Raw vector is rarely used, it basically signifies the vector such that elements are in binary format. Since it is rarely encountered, we will skip this type and only focus on frequent-used types.

4. Logical vector

As the name suggests, it means a vector whose elements are all boolean values, namely, TRUE or FALSE.

Then I would like to discuss the most important operations for an arbitrary vector, more importantly, how a vector can be constructed?

To create a vector, you can use c() function as mentioned above. Alternatively, you can use colon symbol, 1:5 is equivalent to c(1,2,3,4,5). In order to index a vector, you can simply use numeric indices. For example, let’s set x=1:5, then x[2] will return the second element in the vector, hence it will be 2. x[-2] instead will be the complementary index which return all elements except second one, hence the final result would be c(1,3,4,5). The second way to index vector is using logical vector, x[c(TRUE, FALSE, TRUE, TRUE, TRUE)] can achieve the same effect as x[-2]. In addition, it is noteworthy that vector can have associated names. To add names to each element, you can use names(x) = c("first", "second", "third", "fourth", "fifth"). Because of that, we have the third way to index a vector via using its associated names, x[c("first")] will return 1.

# Run following code in R console to reinforce what we covered
v1 <- c(1,2,3)
names(v1) <- c("first","second","third")
v1[c(1,2)]
v1[c(-3)]
v1[c(T,T,F)]
v1[c("first","second")]
class(v1)

Factor: integer vector with label

You might hear this unique R data type before to store categorical data. But essentially, factor is just an integer vector with label. To illustrate that, let’s look at the following example:

# Run this snippet in R console
gender <- factor(c("male","female","female","female","male"))
names(gender) <- c("first","second","third","fourth","fifth")
levels(gender)
nlevels(gender)
as.integer(gender)

Here we first create a variable called gender using factor() R function, then we can give this factor name because as I said, factor is just an integer vector. But different from primitive vector, you can access the label attributes using levels() R function and obtain the number of categories using nlevels() R function. You can also easily coerce a factor into an integer using as.integer().

List: Vector whose elements can be of different types and lengths

List is another special vector, you can view it as a separate data type but I found out that it would beneficial to associate it with primitive vector data types. Again, I will use an example to illustrate the commonalities and differences between list and vectors:

# Run this snippet in R console
a_list <- list(
  c(1,2,3,4),
  matrix(1:12,nrow=3)
)
names(a_list) = c("item1","item2")
# vector based index method, will return a sublist
a_list[c(1)]
a_list[c(TRUE,FALSE)]
a_list[c("item1")]
# list unique index method, will return the content for certain item
a_list$item1  
a_list[["item1"]]

unlist(a_list)

First, we construct a list variable a_list, it encompasses a vector c(1,2,3,4) and a matrix of dimension [3,4]. We can assign names to list because as I said, list is a vector and vector will have names for each element. To access the item in list object, you can use vector-based methods but it will return a sublist. If you want to return the content of each slot directly, considering to use list unique index method, it can be a dollar sign or double bracket. Finally, to convert an arbitrary list to a vector, we can use unlist() R function.

Matrix and Array

I put matrix and array together to directly compare them. What we have discussed are all one-dimensional data, what if the dimension exceeds one? We have matrix data type to handle two dimensional data, the structure of a matrix can be seen from the way it is created. When calling for matrix() R function, we first specify data that we’d like to fill in matrix space, in the example below, we want to fill 1:12 to the matrix. It is noted that R by default fill in value column wise, which means it first fill the first column of a matrix, then move to second column, etc. It is different from language like Python which is row wise by default. You can change this behavior by setting byrow=TRUE. Next, we specify the dimension of this matrix, which will be nrow=4, ncol will be automatically defined as 3 (12/4=3). Finally, we can set the dimnames (rownames + colnames) of the matrix. At the end, you can access all the attributes of this matrix.

Following example was adapted from Richard Cotton’s tutorial book Learning R, they own all the credit.

a_matrix <- matrix(
  1:12,
  nrow=4,   # or dim=c(4,3)
  dimnames=list(
    c("one","two","three","four"),
    c("ein","zwei","drei")
  )
)
dim(a_matrix)
nrow(a_matrix)
ncol(a_matrix)
length(a_matrix)  # the product of each dimension
rownames(a_matrix)
colnames(a_matrix)
dimnames(a_matrix)  # a list

When the dimension of data exceeds two, it is where array data types come into play. Array essentially is just to store n-dimensional data. In order to understand the dimension of an array, the highest dimension will be the last element in the dim vector, it is different from Python language where highest dimension would be the first element in the dimension. Another differences compared to matrix is that, it is recommended to use dimnames instead of directly colnames and rownames because they are not well-defined for a n-dimensional array.

For both matrix and array, length() R function will return the product of all its dimension. In matrix example, it will be 12, and it will be 24 for the array instance.

Following example was adapted from Richard Cotton’s tutorial book Learning R, they own all the credit.

three_d_array <- array(
  1:24,
  dim=c(4,3,2),
  dimnames=list(
    c("one","two","three","four"),
    c("ein","zwei","drei"),
    c("un","deux")
  )
)
dim(three_d_array)
dimnames(three_d_array)
length(three_d_array)

DataFrame

This is probably the most famous data type in R language and it’s widely-used in every field and applications. I want to start by understanding this data type, a DataFrame can be viewed either as a matrix but different data types are allowed, or a list such that each item in this list would be a column, and each column itself will be a factor.

With that understood, we use a simple example to review how to construct and access the dimnames:

a_data_frame <- data.frame(
  x=letters[1:5],
  y=rnorm(5),
  z=runif(5)>0.5
)
rownames(a_data_frame)
colnames(a_data_frame)
dimnames(a_data_frame)

There are tons of other dataframe operation function out there which is out of this article’s scope. I just intend to illustrate the underlying relationship of all primitive R data types and understand how they look like, what they consist of, etc. Usually additional R package provides much more elegant solutions than R base for data frame operation, like reshape2, dplyr, tidyverse, etc.

S3 and S4 object

R programming is compatible for object-oriented programming (OOP), according to Richard Cotton’s tutorial book Learning R, there are 6 different OOP system built in R, namely, S3 object (lightweight system for overloading functions), S4 object (fully featured OOP system), Reference Classes (modern replacement of S4), proto (lightweight wrapper for prototype programming), R.oo (extension of S3), OOP (precursor to reference classes, note it’s obsolete). According to my own experience, I encountered S3 object and S4 object in most of cases, therefore I will focus on these two OOP system.

From a standard OOP perspective, a customized class or object will have its own attributes and methods (functions). In the following example, I constructed a S3 object by first instantiate a list object and use class() R function to specify the name of this class. Then we can use dollar sign to access each attribute (item in list). However, when defining methods, it is slightly strange compared to other OOP system, you first need to define a default behavior, then define the actual function getGPA.

The following snippet is adapted from https://data-flair.training/blogs/object-oriented-programming-in-r/ and they own all the credit.

# S3 object
s <- list(name="John",age=29,GPA=4.0)
class(s) <- "student" 
s$age  # how to access S3 object's attribute 
getGPA <- function(obj){  
   UseMethod("getGPA")
} 
getGPA.default <- function(obj){  
   cat("This is a generic functionn")
} 
getGPA.student <- function(obj){  
   cat("Total GPA is", obj$GPA,"n")
} 
getGPA(s)
attributes(s)
> attributes(s)
$names
[1] "name" "age"  "GPA"
$class
[1] "student"

S4 object is different from S3 in the way they get constructed, in S4 we use setClass() R function to create S4 class. And new() R function to instantiate corresponding S4 object. @ symbol was used for accessing attribute. And here S4 object hold its unique grammar to register its function.

The following snippet is adapted from https://data-flair.training/blogs/object-oriented-programming-in-r/ and they own all the credit.

# S4 object
Agent <- setClass(  
    "Agent",  
    slots = c(location = "numeric", velocity = "numeric", active = "logical"),  
    prototype = list(location = c(0.0,0.0),active = T, velocity = c(100.0,0.0))
)
a <- new("Agent",location=c(1,5),active = F, velocity = c(9,0))
is.object(a)
isS4(a)
slotNames(a)
slot(a,"location") <- c(1,5)
a@location  # how to access S4 object's attribute
setGeneric("getLocation",function(object){      standardGeneric("getLocation")
})
setMethod("getLocation",signature(object="Agent"),function(object){  object@location
})
showMethods("getLocation")
showMethods(class="Agent") 
getLocation(a)

That’s it, Hoping that would be of a bit help. And please let me know if I made any mistakes above since I am learning R as well. If you want to check the code base, please refer to https://github.com/frankligy/exercise_codes/tree/master/R


Related Articles