Introduction to Data Frames in R

Many datasets are stored as data frames. Learn how to create a data frame, select interesting parts of a data frame, and order a data frame according to certain variables.

Linda Ngo
Towards Data Science

--

What is a data frame?

Recall that for matrices all elements must be the same data type. However, when doing market research, you often have questions such as:

  • ‘Are you married’ or other ‘yes/no’ questions ( logical )
  • ‘How old are you?’ ( numeric )
  • ‘What is your opinion on this product?’ or other ‘open-ended’ questions ( character )

The output, namely the respondents’ answers to the questions above is a data set of different data types. It is common to work with data sets that contain different data types instead of just one.

A data frame has the variables of a data set as columns and the observations as rows.

Here is an example of a built-in data frame in R.

Taking a Look at the Data Set

Working with large data sets is not uncommon. When working with (extremely) large data sets and data frames, you must first develop a clear understanding of the structure and main elements of the data set. Therefore, it can often be useful to show only a small part of the entire data set.

To do this in R, you can use the functions head() or tail() . The head() function shows the first part of the data frame. The tail() function shows the last part. Both functions print a top line called the ‘header’ which contains the names of the different variables in the data set.

Taking a Look at the Structure

Another method to get a rapid overview of the data is the str() function. The str() function shows the structure of the data set. For a data frame it gives the following information:

  • The total number of observations (e.g. 32 car types)
  • The total number of variables (e.g. 11 car features)
  • A full list of the variables names (e.g. mpg , cy1 ….)
  • The data type of each variable (e.g. num )
  • The first observations

When you receive a new data set or data frame, applying the str() function is often the first step. It is a great way to get more insight into the data set before deeper analysis.

To investigate the structure of mtcars , use the str() function.

Creating a Data Frame

Let’s construct a data frame that describes the main characteristics of the eight planets in our solar system. Suppose the main features of a planet are:

  • The type of planet (Terrestrial or Gas Giant)
  • The planet’s diameter relative to the diameter of the Earth.
  • The planet’s rotation across the sum relative to that of the Earth.
  • If the planet has rings or not (TRUE or FALSE).

Some research shows that the following vectors are necessary: name , type , diameter , rotation , and rings . The first element in each of these vectors corresponds to the first observation.

# Definition of vectors
name <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")
type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet",
"Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant")
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)

To construct a data frame, use the data.frame() function. As arguments, the vectors from before must be passed: they will become the different columns of the data frame. Because every column has the same length, the vectors you pass should also have the same length. Remember though, it is possible that they contain different types of data.

To construct the planets data frame, we will use the data.frame() function and pass the vectors name, type , diameter , rotation , and rings as arguments.

# Create a data frame from the vectors
planets_df <- data.frame(name, type, diameter, rotation, rings)
The data frame has 8 observations and 5 variables.

Let’s investigate the structure of the new data frame. Recall that we can use the str() function to accomplish this.

Selection of Data Frame Elements

Similar to vectors and matrices, you select elements from a data frame using square brackets [] . By using a comma, you can indicate what to select from the rows and columns respectively. For example:

  • my_df[1,2] selects the value at the first row and second column in my_df .
  • my_df[1:3, 2:4] selects rows 1 to 3 and columns 2 to 4 in my_df .

Selecting all elements of a row or column is the exact same as it is for matrices. For example, my_df[1,] selects all elements of the first row.

For you to try

Select the diameter of Mercury from the planets_df data frame (this is the same data frame we used earlier). This is the value at the first row and the third column. Next, select all data on Mars (the fourth row).

# Definition of vectors
name <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")
type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet",
"Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant")
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)
# Create a data frame from the vectors
planets_df <- data.frame(name, type, diameter, rotation, rings)

Solution

# Print out diameter of Mercury (row 1, column 3)
planets_df[1,3]
# Print out data for Mars (entire fourth row)
planets_df[4,]

Selection of Data Frame Elements (2)

Instead of using numerics to select elements of a data frame, you can also use variable names to select columns of a data frame.

For example, suppose you want to select the first three elements of the type column. Using the first method, you would use

planets_df[1:3,2]

A possible disadvantage of this approach is that you have to know (or look up) the column number of type , which can get difficult if there are a lot of variables. Thus, it is often easier to use the variable name:

plantes_df[1:3, "type"]

For you to try

Select and print out the first 5 values in the "diameter" column of planets_df.

# Definition of vectors
name <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")
type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet",
"Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant")
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)
# Create a data frame from the vectors
planets_df <- data.frame(name, type, diameter, rotation, rings)

Solution

# Select first 5 values of diameter column
planets_df[1:5, "diameter"]

Only planets with rings

Often, it is helpful to select a specific variable from a data frame. To do this, you’ll want to select an entire column. Suppose you want to select all elements of the third variable diameter . You can use either of the following commands:

planets_df[,3]
planets_df[,"diameter"]

However, there is a short-cut. If your columns have names, you can use the $ sign:

planets_df$diameter

For you to try

Select the rings variable from the planets_df data frame we created earlier and store the results as rings_vector . Print out rings_vector to see if you got it right.

# Definition of vectors
name <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")
type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet",
"Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant")
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)
# Create a data frame from the vectors
planets_df <- data.frame(name, type, diameter, rotation, rings)

Solution

# Select the rings variable from planets_df
rings_vector <- planets_df$rings

# Print out rings_vector
rings_vector

Only planets with rings (2)

Suppose you want to use R to help you get data on the planets in our solar system that have rings. Up above, we created the rings_vector which contains the following:

> rings_vector
[1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE

We see that the first four observations (or planets) don’t have a ring ( FALSE ), but the remaining four do ( TRUE ).

To get a nice overview of the names of these planets, their diameter, etc, we’ll use the rings_vector to select the data for the four planets with rings.

# Select all columns for planets with rings
planets_df[rings_vector,]
Here we’ve learned how to select a subset from a date frame ( planets_df ) based on whether or not a certain condition was true (rings or no rings), extracting all relevant data.

Only planets with rings but shorter

We can do exactly the same thing as what we did above by using the subset() function as a short-cut. The subset function returns subsets of vectors, matrices, or data frames which meet conditions.

subset(my_df, subset = some_condition)

The subset function takes in a couple of arguments. The first argument, my_df, specifies the data set to be subsetted. The second argument is a logical expression indicating elements or rows to keep: missing values are taken as false. By adding the second argument, you give R the necessary information and conditions to select the correct subset.

Previously, to get the subset of planets with rings we used the rings_vector .

# Select the rings variable from planets_df
rings_vector <- planets_df$rings
# Select all columns for planets with rings
planets_df[rings_vector,]

We can get the exact same results without needing the rings_vector by using the subset() function.

subset(planets_df, subset = rings)

For you to try

Select the subset of planets from planets_df that have a diameter smaller than Earth. You’ll need to use the subset() function. Hint: Because the diameter variable is a relative measure of the planet’s diameter with respect to the planet Earth, your condition is diameter < 1 .

# Definition of vectors
name <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")
type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet",
"Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant")
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)
# Create a data frame from the vectors
planets_df <- data.frame(name, type, diameter, rotation, rings)

Solution

# Select planets with diameter < 1
subset(planets_df, subset = diameter < 1)
From our output, we can analyze and determine that three planets have a smaller diameter than Earth. They are Mercury, Venus, and Mars.

Sorting

In data analysis, it is often helpful to sort data according to a certain variable in the data set. This can be done with the help of the functionorder() in R.

This function returns the ranked position of each element when it is applied on a variable, such as a vector. For example:

> a <- c(100, 10, 1000)
> order(a)
[1] 2 1 3

In this example, 10, which is the second element in a, is the smallest element, so 2 comes first in the output of order(a). 100, which is the first element in a is the second smallest element, so 1 comes second in the output of order(a).Then, the third and last element in a , 1000, is the largest element, so 3 comes last in the output of order(a) .

With the output of order(a) , we can reshuffle a so that we get an output that is sorted:

> a[order(a)]
[1] 10 100 1000

Sorting your data frame

One useful thing we can do with the order() function is sorting our data frame. Suppose we want to sort our data frame by diameter . That is, the data frame is rearranged such that it starts with the smallest planet and ends with the largest one.

We’ll first have to use the order() function on the diameter column of planets_df . Let’s store the results as positions .

# Use order() to create positions.
positions <- order(planets_df$diameter)

Next, we’ll reshuffle planets_df with the positions vector as row indexes inside square brackets. We are going to keep all columns.

# Use positions to sort planets_df
planets_df[positions, ]

And voila, we now have a sorted data frame based on diameter.

Notes

All images, unless specified, are owned by the author. The banner image was created using Canva.

--

--

Currently pursuing a degree in Computer Science. Passionate about learning new things.