The world’s leading publication for data science, AI, and ML professionals.

Visualizing Trends of Multivariate Data in R using ggplot2

How to simply visualize data with multiple dependent and independent variables.

Generated by word clouds
Generated by word clouds

Data visualization, the art of representing data through graphical elements, is an important part of any research or data analysis project. Visualization is essential in both Exploratory Data Analysis and in demonstrating results of a study. Good visualizations can help telling the story of your data and can communicate the important messages of your analyses, since it allows us to quickly see trends and patterns, find outliers, and get insight.

Often times, it is either not easy to find the type of visualization that best describes your data, or it is not easy to find simple tools for generation of the plots of interest. For example, visualizing trends in high-dimensional datasets with multiple dependent and independent variables (and perhaps with interactions between them) can be tedious.

In this post, I decided to introduce one of the techniques for Visualization of 3D data that I found very effective. This will be helpful if you are new to [R](https://www.r-project.org/) or if you have never used ggplot2 library in R. ggplot2 has several built-in function and capabilities that brings the flexibility needed for presenting complex data.

Additionally, in this blog, there will be some examples of how to play with the function settings to create publish-ready figures and save them at high resolution. You can also find all the codes in one place in my Github page.

Outline:

Part 1: Data explanation and preparation

Part 2: Visualize 3D data using _facet_grid()_ function

Part 3: Visualize 3D data with other ggplot2 built-in functions

Part 4: Visualize data with multiple dependent variables

Part1: Data explanation and preparation

Here, I will be using an in-house dataset from a real world problem where we want to characterize noise in images of a set of subjects with respect to three key parameters set at the time of image acquisition. We won’t get into details of these parameters and for simplicity, we name them Var1, Var2, Var3. Noise in image as calculated by standard deviation and is named as SD.

Var1: Categorical at four levels of 100, 50, 25, 10
Var2: Categorical at three levels of k1, k2, k3
Var3: Categorical at three levels of st0.6, st1.0, st2.0
SD: Continuous in the range of (0, 500)

There are 4x3x3 = 36 combinations of these parameters. Combinations of all these parameters exist in the dataset, meaning that for each subject, there exist 36 images. Each image is at a different combination of the parameter set (see table below).

# set csv_file name
# set working directory
# set output_path
# import libraries
library(ggplot2)
library(stringr)
library(ggpubr)

Read the data and prepare data columns. An example of preparation could be data conversion: here Var1 is a categorical variable at four levels. After reading the data, we have to convert this column first into a numerical column with four levels and then sort it so that it appears in order in our plots.

# Read data
data = read.csv(csv_file,header = TRUE)
#prepare data
data$Var1 <- as.numeric(data$Var1,levels =data$Var1)
data$Var1 <- ordered(data$Var1)
data$Parameter_Set <- factor(data$Parameter_Set)
Few rows of the data. Each subject (determined by unique ID) has 36 images at 36 parameter sets
Few rows of the data. Each subject (determined by unique ID) has 36 images at 36 parameter sets

For the purpose of visualization, we will take the mean of noise over the images in each parameter set:

mean_SD = c()
mean_intensity = c()
Var1 = c()
Var2 = c()
Var3 = c()
for (ParamSet in levels(data$Parameter_Set)){
 data_subset = data[which(data$Parameter_Set==ParamSet),]
 mean_SD = c(mean_SD, mean(data_subset$SD))
 mean_intensity = c(mean_intensity, mean(data_subset$intensity))
 Var1 = c(Var1,str_split(ParamSet, "_")[[1]][1])
 Var2 = c(Var2,str_split(ParamSet, "_")[[1]][2])
 Var3 = c(Var3,str_split(ParamSet, "_")[[1]][3])
}
Mean_DF = data.frame(Parameter_Set = levels(data$Parameter_Set),Var1 = Var1, Var2 = Var2, Var3 = Var3 ,mean_SD = mean_SD, mean_intensity = mean_intensity)
Few sample rows from Mean_DF. Note: this data frame has 36 rows for 36 parameter sets
Few sample rows from Mean_DF. Note: this data frame has 36 rows for 36 parameter sets
####### Prepare the Mean_DF dataframe
data = Mean_DF
data$Var3 <- factor(data$Var3)
data$Var1 <- as.numeric(data$Var1,levels =data$Var1)
data$Var1 <- ordered(data$Var1)
data$Parameter_Set <- factor(data$Parameter_Set)

Part 2: Visualize 3D data using facet_grid() function

Let’s see the plot

Finally, we are going to visualize the 3D data of the mean noise in images across 3 parameters! [facet_grid()](https://ggplot2.tidyverse.org/reference/facet_grid.html) function in ggplot2 library is the key function that allows us to plot the dependent variable across all possible combination of multiple independent variables. ggplot2 gives the flexibility of adding various functions to change the plot’s format via ‘+’ . Below we are adding facet_grid(), geom_point(), labs(), ylim()

ggplot(data, aes(y = mean_SD, x = Var1))+ geom_point(size = 3)+facet_grid(Var3 ~ Var2, scales = "free_x",space = "free")+labs(title =" SD across all parameter sets",x = "Var1 ", y= "Mean SD")+ ylim(0, 500)

It seems that a lot is going on in this plot, so let’s go over it: we have a plot with nine blocks. Each block is one parameter set at a specific level of Var2 and Var3. (nine blocks : combination of three levels of Var2 and three levels of Var3). Var3 is shown on the rows and Var2 is shown on the columns. In each block, we have Var1 on the horizontal axis, and the dependent variable, mean_SD, on vertical axis within the range of 0–500 (determined by ylim()).

The benefit of using facet_grid() function here is that as our data gets more complex and has larger number of variables, it will be very intuitive to unpack the data and visualize the trend of the dependent variable at different combinations of the independent variables. For example, here, by comparing the block on top right (k3, st0.6)with the block on bottom left (k1, st2.0) we see a clear difference. At (k1, st2.0) mean_SD is constant when Var1 changes from 10 to 100, but at (k3, st0.6), mean_SD shows a substantial variation when Var1 changes. As you can tell, this plot can show you the interactions between the independent variables and their individual impact on dependent variable.

If we want to only plot against two of the parameters (Var1 and Var3), the input of facet_grid will change as following:

...+facet_grid(Var3 ~ . , scales = "free_x",space = "free")+ ...

in facet_grid() it is helpful if you set space and scale arguments as "free" instead of the default value of "fixed", so that the scale of the variables and the size of the panels vary freely according to scales of each variable in row and column.

If you are planning to publish your results or present at a conference you have to pay attention to your plot format fonts/sizes/colors of the fonts, colors/sizes of the points and lines, etc. Here is some help on how to make the above plot look nicer. We simply add the theme() function to play with some details from axis and the elements of the block.

ggplot(data, aes(y = mean_SD, x = Var1))+
  geom_point(size = 4)+facet_grid(Var3 ~ Var2, scales = "free_x",space = "free")+labs(title =" SD across all parameter sets",x = "Var1 ", y= "Mean SD")+ ylim(0, 500)+
theme(axis.text=element_text(size=16,face= "bold"),  axis.text.x = element_text(colour = "blue3"),
axis.title.y=element_text(size=18,face="bold", colour = "black"),
axis.title.x=element_text(size=18,face="bold", colour = "blue3"),
strip.text = element_text(size=18, colour = "blue3",face="bold"))

If you want to save your R plots at high resolution you can use the following piece of code.

myplot <- ggplot2(data, aes(y = mean_SD, x = Var1))+ ...
tiff(mytitle, units="in", width=8, height=5, res=300)
myplot
dev.off()

Part 3: Visualize 3D data with other ggplot2 built-in functions

Let’s suppose we don’t want to use facet_grid():

Another way of plotting a high dimensional data is just to simply assign different colors and line types to each different level of Var2 and Var3. In the following code, we assign different colors to different levels of Var2 and different line types to different levels of Var3.

Here we will create three data frames with 3 different values of Var2. we will generate a plot for each data frame and will use ggarrange() to combine the plots into one plot.

in the following plot we intentionally played with the background colors and grid lines (panel.background and panel.grid as the options in the theme() function).

Comparing the plot above with the plot with nine blocks (via facet_grid()) we can see that each have their own cons and pros; One combines data into one diagram, while the other one demonstrates panels of the 3D data that allows for examining data individually and separately. I myself, prefer the version with panels of parameter sets.

Here is the code for the plot in the above:

#separate data frames based on Var2
data_k1 = data[which(data$Var2 == "k1"),]
data_k2 = data[which(data$Var2 == "k2"),]
data_k3 = data[which(data$Var2 =="k3"),]

Part 4: Visualize data with multiple dependent variables

Let’s suppose we want to visualize the trend of our data by not only the noise in the image but also by the range of intensity values of the image. So we will now have a second dependent variable called intensity. In part one, similar to "mean_SD", we have calculated the mean value of intensity ("mean_intensity") across all the parameter sets.

First, let’s take a look at this simple plot between "mean_SD" and Var1 where we have shown the range of "mean_intensity" values via a color gradients (see the legend)

Code for the plot above:

ggplot(data, aes(x = Var1, y = mean_SD , group = interaction(Var3, Var2)))+ geom_point(size = 3, aes(color = mean_intensity))+ 
 geom_line(lwd = 1.5, aes(color = mean_intensity))+    scale_colour_gradient(low = "yellow", high = "red3")+
 labs( x = "Var1 ", y = "Mean_SD")+
 theme(strip.text = element_text(size = 20,face="bold", colour = "red3"),axis.text=element_text(size=20, face="bold"),
 axis.title=element_text(size=20,face="bold", colour = "red3"),
 legend.text = element_text(size = 10,face="bold" ), 
 panel.grid = element_line(colour= "white"),
 panel.background = element_rect(fill = "gray64"),
 legend.background = element_rect(fill = "gray64"))

Now we can use this idea to once again generate a plot that visualizes high-dimensional data with blocks and demonstrate impact of three independent variables on two dependent variables!

ggplot(data, aes(y = mean_SD, x = Var1))+
 geom_point(size = 4,aes(color = mean_intensity))+facet_grid(Var3 ~ Var2, scales = "free_x",space = "free")+ scale_colour_gradient(low = "yellow", high = "red3")+
 labs(title =" SD across all parameter sets",x = "Var1 ", y= "Mean SD")+ ylim(0, 500)+
 theme(axis.text=element_text(size=16,face= "bold"), axis.text.x = element_text(colour = "blue3"),
 axis.title.y=element_text(size=18,face="bold", colour = "black"),
 axis.title.x=element_text(size=18,face="bold", colour = "blue3"),
 strip.text = element_text(size=18, colour = "blue3",face="bold"),
 panel.grid = element_line(colour= "grey"),
 panel.background = element_rect(fill = "gray64"))

Clear visualization is instrumental to obtain insight from data. Understanding patterns and interactions is especially harder in high-dimensional data. R and its libraries such as ggplot2 provide a useful framework for researchers, data enthusiasts, and engineers to play with data and perform knowledge discovery.

R and ggplot2 have many more capabilities creating insightful visualizations, so I invite you to explore these tools. I hope that this brief tutorial has helped in getting familiar with a powerful plotting tool useful for data analysis.

Nastaran Emaminejad

_Follow me on Twitter: @N_Emaminejad and LinkedIn: Nastaran_Emaminejad_


Related Articles