Data visualization is a powerful tool for exploratory data analysis. We can use it to reveal the underlying structure within data or the relationships among variables. An overview of basic descriptive statistics can also be obtained from a data visualization.
Data visualization is of crucial importance in Data Science field. Thus, there are lots of libraries and packages in this domain. Although they have different syntax and methods to create visualizations, the ultimate goal is the same: explore and understand the data.
In this article, we will explore a medical cost dataset using the ggplot2 library of R programming language. The dataset is good for practicing because it contains a mixture of variables with different data types.
What we cover in this article can also be considered as a practical guide for the ggplot2 library. I will try to clearly explain the logic behind generating the plots so that you can apply it to other tasks and datasets.
I use R-studio which is a highly popular IDE for R programming language. The first step is import the libraries and read the csv file that contains the dataset.
> library(ggplot2)
> library(readr)
> insurance <- read_csv("Downloads/datasets/insurance.csv")

The dataset contains personal information about the customers of an insurance company and how much they are charged for insurance.
In ggplot2, we first create a coordinate system as a base layer using the ggplot function. Then we add layers by specifying the type of plots and variables to plotted. We can also add further instructions in order to make the visualizations more informative.
The target variable in our dataset is the charges column so it is better to start with exploring this column. We can use a histogram to visualize the distribution.
Histogram divides the value range of a continuous variable into discrete bins and counts the number of observations in each bin.
> ggplot(insurance) + geom_histogram(mapping = aes(x=charges), color='blue', fill='lightblue')
- We pass the data to the ggplot function which creates a coordinate system as the base layer.
- The geom_histogram function adds a layer on this coordinate system by plotting a histogram according to the given parameters. We use the mapping parameter to specify the columns to be plotted. The color and fill parameters are related to the visual properties of the plot.

The charges are mostly less than 20000. As the value gets higher, the number of observations decreases. Let’s make this plot more informative by separating smokers and non-smokers.
> ggplot(insurance) + geom_histogram(mapping = aes(x=charges, color=smoker), bins = 15)
We pass the column used as separator to the color parameter in aes function. Please note that it is different than the color parameter we used previously which is outside the aes function.
The bins parameter is used to specify the number of bins in the histogram.

We clearly see that smokers are charged more for insurance than non-smokers.
The name ggplot comes from "grammar of graphics".
We can further separate the charges column by adding other categorical variables such as sex and region.
The ggplot library provides the facet_grid function to generate a grid of subplots.
> t <- ggplot(insurance) + geom_histogram(mapping = aes(x=charges, color=smoker), bins=15, fill='white')
> t + facet_grid(rows = vars(sex), cols = vars(region))
The first line creates a histogram by separating the smokers and non-smokers as in the previous example. In the second line, we add two additional dimensions in terms of rows and columns. The variables used as separators are specified in the facet_grid function.

We do not observe a remarkable difference between regions but we see that males are more likely to smoke than females.
There are different types of plots to investigate the relationships among variables. One of them is scatter plot which is usually preferred in case of comparing two numerical variables.
Let’s create a grid of plots that map the relationship between the charges and bmi (body mass index). We will use the smoker and children columns as separators.
> t <- ggplot(insurance) + geom_point(mapping = aes(x=charges, y=bmi, color=smoker))
> t + facet_grid(rows = vars(children))

The geom_point function generates a scatter plot. We don’t see a noticeable correlation between the charges and bmi column. However, the smokers and non-smokers are clearly separated.
Another commonly used type of visualization is box plot. It provides an overview of the distribution of a variable by showing how values are spread out in terms of quartiles and outliers.
The following code generates a box plot of the bmi column. We will distinguish observations according to the categories in the region column.
> ggplot(insurance) + geom_boxplot(mapping = aes(y=bmi, color=region))

The line in the middle of box shows the median value. The height of the box is proportional to how much the values are spread out.
The average bmi of the people in the southeast region is distinctively higher than the average in other regions. The northeast and northwest are quite similar.
Visualizations can also be used to check the size of different categories in terms of the number of observations. Considering only the average values mislead us so it is better to know the number of observations in each category.
The geom_count function of ggplot can be used to visualize two discrete variables. For instance, a plot that displays the sizes of the categories in the region and sex columns are generated as below.
> ggplot(insurance) + geom_count(mapping = aes(x=region, y=sex, color=region)) + labs(title = "Number of Observations in Each Region")
We have also added a title using the labs function. It also allows adding labels for x-axis and y-axis.

The southeast clearly outweighs the other regions. Another interesting finding is that there are more females than males in the northwest region whereas the number of males is higher than females in the northeast.
Conclusion
We have seen how Data Visualization can be used to explore a dataset. There are, of course, many more plots we can generate to further investigate the relationships among the variables.
As the complexity and dimensionality of dataset increase, we use more complex visualization in the exploratory data analysis. However, the fundamental types of plots and techniques are likely to be the same.
Thank you for reading. Please let me know if you have any feedback.