The world’s leading publication for data science, AI, and ML professionals.

How Popular is Your Name? Mini Data Viz Challenge in R

A quick and easy project to practice data visualization using ggplot in R while finding out the popularity of names in the United States.

Photo by Pablo Gentile on Unsplash
Photo by Pablo Gentile on Unsplash

I have dabbled in R for several years but have only recently started to gain traction. When trying to learn R, I kept following the same pattern: see an impressive utilization, get the renewed energy to learn, get overwhelmed.

Because it can do so much, R is a fantastic tool but also intimidating to learn. For those who are starting out on their R learning journey, I encourage you to start by learning just one thing in R. Your thing could be learning how to merge two datasets together or how to calculate descriptive statistics efficiently. For me, it was learning how to do data visualizations. While I still (often) get frustrated, I have found it to be much more satisfying to be able to immediately see the incremental changes visually as I update code.

So, for those of you who want to try your hand at visualizing data in R, welcome! You can follow me along step-by-step writing the code yourself or you can skip down to the bottom and copy-and-paste my code. All you will need to do is add the name you are interested in visualizing!

  1. Install and load packages
  2. Plot all names (intro to ggplot)
  3. Plot "your" name (start of mini-project)
  4. Plot your name against all names
  5. Plot multiple names to compare (bonus viz)

Full code for each of the plots created in steps 3, 4, and 5.

***This article is assuming you already have R and R studio installed. If you aren’t there yet, here is a great lesson. If you aren’t ready to make that jump, this is a fun tutorial that goes through the same dataset we will be using below and allows you to do everything right from your browser – no installing required!

Without further ado, let’s get visualizing!


Step 1: Install and load packages

# If you haven't used the packages before, you will need to install them first#
install.packages("tidyverse")
install.packages("babynames")
# Load packages
library(tidyverse)
library(babynames)

The babynames package has a data frame provided by the Social Security Administration with: year, sex, name, n (number of instances), and prop (number of instances of given name and gender in that year divided by total applicants). Unfortunately, this data only has binary male/female as sex options. This data set includes every name with at least 5 instances! Wow!

Step 2: Plot all names

When starting in data viz in R you are most likely going to start by using the ggplot package. Once you understand the structure, you can quickly change your plots however you want!

%>% ggplot() + (mapping = aes())

Let’s look at this one step at a time:

%>%

You will replace "data" with the name of the data frame that you are using. The babynames package has already created a data frame for you, but if you are using your own data, you will need to create a data frame formula. The "%>%" is called a pipe operator but you can be read as "then." So I might read the that first line and say to myself "Okay, we are going to take the"data" data frame and then we are going to…"

%>% ggplot() +

Now we are adding the ggplot() function from the ggplot package. This is telling R that you are wanting to create a visualization. You can try running just want you have so far and you will notice that there is only a blank rectangle. We haven’t told R how we want things plotted yet. So "we are going to take the"data" data frame and then we are going to create some kind of plot."

babynames %>%
  ggplot()
No, this is not an error. We expect this first plot to be blank.
No, this is not an error. We expect this first plot to be blank.

So this is probably not the fun vizzies we were expecting but stay with me!

%>% ggplot() + (mapping = aes())

Now things are getting interesting! We are now going to pick the geom_function that we want. There are lots of pre-made geom_functions that make it fast and easy to create the plot of your choosing. For example: geom_point() adds a layer of points, geom_bar() adds a bar graph, geom_line() creates a line graph, etc. See all of the available geoms here.

Finally, you are going to map your aesthetics and this is where things start coming together! You will replace with things like setting your x and y variables, but also setting things like color and size.

babynames %>%
  ggplot() +
  geom_point(mapping = aes(x = year, y = n))

Keep in mind this is a GIANT dataset so it will probably take a few moments to fully plot.

There is a dot for each year of each name of both male and female.
There is a dot for each year of each name of both male and female.

Step 3: What about MY name?

Now we can see all names but what about getting to just our name (or whatever name we are interested in)?

To do this, we can create a variable with the name we are interested in looking at. We can also specify if we want to look at only males or females. I will be using my name (Jenna) and female ("F") but feel free to replace those with what you are interested in!

myname <- "Jenna"
mysex <- "F"
babynames %>%
  filter(name == myname, sex == mysex) %>%
  ggplot() +
  geom_point(mapping = aes(x = year, y = n))
A dot for the number of people named "Jenna" with female designated sex each year.
A dot for the number of people named "Jenna" with female designated sex each year.

And just like that! We have accomplished our mission! That is really all you need to start creating plots in R. You can get all kinds of fancy and over-the-top customized, but you also don’t have to. When you are first starting out, celebrate these wins! You wanted to make a plot and you made a plot. Well done.

But, in case you are one of those kids who likes extra-credit, let’s see what else we can do.

Step 4: Plot our name against all names

We are going to be plotting the whole dataset again so it might be a good time to make some coffee or pour some wine while this runs.

mynameis <- "Jenna"
mysexis <- "F"

myname <- babynames %>%
  filter(name == mynameis, sex == mysexis)

mynameminyear <- min(myname$year)-5
maxyear <- max(babynames$year)

babynames %>%
  filter(year > mynameminyear) %>%
  ggplot() +
  geom_point(mapping = aes(x = year, y = prop), alpha = 0.2, color = "gray") +
    geom_point(data = myname, mapping = aes(x = year, y = prop), alpha = 0.8, color = "#013175") +
# the below is just formatting, not required!  
theme_minimal() +
 theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.background = element_blank(),
        axis.title = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) +
  ggtitle(paste("Popularity of the name ", mynameis, " from ", mynameminyear, " to ", maxyear))
The name Jenna and sex female plotted against all other names.
The name Jenna and sex female plotted against all other names.

Step 5: Plot multiple names to compare

Because there is nothing like the spice of competition to get the learning juices flowing, you can highlight multiple names at once to see how you compare. I will plot three names together but feel free to add on more!

Start by choosing a name and sex for three people you are wanting to compare. I did myself, my partner, and my brother. Remember that the name should be in quotes and you can select either "M" or "F" for sex.

name_one <- "Jenna"
sex_one <- "F"
name_two <- "Melissa"
sex_two <- "F"
name_three <- "Jeffrey"
sex_three <- "M"

Okay, we have set our variables, time to plot!

firstname <- babynames %>%
  filter(name == name_one, sex == sex_one)

secondname <- babynames %>%
  filter(name == name_two, sex == sex_two)

thirdname <- babynames %>%
  filter(name == name_three, sex == sex_three)

legendcolors <- c("name_one" = "#219EBC", "name_two" = "#FB8500", "name_three" = "#023047")
babynames %>%
  ggplot() +
  geom_point(mapping = aes(x = year, y = prop), alpha = 0.1, color = "gray") +
  geom_point(data = firstname, mapping = aes(x = year, y = prop, color = "name_one"), alpha = 0.8) +
  geom_point(data = secondname, mapping = aes(x = year, y = prop, color = "name_two"), alpha = 0.8) +
  geom_point(data = thirdname, mapping = aes(x = year, y = prop, color = "name_three"), alpha = 0.8) +

# The below is formatting and not required!
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.background = element_blank(),
        axis.title = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) +
  ggtitle(paste("Who has the most popular name?")) +
  scale_color_manual(name = "Name", values = legendcolors)

We did it! And dang it, it looks like I (name_one) am the loser when it comes to overall popularity although it looks like I might have beaten out the other two in the early 2000s. Melissa females (name_two) have been around a long time and a surge in popularity in the 1990s.


And now, there are endless ways you can completely customize this visual to make it your own. Choose your own adventure! Some fun things to try:

  • Change the colors used (here is a guide to customizing colors in ggplot)
  • Highlight additional names
  • Annotate different points on the graph

Full Code

Option 1: One Name Only

Option 2: One Name vs. All Names

Option 3: Compare Three Names vs. All Names

I can’t wait to see what you all come up with! Keep knocking your head against the wall, it gets so much better!

Jenna EaglesonMy background is in Industrial-Organizational Psychology and I have found my home in People Analytics. Data viz is what makes my work come to life. I mostly use Power BI but I occasionally foray into Tableau and other tools. I would love to hear more about your journey! Reach me by commenting here or on Twitter or Linkedin.


Related Articles