Data Exploration and Visualization with R & ggplot

Visualizing Game Genres and Platforms from the IGN Database

Finn Qiao
Towards Data Science

--

Given my recent foray into R and ggplot, it seemed appropriate to take a break from the usual Python jupyter notebooks. I have chosen the IGN dataset from kaggle to do a quick data exploration and visualization of the data.

After loading the tidyverse package, which contains useful packages like ggplot2 and dplyr, we begin by reading in the csv. By setting stringsAsFactors to FALSE, we make sure that character string variables are not read in as factors.

A quick look at the dataframe reveal some interesting variables. In particular, the title, platform, score, genre, and release_year variables.

The list of top rated titles don’t seem to be too interesting as there are multiple title with scores of 9 and 10. Let’s instead look at the average review scores based on particular groupings.

Game Genres

As there are 112 unique genres, lets just take the top 10 genres.

Are all 122 genres necessary? The division here might be too granular. There are quite a few genres like ‘Adventure, Platformer’ and ‘Adventure, Episodic’ that have the same primary genre of ‘Adventure’. I thus wrote a function to get the primary genre of each game by taking the first word of the genre variable.

The new main_genre variable now includes a much more manageable 31 unique genres. There seems to be one named NA though.

There are 36 rows with missing genre values. As this is a really small sample of the 18,000+ observations and it would take unnecessary effort to label them manually, we drop these observations with NA genre values.

Let’s double check here to make sure there aren’t any NA values in the other values. The apply function can be used here to run a function on every column of the ign dataframe. Here, a value of FALSE would indicate that there are no NA values in that particular column.

With the newly reduced genres, lets look at the mean review score distribution again, this time with all the genres. We see that compilation games have dropped quite far back while hardware games remain far and away the highest rated games.

Let’s visualize this again with a box plot.

It seems that while adult, baseball, and hardware game review scores belong in a narrow range, there is way more variance and outliers present across almost all the other genres.

What are hardware games though?

Apparently they refer to VR hardware and are only represented by two entries. Given the nascent VR industry, this paucity of observations seems to be appropriate.

Given that the hardware category only has two observations, which are the genres with the most games?

Given that a bar chart with 30 variables seemed rather cramped in the above chart for mean review scores, let’s try with a lollipop chart here by combining geom_point and geom_segment.

There seems to be an overwhelming number of action games, with over 5000 entries, or around 27% of all entries.

Game Platforms

There are 59 unique platforms. Let’s look at the mean review scores across the top 10.

With the exception of Macintosh, that’s quite a throwback list.

What about the top 10 platforms by the number of games?

Are there games that can be played on multiple platforms? By grouping the initial data frame by the game title, we find the number of platforms that support a particular title.

Given that there are also so many platforms that have fallen out of favor in recent years, does the number of games in the IGN database reflect the changing popularity of the platforms?

Let’s chart a seasonal plot across all platforms.

1970 seems to be too far back so 1996 seems like the reasonable start point for this plot. Looking at data past 2010 seems to reveal an odd drop in the number of titles overall. As this might be a data quality issue, we focus on the 15 years from 1996 to 2010.

To prevent cluttering the graph, we limit data points to platforms with over 10 titles in any given year.

Some interesting trends of note:

  1. The fall of games listed for each Playstation generation coincides with the rise of the games for the next Playstation generation.
  2. 2010 is the first year mobile games and tablet games such as Android and iPad games appear on the plot, possibly signaling the start of the adoption of mobile games.
  3. The annual number of PC games stays fairly stable throughout.
  4. The massive spike in Wii games around 2007–2008 was quite unexpected though a look at Google Trends corroborates the subsequent dip in interest and number of game titles for the console.

It was really interesting getting to pick up a new language to visualize data with and definitely looking towards more effective data visualizations with these new tools in upcoming posts!

The R markdown code can be found here and feel free to connect on LinkedIn too!

--

--