Analyzing Video Games Data in R

Published in

Towards Data Science

11 min readSep 9, 2019

Being a gamer myself, I had a lot of fun analyzing this dataset. Actually, this dataset was made from merging two different datasets: Tidytuesday and Kaggle. Reason for using two datasets is because the Tidytuesday dataset didn't have enough information for an interesting analysis. That is why I merged the two to get more interesting data and insights out of it. This is also my first article in which I applied some machine learning using the Tidymodels meta-package.

Loading Librarieslibrary(tidyverse) ## For data wrangling and visualization
library(lubridate) ## To work with dates
library(ggpubr)    ## Extra visualizations and themes
library(patchwork) ## Patch visualizations together
library(hrbrthemes)## Extra themes and formatting
library(ggalt)     ## Extra visualizations
library(vapoRwave) ## Retro themes
library(extrafont) ## Exta fontsLoading Datavideo_games <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-07-30/video_games.csv") %>%
  mutate(release_date = as.Date(release_date, "%b %d, %Y")) %>%
  distinct(game, developer, publisher, .keep_all = TRUE)metacritic_games <- read_csv("C:\\Users\\ACER\\Downloads\\metacritic-games-stats-20112019\\metacritic_games.csv",trim_ws = TRUE)
Cleaning and mergin the datasets togethergames_final <- metacritic_games %>%
  filter(platform=="PS4") %>% 
  inner_join(video_games,by = "game") %>%
  mutate(owners=parse_number(owners,trim_ws = TRUE)) %>%
  mutate(publisher = case_when(str_detect(publisher,pattern = "Warner Bros|WB")~"Warner Brothers",TRUE~publisher)) %>% 
  select(-c(release_date.x,developer.y,number_players,number,metascore.y))

After the reading in the video_games data, I used the mutate function to convert the release_date column to date type
Since some games had multiple developers and genres, I used the distinct function to get a unique game per row
For this analysis, I am only interested in PS4 games. That is why I filtered the metacritic_games dataset to only PS4 games before doing an inner_join with the video_games dataset
Used the parse function to get the upper-range value from the owners column for each game as a numeric
Since there were multiple Warner Bros. publishers, I used the case_when function to lump them together as one
Lastly, I deselected the duplicate columns

Further Trimming

After some investigation, I realized that there was too much junk in the data and to do an insightful analysis on a game level, I decided to only look at the top 50 publishers based on the total number of people who own their games

Calculating top 50 publisherstop_50_publishers <- games_final %>% 
  group_by(publisher) %>% 
  summarise(Owners=sum(owners)) %>% 
  top_n(50)Filtering games_final based on top 50 publisherstop_50_filtered <- games_final %>% 
  semi_join(top_50_publishers)

First, I group by publisher in the games_final dataset to get total owners for each publisher and then use the top_n function to get the top 50 publishers based on owners
Then I use the semi_join function between the games_final and the top_50_publishers datasets. What this does is that it filters out the games_final data based on the top 50 publishers that are in top_50_publishers
This dataset is a lot smaller than the original but it allows me to only look at mainstream games
Do note that this filtered data is only till 2018 in case you are wondering why aren't any of your favorite games which came out in 2019 like God of War, Spiderman or Red Dead Redemption 2 in this list

This is what the final dataset looks like

Who are the Top 50 Publishers?

top_50_filtered %>%
  filter(!is.na(publisher)) %>% 
  group_by(publisher) %>% 
  summarise(owners=sum(owners)) %>% 
  ggplot(aes(reorder(publisher,owners),owners))+
  geom_bar(fill="#8B2E8B",stat = "identity",color="black")+
  coord_flip()+
  geom_text(aes(label=paste0(round(owners/1000000,digits = 1)," ","M")),color="white",fontface="bold",hjust=1)+
  new_retro()+
  scale_color_hotlineBling()+
  labs(x=" ")+
  scale_y_comma()+
  theme(legend.position = "none",axis.text.y = element_text(colour = "white",family = "SF Alien Encounters"))

Started by removing NA publishers
Performed the same group by function to get total owners against each publisher
Used ggplot and geom_bar to make a standard barplot and used the coord_flip function to flip axes. Note the reorder function within ggplot allows to plot a sorted barplot from largest to lowest values
Since the owners variable is in millions, I divided it by a million and concatenated an “M” at the end using paste0 to label total owners for each publisher without messing the aesthetics

A lot of familiar faces in there like Ubisoft, Rockstar Games, etc but honestly I did not know about Digital Extremes — sitting at the top
They are actually responsible for a really popular online free-to-play, role-playing shooter called Warframe which came out in 2013

What is the relationship between User Score and Metascore?

top_50_filtered %>% 
  ggplot(aes(user_score,metascore.x))+
  geom_point(color="green")+
  geom_text(aes(label=game),check_overlap = TRUE,color="turquoise",size=3)+
  vapoRwave::new_retro()+
  scale_color_newRetro()+
  labs(title = "Relationship between Userscore and Metascore",x="USERSCORE",y="METASCORE")

There is a general trend between User Score and Metascore. Games like The Witcher 3, GTA V, Rayman Legends and Rocket League scored high in both and deservedly so
The games that are encircled are the ones that were hyped but failed in meeting expectations. Although I am surprised that Shadow of War had such a low user score, that game was sick!
One game that caught my attention is Mad Max. It received lukewarm reviews from critics but the users liked it. I actually never played it because of the former but this might change my mind

Top 30 games by User Score

top_50_filtered %>%
  top_n(30,user_score) %>% 
  ggplot(aes(reorder(game,user_score),user_score))+
    geom_lollipop(color="white")+
  coord_flip()+
  geom_text(aes(label=user_score),color="white",hjust=-1)+
  new_retro()+
  scale_y_continuous(limits = c(0,100))+
  labs(x=" ",title = "Top 30 Games by User Score")

Used the top_n function to filter out the top 30 games based on user_score
reorder function within ggplot to sort the plot from largest value to lowest
geom_lollipop to make a lollipop chart

Not surprised to see The Witcher 3 at the top. The game was a masterpiece
Although I am surprised to see The Evil Within 2 at second place. It was a really good game but I didn't realize it would score so high, more than games like GTA V
So the sequel to Shadow of War, Middle Earth: Shadow of Mordor scored a lot better with the users. I personally liked the prequel better but both were good games
I am surprised to see Batman: Arkham Knight lower than expected. It is one of my all-time favorite games on the PS4 and the best in the Batman Arkham trilogy in my opinion. I think the low score could be attributed to people being annoyed by the batmobile mechanics and underwhelming boss fights

Top 30 Games by Metascore

top_50_filtered %>%
  top_n(30,metascore.x) %>% 
  ggplot(aes(reorder(game,metascore.x),metascore.x))+
    geom_lollipop(color="light blue")+
  coord_flip()+
   geom_text(aes(label=metascore.x),color="light blue",hjust=-1)+
  new_retro()+
  scale_y_continuous(limits = c(0,100))+
  labs(x=" ",y=" ",title = "Top 30 Games by Meta Score")

Same code as before. Just replaced user score with metascore

You can clearly see that a lot of the games have shifted when we move from user score to Metascore. Using the anti_join function we can see which games present in the top 30 games by user score were not in the top 30 games by meta score

top_50_filtered %>%
  top_n(30,user_score) %>%
  anti_join(top_50_filtered %>% 
              top_n(30,metascore.x))

The anti_join function here automatically detects the common column between the two data frames — game and shows the top 30 user games that are not present in the top 30 games by metascore

Most of these games do not have a major difference between their user score and Metascore with the exception of Evil Within 2 and Mad Max

User Score Distribution by Publisher

top_50_filtered %>%
  filter(!is.na(publisher)) %>% 
  ggplot(aes(reorder(publisher,user_score,FUN = mean),user_score))+
  geom_boxplot(color="purple")+
  geom_hline(yintercept = 80,color="white")+
  coord_flip()+
  new_retro()+
  labs(x=" ")

Plotting a boxplot to visualize the distribution of user score for the top 50 publishers
Used the reorder function to sort the boxplots by the average user score
Added a vertical line at 80 as a threshold for what is considered a great game

I can see why Ubisoft scores so low here. Their recent games in the Far Cry and Assassin’s Creed series have been pretty underwhelming
Rockstar never disappoints. Every game they make is a masterpiece
Activision is all over the place. Maybe people are finally getting tired of the Call of Duty series
Capcom is really low. This is probably due to releasing an incomplete Street Fighter V game and charging full price for it. But 2019 has been an amazing year for Capcom with games like Monster Hunter, Resident Evil 2: Remake and Devil May Cry 5 being huge hits for them

Metascore Distribution by Publisher

top_50_filtered %>%
  filter(!is.na(publisher)) %>% 
  ggplot(aes(reorder(publisher,metascore.x,FUN = mean),metascore.x))+
  geom_boxplot(color="green")+
  geom_hline(yintercept = 80,color="white")+
  coord_flip()+
  new_retro()+
  labs(x=" ",y="Metascore")

Seems like publishers do a lot better when it comes to Metascore compared to their user scores

Longest Games by Average Playtime

top_50_filtered %>%
  top_n(30,average_playtime) %>% ggplot(aes(reorder(game,average_playtime/60),average_playtime/60))+
  geom_lollipop(color="purple")+
  coord_flip()+
  geom_text(aes(label= round(average_playtime/60)),color="white",hjust=-1)+
  vapoRwave::new_retro()+
  scale_y_continuous(limits = c(0,40))+
  labs(x=" ",y="Average playtime in 2 weeks -  (hrs)")

The average playtime for each game measures how long did users play the game for two weeks on average in minutes
Used top_n function to only look at the top 30 longest games
when plotting, divided average_playtime by 60 to convert it into hrs
geom_text to label the average_playtime in hrs

Most of the top longest games are open-world like Just Cause 4, Far Cry 5, Mad Max, Assassin’s Creed, GTA V, and Witcher 3
Surprised to see Prey number 3. I didn't play it but I don't think it was open-world

Percentage Positive Reviews by Users

top_50_filtered %>%  mutate(percentage_positive_users=positive_users/(positive_users+negative_users+neutral_users),
         percentage_positive_critics =positive_critics/(positive_critics+negative_critics+neutral_critics)) %>%
  filter(positive_users>=10,percentage_positive_users>=0.5) %>%
  top_n(30,percentage_positive_users) %>%  ggplot(aes(reorder(game,percentage_positive_users),percentage_positive_users))+
  geom_lollipop(color="white")+
  coord_flip()+
  labs(x=" ")+
  new_retro()

Used the mutate function to create two new columns: percentage_positive_users and percentage_positive_critics
Since there are some games who received very few total scores, I used the filter function to show only those who had at least 10 positive user scores and had an overall percentage of positive users ≥ 50%
Rest is the same code that we have used for previous lollipop charts

Wow! surprised to see GTA V so low
Mad Max and Dying Light are one of those games that were really good but went under the radar
Sword Art Online received horrendous reviews from the critics but looks like the users liked it

Percentage Positive Reviews by Critics

top_50_filtered %>%  mutate(percentage_positive_users=positive_users/(positive_users+negative_users+neutral_users),
         percentage_positive_critics =positive_critics/(positive_critics+negative_critics+neutral_critics)) %>%
  filter(positive_critics>=10,percentage_positive_critics>=0.5) %>%
  top_n(30,percentage_positive_critics) %>%  ggplot(aes(reorder(game,percentage_positive_critics),percentage_positive_critics))+
  geom_lollipop(color="white")+
  coord_flip()+
  labs(x=" ")+
  new_retro()

NBA 2k16 and NBA 2k17 received good reviews from the critics but low scores from the users

Honestly, I have not heard about the top 4 games but these aren't considered AAA games due to their small scale

Predictive Analysis

Predicting the Metascore is not practical since it is calculated based on a formula of all the numerical variables provided in the dataset. Predicting user score is more interesting. For this, I will be using the Tidymodels meta-package. It is a collection of packages that allows you to create a seamless data pipeline from preprocessing your data, tuning your models, modeling your data , making predictions and measuring your model performance while adhering to the “tidy” principles of the Tidyverse

Data Preparation

library(tidymodels)ratings_data <- top_50_filtered %>% 
  mutate(percentage_positive_users=positive_users/(positive_users+negative_users+neutral_users),
         percentage_negative_users =negative_users/(positive_users+negative_users+neutral_users),
         percentage_positive_critics =positive_critics/(positive_critics+negative_critics+neutral_critics),
         percentage_negative_critics =negative_critics/(positive_critics+negative_critics+neutral_critics)) %>% 
  drop_na()Splitting the datasplit <- initial_split(ratings_data,prop = 0.6,)train_games <- split %>% training() %>%
test_games <- split %>% testing()

Used the mutate function to create new variables that will be used as predictors for user_score
Drop any missing values
Use the intial_split function from rsample library to create a split parameter for our train and test splits
Chain it with the training and testing functions from rsample to create our train and test datasets

Creating Preprocessing pipeline

Creating Recipenorm_recipe <-
  recipe(user_score~percentage_positive_critics+percentage_negative_critics+percentage_positive_users+
           percentage_negative_users+price+genre+rating+publisher,data = train_games) %>%
  step_naomit(all_numeric(),all_nominal()) %>%
  step_dummy(all_nominal(),,one_hot = TRUE) %>% 
  step_normalize(all_predictors(),na_rm = TRUE) %>%
  step_corr(all_predictors()) %>%
  prep()Applying Recipe on Train and Test setstrain_prepped <- juice(norm_recipe)
test_prepped <- norm_recipe %>% bake(test_game)

Using the Recipes package, I am a creating a “recipe” which is basically a pipeline of how I want to process the variables in my data
we start with defining the relationship between the dependent and independent variables
The “step” functions are the data transforming functions
step_naomit removes missing values from both nominal and numeric variables
step_dummy does one-hot encoding on all the categorical variables
step_normalize normalizes all the numeric predictors
step_corr removes any independent variables that have a high correlation with each other
prep function basically finalizes and prepares the recipe
juice function, when applied on norm_recipe, gives us the transformed version of the training set
bake function chained together with norm_recipe and applied on the test data give us the transformed version

Modeling

ranger <- rand_forest(trees = 100,mode = "regression") %>% 
  set_engine("ranger") %>% 
  fit(user_score~.,data=train_prepped)

For defining our model we use the Parsnip package to define a randomforest regression model
Set the engine to ranger
Use the fit function to define our formula and supplied the transformed training data

Predicting

Converting NAs to 0test_prepped <- test_prepped %>% 
  mutate_all(~replace(.,is.na(.),0))
Making predictionsranger %>% 
  predict(test_prepped) %>% 
   bind_cols(test_games) %>% 
   select(game,.pred,user_score)Measuring Model Accuracyranger %>% 
  predict(test_prepped) %>%
  bind_cols(test_games) %>% 
  metrics(truth=user_score,estimate=.pred)

Before applying predictions on the prepped test set, I had to convert some NAs to 0 otherwise the randomforest model will throw an error
Finally, we use the ranger model to predict on our prepped test set and joined the predictions with the original test data to see the predicted and actual user scores for each game

Sample of predicted vs actual user scores

Once we have made predictions, I used the metrics function from the yardstick package to calculate the model's accuracy by supplying it both actual and predicted user scores

rsq measures how much the variation in our dependent variable — user_score can be explained by the independent variables in our dataset. In our case, it is not too bad considering the type of data we had and it was a very small dataset
It could be improved by hyperparameter tuning and cross-validation but that is beyond the scope of this article. Also, these features are still under development in Tidymodels as of now, there is no seamless to integrate them in our analysis

Conclusion

I had a lot of fun doing an end-to-end analysis on this data and being able to apply some basic machine learning using the Tidymodels package was good practice. Although, it is important to note to not draw any major conclusions from this data as we looked at a very small subset of our original data sources. I think this analysis could be made even more interesting if we gathered more data on all the different videogames.