Pokemon Generation Gap? (Python Data Analysis) — Part 1: cleaning, EDA

I have been learning Python for data analysis through online courses, and sometimes Youtube videos, etc. I know I only took the first step, but I couldn’t resist the urge to put what I have learned into a test. I strongly believe that knowledge is not yours until you actually apply it and use it. I don’t have that “Beautiful Mind” type of brain, I can’t understand things instantly in my head just by watching. Considering my current level of Python fluency, I will start with some simple basic analysis.

I am not a Pokemon geek, but I did play Pokemon Go. Unfortunately I was one of those early leavers. I think I levelled up until level 25, and that was it. I did watch some episodes of TV series, but just by chance. Then, why Pokemon? I wanted to start somewhere light, not too serious and not too complicated. While I was looking through Kaggle’s datasets, I found Pokemon dataset! The size of the data is manageable for me, doesn’t contain too many columns and each column is easy enough for me to understand. And Pokemons are cute…

I first looked through all the kernels in Pokemon datasets, many cool analysis some too advanced for me, some I can understand. So I first start with a simple question. “Is there any difference in Pokemons in different generations?” (There are 7 different generations spanning from 1996 to 2017)

First, I downloaded the dataset and saved it in my computer to read and see in Python.

import pandas as pd
df = pd.read_csv('Pokemon.csv')
df.head()

By looking at first 5 entries of data, I already saw two problems. One, ‘#’ column has duplicates, and it seems that same ID numbers have been given to Pokemons’ Mega-evolve form as their original form (Mega-evolution is an ultimate form of a Pokemon, not applicable to all Pokemons). But before and after Mega-evolution are certainly different. Not only their stats change, but also their looks. For example, while the third entry Vensaur’s total stat point is 525, and the fourth entry Mega Venusaur’s Total is 625. The looks are as below.

Left: Venusaur, Right: Mega Venusaur (and No! they are not the same, take a closer look)

Second problem is that the fourth entry Mega Venusaurs has duplicated Venusaur at the front in the Name column value; “VenusaurMega Venusaur”. This needs to be dealt with.

Looking at just first five entries didn’t give me a full picture of data. So to see broader picture, I called “info” method.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 13 columns):
# 800 non-null int64
Name 800 non-null object
Type 1 800 non-null object
Type 2 414 non-null object
Total 800 non-null int64
HP 800 non-null int64
Attack 800 non-null int64
Defense 800 non-null int64
Sp. Atk 800 non-null int64
Sp. Def 800 non-null int64
Speed 800 non-null int64
Generation 800 non-null int64
Legendary 800 non-null bool
dtypes: bool(1), int64(9), object(3)
memory usage: 75.9+ KB

I can see there are 800 entries, and values for the columns are “object” or “int64” or “bool”, means string or integer or boolean values. For “Type 2” variable, there are only 414 non-null entries, which means there are 386 missing values. Missing values needs to be handled cautiously. There might be a reason why they are missing, and you might find some useful insight by figuring out the reason. Sometimes missing part might even distort the whole data. But in this case, it is just because some of Pokemons do not have secondary type. For my analysis, I won’t be using “Type 2” attribute, so I will just leave it as it is.

Now it’s time for some EDA (Exploratory Data Analysis)! There can be two aspects of EDA, numerical summarisation, and some visual method such as graphs. In Python, summary statistics can be easily extracted just by one method, “describe”.

df.describe()

I can see count, mean, standard deviation, minimum, interquartile range, maximum all in one table, and now I have a better understanding of the data. (“describe” method only grabs results from numerical values. So the columns with string, boolean values: ‘Name’, ‘Type 1’, ‘Type 2’, ‘Legendary’ are not shown here.) Even though this dataset is easy to see what each column means, sometime it is very helpful to see the original documentation of the dataset. According to Alberto Barradas who uploaded the dataset to Kaggle, the descriptions are as below.

#: ID for each pokemon
Name: Name of each pokemon
Type 1: Each pokemon has a type, this determines weakness/resistance to attacks
Type 2: Some pokemon are dual type and have 2
Total: sum of all stats that come after this, a general guide to how strong a pokemon is
HP: hit points, or health, defines how much damage a pokemon can withstand before fainting
Attack: the base modifier for normal attacks (eg. Scratch, Punch)
Defense: the base damage resistance against normal attacks
SP Atk: special attack, the base modifier for special attacks (e.g. fire blast, bubble beam)
SP Def: the base damage resistance against special attacks
Speed: determines which pokemon attacks first each round

I guess ‘Generation’ and ‘Legendary’ variables are added later. Just to help your understanding, ‘Generation’ variable is about which generation of Pokemon it is, and ‘Legendary’ is whether the Pokemon is a legendary class or not.

Let’s move on. I first want to tackle two problems I saw when I called “df.head()”. The duplicates of ‘#’ column and strange names in ‘Name’ column.

df = df.drop(['#'],axis=1)
import re # importing regex library to use in below line
df.Name = df.Name.apply(lambda x: re.sub(r'(.+)(Mega.+)',r'\2', x))
df.head()

OK. I dropped ‘#’ column entirely, and also removed repetitive Pokemon name in front of its Mega-evolve forms. It looks better now. Let’s see the end of the entries this time by calling “tail” method.

df.tail()

Oh wait. Something looks not right. Entry number 797, 798 seems to have duplicates in the front of the name. For your information, Hoopa is the right name, not HoopaHoopa, and they look like below.

Left: Hoopa Confined, Right: Hoopa Unbound

I better tidy that name for Hoopa. So I wrote another code to fix just two Hoopa entries.

df.Name = df.Name.apply(lambda x: re.sub(r'(HoopaHoopa)(.+)','Hoopa'+r'\2',x))
df.tail()

To be honest, in order to figure out if there are any other strange names to fix, I also opened the csv file, and scanned through the ‘Name’ column for all 800 entries. I guess I can’t do that for 10,000 or 100,000 entries, and might need to think about smarter way to tackle this. But anyway I found out that only Pokemons with Mega-evolve forms or Hoopa have this kind of problem in their ‘Name’ column values.

Finally, I am ready to move on to some visual EDA. Since I want to find out how Pokemons differ from generation to generation, it is a good idea to group the entries by different generations and see how they look on plots.

‘Total’ column is the sum of all the stats values, and I think it is a good indicator of overall stats of a Pokemon. So, let’s see how ‘Total’ values are distributed across Pokemons within same generation.

According to Seaborn’s documentation, “A violin plot plays a similar role as a box and whisker plot. It shows the distribution of quantitative data across several levels of one (or more) categorical variables such that those distributions can be compared”. The strengths of violin plots are you can see the distribution shape from the graph. With box plots, you can never know if the distribution is unimodal or multimodal.

Anatomy of Violin Plot
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.violinplot(x='Generation',y='Total',data=df)

It looks like Generation 1 and 3 are more dispersed than others, while Generation 4 looks the shortest with two distinctive peaks. At least, from the plot, it looks like there might be some significant difference between generations, but too early to tell.

While I was looking for definition and explanations about violin plots, I found out one interesting thing. While Violin plots display more information, they can be more noisier then a Box Plot. Hmmm is that so? I think I will have to do a box plot and see how they differ.

I think the post is getting too long. I will have to stop here and continue on the second part.


Below is the link to the Gist I created with the above codes. Feel free to check, and if you have any suggestions or feedbacks, please let me know.