
Machine learning is a branch of Data Science specialised in making computers act without being explicitly programmed [1]. Recently, machine learning has given us self-driving cars, precise web search, practical speech recognition and a massively improved understanding of genetics. Machine learning is so relevant that you probably use it many times a day without noticing it. So, it is not a surprise that machine learning expertise is on the rise in multiple industries [2].
However, machine learning is not an easy topic to learn. It requires a decent level of math and statistics, let alone programming skills. Still, to reach a point where you are familiar and comfortable working with machine learning, you will have to start exposing yourself to the topic as often as possible. But, it can difficult to find a straightforward example for beginners.
Most of the time, students have to deal with the theoretical aspect of Machine Learning and particular parts of the problem (e.g. terminology used in the financial market when predicting stock prices). As a result, students might get overwhelmed by deciphering industries’ particularities to understand the machine learning problem at hand. For this reason, I have created a Kaggle dataset using a student-friendly example of machine learning.
Why Pokémons?
Because it is simple and less complicated than any other real-life example, you will encounter whilst studying Data Science and machine learning. Pokémons are fictional creatures, each having unique characteristics, abilities and skills [3]. Pokémons are trained to battle against each other (just like boxing or karate). Because of the combination of its features, some of them are more likely to win combats. However, a few species of Pokémons are so powerful that they can beat most of its opponents. These rare and powerful creatures are called Legendary Pokémons.
With that in mind, we will apply machine learning algorithms to each specie’s characteristics, abilities and skills to predict whether a Pókemon is legendary. So, now, check the step-by-step below.

Machine Learning Step-by-Step
Ideally, you need two datasets. The first one, called train dataset, will be a list of Pokémons and their characteristics. It will also classify whether each species of Pokémon is legendary. In machine learning terms, a legendary Pokémon will be a ‘1’ , whereas a non-legendary Pokémon will be a ‘0’ (zero). Our machine learning will look at this list and cross-check if there are similarities among data points of legendary and non-legendary Pokémons. You will be training your computer to identify legendary Pokémons, hence the name of the dataset.
The second dataset, called a testing dataset, should be a separate file containing a similar list of Pokémons. However, the testing dataset does not show whether a Pokémon is legendary. Once we have sorted our algorithm on the training dataset, we will test it on the testing dataset and see how well our model predicts legendary Pokémons.
The result will be a percentage; for example, 81.3% of certainty a Pokémon species is legendary. Spoiler alert: it is almost sure that you will never reach 100% certainty on real-life cases, so don’t be discouraged going forward. Just do your best.
Preparing the data
To make your learning process more efficient, I have cleaned the data, so you don’t have to waste time sorting out columns and null values. So here is how I did:
Data Analysis
In this section, we will analyse the relationship between different features associated with ‘isLegendary’. A heatmap of correlation between various components might be helpful:
- Positive numbers = Positive correlation, i.e. increase in one feature will increase the other feature & vice-versa.
- Negative numbers = Negative correlation, i.e. increase in one feature will decrease the other feature & vice-versa.
In our case, we focus on which features have a strong positive or negative correlation with the legendary feature.

It looks like that features such as ‘Attack‘ and ‘Defense‘ as well as ‘Catch_Rate‘ have a high correlation with being legendary. These features might be useful for our model later, so, let’s keep them in mind.
Now, take a closer look at two other features that have strong correlations with isLegendary.
It seems that most legendary Pokémons are also a Flying-type, followed by the Dragon-type. There are no legendary Poison, Fighting or Bug types. Still, Type_1 feature can be useful to predict legendary Pokémons. Like Type_1, Type_2 can be helpful to predict legendary Pokémons.
Feature Extraction
In this step of the process, we want to extract only features (mentioned above) that will be useful to our machine learning model. Also, we must make it easier for our model to read the data points and make the necessary calculations. So, we have to transform each dataset from this…

To this…

Although there is no secret in taking this step, there is no reason to go through the data preparation process here on Medium. So, I have outlined it for you on my Kaggle page. Now, we can move on to fun bit, the machine learning.
Machine Learning
Finally, we have reached our machine learning step. There are many algorithms out there, but we will apply only some of the most used classification algorithms to predict legendary Pokémons:
- Logistic Regression
- k-Nearest Neighbor (KNN)
- Decision Tree
- Random Forest
- Naive Bayes (GaussianNB)
Here is the training and testing procedure:
First, we will define datasets and train these classifiers with our training data. Also, import the necessary modules.
Second, using the trained classifier, we will predict the Legendary outcome from the test data.
Logistic Regression
Logistic regression is a model where the dependent variable (DV) is categorical. It covers cases of a binary dependent variable – for example, it can take only two values, ‘0’ and ‘1’. These cases represent outcomes such as win/lose, dead/alive and legendary/not-legendary.
k-Nearest Neighbours
k-nearest neighbours’ algorithm (_k_NN) is one of the simplest machine learning algorithms used for classification and regression. In _k_NN classification, the output we get is a class membership. The class means that the input will receive the same classification as the majority of its neighbours’. The data point is assigned to the class most common among its nearest neighbours. K is the number of neighbours, so it must be a positive integer (usually a small number). If k=1, then the object is given to the class of that single nearest neighbour. In other words:
‘if it has a tail, four legs, and it barks, then it is likely to be a dog.’
Decision Tree
Decision trees are flowchart-like structures in which each internal node represents a ‘test’ on a certain attribute. In our case, does a Pokémon is Dragon-like or Electric-type. Each branch represents the test’s outcome, and each leaf node represents a decision after taking into account all attributes. Thus, each path from the root to leaf represent a classification rule.
Random Forest
Random forests are a grouping learning method for classifications that work by constructing several decision trees. Random Forests output the class that is the mode of the classes (classification) or the mean prediction (regression) of the individual trees. Random decision forests correct for decision trees’ downside of overfitting the training set.
Gaussian Naive Bayes
The Naive Bayes is a simple probabilistic classifier based on the famous Bayes’ theorem with strong (naive) independence assumptions between the features.
Bayes’ theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event [4]. In other words, Bayes‘ Theorem is a way of finding a probability when we know certain other probabilities. For example, suppose COVID-19 is related to age. In that case, a person’s age can be used to assess the probability that a person has COVID-19 more accurately compared to the assessment of the likelihood of COVID-19 made without knowing the person’s age.
All naive Bayes classifiers assume that a particular feature’s value is independent of the value of any other attribute, given the class variable. For instance, a fruit may be considered an orange if it is orange, round, and about 7 cm size. A naive Bayes classifier considers each feature to contribute independently to the probability that this fruit is an orange, regardless of any possible correlations between the colour, roundness, and diameter features.
Finally, we will calculate the accuracy score (in percentage) of each trained classifier.

We can see that the Decision Tree and Random Forest classifiers have the highest accuracy score. However, we choose the Random Forest Classifier as it can limit overfitting, whereas the Decision Tree classifier does not. So, next time someone creates a new Pokémon, we can apply our Random Forest Classifier to predict with 86.2% certainty whether the new creature is legendary.
Conclusions
Nowadays, machine learning is all around us. It will continue to provide innovative solutions to difficult classifying and predicting problems. Indeed, it is not an easy topic to learn. Still, it can be even more daunting in most cases because of the real-life examples, which require some degree of familiarity with the industry (e.g. stock market). Pokémons come in handy for data scientists and machine learning students who want to understand how classification works at the most basic level. Experts might criticise that 86.2% accuracy is not enough. Still, this article’s primary goal was to take you through the rationale behind a machine learning project. I hope you feel encouraged to take the Pokémon challenge. Ultimately, these challenges will make you more comfortable working with machine learning. So, are you ready?
Thanks for reading. Here are some articles you might like it:
The Perfect Python Cheatsheet for Beginners
Increase Productivity: Data Cleaning using Python and Pandas
References:
[1] Coursera https://www.coursera.org/learn/machine-learning
[2] KD Nuggets https://www.kdnuggets.com/2020/11/rise-machine-learning-engineer.html.
[3] Pokémons https://www.pokemon.com/uk/pokedex/
[4] Bayes’ Theorem https://en.wikipedia.org/wiki/Bayes%27_theorem