The world’s leading publication for data science, AI, and ML professionals.

Exploratory Data Analysis with Python 101

As a data scientist I recently took the chance, left my R tidyverse comfortzone and embarked on the adventure to explore the unknown…

Switching sides from an useR to a Pythonista

As a data scientist I recently took the chance, left my R tidyverse comfortzone and embarked on the adventure to explore the unknown depths of Python, numpy, scikit-learn and pandas.

Photo by Matthew Sleeper on Unsplash
Photo by Matthew Sleeper on Unsplash

To document my adventure of exploring the Python data science ecosystem I started this series of stories. The purpose of this is firstly, to create an online reference page for myself and secondly, to potentially help other fellow R data scientist who want to embark on a similar journey and explore the possibilities of Python or simply to support other people who are getting started with data Analysis in Python.

In this story we will dive right into our first adventure and go as far as to the very bottom of the North Atlantic Ocean to explore the Titanic dataset with Python. If you want to tag along and practice your data science skills on the same dataset, please visit Kaggle and the Titanic Challenge which is a great place to start your data science career by the way.


Reading data

Reading in data with Python couldn’t be more similar to R. Of course there is many ways but one of the most popular options is to use the read_csv function from the pandas module. This function takes your csv file and directly reads it in as a pandas dataframe which is the go to data structure for tabular data in Python.

import pandas as pd
data = pd.read_csv("/kaggle/input/titanic/train.csv")
data

If you read your data inside a jupyter notebook, you can type out the variable that you used to store the data in and it will dsplay the first and last couple of lines of the dataframe. If you want to only have a glimpse of the first couple of observations use data.head(6) where the number inside the brackets corresponds to the number of lines that will be returned.

Nice! Our data is in Python! Well that was easy… But how else can we get some more insights into our data?

Exploring data (Part I: tables)

There is a set of methods in pandas that provide you with some more information on your dataset and are subsequently described in more detail.

.info()

The first method is data.info() which is very similar to the R based function str(data) and returns information on all columns, their datatypes and missing values:

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

You can see the column index, its name, the sum of Non-Null (NA) values and the datatype. The dataframe in total has 891 rows and the columns Age, Cabin as well as Emarked seem to have some missing data.

.describe()

The second option that you can use to explore your data in a tabular way is data.describe() which is similar to the R function summary(data).

data.describe()
Image by Author
Image by Author

This functions returns descriptive statistics on you data such as the count, mean value, standard deviation as well as the quartiles, median, min and max values.

.value_counts()

You can see that only the numeric columns were used to create the table. If you want to get some more insights about your non-numeric columns (in this case Cabin, Embarked, Ticket, Name, Sex) you can use the value_counts() method that can be used as follows:

data["Sex"].value_counts()
male      577
female    314
Name: Sex, dtype: int64

Let’s go one step further and group attributes to get aggregate statistic per group (similar to dplyrs group_by() or SQL group by). If we for example want to know which gender had the higher probability to survive the Titanic disaster we could do the following:

data[["Sex", "Survived"]].groupby(["Sex"], as_index = False).mean()

And we can clearly see that 74% women survived and only roughly 18% of men were lucky enough to get out alive.

Correlations analysis – .corr()

The dataset is not too large and therefore we can compute the correlation coefficient on the entire dataset with the corr() method as follows:

data.corr()
Image by Author
Image by Author

Note how Python used methods to compute your results whereas R usually takes a function such as cor(data) to return your values.

The command above returns the correlation coefficients for each numerical attribute combination. If our dataset is larger or if we are only interested in correlations of one variable we could do the following, which subsets the correlation matrix and sort the coefficients:

data.corr()["Parch"].sort_values(ascending=False)
Parch          1.000000
SibSp          0.414838
Fare           0.216225
Survived       0.081629
Pclass         0.018443
PassengerId   -0.001652
Age           -0.189119
Name: Parch, dtype: float64

This is basically as far as tabular Exploratory Data Analysis gets us. Its time to pull out the big guns, and create some plots!

Exploring data (Part II: plots)

In the first part we have seen which methods we can use to explore our data in an informative but rather boring (tabular) manner. In this part we will focus on how we can go one step further and explore our data with some graphs. __ If you come from R you probably fell in love with ggplot2 I am not going to lie, the Python plotting ecosystem is very different to R and deserves an article on its own. To help you find your way around the basic plotting functions in Python that will help you with your exploratory data analysis I collected a set of useful functions which are listed below. Spoiler alert: Seaborn is awesome! But it will take you some time to get used to its grammer if you come from R. If you don’t like change and switching to the Python syntax and intendations is already enough for you, I have good news. Check out plotnine, its basically ggplot for Python with the good old R syntax.

Scatter matrix

Let’s stay in the area of correlation analysis and create scatter plots which can tell us a little bit more about the relationship of our features by graphically plotting them against each other. Here the pandas scatter_matrix() function comes in handy:

from pandas.plotting import scatter_matrix
scatter_matrix(data)
Image by Author
Image by Author

As can be seen above, each numerical attribute was plotted against each other. The diagonal however per default doesn’t show a scatter plot but rather a histogram of each variable which comes in quite handy. You can change this behaviour in the function call. Although we do not have many features, the graph is almost unreadable. Lets therefore zoom in a little bit only display 3 features of our choice which makes the plot much more readable:

attributes=["Age", "Fare", "Parch"]
scatter_matrix(data[attributes], figsize=(12,8)
Image by Author
Image by Author

If you want to zoom in even more to see some more detail in the relationship of two variables you can do:

data.plot(kind="scatter", x="Fare", y="Age", alpha=0.2)
Image by Author
Image by Author

By setting the alpha value to 0.1–0.2 the data points do not mask out each other and a density of observations can be seen. In the plot above for example we can see that most passenger travelled with a fare of under 100$ and had an Age of about between 15–50 years.

sns.heatmap()

I know it has been a lot of correlation analysis but before we move on, I would like to present one more method based on the seaborn module which enables you to create correlation plots sns.heatmap():

import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(data[["Age","Sex","SibSp","Parch","Pclass","Fare","Embarked", "Survived"]].corr(), annot = True)
plt.show()
Image by Author
Image by Author

This plot is much more readable than the graphs and tables above due to its color coding of correlation coefficients.

sns.factorplot()

Remember when we looked into Survival and Sex by grouping the "Sex" variable above? There is an easy way to do this graphically via the seaborn .factorplot() function.

sns.factorplot(x="Sex", y ="Survived", data=data, kind="bar", size=3)
plt.show()
Image by Author
Image by Author

This is essentially a plotted version of the code below which returns the mean value (and variance) per class.

data[["Sex", "Survived"]].groupby(["Sex"], as_index = False).mean()
data[["Sex", "Survived"]].groupby(["Sex"], as_index = False).var()

sns.catplot()

If you are interested in plotting the counts of categorical variables then catplot() from seaborn is the function to go to:

sns.catplot(x="Survived", col="Sex", col_wrap=4,
 data=data,
 kind="count", height=2.5, aspect=.8)
plt.show()
Image by Author
Image by Author

As opposed to the previous plot, this function returns the counts per category in a FacetGrid. To get more information on this very powerful plot type and other seaborn plotting function visit the seaborn website. Since FacettGrids are an integral part of seaborn, we will cover them in more detail at the end of this article.

hist()

Another quick way to get a feel of the type of data you are dealing with, is to plot histograms for each numerical variable. The hist() method can be either applied on a single attribute at a time or on the entire dataset at once:

data.hist(bins=20, figsize=(10,10))
plt.show()
Image by Author
Image by Author

Very similar to the hist() function in R, right?

But again remember, in Python you mostly work with methods that you apply to an object with the your_data.method_name() syntax where in R you usually apply functions like function_name(your_data).

sns.distplot()

Another nice way of plotting distributions of variables is by using the seaborn distplot function (!) which returns a histogram and a kernel density estimate:

sns.distplot(data["Age"])
plt.show()
Image by Author
Image by Author

sns.FacetGrid

FacetGrid() from seaborn is the last functions for today. Its helps you with creating facet plots (basically grouping your variables via categories and plotting them all in one plot). If you for example want to see the distribution of Sex by Survival (as a factor) you could do the following:

g = sns.FacetGrid(data, row="Survived")
g.map(sns.distplot, "Age", bins=25)
plt.show()
Image by Author
Image by Author

FacetPlots are very powerful and you can use them with basically every seaborn plotting type. For more examples please check out this awesome documentation.

Conclusions and remedies that will make life easier for R users when starting with EDA in Python

Photo by milan degraeve on Unsplash
Photo by milan degraeve on Unsplash
  • As you can see, the functions names and calls for basic exporatory data analysis tasks in R and Python are very similar. The most confusion for many is to remember in Python when to apply a method to a dataset and when to call a function instead, but just keep practicing until you get it right 😉
  • If you feel as lost with plotting in Python as I did in the beginning, have a look at the plotnine module which has the same syntax like ggplot2 in R.
  • If you miss piping %>% dplyr style, I also have good news for you. Check out the dfply module which like dplyr also allows chaining of multiple operations with pipe operators In Python.

Final overview on basic EDA functions in Python and their R equivalent:

Reading data

#Reading data in Python
import pandas as pd
pd.read_csv("data.csv")
---
#Reading data in R
library(readr)
read_csv("data.csv")

Exploring data

#Exploring data in Python
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
---
#Exploring data in R
str(data)
Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame':    891 obs. of  12 variables:
 $ PassengerId: num  1 2 3 4 5 6 7 8 9 10 ...
 $ Survived   : num  0 1 1 1 0 0 0 0 1 1 ...
 $ Pclass     : num  3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
 $ Sex        : chr  "male" "female" "female" "female" ...
 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : num  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : num  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : chr  NA "C85" NA "C123" ...
 $ Embarked   : chr  "S" "C" "S" "S" ...
 - attr(*, "spec")=
  .. cols(
  ..   PassengerId = col_double(),
  ..   Survived = col_double(),
  ..   Pclass = col_double(),
  ..   Name = col_character(),
  ..   Sex = col_character(),
  ..   Age = col_double(),
  ..   SibSp = col_double(),
  ..   Parch = col_double(),
  ..   Ticket = col_character(),
  ..   Fare = col_double(),
  ..   Cabin = col_character(),
  ..   Embarked = col_character()
  .. )

Summarise data

#Summarise data in R
data.describe()
Image by Author
Image by Author
#Summarise data in R
summary(data)
PassengerId       Survived          Pclass          Name          
 Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891        
 1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
 Median :446.0   Median :0.0000   Median :3.000   Mode  :character  
 Mean   :446.0   Mean   :0.3838   Mean   :2.309                     
 3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000                     
 Max.   :891.0   Max.   :1.0000   Max.   :3.000                     

     Sex                 Age            SibSp           Parch       
 Length:891         Min.   : 0.42   Min.   :0.000   Min.   :0.0000  
 Class :character   1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000  
 Mode  :character   Median :28.00   Median :0.000   Median :0.0000  
                    Mean   :29.70   Mean   :0.523   Mean   :0.3816  
                    3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000  
                    Max.   :80.00   Max.   :8.000   Max.   :6.0000  
                    NA's   :177                                     
    Ticket               Fare           Cabin             Embarked        
 Length:891         Min.   :  0.00   Length:891         Length:891        
 Class :character   1st Qu.:  7.91   Class :character   Class :character  
 Mode  :character   Median : 14.45   Mode  :character   Mode  :character  
                    Mean   : 32.20                                        
                    3rd Qu.: 31.00                                        
                    Max.   :512.33

Value counts

# Value Counts in Python
data["Sex"].value_counts()
male      577
female    314
Name: Sex, dtype: int64
---
# Value Counts in R (dplyr)
data %>% count(Sex) #or
data %>% group_by(Sex) %>% tally()
Sex     n
<chr> <int>
female 314
male   577

Aggregate statistics

#Data aggregation in Python
data[["Sex", "Survived"]].groupby(["Sex"], as_index = False).mean()
---
#Data aggregation in R (dplyr)
data %>% group_by(Sex) %>% summarize(mean_survival=mean(Survived))

Leave me a comment if you have any questions!

Cheers, Martin

Further material:

For all of you who are coming from R to Python these links might be useful for you:

General Python Documentation:


Related Articles