Switching sides from an useR to a Pythonista
As a data scientist I recently took the chance, left my R tidyverse comfortzone and embarked on the adventure to explore the unknown depths of Python, numpy, scikit-learn and pandas.

To document my adventure of exploring the Python data science ecosystem I started this series of stories. The purpose of this is firstly, to create an online reference page for myself and secondly, to potentially help other fellow R data scientist who want to embark on a similar journey and explore the possibilities of Python or simply to support other people who are getting started with data Analysis in Python.
In this story we will dive right into our first adventure and go as far as to the very bottom of the North Atlantic Ocean to explore the Titanic dataset with Python. If you want to tag along and practice your data science skills on the same dataset, please visit Kaggle and the Titanic Challenge which is a great place to start your data science career by the way.
Reading data
Reading in data with Python couldn’t be more similar to R. Of course there is many ways but one of the most popular options is to use the read_csv function from the pandas module. This function takes your csv file and directly reads it in as a pandas dataframe which is the go to data structure for tabular data in Python.
import pandas as pd
data = pd.read_csv("/kaggle/input/titanic/train.csv")
data
If you read your data inside a jupyter notebook, you can type out the variable that you used to store the data in and it will dsplay the first and last couple of lines of the dataframe. If you want to only have a glimpse of the first couple of observations use data.head(6) where the number inside the brackets corresponds to the number of lines that will be returned.

Nice! Our data is in Python! Well that was easy… But how else can we get some more insights into our data?
Exploring data (Part I: tables)
There is a set of methods in pandas that provide you with some more information on your dataset and are subsequently described in more detail.
.info()
The first method is data.info() which is very similar to the R based function str(data) and returns information on all columns, their datatypes and missing values:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
You can see the column index, its name, the sum of Non-Null (NA) values and the datatype. The dataframe in total has 891 rows and the columns Age, Cabin as well as Emarked seem to have some missing data.
.describe()
The second option that you can use to explore your data in a tabular way is data.describe() which is similar to the R function summary(data).
data.describe()

This functions returns descriptive statistics on you data such as the count, mean value, standard deviation as well as the quartiles, median, min and max values.
.value_counts()
You can see that only the numeric columns were used to create the table. If you want to get some more insights about your non-numeric columns (in this case Cabin, Embarked, Ticket, Name, Sex) you can use the value_counts() method that can be used as follows:
data["Sex"].value_counts()
male 577
female 314
Name: Sex, dtype: int64
Let’s go one step further and group attributes to get aggregate statistic per group (similar to dplyrs group_by() or SQL group by). If we for example want to know which gender had the higher probability to survive the Titanic disaster we could do the following:
data[["Sex", "Survived"]].groupby(["Sex"], as_index = False).mean()

And we can clearly see that 74% women survived and only roughly 18% of men were lucky enough to get out alive.
Correlations analysis – .corr()
The dataset is not too large and therefore we can compute the correlation coefficient on the entire dataset with the corr() method as follows:
data.corr()

Note how Python used methods to compute your results whereas R usually takes a function such as cor(data) to return your values.
The command above returns the correlation coefficients for each numerical attribute combination. If our dataset is larger or if we are only interested in correlations of one variable we could do the following, which subsets the correlation matrix and sort the coefficients:
data.corr()["Parch"].sort_values(ascending=False)
Parch 1.000000
SibSp 0.414838
Fare 0.216225
Survived 0.081629
Pclass 0.018443
PassengerId -0.001652
Age -0.189119
Name: Parch, dtype: float64
This is basically as far as tabular Exploratory Data Analysis gets us. Its time to pull out the big guns, and create some plots!
Exploring data (Part II: plots)
In the first part we have seen which methods we can use to explore our data in an informative but rather boring (tabular) manner. In this part we will focus on how we can go one step further and explore our data with some graphs. __ If you come from R you probably fell in love with ggplot2 I am not going to lie, the Python plotting ecosystem is very different to R and deserves an article on its own. To help you find your way around the basic plotting functions in Python that will help you with your exploratory data analysis I collected a set of useful functions which are listed below. Spoiler alert: Seaborn is awesome! But it will take you some time to get used to its grammer if you come from R. If you don’t like change and switching to the Python syntax and intendations is already enough for you, I have good news. Check out plotnine, its basically ggplot for Python with the good old R syntax.
Scatter matrix
Let’s stay in the area of correlation analysis and create scatter plots which can tell us a little bit more about the relationship of our features by graphically plotting them against each other. Here the pandas scatter_matrix() function comes in handy:
from pandas.plotting import scatter_matrix
scatter_matrix(data)

As can be seen above, each numerical attribute was plotted against each other. The diagonal however per default doesn’t show a scatter plot but rather a histogram of each variable which comes in quite handy. You can change this behaviour in the function call. Although we do not have many features, the graph is almost unreadable. Lets therefore zoom in a little bit only display 3 features of our choice which makes the plot much more readable:
attributes=["Age", "Fare", "Parch"]
scatter_matrix(data[attributes], figsize=(12,8)

If you want to zoom in even more to see some more detail in the relationship of two variables you can do:
data.plot(kind="scatter", x="Fare", y="Age", alpha=0.2)

By setting the alpha value to 0.1–0.2 the data points do not mask out each other and a density of observations can be seen. In the plot above for example we can see that most passenger travelled with a fare of under 100$ and had an Age of about between 15–50 years.
sns.heatmap()
I know it has been a lot of correlation analysis but before we move on, I would like to present one more method based on the seaborn module which enables you to create correlation plots sns.heatmap():
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(data[["Age","Sex","SibSp","Parch","Pclass","Fare","Embarked", "Survived"]].corr(), annot = True)
plt.show()

This plot is much more readable than the graphs and tables above due to its color coding of correlation coefficients.
sns.factorplot()
Remember when we looked into Survival and Sex by grouping the "Sex" variable above? There is an easy way to do this graphically via the seaborn .factorplot() function.
sns.factorplot(x="Sex", y ="Survived", data=data, kind="bar", size=3)
plt.show()

This is essentially a plotted version of the code below which returns the mean value (and variance) per class.
data[["Sex", "Survived"]].groupby(["Sex"], as_index = False).mean()
data[["Sex", "Survived"]].groupby(["Sex"], as_index = False).var()
sns.catplot()
If you are interested in plotting the counts of categorical variables then catplot() from seaborn is the function to go to:
sns.catplot(x="Survived", col="Sex", col_wrap=4,
data=data,
kind="count", height=2.5, aspect=.8)
plt.show()

As opposed to the previous plot, this function returns the counts per category in a FacetGrid. To get more information on this very powerful plot type and other seaborn plotting function visit the seaborn website. Since FacettGrids are an integral part of seaborn, we will cover them in more detail at the end of this article.
hist()
Another quick way to get a feel of the type of data you are dealing with, is to plot histograms for each numerical variable. The hist() method can be either applied on a single attribute at a time or on the entire dataset at once:
data.hist(bins=20, figsize=(10,10))
plt.show()

Very similar to the hist() function in R, right?
But again remember, in Python you mostly work with methods that you apply to an object with the your_data.method_name() syntax where in R you usually apply functions like function_name(your_data).
sns.distplot()
Another nice way of plotting distributions of variables is by using the seaborn distplot function (!) which returns a histogram and a kernel density estimate:
sns.distplot(data["Age"])
plt.show()

sns.FacetGrid
FacetGrid() from seaborn is the last functions for today. Its helps you with creating facet plots (basically grouping your variables via categories and plotting them all in one plot). If you for example want to see the distribution of Sex by Survival (as a factor) you could do the following:
g = sns.FacetGrid(data, row="Survived")
g.map(sns.distplot, "Age", bins=25)
plt.show()

FacetPlots are very powerful and you can use them with basically every seaborn plotting type. For more examples please check out this awesome documentation.
Conclusions and remedies that will make life easier for R users when starting with EDA in Python

- As you can see, the functions names and calls for basic exporatory data analysis tasks in R and Python are very similar. The most confusion for many is to remember in Python when to apply a method to a dataset and when to call a function instead, but just keep practicing until you get it right 😉
- If you feel as lost with plotting in Python as I did in the beginning, have a look at the plotnine module which has the same syntax like ggplot2 in R.
- If you miss piping %>% dplyr style, I also have good news for you. Check out the dfply module which like dplyr also allows chaining of multiple operations with pipe operators In Python.
Final overview on basic EDA functions in Python and their R equivalent:
Reading data
#Reading data in Python
import pandas as pd
pd.read_csv("data.csv")
---
#Reading data in R
library(readr)
read_csv("data.csv")
Exploring data
#Exploring data in Python
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
---
#Exploring data in R
str(data)
Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 891 obs. of 12 variables:
$ PassengerId: num 1 2 3 4 5 6 7 8 9 10 ...
$ Survived : num 0 1 1 1 0 0 0 0 1 1 ...
$ Pclass : num 3 1 3 1 3 3 1 3 3 2 ...
$ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
$ Sex : chr "male" "female" "female" "female" ...
$ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
$ SibSp : num 1 1 0 1 0 0 0 3 0 1 ...
$ Parch : num 0 0 0 0 0 0 0 1 2 0 ...
$ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Cabin : chr NA "C85" NA "C123" ...
$ Embarked : chr "S" "C" "S" "S" ...
- attr(*, "spec")=
.. cols(
.. PassengerId = col_double(),
.. Survived = col_double(),
.. Pclass = col_double(),
.. Name = col_character(),
.. Sex = col_character(),
.. Age = col_double(),
.. SibSp = col_double(),
.. Parch = col_double(),
.. Ticket = col_character(),
.. Fare = col_double(),
.. Cabin = col_character(),
.. Embarked = col_character()
.. )
Summarise data
#Summarise data in R
data.describe()

#Summarise data in R
summary(data)
PassengerId Survived Pclass Name
Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
Median :446.0 Median :0.0000 Median :3.000 Mode :character
Mean :446.0 Mean :0.3838 Mean :2.309
3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
Max. :891.0 Max. :1.0000 Max. :3.000
Sex Age SibSp Parch
Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000
Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000
Mode :character Median :28.00 Median :0.000 Median :0.0000
Mean :29.70 Mean :0.523 Mean :0.3816
3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
Max. :80.00 Max. :8.000 Max. :6.0000
NA's :177
Ticket Fare Cabin Embarked
Length:891 Min. : 0.00 Length:891 Length:891
Class :character 1st Qu.: 7.91 Class :character Class :character
Mode :character Median : 14.45 Mode :character Mode :character
Mean : 32.20
3rd Qu.: 31.00
Max. :512.33
Value counts
# Value Counts in Python
data["Sex"].value_counts()
male 577
female 314
Name: Sex, dtype: int64
---
# Value Counts in R (dplyr)
data %>% count(Sex) #or
data %>% group_by(Sex) %>% tally()
Sex n
<chr> <int>
female 314
male 577
Aggregate statistics
#Data aggregation in Python
data[["Sex", "Survived"]].groupby(["Sex"], as_index = False).mean()
---
#Data aggregation in R (dplyr)
data %>% group_by(Sex) %>% summarize(mean_survival=mean(Survived))
Leave me a comment if you have any questions!
Cheers, Martin
Further material:
For all of you who are coming from R to Python these links might be useful for you:
-
- Henze, M., R & Python Rosetta Stone: EDA with dplyr vs pandas https://heads0rtai1s.github.io/2020/11/05/r-python-dplyr-pandas/
-
- Mitsa, T., Python And R for Data Wrangling: Compare Pandas and Tidyverse Code Side-By-Side, and Learn Speed-Up Tips. https://towardsdatascience.com/python-and-r-for-data-wrangling-examples-for-both-including-speed-up-considerations-f2ec2bb53a86
-
- Pandey, P., From ‘R vs Python’ to ‘R and Python’ https://towardsdatascience.com/from-r-vs-python-to-r-and-python-aa25db33ce17
-
- Weseley, G., Tidying Up Pandas https://towardsdatascience.com/tidying-up-pandas-4572bfa38776
-
- Frei, L., How to Use ggplot2 in Python https://towardsdatascience.com/how-to-use-ggplot2-in-python-74ab8adec129
-
- Akinkunle, A., dplyr-style Data Manipulation with Pipes in Python https://towardsdatascience.com/dplyr-style-data-manipulation-with-pipes-in-python-380dcb137000
General Python Documentation:
-
- Python Seaborn Documentation: https://seaborn.pydata.org/