Gaining a solid understanding of Pandas series
Agenda
The two central data structures of Pandas are Series and DataFrame. This post is an attempt to have a proper understanding of Pandas series.
The foundation of a DataFrame is a Series. The docstring of DataFrame defines a DataFrame as:
Can be thought of as a dict-like
container for Series objects
Many operations on dataframe return series instance. It is thus essential that we have a solid understanding of Series.
What is a series
A Pandas series
can be conceptualised in two ways. It can be envisioned as a single column of tabular data. It can also be envisioned as a single row of tabular data.
Let’s assume there is a database table called accounting
which stores revenue
and expenses
across different years.
year revenue expense
2017 1000 800
2018 1200 900
2019 1500 1100
The collection of all the revenues is a Series
, i.e [1000, 1200, 1500]
is a series. The series is a collection of same attribute of different objects/rows/instances.
Series
can also be thought of as a single row of a the table, i.e [2017, 1000, 800]
is a series too. In this case the series is a collection of different attributes of a single object.
Creating a series
In a real world scenario, you will rarely create a series. You will probably be provided with a csv or some other data source from where you will read the data. pandas has utilities to read data from these data sources.
The read data will be a pandas dataframe. You will extract some series out of the dataframe and operate on the series.
This is an easy task though. For now, let’s explicitly create a series
A Series
can be created using pandas.Series
.
Let’s try it on ipython shell.
In [2]: revenue = pd.Series([1000, 1200, 1500, 800, 900], index=['2014', '2015', '2016', '2017', '2018'])
index
keyword argument assigns label to different values of series. These labels come handy in retrieving data from series.
Indexing a series
Let’s get information out of this series. Let’s retrieve the revenue for the year 2016.
In [8]: revenue['2016']
Out[8]: 1500
Series also has two methods for indexing, they are .loc
and .iloc
.
.loc
does label based indexing.
In [9]: revenue.loc['2016']
Out[9]: 1500
.iloc
does integer based indexing. 2016 is the third entry, i.e indexed 2.
In [10]: revenue.iloc[2]
Out[10]: 1500
revenue.iloc['2016']
would fail with a TypeError
since iloc
expects integer based indexing. Since 2016
is in quotes and is a string, hence it would cause a TypeError
.
revenue.loc[2]
would fail with an KeyError
since loc
expects label based indexing. As the index keys are '2014', '2015' and '2016'
, and 2
isn’t present in index key, hence it would cause a KeyError
.
Similarly revenue['2017']
would also cause a KeyError
.
Slicing a series
We can slice a series using the same syntax as list slicing. We want to get a slice from 2014 to 2016.
In [17]: revenue['2014': '2016']
Out[17]:
2014 1000
2015 1200
2016 1500
dtype: int64
Notice that the returned data structure is a pandas.Series
too.
In [18]: type(revenue['2014': '2016'])
Out[18]: pandas.core.series.Series
If you want to use integer positions for slicing, you will have to use .iloc
. Integer positions can only be used with an iloc
. It cannot be used with loc
.
In [20]: revenue.iloc[0:3]
Out[20]:
2014 1000
2015 1200
2016 1500
dtype: int64
The returned object is again a Series.
If we want information for multiple years, we will have to pass a list to the indexing syntax. Let’s get revenue for years 2014 and 2017.
In [23]: revenue[['2014', '2017']]
Out[23]:
2014 1000
2017 800
dtype: int64
This was interesting where series slicing is much more concise than list slicing.
Say you have list which looks like the following:
revenue_list = [1000, 1200, 1500]
In case we want a single value, we can say revenue_list[0]
. However there is no list slicing syntax for getting multiple non contiguous values. We can’t say revenue_list[[0, 2]]
. Instead we would have to say [revenue_list[0], revenue_list[2]
.
But with a series, we could just say revenue[['2014', '2017']]
.
We could have used revenue.loc[['2014', '2017']]
too if we wanted to keep it explicit.
2014 is the first entry and 2017 is the fourth entry in the series. We could also use revenue.iloc[[0, 3]]
.
In [25]: revenue.iloc[[0, 3]]
Out[25]:
2014 1000
2017 800
dtype: int64
As you should have noticed, slicing a series always returns another series.
Getting labels of the series
We want a list of all the years for which we have revenue.
In [93]: revenue.index
Out[93]: Index(['2014', '2015', '2016', '2017', '2018'], dtype='object')In [94]: list(revenue.index)
Out[94]: ['2014', '2015', '2016', '2017', '2018']
As we saw a Series
object has an attribute called index
which returns the labels.
Filtering a series
We want to get all years where revenue is greater than 1000.
In [50]: revenue[revenue > 1000]
Out[50]:
2015 1200
2016 1500
dtype: int64
We will understand how it works under the hood in next section.
Filtering using a list of booleans
List of booleans is the underlying mechanism which enables filtering on series.
We want to get revenue for 2014 and 2017. 2014 is positioned 0 and 2017 is positioned 3.
We will have to create a list of same length as revenue and set 0th and 3rd element as True.
In [75]: l = [True, False, False, True, False] # Set 0th and 3rd element TrueIn [76]: revenue[l]
Out[76]:
2014 1000
2017 800
dtype: int64
We passed a list of booleans to indexing syntax and it filtered the series to return another series. The returned series only contains values where the corresponding element was True in the boolean list.
Instead of list, we could create a series of length 5 with boolean value and use the boolean series with revenue
.
In [78]: boolean_series = pd.Series([True, False, False, True, False], index=['2014', '2015', '2016', '2017', '2018'])In [79]: revenue[boolean_series]
Out[79]:
2014 1000
2017 800
dtype: int64
Since we wanted to use the boolean series with revenue, we had to ensure that the indexes of both series match.
Let’s see what revenue > 1000
give us:
In [80]: revenue > 1000
Out[80]:
2014 False
2015 True
2016 True
2017 False
2018 False
dtype: bool
It returns us a series which is similar to boolean_series we explicitly created.
Since it’s a series, we could directly use this with revenue
and do revenue[revenue > 1000]
. This is exactly what we did in the last section on filtering.
More filtering
We want to get all the years where revenue is greater than 1000 but less than 1300.
In [68]: revenue[(revenue > 1000) & (revenue < 1300)]
Out[68]:
2015 1200
dtype: int64
Arithmetic operation on series
Suppose we realise that each year’s revenue was off by 100. We want to increase each year’s revenue by 100 to make up for it.
In [81]: revenue + 100
Out[81]:
2014 1100
2015 1300
2016 1600
2017 900
2018 1000
dtype: int64
Notice how each value of the series increased by 100.
Any arithmetic operation on series is applied to all the values of the series.
Ordering on series
We want to sort the revenues in ascending order.
In [87]: revenue.sort_values()
Out[87]:
2017 800
2018 900
2014 1000
2015 1200
2016 1500
dtype: int64
We want to sort the revenues in descending order.
In [88]: revenue.sort_values(ascending=False)
Out[88]:
2016 1500
2015 1200
2014 1000
2018 900
2017 800
dtype: int64
We want to know the year which had maximum revenue.
In [92]: revenue.sort_values(ascending=False).index[0]
Out[92]: '2016'
revenue.sort(ascending=False)
returns an ordered series. And since a series has attribute index
, we could use it and get the label of year with maximum revenue.
There is a better method to achieve this though. We will see idxmax()
in next section.
Aggregation on series
Let’s find out total revenue across all the years.
In [96]: revenue.sum()
Out[96]: 5400
Series has utilities to easily find the average, maximum and minimum too.
In [102]: revenue.mean()
Out[102]: 1080.0In [103]: revenue.max()
Out[103]: 1500In [104]: revenue.min()
Out[104]: 800
We want to know the year which had maximum revenue.
In [106]: revenue.idxmax()
Out[106]: '2016'
Grouping and aggregation on series
Let’s assume there could be multiple rows for revenue for a particular year.
In [2]: revenue_with_multiple_entries = pd.Series([1000, 1200, 1500, 800, 900, 500], index=['2014', '2015', '2016', '2017', '2018', '2017'])In [3]: print(revenue_with_multiple_entries)
2014 1000
2015 1200
2016 1500
2017 800
2018 900
2017 500
dtype: int64
We have multiple rows for revenue of 2017. We want to know the total revenue across different years.
We can achieve it by doing a groupby
and then doing a sum
.
In [4]: revenue_with_multiple_entries.groupby(revenue_with_multiple_entries.index).sum()
Out[4]:
2014 1000
2015 1200
2016 1500
2017 1300
2018 900
dtype: int64
Most common value of series
Let’s create a series of names of students in a class.
In [6]: names = pd.Series(['steve jobs', 'bill gates', 'mark twain', 'charles darwin', 'charles dickens', 'mark zuckerberg', 'charles darwin'])In [7]: print(names)
0 steve jobs
1 bill gates
2 mark twain
3 charles darwin
4 charles dickens
5 mark zuckerberg
6 charles darwin
You should have noticed that we did not provide explicit index
to the series. So an integer label was automatically assigned.
Let’s find the most common name.
In [10]: names.value_counts()
Out[10]:
charles darwin 2
bill gates 1
mark zuckerberg 1
mark twain 1
steve jobs 1
charles dickens 1
String operation on series
Let’s find all the names which start with mark
.
In [16]: names[names.str.startswith('mark')]
Out[16]:
2 mark twain
5 mark zuckerberg
dtype: object
names.str.startswith('mark')
returns a boolean array which we are passing to indexing syntax to filter the series.
A pandas series has str
attribute which allows executing string operations.
apply method
Series has a method called apply
which allows doing advanced filtering.
We want to find all names having surname as darwin
.
In [27]: names[names.apply(lambda x: x.split(' ')[-1] == 'darwin')]
Out[27]:
3 charles darwin
6 charles darwin
dtype: object
apply
method expects a function to be passed to it. That’s why we passed it a lambda function.
Calling names.apply() applies the lambda function to all values of the series. The return value of .apply()
is a series
of booleans.
In [26]: names.apply(lambda x: x.split(' ')[-1] == 'darwin')
Out[26]:
0 False
1 False
2 False
3 True
4 False
5 False
6 True
dtype: bool
We used the boolean series
to filter the original series.
Thank you for reading so far! Before you go:
- 👏 Clap for the story if it helped :)
- 📰 View more content from me More Stories
- 🔔 Stay connected: Twitter | LinkedIn