Gaining a solid understanding of Pandas series

Akshar Raaj
Towards Data Science
8 min readAug 4, 2019

--

Photo by Stone Wang on Unsplash

Agenda

The two central data structures of Pandas are Series and DataFrame. This post is an attempt to have a proper understanding of Pandas series.

The foundation of a DataFrame is a Series. The docstring of DataFrame defines a DataFrame as:

Can be thought of as a dict-like
container for Series objects

Many operations on dataframe return series instance. It is thus essential that we have a solid understanding of Series.

What is a series

A Pandas series can be conceptualised in two ways. It can be envisioned as a single column of tabular data. It can also be envisioned as a single row of tabular data.

Let’s assume there is a database table called accounting which stores revenue and expenses across different years.

 year   revenue   expense
2017 1000 800
2018 1200 900
2019 1500 1100

The collection of all the revenues is a Series, i.e [1000, 1200, 1500] is a series. The series is a collection of same attribute of different objects/rows/instances.

Series can also be thought of as a single row of a the table, i.e [2017, 1000, 800] is a series too. In this case the series is a collection of different attributes of a single object.

Creating a series

In a real world scenario, you will rarely create a series. You will probably be provided with a csv or some other data source from where you will read the data. pandas has utilities to read data from these data sources.

The read data will be a pandas dataframe. You will extract some series out of the dataframe and operate on the series.
This is an easy task though. For now, let’s explicitly create a series

A Series can be created using pandas.Series.

Let’s try it on ipython shell.

In [2]: revenue = pd.Series([1000, 1200, 1500, 800, 900], index=['2014', '2015', '2016', '2017', '2018'])

index keyword argument assigns label to different values of series. These labels come handy in retrieving data from series.

Indexing a series

Let’s get information out of this series. Let’s retrieve the revenue for the year 2016.

In [8]: revenue['2016']
Out[8]: 1500

Series also has two methods for indexing, they are .loc and .iloc.

.loc does label based indexing.

In [9]: revenue.loc['2016']
Out[9]: 1500

.iloc does integer based indexing. 2016 is the third entry, i.e indexed 2.

In [10]: revenue.iloc[2]
Out[10]: 1500

revenue.iloc['2016'] would fail with a TypeError since iloc expects integer based indexing. Since 2016 is in quotes and is a string, hence it would cause a TypeError.

revenue.loc[2] would fail with an KeyError since loc expects label based indexing. As the index keys are '2014', '2015' and '2016' , and 2 isn’t present in index key, hence it would cause a KeyError.
Similarly revenue['2017'] would also cause a KeyError.

Slicing a series

We can slice a series using the same syntax as list slicing. We want to get a slice from 2014 to 2016.

In [17]: revenue['2014': '2016']
Out[17]:
2014 1000
2015 1200
2016 1500
dtype: int64

Notice that the returned data structure is a pandas.Series too.

In [18]: type(revenue['2014': '2016'])
Out[18]: pandas.core.series.Series

If you want to use integer positions for slicing, you will have to use .iloc. Integer positions can only be used with an iloc. It cannot be used with loc.

In [20]: revenue.iloc[0:3]
Out[20]:
2014 1000
2015 1200
2016 1500
dtype: int64

The returned object is again a Series.

If we want information for multiple years, we will have to pass a list to the indexing syntax. Let’s get revenue for years 2014 and 2017.

In [23]: revenue[['2014', '2017']]
Out[23]:
2014 1000
2017 800
dtype: int64

This was interesting where series slicing is much more concise than list slicing.
Say you have list which looks like the following:

revenue_list = [1000, 1200, 1500]

In case we want a single value, we can say revenue_list[0] . However there is no list slicing syntax for getting multiple non contiguous values. We can’t say revenue_list[[0, 2]] . Instead we would have to say [revenue_list[0], revenue_list[2] .
But with a series, we could just say revenue[['2014', '2017']].

We could have used revenue.loc[['2014', '2017']]too if we wanted to keep it explicit.

2014 is the first entry and 2017 is the fourth entry in the series. We could also use revenue.iloc[[0, 3]].

In [25]: revenue.iloc[[0, 3]]
Out[25]:
2014 1000
2017 800
dtype: int64

As you should have noticed, slicing a series always returns another series.

Getting labels of the series

We want a list of all the years for which we have revenue.

In [93]: revenue.index
Out[93]: Index(['2014', '2015', '2016', '2017', '2018'], dtype='object')
In [94]: list(revenue.index)
Out[94]: ['2014', '2015', '2016', '2017', '2018']

As we saw a Series object has an attribute called index which returns the labels.

Filtering a series

We want to get all years where revenue is greater than 1000.

In [50]: revenue[revenue > 1000]
Out[50]:
2015 1200
2016 1500
dtype: int64

We will understand how it works under the hood in next section.

Filtering using a list of booleans

List of booleans is the underlying mechanism which enables filtering on series.

We want to get revenue for 2014 and 2017. 2014 is positioned 0 and 2017 is positioned 3.

We will have to create a list of same length as revenue and set 0th and 3rd element as True.

In [75]: l = [True, False, False, True, False] # Set 0th and 3rd element TrueIn [76]: revenue[l]
Out[76]:
2014 1000
2017 800
dtype: int64

We passed a list of booleans to indexing syntax and it filtered the series to return another series. The returned series only contains values where the corresponding element was True in the boolean list.

Instead of list, we could create a series of length 5 with boolean value and use the boolean series with revenue.

In [78]: boolean_series = pd.Series([True, False, False, True, False], index=['2014', '2015', '2016', '2017', '2018'])In [79]: revenue[boolean_series]
Out[79]:
2014 1000
2017 800
dtype: int64

Since we wanted to use the boolean series with revenue, we had to ensure that the indexes of both series match.

Let’s see what revenue > 1000 give us:

In [80]: revenue > 1000
Out[80]:
2014 False
2015 True
2016 True
2017 False
2018 False
dtype: bool

It returns us a series which is similar to boolean_series we explicitly created.

Since it’s a series, we could directly use this with revenue and do revenue[revenue > 1000]. This is exactly what we did in the last section on filtering.

More filtering

We want to get all the years where revenue is greater than 1000 but less than 1300.

In [68]: revenue[(revenue > 1000) & (revenue < 1300)]
Out[68]:
2015 1200
dtype: int64

Arithmetic operation on series

Suppose we realise that each year’s revenue was off by 100. We want to increase each year’s revenue by 100 to make up for it.

In [81]: revenue + 100
Out[81]:
2014 1100
2015 1300
2016 1600
2017 900
2018 1000
dtype: int64

Notice how each value of the series increased by 100.

Any arithmetic operation on series is applied to all the values of the series.

Ordering on series

We want to sort the revenues in ascending order.

In [87]: revenue.sort_values()
Out[87]:
2017 800
2018 900
2014 1000
2015 1200
2016 1500
dtype: int64

We want to sort the revenues in descending order.

In [88]: revenue.sort_values(ascending=False)
Out[88]:
2016 1500
2015 1200
2014 1000
2018 900
2017 800
dtype: int64

We want to know the year which had maximum revenue.

In [92]: revenue.sort_values(ascending=False).index[0]
Out[92]: '2016'

revenue.sort(ascending=False) returns an ordered series. And since a series has attribute index, we could use it and get the label of year with maximum revenue.

There is a better method to achieve this though. We will see idxmax() in next section.

Aggregation on series

Let’s find out total revenue across all the years.

In [96]: revenue.sum()
Out[96]: 5400

Series has utilities to easily find the average, maximum and minimum too.

In [102]: revenue.mean()
Out[102]: 1080.0
In [103]: revenue.max()
Out[103]: 1500
In [104]: revenue.min()
Out[104]: 800

We want to know the year which had maximum revenue.

In [106]: revenue.idxmax()
Out[106]: '2016'

Grouping and aggregation on series

Let’s assume there could be multiple rows for revenue for a particular year.

In [2]: revenue_with_multiple_entries = pd.Series([1000, 1200, 1500, 800, 900, 500], index=['2014', '2015', '2016', '2017', '2018', '2017'])In [3]: print(revenue_with_multiple_entries)
2014 1000
2015 1200
2016 1500
2017 800
2018 900
2017 500
dtype: int64

We have multiple rows for revenue of 2017. We want to know the total revenue across different years.

We can achieve it by doing a groupby and then doing a sum.

In [4]: revenue_with_multiple_entries.groupby(revenue_with_multiple_entries.index).sum()
Out[4]:
2014 1000
2015 1200
2016 1500
2017 1300
2018 900
dtype: int64

Most common value of series

Let’s create a series of names of students in a class.

In [6]: names = pd.Series(['steve jobs', 'bill gates', 'mark twain', 'charles darwin', 'charles dickens', 'mark zuckerberg', 'charles darwin'])In [7]: print(names)
0 steve jobs
1 bill gates
2 mark twain
3 charles darwin
4 charles dickens
5 mark zuckerberg
6 charles darwin

You should have noticed that we did not provide explicit index to the series. So an integer label was automatically assigned.

Let’s find the most common name.

In [10]: names.value_counts()
Out[10]:
charles darwin 2
bill gates 1
mark zuckerberg 1
mark twain 1
steve jobs 1
charles dickens 1

String operation on series

Let’s find all the names which start with mark.

In [16]: names[names.str.startswith('mark')]
Out[16]:
2 mark twain
5 mark zuckerberg
dtype: object

names.str.startswith('mark') returns a boolean array which we are passing to indexing syntax to filter the series.

A pandas series has str attribute which allows executing string operations.

apply method

Series has a method called apply which allows doing advanced filtering.

We want to find all names having surname as darwin.

In [27]: names[names.apply(lambda x: x.split(' ')[-1] == 'darwin')]
Out[27]:
3 charles darwin
6 charles darwin
dtype: object

apply method expects a function to be passed to it. That’s why we passed it a lambda function.

Calling names.apply() applies the lambda function to all values of the series. The return value of .apply() is a seriesof booleans.

In [26]: names.apply(lambda x: x.split(' ')[-1] == 'darwin')
Out[26]:
0 False
1 False
2 False
3 True
4 False
5 False
6 True
dtype: bool

We used the boolean series to filter the original series.

Thank you for reading so far! Before you go:

--

--