PyTrix Series

PyTrix is a weekly series where I present cool things that can be done in Python that are useful for Data Scientist. In the previous weeks of the PyTrix Series, we majored on working with Numpy Arrays, indexing and slicing to be specific…
Pandas, a fast, powerful, flexible and easy to use open source framework that was built on top of Python for data manipulation and analysis, like Numpy, is one of the de-facto libraries for Data Science. A large portion of people would most likely of started out their Data Science journey learning the Pandas and if you haven’t then I’d highly suggest that you read the Documentation — after reading this story of-course.
For access to the code used in this Github click the link below.
Choice for Indexing
For many learning Pandas their methods for multi-axis indexing is seldom understood by practitioners, which is fair since there is number of ways to do object selection:
.loc
is primarily label based meaning that it selects rows (or columns) with explicit label from the index. Therefore, aKeyError
will be raised if the items are not found..iloc
is primarily integer position based (from 0 to length-1 of the axis) meaning that rows (or columns) are selected by its position in the index – it takes only integers. If the indexer is out-of-bounds then anIndexError
will be raised.[]
is primarily for selecting out lower-dimensional slices.
Note: The Python and NumPy indexing operators
[]
and attribute operator.
provide quick and easy access to pandas data structures across a wide range of use cases. This makes interactive work intuitive, as there’s little new to learn if you already know how to deal with Python dictionaries and NumPy arrays. However, since the type of the data to be accessed isn’t known in advance, directly using standard operators has some optimization limits. For production code, we recommended that you take advantage of the optimized pandas data access methods exposed in this chapter.
- An extract from Pandas Documentation
Attribute access with [ ]
For those that have been following the Pytrix Series, or if you have some familiarity of Python indexing behavior, then you would know that the primary method used for indexing lower dimensional slices is []
.
import pandas as pd
import numpy as np
df= pd.DataFrame(np.random.randn(5, 5),
columns=["col_1", "col_2", "col_3", "col_4", "col_5"])
df

# access all elements in "col_1"
df["col_1"]
>>>> 0 0.32
1 0.74
2 0.14
3 0.79
4 0.58
Name: col_1, dtype: float64
Above we see that using []
to index a Pandas DataFrame class returns a series
corresponding to the name of the column . If the data is a series
a scalar will be returned, for example I will use the returned series from above to show that we can return a scalar like so:
a = df["col_1"]
a[1]
>>>> 0.79390631428754759
This is the most basic form of indexing. We also may pass a list of columns to []
to select columns in that order.
df[["col_2", "col_3", "col_4"]]

I will not go into to much depth with this functionality, since it is very much like indexing and slicing in ordinary Python and Numpy except that the start and end bounds are both returned.
Selection by Label
Pandas uses a strict inclusion based principle to allow for all of the many methods used for pure label based indexing to function correctly – failure to provide a label that is not in the index will raise KeyError
.
For this section I will be creating a new Pandas DataFrame
new_df = pd.DataFrame(np.random.randn(10, 4), index=list("abcdefghij"), columns=list("ABCD"))
new_df

The .loc
attribute is the primary access method when selecting by label. Valid inputs into .loc
are as follows:
- A single label (Note that a label may be an integer which does not refer to the integer position along the index)
# cross-section using a single label
new_df.loc["a"]
>>>> A 0.952954
B -1.310324
C -1.376740
D 0.276258
Name: a, dtype: float64
- A list or array of labels
# A list or array of labels
new_df.loc[["a", "d ", "i"]]

- A slice with labels (Note that both the start and end bounds are included in the slice contrary to Python slicing)
# slice with labels
new_df.loc["a":"d", "B":]

- Boolean array
# a boolean array
new_df.loc["j"] > 0.5
>>>> A False
B False
C True
D True
Name: j, dtype: bool
# another way to use a boolean array
new_df.loc[:, new_df.loc["j"] > 0.5]

Selection by position
Pandas uses 0-based
indexing that follows the semantics of Python and Numpy slicing. There are a variety of methods that could be used to access elements by position by using purely integer based indexing. Trying to use a non-integer, regardless of its validity, an IndexError
will be raised.
# using same df but new index values
new_df.index = list(range(0, 10))
new_df

The .iloc
attribute is the primary access method when selecting by position. Valid inputs into .iloc
are as follows:
- An integer
# cross section using an integer position
new_df.iloc[3]
>>>> A -0.650225
B 1.571667
C -1.204090
D 0.637101
Name: 3, dtype: float64
- A list or array of integers
# list or array of integers
new_df.iloc[[1, 5, 2, 3]]

# another example
new_df.iloc[[1, 2, 3, 4], [0, 1, 2]]

- A slice object with integers
# integer slicing
new_df.iloc[:3]

# integer slicing by column
new_df.iloc[:, :3]

- A boolean array.
new_df.iloc[1:] > 0.5

When slices that go out of bounds are used then this can result in an empty axis.
new_df.iloc[10:13]

However, if we attempt to call a single indexer that is out of bounds, this will raise IndexError
. Additionally, if we attempt to call a list of indexers of which any element is out of bounds then this will also raise an IndexError
.
Note: Whether a copy or a reference is returned for a setting operation, may depend on the context. This is sometimes called
chained assignment
and should be avoided. See Returning a View versus Copy.
Selection by Callable
All the above attributes can accept what is called a callable
as an indexer. The requirement for the callable
is that it must be a function with one argument that returns a valid output of indexing.
# selection by a callable using []
new_df[lambda df: df.columns[3]]
>>>> 0 0.276258
1 -0.617174
2 0.229638
3 0.637101
4 0.977468
5 0.401624
6 -0.659852
7 0.984220
8 0.947080
9 1.182526
Name: D, dtype: float64
This returns a series of the column at index 3 (in this scenario the column at index 3 is "D").
# selection by a callable using .loc
new_df.loc[:, lambda df: ["A", "B"]]

In this example we return the all rows and columns A and B using the .loc
attribute. Below we will do the exact same thing using .iloc
.
# selection by a callable using .iloc
new_df.iloc[:, lambda df: [0, 1]]

Wrap Up
Pandas is an extremely useful tool for Data Scientist. It is worth going through the documentation to see what other cool things are in this package, as there are so many things that I did not touch on in this post for example how to use these operations on pd.Series
.
Thank you for taking your time to go through this story. If you enjoyed it you can find more articles like this one in the PyTrix Series.
If you’d like to reach out to me personally, I am most active on LinkedIn and would be more than happy to connect with you!