In this article, I wanted to quickly show a few useful Pandas methods/functions, which can come in handy during your daily work. To manage expectations, this is not an article showing the basic functionalities of pandas and there is no particular theme to the methods. Without further ado, let’s start!
1. hasnans
There are many ways of inspecting whether a Series/DataFrame contains missing values, including dedicated libraries such as missingno. A simple way to check if a column of a DataFrame contains missing values could look as follows:
df["column"].isna().any()
Alternatively, we can use the hasnans method of a pd.Series for slightly less typing:
df["column"].hasnans
One downside is that this method does not work for a DataFrame, though we can easily use it in combination with list comprehension:
[df[col].hasnans for col in df.columns]
Naturally, we lose the names of the columns using this approach. So I would say the primary use case of hasnans is doing quick checks on single columns.
2. transform
The transform method is used for applying transformations to an entire DataFrame or different transformations to selected columns. However, what makes this method stand out is that it returns an object of the same shape as the input.
That is why transform can be a big time-saver when dealing with aggregates. Normally, if we wanted to have some aggregate values (think sum or mean per group) in the original DataFrame, we would create a new, smaller DataFrame using groupby and then merge it back to the original DataFrame. That is a bit time-consuming and produces unnecessary intermediate objects.
With transform, we can do it in one go! Let’s create a DataFrame with some shop IDs and their sales over multiple days (for simplicity, we skip the dates).
Image by author
We can easily add the sum of sales per shop using the following snippet:
Image by author
transform definitely simplifies the process. One more advanced use case of transform can be for aggregate filtering. For example, we want to extract rows concerning shops which sold more than 75 units of goods. And ideally without creating a new column in the output DataFrame. We can do so as follows:
Image by author
3. merge_asof
In principle, this function is similar to a standard left join, however, we match on the nearest key rather than equal keys. This will become clearer with an example.
We also need to know that there are 3 directions of joining using merge_asof (the definitions are adapted from the documentation):
backward – for each row in the left DataFrame we select the last row in the right DataFrame whose on key is less than or equal to the left’s key. This is the default one.
forward – for each row in the left DataFrame we select the first row in the right DataFrame whose on key is greater than or equal to the left’s key.
nearest – for each row in the left DataFrame we search selects the row in the right DataFrame whose on key is closest in absolute distance to the left’s key.
Another thing to take into consideration is that both DataFrames must be sorted by the key.
I will not try to reinvent the wheel and will use the example provided in the documentation. Here, we have two DataFrames, one containing trade information and the other one containing quotes. Both contain detailed timestamps (including milliseconds).
Then, we merge the two DataFrames within 2ms between the quote time and the trade time:
What results in the following DataFrame:
Image by author
It is easiest to understand the logic of the merge by following up a few examples from the output frame back to the two DataFrames.
4. insert
This one is a simple, yet handy method. We can use it to insert a column into a specific location in a DataFrame. The most common use case is when we would like to add a column with extra information for easier analysis of the output. For our and stakeholders’ convenience, such information would be most useful as one of the first columns. We can do so with the following snippet:
Image by author
5. explode
explode is useful when we want to expand lists within a DataFrame into multiple rows. Imagine the following case – we have two IDs and multiple values stored within lists for each one of them. We would like to have a row per each ID-value combination.
Image by author
df.explode("values", ignore_index=True)
Image by author
6. str
When working with columns containing strings, we can use str, which allows us to use a variety of handy methods that can be applied to the entire column in a vectorized manner. For example, we can easily split a column containing full addresses into separate components. First, let’s create the DataFrame.
Image by author
Then, we can use the str.split method to create the new columns.
Image by author
Another possibility is to use the str.replace method for modifying the strings.
7. read_clipboard
This one can be especially useful when you frequently work with Excel/Google sheets or receive data in such files. Using read_clipboard, we can easily load the data from the source sheet to pandas using our computer’s clipboard. This way, we can bypass the need to save the file to .csv or .xls. Naturally, this is mostly useful for some small, ad-hoc analyses.
Image by author
Above you can see an example of a simple table in Google sheets. We select the range, press cmd+c (or control+c for Windows), and then easily load data to Python using the snippet below:
pd.read_clipboard()
What returns the following:
Image by author
8. eval
The eval method uses string expressions to efficiently compute operations on DataFrames. According to the documentation:
The following arithmetic operations are supported: +, -, *, /, **, %, // (python engine only) along with the following boolean operations: | (or), & (and), and ~ (not). Additionally, the 'pandas' parser allows the use of and, or, and not with the same semantics as the corresponding bitwise operators.
Image by author
9. squeeze
Remember the case when you wanted to extract a scalar value from a DataFrame, and after some filtering, you used .values[0] to access the final output? And then you probably thought that there must be a nicer way to do it. Indeed there is – the squeeze method.
First, let’s define a DataFrame:
Then, we can try to access a particular value in the DataFrame, for example:
df.loc[df["a"] == 3, "b"]
And the output looks as follows:
2 6.0
Name: b, dtype: float64
For a simple print this is ok, but it might cause some issues when we want to assign this value to a variable or even place it in another DataFrame.
The elegant solution to get the scalar value is to add the squeeze method at the end:
df.loc[df["a"] == 3, "b"].squeeze()
What returns:
6.0
Before you go
You can find the code used for this article on my GitHub. Also, any constructive feedback is welcome. I would be very happy to hear about some pandas functionalities that make your work easier! You can reach out to me on Twitter or in the comments.
Liked the article? Become a Medium member to continue learning by reading without limits. If you use this link to become a member, you will support me at no extra cost to you. Thanks in advance and see you around!
You might also be interested in one of the following: