The world’s leading publication for data science, AI, and ML professionals.

Pandas Minesweeper 101

Lookout for these hurdles before diving deep into pandas

Photo by Markus Spiske on Unsplash
Photo by Markus Spiske on Unsplash

Pandas is a flexible library that forms an integral part of a data scientist’s coding exposure. The reason for the popularity of this library in python is its powerful vectorization feature where an entire array can be executed at a time. This results in less run time when compared to an equivalent operation with a loop. Also, most of the commands in pandas are quite straightforward and pandas have really good documentation with clear examples.

Even though pandas has many great features that are powerful to handle a million rows of data, there are hidden mines that we need to be aware of before stepping in and writing the code.

When I first started with this library, I was little aware of these roadblocks which are easy to slip through the crack but might result in error or wrong results. Hence, I have compiled this article to share some of the possible mines that we have to look out for so we could sweep it 🙂

Importance of Indexing

While dealing with millions of data points and especially while performing operations on a portion of the dataset by filtering, it is important to take care of the order of the index. Pandas use the index as a lookup and do index match to perform any operation.

But some operations might change the index and it is good to be aware.

💣 Mine 1 – merge(): While joining two dataframes, pandas merge has the quality of resetting the index. This hidden quality is not explicitly shown in the docs. Hence, it’s always good to be cautious about the index of the resultant dataframe and if in case we want to perform any operations (arithmetic, aggregate, etc.) we would have to persist the old dataframe’s index and apply it to the new one. In this way, we can eliminate the Index mismatch error which will be quite annoying to spot if we are dealing with 1000 lines of code and 100 million data

💣 Mine 2 – Beware of index based warnings: Pandas library inherently gives a warning ⚠️ when it’s about to change the index of the dataframe while performing filtered lookups (operations that include .loc, .iloc based filtering). It is always good to be completely aware and understand them before moving our code to production.

To give one such example, I was recently doing Boolean based indexing and came across the warning UserWarning: Boolean index key is reindexed to match the dataframe index. Here’s an example of Boolean based indexing

import pandas as pd
df = pd.DataFrame(data={'values':[1,56,3,4,5,98,100]})
condition = (df.values > 10) 
df.loc[condition, 'marker'] = 'Yes'

If the index of condition and df are not the same, pandas would try to handle it and gives a warning to us to make sure that our code logic is right.

Knowing when to use Inplace

💣 Mine 3 – Many operations in pandas have this inherent parameter called inplace which can be set to True or False based on whether we want to make the changes directly on the original dataframe or assign the changes to the new dataframe respectively.

While setting inplace = True following series of events are bound to occur:

  • Pandas create a copy of the original data
  • Makes the described computation like drop a column, row, dropna, etc
  • Assigns the result to the original data
  • Deletes the copy

Even though on a high level, inplace=True seems like it does not create a copy of a dataframe in the memory, it still creates one to do the necessary computation as described in the flow of events above.

Also, the code becomes harder to debug if pandas give a warning ⚠️ SettingwithCopy which is often bound to happen while inplace is turned on.

But in some operations, inplace = True can save computational time. Hence it is often good to understand the volume and scope of the dataset that we are using before using the inplace parameter.

Datetime Mishaps

When I first started with pandas, I found handling the DateTime field a little bit tricky. This is because, while dealing with DateTime field it is important to take care of the timezone and while performing any arithmetic operation on this field like adding hours/days/months, it has to be first converted into a DateTime format from an integer and failing to do so might throw an error which was difficult to look out for in the beginning 🤖

There was one particular issue which I couldn’t get many inputs from the internet. I will be discussing in detail the scenario and how I handled the incident.

💣 Mine 4 – The actual problem statement is to filter based on a condition and fill the date values only on those filtered rows. After finishing this task, I found that some of the date values in the specific datetime column were numbers. The integer values were filled sporadically, in the sense, some rows had correct date values and some had numbers. When I tried to look out for this problem on the internet, it suggested me to convert the dtype of the resultant column to DateTime. But this didn’t solve my issue.

Later I realized that the problem was with the column initialization. Since I initialised column as

df['date_field'] = np.NaN

Instead, it must be initialised as

df['date_field'] = pd.NaT

Missing brackets

💣 Mine 5 – For some of the pandas command, it is inherent to specify () __ even if no arguments are passed on the function. If in case we forget to specify the brackets for those functions, pandas do not throw errors, but prints slightly weird results which might be confusing.

[IN] df.values.sample
1,56,3,4,5,98,100
[OUT] <bound method NDFrame.sample of 0                          
1
1                           56
2                           3
3                           4
4                           5
5                           98
6                           100

Conclusion

While dealing with huge number of data with a lot disparities, it is always good to completely understand the underlying code logic and if in case, pandas gives some warning errors, it’s good to check why and how to remove it before ignoring.

Hence, on high level we could deal with any mishaps by gradually getting familiarised with the framework 🤓


Related Articles