The world’s leading publication for data science, AI, and ML professionals.

Learning Pandas by Examples

A compendium of useful, interesting, inspirational usages of Python Pandas library

Photo by Tolga Ulkan on Unsplash
Photo by Tolga Ulkan on Unsplash

Let’s talk about the Pandas package.

When you browse through Stackoverflow or reading blogs on Toward Data Science, have you ever encountered some super elegant solutions (maybe just one line) that can replace your dozens of lines codes (for loop, functions)?

I encountered that kind of situation a lot, and I was often like, "Wow, I didn’t know this function can be used in this way, TRULY amazing!" Different people will have different excitement point for sure, but I bet these moments have occurred to your before if you ever work in the applied data science field.

However, one thing that puzzles me is that there’s not a place or repository to store and record these inspirational moments and the associated real-world examples. That’s the reason why I want to take the initiative to construct a GitHub repository just focusing on collecting these interesting/impressive usages/examples specifically in the Pandas library that makes you want to shout out!

Here’s the link to the repository:

https://github.com/frankligy/pandas_by_examples

Now I will show you two concrete examples that happen in my life and why I think having a repository like this would be helpful.


60% My Pandas coding errors attribute to overlook "dtype"

dtype is a special object, or attributes of each Pandas data frame column, Series, and Index object, it is usually determined automatically so I usually forget the existence of this hidden property, which results in a lot of unexpected errors.

For instance, Let’s create a data frame:

df = pd.DataFrame({'col1':[1,2,'first'],
                   'col2': [3,4,'second'],
                   'col3': [5,6,'third']})

Then I deleted the third row because they are all strings, I only need numeric values for plotting a heatmap.

df = df.iloc[:-1,:]

Now I can draw the heatmap using df data frame:

import seaborn as sns
sns.heatmap(df)

And we got an error:

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

You may wonder why is that? Let’s have a look at the dtypeof df :

df['col1].dtype
# dtype('O')

It is an "object" type, instead of "int" even though all the values in this data frame are integers. The reason is that the dtype is inferred from the original data frame (the third row is a string, which forces the dtype of each column to become "object"), you remove the last row, but the dtype doesn’t automatically change.

We change the dtype to int and draw the heatmap again,

df = df.astype('int')
The input data frame columns should be numeric
The input data frame columns should be numeric

So I surmise that this is an easy-to-fall-into trap that worth highlighting somewhere, I create an example to show you the importance of specifying the dtype when using pandas. It is just a super basic example but encompasses the critical idea of paying attention to dtype variables.

https://github.com/frankligy/pandas_by_examples/blob/main/examples/3_Learning_dtype.ipynb

Let’s see another example:

How to convert two columns to a dictionary?

This is a real-world example I recently encountered and here I simplify this problem a bit. Imagining we have a data frame like this:

data frame we have
data frame we have

I want to get a Python dictionary like this:

Python dictionary I want to get
Python dictionary I want to get

To picture this problem, in a real setting, it is actually a giant data frame with hundreds of thousands of rows, so we definitely hope to have an automatic solution to achieve that.

You can achieve it using only one line of code:

df.groupby(by='nationality')['names'].apply(
lambda x:x.tolist()).to_dict()

How does it work? I have a step-by-step instruction in my GitHub example,

https://github.com/frankligy/pandas_by_examples/blob/main/examples/5_columns2dict.ipynb

And I just paste it here:

convert two columns to the dictionary
convert two columns to the dictionary

Conclusion

This article is really just aiming to let you get a sense of why I want to create a repository like this to store those impressive use cases in the Python Pandas library. I will keep updating and adding examples that I encountered in my daily work. If you agree with my initiative, I would be really appreciated it if you’d like to contribute to this as well, just by simply filing a pull request so I will merge your examples onto the repository. I hope this repository can become a place where both programming beginners and intermediate data scientists would enjoy to check every day and can bring convenience to them.

Repository link:

https://github.com/frankligy/pandas_by_examples

Thanks for reading! If you like this article, follow me on medium, thank you so much for your support. Connect me on my Twitter or LinkedIn, also please let me know if you have any questions or what kind of pandas tutorials you would like to see In the future!


Related Articles