Let’s talk about the Pandas package.
When you browse through Stackoverflow or reading blogs on Toward Data Science, have you ever encountered some super elegant solutions (maybe just one line) that can replace your dozens of lines codes (for loop, functions)?
I encountered that kind of situation a lot, and I was often like, "Wow, I didn’t know this function can be used in this way, TRULY amazing!" Different people will have different excitement point for sure, but I bet these moments have occurred to your before if you ever work in the applied data science field.
However, one thing that puzzles me is that there’s not a place or repository to store and record these inspirational moments and the associated real-world examples. That’s the reason why I want to take the initiative to construct a GitHub repository just focusing on collecting these interesting/impressive usages/examples specifically in the Pandas library that makes you want to shout out!
Here’s the link to the repository:
https://github.com/frankligy/pandas_by_examples
Now I will show you two concrete examples that happen in my life and why I think having a repository like this would be helpful.
60% My Pandas coding errors attribute to overlook "dtype"
dtype
is a special object, or attributes of each Pandas data frame column, Series, and Index object, it is usually determined automatically so I usually forget the existence of this hidden property, which results in a lot of unexpected errors.
For instance, Let’s create a data frame:
df = pd.DataFrame({'col1':[1,2,'first'],
'col2': [3,4,'second'],
'col3': [5,6,'third']})
Then I deleted the third row because they are all strings, I only need numeric values for plotting a heatmap.
df = df.iloc[:-1,:]
Now I can draw the heatmap using df
data frame:
import seaborn as sns
sns.heatmap(df)
And we got an error:
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
You may wonder why is that? Let’s have a look at the dtype
of df
:
df['col1].dtype
# dtype('O')
It is an "object" type, instead of "int" even though all the values in this data frame are integers. The reason is that the dtype is inferred from the original data frame (the third row is a string, which forces the dtype of each column to become "object"), you remove the last row, but the dtype doesn’t automatically change.
We change the dtype to int and draw the heatmap again,
df = df.astype('int')

So I surmise that this is an easy-to-fall-into trap that worth highlighting somewhere, I create an example to show you the importance of specifying the dtype
when using pandas. It is just a super basic example but encompasses the critical idea of paying attention to dtype
variables.
https://github.com/frankligy/pandas_by_examples/blob/main/examples/3_Learning_dtype.ipynb
Let’s see another example:
How to convert two columns to a dictionary?
This is a real-world example I recently encountered and here I simplify this problem a bit. Imagining we have a data frame like this:

I want to get a Python dictionary like this:

To picture this problem, in a real setting, it is actually a giant data frame with hundreds of thousands of rows, so we definitely hope to have an automatic solution to achieve that.
You can achieve it using only one line of code:
df.groupby(by='nationality')['names'].apply(
lambda x:x.tolist()).to_dict()
How does it work? I have a step-by-step instruction in my GitHub example,
https://github.com/frankligy/pandas_by_examples/blob/main/examples/5_columns2dict.ipynb
And I just paste it here:

Conclusion
This article is really just aiming to let you get a sense of why I want to create a repository like this to store those impressive use cases in the Python Pandas library. I will keep updating and adding examples that I encountered in my daily work. If you agree with my initiative, I would be really appreciated it if you’d like to contribute to this as well, just by simply filing a pull request so I will merge your examples onto the repository. I hope this repository can become a place where both programming beginners and intermediate data scientists would enjoy to check every day and can bring convenience to them.
Repository link:
https://github.com/frankligy/pandas_by_examples
Thanks for reading! If you like this article, follow me on medium, thank you so much for your support. Connect me on my Twitter or LinkedIn, also please let me know if you have any questions or what kind of pandas tutorials you would like to see In the future!