The world’s leading publication for data science, AI, and ML professionals.

A hidden gem: df.select_dtypes()

A micro-post about a Pandas function from 2014 that had been flying under the radar for me.

How many times have you gone to feed a pandas DataFrame into a utility function form another library and it fails due to there being object columns. Maybe it was a graph from seaborn? (we’re looking at you sns.clustermaps()).

>>> sns.clustermap(
        df, 
        method='ward', 
        cmap="YlGnBu", 
        standard_scale=1
    )
TypeError: unsupported operand type(s) for -: 'str' and 'str'

So, I build a list of all the data types of my columns and filter those non numeric columns out, and pass that resulting data frame into the function that only expected numeric columns.

>>> numeric_cols = [
        col 
        for col, dtype 
        in df.dtypes.items() 
        if dtype=='float64'
    ]
>>> print(numeric_cols)
['col1', 'col2']
>>> sns.clustermap(
        df[numeric_cols], 
        method='ward', 
        cmap="YlGnBu", 
        standard_scale=1
    )
Data from E-GEOD-20108 - Responses to ethanol in haploid and diploid strain | Image by Author
Data from E-GEOD-20108 – Responses to ethanol in haploid and diploid strain | Image by Author

As I was investigating some data relating to gene expressions in yeasts yesterday, it got me thinking. There has to be a better way. This concept of filtering a DataFrame’s columns by datatype should be common enough to be a one-liner! Well, it turns out it is, and even though I thought I had crawled to documentation enough times to be across the pandas API, it turns out there’s always a new hidden gem!

Pandas DataFrames have a built in method called select_dtypes() which does exactly what I wanted. And rewriting the above code to use this new (to me) function looks like the following (notice select_dtypes() takes a list, you can filter for multiple data types!):

>>> sns.clustermap(
        df.select_dtypes(['number']), 
        method='ward', 
        cmap="YlGnBu", 
        standard_scale=1
    )

I’m sure most of you are already well across this since (I did a git blame on the Pandas git repo to see when it was added: 2014 🤦). Anyway, I hope maybe someone will find this as helpful as I have. Happy coding!

pandas.DataFrame.select_dtypes – pandas 1.3.4 documentation

ArrayExpress

seaborn.clustermap – seaborn 0.11.2 documentation


Related Articles