Style Pandas Dataframe Like a Master

Eyal Trabelsi
Towards Data Science
5 min readSep 6, 2019

--

Photo by billow926 on Unsplash

What is styling and why care?

The basic idea behind styling is to leverage visual aids like color and format, in order to communicate insight more efficiently.

One of the most common ways of visualizing a dataset is using a table. Tables allow your data consumers to gather insight by reading the underlying data. For example, you may find yourself in scenarios where you want to provide your consumers access to the underlying data using a table.

In this article, you’ll learn how to add visualization to a pandas dataframe by using pandas styling and options/settings. The Pandas documentation itself is pretty comprehensive, but if you’re looking for a slightly friendlier introduction, I think you came to the right place. I will use kaggle’ “San Fransisco Salaries dataset” as an example, as always we start by loading the dataset using pandas.

Pandas code to load the dataset and some basic data munging:

df = pd.read_csv('Salaries.csv')\
.replace('Not Provided', np.nan)\
.astype({"BasePay":float, "OtherPay":float})
This is the rendered dataframe of “San Fransisco Salaries

Pandas Options/Settings API

Pandas have an options system that lets you customize some aspects of its behavior, here we will focus on display-related options. You may have experienced the following issues when using when you rendered the data frame:

  • There’re too many columns/rows in the dataframe and some columns/rows in the middle are omitted on display.
    For example, if one would like to show at most 7 rows and at most 7 columns one would:
  • Columns containing long texts get truncated and columns containing floats display too many / too few digits only on display.

Pandas Styling API

As we mentioned pandas also have a styling system that lets you customize some aspects of its the rendered dataframe, using CSS. You write a “style functions” that take scalars, DataFrame or Series, and return like-indexed DataFrames or Series with CSS "attribute: value" pairs for the values.

The most straightforward styling example is using currency symbols when working with currency values. For instance, in our data some of the columns (BasePay, OtherPay, TotalPay, and TotalPayBenefit) are currency values, so we would like to add dollar signs and commas. This can be done using the style.formatfunction:

Pandas code to render dataframe with formating of currency columns

df.head(10).style.format({"BasePay": "${:20,.0f}", 
"OtherPay": "${:20,.0f}",
"TotalPay": "${:20,.0f}",
"TotalPayBenefits":"${:20,.0f}"})

These styling functions can be incrementally passed to the Styler which collects the styles before rendering, thus if we want to add a function that format the EmployeeName and companyTitle as well, this can be done using another style.formatfunction:

Pandas code to render dataframe that also formats some columns to lower case

df.head(10).style.format({"BasePay": "${:20,.0f}", 
"OtherPay": "${:20,.0f}",
"TotalPay": "${:20,.0f}",
"TotalPayBenefits":"${:20,.0f}"})\
.format({"JobTitle": lambda x:x.lower(),
"EmployeeName": lambda x:x.lower()})

Seems a lot better now, but let’s take it a step forward the Index here doesn’t add any real information, we can use the hide_index function to suppresses the display of the index using the following code snippet:

Pandas code to render the formatted dataframe without the index

df.head(10).style.format({"BasePay": "${:20,.0f}", 
"OtherPay": "${:20,.0f}",
"TotalPay": "${:20,.0f}",
"TotalPayBenefits":"${:20,.0f}"})\
.format({"JobTitle": lambda x:x.lower(),
"EmployeeName": lambda x:x.lower()})\
.hide_index()

Having this type of flexibility when it comes to rendering our dataset is pretty powerful and useful, but that simply put NOT ENOUGH. You can apply conditional formatting, the visual styling of a DataFrame depending on the actual data within. The simplest example is the builtin functions in the style API, for example, one can highlight the highest number in green and the lowest number in color:

Pandas code that also highlights minimum/maximum values

df.head(10).style.format({"BasePay": "${:20,.0f}", 
"OtherPay": "${:20,.0f}",
"TotalPay": "${:20,.0f}",
"TotalPayBenefits":"${:20,.0f}"})\
.format({"JobTitle": lambda x:x.lower(),
"EmployeeName": lambda x:x.lower()})\
.hide_index()\
.highlight_max(color='lightgreen')\
.highlight_min(color='#cd4f39')

Another useful function is background_gradientwhich can highlight the range of values in a column. In addition, the cmap argument allows us to choose a color palette for the gradient. The matplotlib documentation lists all the available options (seaborn has some options as well).

Pandas code that also adds a background gradient

df.head(10).style.format({"BasePay": "${:20,.0f}", 
"OtherPay": "${:20,.0f}",
"TotalPay": "${:20,.0f}",
"TotalPayBenefits":"${:20,.0f}"})\
.format({"JobTitle": lambda x:x.lower(),
"EmployeeName": lambda x:x.lower()})\
.hide_index()\
.background_gradient(cmap='Blues')

One can even use styler.set_properties when the style doesn’t actually depend on the values. In this example, we will render our dataset with a black background and with green color for the text itself.

Pandas code to render the formatted dataframe in the same way for each cell.

df.head(10).style.set_properties(**{'background-color': 'black',                                                   
'color': 'lawngreen',
'border-color': 'white'})

But if we are honest, most of the time we would like to change the visualization attributes depending on the values and what we want to emphasis, we can use one of the following to help reach our goal:

  • Styler.applymap(func) for element-wise styles.
  • Styler.apply(func, axis=0) for column-wise styles.
  • Styler.apply(func, axis=1) for row-wise styles.
  • Styler.apply(func, axis=None) for tablewise styles.

The first example is Highlighting all negative values in a dataframe.

Pandas code to render the formatted dataframe with changed font color if the value is a string

df.head(10).style.format({"BasePay": "${:20,.0f}", 
"OtherPay": "${:20,.0f}",
"TotalPay": "${:20,.0f}",
"TotalPayBenefits":"${:20,.0f}"})\
.format({"JobTitle": lambda x:x.lower(),
"EmployeeName": lambda x:x.lower()})\
.hide_index()\
.applymap(lambda x: f”color: {‘red’ if isinstance(x,str) else ‘black’}”)

At last the pandas styling API also supports more advanced styling like drawing bar charts within the columns, we will introduce here the bar function and some of the parameters to configure the way it is displayed in the table:

df.head(10).style.format({"BasePay": "${:20,.0f}", 
"OtherPay": "${:20,.0f}",
"TotalPay": "${:20,.0f}",
"TotalPayBenefits":"${:20,.0f}"})\
.format({"JobTitle": lambda x:x.lower(),
"EmployeeName": lambda x:x.lower()})\
.hide_index()\
.bar(subset=["OtherPay",], color='lightgreen')\
.bar(subset=["BasePay"], color='#ee1f5f')\
.bar(subset=["TotalPay"], color='#FFA07A')

Conclusion

The pandas style API and the options API are really useful when you get towards the end of your data analysis and need to present the results to others. There are a few tricky components to string formatting so hopefully, the items highlighted here are useful to you.

--

--

Data architect at bigabid with a passion for performance, scale, python, machine learning and making software easier to use.