The world’s leading publication for data science, AI, and ML professionals.

Make Your Pandas DataFrame Output Report-Ready

Instruction of Pandas style and Sparklines Library

Photo by 12019 on Pixabay
Photo by 12019 on Pixabay

As a Data Scientist or Analyst who use Python as the primary programming language, I believe you must have use Pandas a lot. It is very frequently for us to output out pandas data frame in the Jupyter notebooks.

However, have you ever think that we can actually let the data frame visualise itself? In other words, for some simple visualisation purposes, we don’t need Matplotlib or other visualisation libraries. The Pandas data frame output can be visualised like an Excel spreadsheet with complicated styles and with very easy code definition.

In this article, I’ll introduce the style package in the Pandas library which is known by relatively fewer people than its data processing methods. Also, there are some interesting libraries that support more IN-LINE visualisations of the Pandas data frame. In the last section, I’ll also introduce one of them – Sparklines.

About Pandas Style Package

Photo by Free-Photos on Pixabay
Photo by Free-Photos on Pixabay

It is known that Pandas data frames can be output in iPython/Jupyter notebook that is automatically rendered in HTML with CSS styles. This is definitely an amazing feature because the presentation is very nice even if we just simply print it.

But wait, it makes use "HTML + CSS". Yes, Pandas also allows us to customise the CSS styles to make it even looks nicer. This is achieved by the "style" API.

Style – pandas 1.1.2 documentation

We can simply call df.style to get the Styler object of a data frame, and then add the styles we want. Now, let’s see what we can do.

Formatting the Outputs

Photo by geralt on Pixabay
Photo by geralt on Pixabay

Of course, we can always format the data itself such as df.round(2) to round all the numerical values with 2 decimals. However, there are some benefits to do that using Pandas styles. For example, we don’t actually change the value, but only the presentation, so that we didn’t lose the precision.

Let’s create a random data frame first.

import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(10, 2)*100)

Then, lets output the data frame with specific formats.

df.style.format("{:.2f}")

BTW, if you’re not that familiar with Python Format Specification Mini-Language, which is the expression I used {:.2f}, you can check out the official documentation here (highly recommend).

Format Specification Mini-Language – Python documentation

Highlight Cells with Background and Text Colours

Photo by YeriLee on Pixabay
Photo by YeriLee on Pixabay

I know, formatting is not cool enough. With CSS we can easily do a lot of things, such as changing background colours and Text Colours.

For example, for the same data frame above-shown, we want to highlight the positive numbers and native numbers separately. If you have some basic knowledge about CSS, or just simply Google it, you will know the below properties to set the background colour and text colour of HTML table cells.

background-color: red; color: white

Let’s write a function to colour the table cells.

def highlight_number(row):
    return [
        'background-color: red; color: white' if cell <= 0 
        else 'background-color: green; color: white'
        for cell in row
    ]

If the value of the cell is negative, we use red colour as the background, otherwise use green colour if it is positive. Since the colour might be kind of dark, we also want to change the text colour to white.

Then, we can apply the function to the data frame.

df.style.apply(highlight_number)

Super cool! Now it is very obvious that we got negative and positive numbers, and they are distinguished very well.

Chaining the Style Functions

Photo by TanteTati on Pixabay
Photo by TanteTati on Pixabay

So far, we’re only adding one type of style each time. In fact, once we called df.style, it returns the Styler object of the data frame. The Styler object supports chaining the style functions. Let’s have a look at another example which is more complex.

Let’s say, we want to add the following styles to the original data frame.

  1. Highlight the negative number in red and positive number in green.
  2. Format the number in currency.
  3. Add a caption to the table using set_caption().
  4. Make the cell padding larger so that it looks not that squeezed.
  5. Add white borders between the cells to improve the presentation.

Yes, we can do all of these in one go, by using the chain expression.

df.style 
  .apply(highlight_number) 
  .format('${0:,.2f}') 
  .set_caption('A Sample Table') 
  .set_properties(padding="20px", border='2px solid white')

Feels like the table can be directly used in some business report 🙂

Built-in Highlighting Functions

Photo by PublicDomainPictures on Pixbay
Photo by PublicDomainPictures on Pixbay

Don’t know CSS, but still want to show off? Yes, Pandas style also provides some built-in functions, which are very cool but very easy to use.

Highlighting Functions

Let’s use the same data frame for the demonstration.

# Make a nan value for demo purposes
df.at[1, 1] = None
# Add styles
df.style 
  .highlight_null('lightgray') 
  .highlight_max(color='lightgreen') 
  .highlight_min(color='pink')

See, it is very easy to highlight the NULL values, Min and Max values using whatever colours you prefer. I would recommend using light colours because the text colour always gonna be black.

Gradient Coloured Background

Pandas style also support using cmap to colour the cell background in gradient colours. This is very useful when we want to visualise the numeric data in scales.

df = pd.DataFrame(np.random.randn(10, 2))
df.style 
  .background_gradient(cmap='Blues')

So, the background colour is gradient depends on the values. Please be noted that "Blues" is one of the cmap that that Pandas supports. If you wonder what are others that are supported, the following documentation page from Matplotlib is a good reference.

Choosing Colormaps in Matplotlib – Matplotlib 3.1.0 documentation

In-line Bar Chart

This is another super cool feature that is built-in. It can generate bars as in the background of each cell to indicate their values. Let’s use the above data frame.

df.style.bar()

Of course, we’re not satisfied by the default style of the bar chart. Let’s improve the presentation.

df.style 
  .format('{:.2f}') 
  .bar(align='mid', color=['#FCC0CB', '#90EE90']) 
  .set_caption('A Sample Table with Bar Chart') 
  .set_properties(padding="15px", border='2px solid white', width='300px')

Can you believe this is still the "Pandas dataframe" that you use to be very familiar with? 🙂

Let me simply explain the parameters in the bar() method. The color parameter supports either a single string or a tuple, when it is a tuple, the first colour will be used for colouring the negative values and the second colour is for the positive. Since we’re using two colours, we need to set the bar to be aligned in the middle of the cell.

Sparklines – An in-line histogram

Photo by Myriams-Fotos on Pixabay
Photo by Myriams-Fotos on Pixabay

I could stop here, but I want to show off another library I found which is very interesting called Sparklines. I wouldn’t say it’s very nice looking, but the idea is really cool.

You can install the library using pip.

pip install sparklines

Then, let’s import the library and create another sample data frame for demonstration purposes.

from sparklines import sparklines
df = pd.DataFrame({
    'a': np.random.normal(0, 1, 100),
    'b': np.random.normal(1, 1, 100),
    'c': np.random.normal(0, 3, 100)
})

The feature of the library Sparklines is simple. it can generate bar charts using Unicode strings such as ▁, ▂, ▃, until █. To make sure the bars in correct orders and make sense as a histogram, we need to prepare the values using NumPy to generate the histogram values first.

def sparkline_dist(data):
    hist = np.histogram(data, bins=10)[0]
    dist_strings = ''.join(sparklines(hist))
    return dist_strings

For each column, we can generate the histogram using sparkline.

[sparkline_dist(df[col]) for col in df.columns]

Finally, we can put the strings along with other statistics to make a better report.

df_stats = df.agg(['mean', 'std']).transpose()
df_stats['histogram'] = sl_list
df_stats.style 
  .format('{:.2f}', subset=['mean', 'std']) 
  .set_caption('A Sample Table with Sparklines Distributions') 
  .set_properties(padding="15px", border='2px solid white')

Done!

Summary

Photo by pasja1000 on Pixabay
Photo by pasja1000 on Pixabay

In this article, I have demonstrated to you all the major methods in the Pandas style package. We can format the values, colour the backgrounds, improve the presentation with customised CSS properties. There are also many built-in functions that can be used out-of-box to create a "report-ready" data frame output.

Join Medium with my referral link – Christopher Tao

If you feel my articles are helpful, please consider joining Medium Membership to support me and thousands of other writers! (Click the link above)


Related Articles