The world’s leading publication for data science, AI, and ML professionals.

Going Beyond value_counts(): Creating Visually Engaging Frequency Tables With Only 3 Lines Of Code…

Some simple yet helpful lifehacks you probably didn't know

Image by Author inspired by Loveascii
Image by Author inspired by Loveascii

While the easiest way to create a frequency table for a Series object in Python Pandas library is by applying the value_counts() method, the result of this operation looks rather basic. We can make it more informative by adjusting the method’s boolean parameters normalize, sort, ascending, and dropna, or grouping the values (if they are numeric) into bins. However, the options here are quite limited, so to visually enhance the resulting frequency table, we can consider some simple yet helpful lifehacks, such as method chaining, text customization, adding the % symbol to each frequency value, and using the power of pretty-printing.

In this article, we’ll experiment with a Kaggle dataset containing information about age, sex, location, education, etc., for 60K users of OkCupid dating app 💞 . For our purposes, however, we’ll use only the data on the user statuses.

1. Method chaining

To start with, let’s create a basic frequency table for the user statuses:

import pandas as pd
profiles = pd.read_csv("okcupid_profiles.csv", na_values="unknown")
s = profiles["status"]
s.value_counts()
Output:
single            55697
seeing someone     2064
available          1865
married             310
Name: status, dtype: int64

Even though this table clearly shows the overall trend, it would be more convenient to see this information in relative frequencies. Let’s assign True to the normalize parameter and the resulting frequency table, which is a Series by itself, to a variable called s_norm.

(Side note: hereinafter, we’re going to leave all the other parameters of the value_counts() method by default, meaning that we’ll consider only frequency tables sorted in descending order and excluding missing values. For our purposes, using or not those parameters doesn’t matter.)

s_norm = s.value_counts(normalize=True)
s_norm
Output:
single            0.929275
seeing someone    0.034437
available         0.031117
married           0.005172
Name: status, dtype: float64

All the frequencies are converted into fractions now, but we’d prefer to see them in percentages, so let’s multiply the Series by 100:

s_pct = s_norm.mul(100)
s_pct
Output:
single            92.927456
seeing someone     3.443673
available          3.111652
married            0.517218
Name: status, dtype: float64

We don’t really need such precision of the percentages. Moreover, let’s imagine that we don’t need decimal points at all:

s_pct_rounded = s_pct.round()
s_pct_rounded
Output:
single            93.0
seeing someone     3.0
available          3.0
married            1.0
Name: status, dtype: float64

Unfortunately, rounding up to the whole part left us with so-called nullable integers (i.e. float numbers with a decimal part equal to 0). To fix it, we can use either convert_dtypes() or astype(int):

s_pct_int = s_pct_rounded.convert_dtypes()
s_pct_int
Output:
single            93
seeing someone     3
available          3
married            1
Name: status, dtype: Int64

Let’s now expand the whole expression for s_pct_int displaying all the methods we chained:

s_pct_int = profiles['status'].value_counts(normalize=True).mul(100).round().convert_dtypes()

2. Adding a table title

Without knowing the context, from the table above, it can be not clear that the numbers represent percentages and not absolute frequencies. Let’s add a title to the table using the f-string formatting:

print(f"OkCupid user statuses, %n{s_pct_int}")
Output:
OkCupid user statuses, %
single            93
seeing someone     3
available          3
married            1
Name: status, dtype: Int64

A good idea, especially if we print out several frequency tables at once, would be to visually highlight the title from the rest of the table, e.g. making it bold. To do so, we can use ANSI escape code sequences. In particular, to print a string in bold in Python, we should add the sequence 33[1m before the string and 33[0m – after it:

print(f"33[1mOkCupid user statuses, %33[0mn{s_pct_int}")
Output:
OkCupid user statuses, % 
single            93 
seeing someone     3 
available          3 
married            1 
Name: status, dtype: Int64

3. Adding the % symbol to each frequency value

In the example above, we added the % symbol to the table title. What if we want to add it to each frequency value instead? A workaround here is to make a list of frequency Values with the % symbol added to each of them and then create a Series from this list. For making the list, we can use the following for-loop:

lst_with_pct_symb = []
for value in s_pct_int.values:
    lst_with_pct_symb.append(f"{value}%")
lst_with_pct_symb
Output:
['93%', '3%', '3%', '1%']

or, more concisely, a list comprehension:

lst_with_pct_symb = [f"{value}%" for value in s_pct_int.values]

Now, let’s create and print out an updated frequency table. This time, we’ll remove the % symbol from the title:

s_with_pct_symb = pd.Series(lst_with_pct_symb, index=s_pct_int.index)
print(f"33[1mOkCupid user statuses33[0mn{s_with_pct_symb}")
Output:
OkCupid user statuses
single            93%
seeing someone     3%
available          3%
married            1%
dtype: object

4. Pretty-printing table

Finally, we can pretty-print the frequency table. For this purpose, we’re going to use the to_markdown() pandas method that requires installation (not necessarily importing) of the tabulate module (pip install tabulate).

Important: to display the results correctly, the to_markdown() method has to be used only inside the print() command.

Let’s show again the user status frequency table called s_with_pct_symb, this time as a real table, using the basic syntax of to_markdown(). As earlier, we’ll add a title in bold and apply f-string formatting. For consistency, hereinafter, we’ll assign the frequency table to be displayed to a variable called S":

S = s_with_pct_symb
print(f"33[1mOkCupid user statuses33[0mn{S.to_markdown()}")
Output:
OkCupid user statuses
|                | 0   |
|:---------------|:----|
| single         | 93% |
| seeing someone | 3%  |
| available      | 3%  |
| married        | 1%  |

The first thing we may want to do is to remove an automatically created table header (redundant in our case anyway). To do so, we have to assign an empty list or string to an optional parameter called headers:

print(f"33[1mOkCupid user statuses33[0mn{S.to_markdown(headers=[])}")
Output:
OkCupid user statuses
|:---------------|:----|
| single         | 93% |
| seeing someone | 3%  |
| available      | 3%  |
| married        | 1%  |

In the tabulate documentation, we can find some other parameters to tune. However, they are mostly related to DataFrame objects. In our case, given that we have a Series and also remembering that the frequency values actually became strings after adding the % symbol, we have a much smaller choice.

Let’s play with the parameters tablefmt and stralign. The first of them defines the table format and can have one of the following values: plain, simple, github, grid, fancy_grid, pipe, orgtbl, jira, presto, pretty, psql, rst, etc. For example, the table format that we saw earlier is called pipe, the default one for the to_markdown() pandas method. Curious that for the tabulate package itself, the default table format is simple. As for the second parameter, stralign, it serves for overriding the default string data alignment (which is left). Possible options here are right and center.

print(f"33[1mOkCupid user statuses33[0mn"
      f"{S.to_markdown(headers=[], tablefmt='fancy_grid', stralign='right')}")
Output:
OkCupid user statuses
╒════════════════╤═════╕
│         single │ 93% │
├────────────────┼─────┤
│ seeing someone │  3% │
├────────────────┼─────┤
│      available │  3% │
├────────────────┼─────┤
│        married │  1% │
╘════════════════╧═════╛

Let’s now turn our attention to the frequency table called s_pct_int that we created earlier. To remind its syntax and how it looks like:

s_pct_int = profiles['status'].value_counts(normalize=True).mul(100).round().convert_dtypes()
s_pct_int
Output:
single            93
seeing someone     3
available          3
married            1
Name: status, dtype: Int64

To create a markdown table for s_pct_int, it makes sense to add a corresponding table header. Also, let’s try a new value for the tablefmt parameter:

S = s_pct_int
print(f"33[1mOkCupid user statuses33[0mn"
      f"{S.to_markdown(headers=['STATUS', '%'], tablefmt='github')}")
Output:
OkCupid user statuses
| STATUS         |   % |
|----------------|-----|
| single         |  93 |
| seeing someone |   3 |
| available      |   3 |
| married        |   1 |

However, here we have good news: for numeric frequency values, we can use the floatfmt parameter that allows custom formatting for float numbers. It means that the last 2 methods in the method chaining above (round() and convert_dtypes()) are redundant and can be removed. It leaves us with the earlier-created frequency table s_pct:

s_pct = profiles['status'].value_counts(normalize=True).mul(100)
s_pct
Output:
single            92.927456
seeing someone     3.443673
available          3.111652
married            0.517218
Name: status, dtype: float64

Let’s display its markdown representation:

S = s_pct
print(f"33[1mOkCupid user statuses33[0mn"
      f"{S.to_markdown(headers=['STATUS', '%'], tablefmt='github', floatfmt='.0f')}")
Output:
OkCupid user statuses
| STATUS         |   % |
|----------------|-----|
| single         |  93 |
| seeing someone |   3 |
| available      |   3 |
| married        |   1 |

Due to the correct number formatting, we obtained the same table as the previous one.

Attention: the floatfmt parameter doesn’t work in combination with the table format pretty.

Practical takeaways

Despite the whole walkthrough above took a lot of iterations and descriptions, below we’ll find the final code solutions for 4 different versions of our frequent table for profiles['status'], all in %:

  • 2 simple tables with/without the % symbol,
  • 2 pretty-printed tables with/without the % symbol and with/without a table header.

Each solution takes at most 3 lines of simple code and produces a visually efficient representation of the frequency table in interest.

S = profiles['status'].value_counts(normalize=True).mul(100).round().convert_dtypes()
print(f"33[1mOkCupid user statuses, %33[0mn{S}")
Output:
OkCupid user statuses, %
single            93
seeing someone     3
available          3
married            1
Name: status, dtype: Int64

s_pct_int = profiles['status'].value_counts(normalize=True).mul(100).round().convert_dtypes()
S = pd.Series([f"{value}%" for value in s_pct_int.values], index=s_pct_int.index)
print(f"33[1mOkCupid user statuses33[0mn{S}")
Output:
OkCupid user statuses
single            93%
seeing someone     3%
available          3%
married            1%
dtype: object

S = profiles['status'].value_counts(normalize=True).mul(100)
print(f"33[1mOkCupid user statuses33[0mn"
      f"{S.to_markdown(headers=['STATUS', '%'], tablefmt='github', floatfmt='.0f')}")
Output:
OkCupid user statuses
| STATUS         |   % |
|----------------|-----|
| single         |  93 |
| seeing someone |   3 |
| available      |   3 |
| married        |   1 |

s_pct_int = profiles['status'].value_counts(normalize=True).mul(100).round().convert_dtypes()
S = pd.Series([f"{value}%" for value in s_pct_int.values], index=s_pct_int.index)
print(f"33[1mOkCupid user statuses33[0mn{S.to_markdown(headers=[], tablefmt='fancy_grid')}")
Output:
OkCupid user statuses
╒════════════════╤═════╕
│ single         │ 93% │
├────────────────┼─────┤
│ seeing someone │ 3%  │
├────────────────┼─────┤
│ available      │ 3%  │
├────────────────┼─────┤
│ married        │ 1%  │
╘════════════════╧═════╛

Conclusion

In this article, we discussed some simple but powerful approaches to improving frequency table layout and overall readability. They all have the value_counts() pandas method as a core part, but all go beyond it and result in more impactful representations. What’s more, each of the suggested solutions, in its final form, requires at most 3 lines of code.

I hope you enjoyed reading my article and found it helpful. Thanks for reading to everyone, and good luck to those who use OkCupid dating app 😉💘


You can find interesting also these articles:

5 Pandas Methods You’ve Never Used… And You Didn’t Lose Anything!

An Unconventional Yet Convenient Matplotlib Broken_Barh Function And When It Is Particularly…

2 Efficient Ways of Creating Fancy Pictogram Charts in Python


Related Articles