
While the easiest way to create a frequency table for a Series object in Python Pandas library is by applying the value_counts()
method, the result of this operation looks rather basic. We can make it more informative by adjusting the method’s boolean parameters normalize
, sort
, ascending
, and dropna
, or grouping the values (if they are numeric) into bins. However, the options here are quite limited, so to visually enhance the resulting frequency table, we can consider some simple yet helpful lifehacks, such as method chaining, text customization, adding the %
symbol to each frequency value, and using the power of pretty-printing.
In this article, we’ll experiment with a Kaggle dataset containing information about age, sex, location, education, etc., for 60K users of OkCupid dating app 💞 . For our purposes, however, we’ll use only the data on the user statuses.
1. Method chaining
To start with, let’s create a basic frequency table for the user statuses:
import pandas as pd
profiles = pd.read_csv("okcupid_profiles.csv", na_values="unknown")
s = profiles["status"]
s.value_counts()
Output:
single 55697
seeing someone 2064
available 1865
married 310
Name: status, dtype: int64
Even though this table clearly shows the overall trend, it would be more convenient to see this information in relative frequencies. Let’s assign True
to the normalize
parameter and the resulting frequency table, which is a Series by itself, to a variable called s_norm
.
(Side note: hereinafter, we’re going to leave all the other parameters of the value_counts()
method by default, meaning that we’ll consider only frequency tables sorted in descending order and excluding missing values. For our purposes, using or not those parameters doesn’t matter.)
s_norm = s.value_counts(normalize=True)
s_norm
Output:
single 0.929275
seeing someone 0.034437
available 0.031117
married 0.005172
Name: status, dtype: float64
All the frequencies are converted into fractions now, but we’d prefer to see them in percentages, so let’s multiply the Series by 100:
s_pct = s_norm.mul(100)
s_pct
Output:
single 92.927456
seeing someone 3.443673
available 3.111652
married 0.517218
Name: status, dtype: float64
We don’t really need such precision of the percentages. Moreover, let’s imagine that we don’t need decimal points at all:
s_pct_rounded = s_pct.round()
s_pct_rounded
Output:
single 93.0
seeing someone 3.0
available 3.0
married 1.0
Name: status, dtype: float64
Unfortunately, rounding up to the whole part left us with so-called nullable integers (i.e. float numbers with a decimal part equal to 0). To fix it, we can use either convert_dtypes()
or astype(int)
:
s_pct_int = s_pct_rounded.convert_dtypes()
s_pct_int
Output:
single 93
seeing someone 3
available 3
married 1
Name: status, dtype: Int64
Let’s now expand the whole expression for s_pct_int
displaying all the methods we chained:
s_pct_int = profiles['status'].value_counts(normalize=True).mul(100).round().convert_dtypes()
2. Adding a table title
Without knowing the context, from the table above, it can be not clear that the numbers represent percentages and not absolute frequencies. Let’s add a title to the table using the f-string formatting:
print(f"OkCupid user statuses, %n{s_pct_int}")
Output:
OkCupid user statuses, %
single 93
seeing someone 3
available 3
married 1
Name: status, dtype: Int64
A good idea, especially if we print out several frequency tables at once, would be to visually highlight the title from the rest of the table, e.g. making it bold. To do so, we can use ANSI escape code sequences. In particular, to print a string in bold in Python, we should add the sequence 33[1m
before the string and 33[0m
– after it:
print(f"33[1mOkCupid user statuses, %33[0mn{s_pct_int}")
Output:
OkCupid user statuses, %
single 93
seeing someone 3
available 3
married 1
Name: status, dtype: Int64
3. Adding the %
symbol to each frequency value
In the example above, we added the %
symbol to the table title. What if we want to add it to each frequency value instead? A workaround here is to make a list of frequency Values with the %
symbol added to each of them and then create a Series from this list. For making the list, we can use the following for-loop:
lst_with_pct_symb = []
for value in s_pct_int.values:
lst_with_pct_symb.append(f"{value}%")
lst_with_pct_symb
Output:
['93%', '3%', '3%', '1%']
or, more concisely, a list comprehension:
lst_with_pct_symb = [f"{value}%" for value in s_pct_int.values]
Now, let’s create and print out an updated frequency table. This time, we’ll remove the %
symbol from the title:
s_with_pct_symb = pd.Series(lst_with_pct_symb, index=s_pct_int.index)
print(f"33[1mOkCupid user statuses33[0mn{s_with_pct_symb}")
Output:
OkCupid user statuses
single 93%
seeing someone 3%
available 3%
married 1%
dtype: object
4. Pretty-printing table
Finally, we can pretty-print the frequency table. For this purpose, we’re going to use the to_markdown()
pandas method that requires installation (not necessarily importing) of the tabulate module (pip install tabulate
).
Important: to display the results correctly, the to_markdown()
method has to be used only inside the print()
command.
Let’s show again the user status frequency table called s_with_pct_symb
, this time as a real table, using the basic syntax of to_markdown()
. As earlier, we’ll add a title in bold and apply f-string formatting. For consistency, hereinafter, we’ll assign the frequency table to be displayed to a variable called S
":
S = s_with_pct_symb
print(f"33[1mOkCupid user statuses33[0mn{S.to_markdown()}")
Output:
OkCupid user statuses
| | 0 |
|:---------------|:----|
| single | 93% |
| seeing someone | 3% |
| available | 3% |
| married | 1% |
The first thing we may want to do is to remove an automatically created table header (redundant in our case anyway). To do so, we have to assign an empty list or string to an optional parameter called headers
:
print(f"33[1mOkCupid user statuses33[0mn{S.to_markdown(headers=[])}")
Output:
OkCupid user statuses
|:---------------|:----|
| single | 93% |
| seeing someone | 3% |
| available | 3% |
| married | 1% |
In the tabulate documentation, we can find some other parameters to tune. However, they are mostly related to DataFrame objects. In our case, given that we have a Series and also remembering that the frequency values actually became strings after adding the %
symbol, we have a much smaller choice.
Let’s play with the parameters tablefmt
and stralign
. The first of them defines the table format and can have one of the following values: plain
, simple
, github
, grid
, fancy_grid
, pipe
, orgtbl
, jira
, presto
, pretty
, psql
, rst
, etc. For example, the table format that we saw earlier is called pipe
, the default one for the to_markdown()
pandas method. Curious that for the tabulate package itself, the default table format is simple
. As for the second parameter, stralign
, it serves for overriding the default string data alignment (which is left
). Possible options here are right
and center
.
print(f"33[1mOkCupid user statuses33[0mn"
f"{S.to_markdown(headers=[], tablefmt='fancy_grid', stralign='right')}")
Output:
OkCupid user statuses
╒════════════════╤═════╕
│ single │ 93% │
├────────────────┼─────┤
│ seeing someone │ 3% │
├────────────────┼─────┤
│ available │ 3% │
├────────────────┼─────┤
│ married │ 1% │
╘════════════════╧═════╛
Let’s now turn our attention to the frequency table called s_pct_int
that we created earlier. To remind its syntax and how it looks like:
s_pct_int = profiles['status'].value_counts(normalize=True).mul(100).round().convert_dtypes()
s_pct_int
Output:
single 93
seeing someone 3
available 3
married 1
Name: status, dtype: Int64
To create a markdown table for s_pct_int
, it makes sense to add a corresponding table header. Also, let’s try a new value for the tablefmt
parameter:
S = s_pct_int
print(f"33[1mOkCupid user statuses33[0mn"
f"{S.to_markdown(headers=['STATUS', '%'], tablefmt='github')}")
Output:
OkCupid user statuses
| STATUS | % |
|----------------|-----|
| single | 93 |
| seeing someone | 3 |
| available | 3 |
| married | 1 |
However, here we have good news: for numeric frequency values, we can use the floatfmt
parameter that allows custom formatting for float numbers. It means that the last 2 methods in the method chaining above (round()
and convert_dtypes()
) are redundant and can be removed. It leaves us with the earlier-created frequency table s_pct
:
s_pct = profiles['status'].value_counts(normalize=True).mul(100)
s_pct
Output:
single 92.927456
seeing someone 3.443673
available 3.111652
married 0.517218
Name: status, dtype: float64
Let’s display its markdown representation:
S = s_pct
print(f"33[1mOkCupid user statuses33[0mn"
f"{S.to_markdown(headers=['STATUS', '%'], tablefmt='github', floatfmt='.0f')}")
Output:
OkCupid user statuses
| STATUS | % |
|----------------|-----|
| single | 93 |
| seeing someone | 3 |
| available | 3 |
| married | 1 |
Due to the correct number formatting, we obtained the same table as the previous one.
Attention: the floatfmt
parameter doesn’t work in combination with the table format pretty
.
Practical takeaways
Despite the whole walkthrough above took a lot of iterations and descriptions, below we’ll find the final code solutions for 4 different versions of our frequent table for profiles['status']
, all in %:
- 2 simple tables with/without the
%
symbol, - 2 pretty-printed tables with/without the
%
symbol and with/without a table header.
Each solution takes at most 3 lines of simple code and produces a visually efficient representation of the frequency table in interest.
S = profiles['status'].value_counts(normalize=True).mul(100).round().convert_dtypes()
print(f"33[1mOkCupid user statuses, %33[0mn{S}")
Output:
OkCupid user statuses, %
single 93
seeing someone 3
available 3
married 1
Name: status, dtype: Int64
s_pct_int = profiles['status'].value_counts(normalize=True).mul(100).round().convert_dtypes()
S = pd.Series([f"{value}%" for value in s_pct_int.values], index=s_pct_int.index)
print(f"33[1mOkCupid user statuses33[0mn{S}")
Output:
OkCupid user statuses
single 93%
seeing someone 3%
available 3%
married 1%
dtype: object
S = profiles['status'].value_counts(normalize=True).mul(100)
print(f"33[1mOkCupid user statuses33[0mn"
f"{S.to_markdown(headers=['STATUS', '%'], tablefmt='github', floatfmt='.0f')}")
Output:
OkCupid user statuses
| STATUS | % |
|----------------|-----|
| single | 93 |
| seeing someone | 3 |
| available | 3 |
| married | 1 |
s_pct_int = profiles['status'].value_counts(normalize=True).mul(100).round().convert_dtypes()
S = pd.Series([f"{value}%" for value in s_pct_int.values], index=s_pct_int.index)
print(f"33[1mOkCupid user statuses33[0mn{S.to_markdown(headers=[], tablefmt='fancy_grid')}")
Output:
OkCupid user statuses
╒════════════════╤═════╕
│ single │ 93% │
├────────────────┼─────┤
│ seeing someone │ 3% │
├────────────────┼─────┤
│ available │ 3% │
├────────────────┼─────┤
│ married │ 1% │
╘════════════════╧═════╛
Conclusion
In this article, we discussed some simple but powerful approaches to improving frequency table layout and overall readability. They all have the value_counts()
pandas method as a core part, but all go beyond it and result in more impactful representations. What’s more, each of the suggested solutions, in its final form, requires at most 3 lines of code.
I hope you enjoyed reading my article and found it helpful. Thanks for reading to everyone, and good luck to those who use OkCupid dating app 😉💘
You can find interesting also these articles:
5 Pandas Methods You’ve Never Used… And You Didn’t Lose Anything!
An Unconventional Yet Convenient Matplotlib Broken_Barh Function And When It Is Particularly…
2 Efficient Ways of Creating Fancy Pictogram Charts in Python