Your daily life data analysis
As a data scientist/analyst, your job is to produce a report that contained many insights for business decisions. A report can be made by several useful tools such as Microsoft Excel, SAP, or customized with the programing language such as SAS, R, or Python. The result can be sent through internal email to a stakeholder or publish through the centralized dashboard.
Like everyone else, I am a data analyst who uses python for making a report or presentation in daily life. My usual assignment is to make an ad-hoc analysis within 2–3 hours to present to the management team.
To get the result I want, I have to start my Jupiter notebook kernel and quickly write a code to produce the number. After that, I may put the result to the Microsoft PowerPoint with some underlying footnote and send them to my supervisor to present and make an important decision before the end of the day.
One pain point is that with a time limitation, I have to digest the information, write the code to produce the result, and put it in the Microsoft PowerPoint with a pretty format to present it.
Unfortunately, the programing language I use may not contain the feature that makes your report look better for the management team, e.g., put a comma in the number or not using a scientific notation to show a high number.
If you submit your report without taking care of those aspects, The management team may complain a lot about your report, and, sometimes, they throw it in the garbage without taking a glance at it. That will make you so much annoyed because you take time and effort on it.

To solve the problem, you can then put the result from your programing language to Microsoft Excel and manually change the format as you want. Excel is a good tool for this purpose. The bad part is that you have to do it manually. What if we can make it automated during the programming process. It would be great, wouldn’t it?
As-is
Here let’s look and this data frame I produce for this example. This is the revenue amount that the company needs. As you can see, this is the default result returns from the pandas
data frame. Without any configuration, this is what you get.

One comment I always received from my supervisor or chief officer is that.
Can you make it more readable and easy for comparison?"
The solution may be to divide the number by a million and put the unit in the above of the table instead. And one thing you have to remember is that it should be consistent over your presentation. What if there are 100 tables you have to reproduce? It’s tough, right.
I found out that you can fix it programmatically. I spent a lot of time collecting the following snippets of code from the internet. Thanks a lot for Stack overflow! I think that sharing it with you guys could benefit anyone who found these problems as mine. You would reduce your time on cosmetic comment and then put your concentration to the content validity.
How to Improve? You may ask.
Human readable format
The most frequent comment I received is that can you rounding the number and put the end notation such as M for million or K for thousand for me? This will make your table looks better and reduce the unnecessary information in the reader’s eyes. A lot of time, we don’t need this much precision to decide where to go.
Here is the function to convert the number in your pandas
data frame to the format your want.
def human_readable_format(value, pos=None):
'''
Convert number in dataframe to human readable format
`pos` argument is to used with the matplotlib ticker formatter.
'''
assign_unit = 0
units = ['', 'K', 'M', 'B']
while value >= 1_000:
value /= 1_000
assign_unit += 1
return f"{value:.2f} {units[assign_unit]}"

Tada! Here is the result that you will get. It’s a lot easier to read, right?
The drawback of this function is it converting your number to a string, which means you will lose the sorting ability from a data frame. The problem can be solved by sorting the values you want first then apply them later.
You can save the result to excel or CSV file and put it in PowerPoint. My go-to method usually is taking a screenshot and put it directly into the presentation.
This snippet code saves me a lot of time reproducing multiple tables because when you get the comment from your supervisor, you have to refresh all of it. Suppose you have 100 tables in the presentation. It’s a nightmare for people who make it manually table by table.
Also, with this formatted, we can use it in the matplotlib
graph as well. I think matplotlib
would be your first choice to plot a graph if you use a pandas
library for your Data Analysis.

You can set the y axis of this graph with the human-readable format like your table by
import matplotlib.ticker as mticker
import matplotlib.pyplot as plt
fig , ax = plt.subplots();
df['value_9'].plot(ax=ax);
ax.yaxis.set_major_formatter(
mticker.FuncFormatter(human_readable_format)
)
It looks much more compelling.

Highlight cell
Sometimes you need to point the importance number, trend, or message out of the table. You have a logical rule in mind, such as highlighting the month with the max value of the collection amount. The number can be varied based on the underlying transaction in data. If you want to highlight it dynamically, you have to do it programmatically.
This is the second thing I used the most for making my table looks better. It helps you convey the message and improving your story-telling ability. To emphasize what is important from the rest.
def highlight_max_value(series):
# get True, or False status of each value in series
boolean_mask = series == series.max()
# return color is orange when the boolean mask is True
res = [f"color : orange" if max_val else '' for max_val in boolean_mask]
return res
df.style.apply(highlight_max_value)

Sometimes, you will find the underlying trend inside your data easier. You can’t notice the pattern from a ton of data without rearranging it properly.
Less is more
The last one is not to add something interesting to your data frame/plot but to remove it out. Sometimes less is more. The less component in your data frame or graph leads to a better message conveying. The reader or recipient can absorb only what they have to absorb.

You can change a lot of things here, and then it will look like this.
# Prepare data set
revenue = df[['value_9']].copy()
revenue['pct'] = revenue['value_9'] * 100 / revenue['value_9'].sum()
revenue = revenue.sort_values('pct', ascending=False).reset_index(drop=True)
revenue['cumsum_pct'] = revenue['pct'].cumsum()
import matplotlib.ticker as mticker
import matplotlib.pyplot as plt
import seaborn as sns
# enlarge font size for the graph
sns.set_context('talk')
# plot the bar chart to show the revenue amount
fig , ax = plt.subplots(figsize=(9,6));
revenue['value_9'].plot.bar(ax=ax);
ax.yaxis.set_major_formatter(mticker.FuncFormatter(human_readable_format))
plt.title('Revenue generated');
# plot cumulative revenue in percentage
# to show the impact of first 3 customers
ax2 = plt.twinx(ax)
revenue['cumsum_pct'].plot(ax=ax2, color='orange');
ax2.yaxis.set_major_formatter(mticker.PercentFormatter())
sns.despine();

By arranging the data and adding some information to it, a more intuitive chart can be used for decisions. For example, we know that only the first 3 customers taking of our revenue more than 80 %. So to keep them in a good relationship will be necessary than anything else.
Summary
In a new era, the data analyst uses the programing language to derive a report or presentation. It reduces a lot of time for manual tasks, but there has more complicated stuff to take care of, as mentioned above. It’s a trade-off.
I think the tip and tricks I share with you guys today would be helpful in some way. I spent the time finding the snippet code and adapt with my work. If it’s consolidated in one place and I can always look back at it, it would be convenient. Also, you guys can look at this article too!
All the code in this article can be found here!
Pathairush Seeda
if you like this article and would like to see something like this more.