The world’s leading publication for data science, AI, and ML professionals.

Python for Excel Users – Part 1

Getting started with dealing with data using pandas

This is part one of a tutorial series for everyone who wants to get started working in python instead of Excel, e.g. for automating tasks or improving speed and scalability. Or for just the curious ones.

You’re at the right spot when:

  • you use Excel to combine various sheets or tables (using formulas like Index, Match, VLookup
  • use simple mathematical operations like sum and mean
  • use conditional aggregation functions (e.g. sum of transactions per category) or pivot tables

This blog post won’t cover everything you need to know to dump Excel nor is that the goal of the series. Part 1 is meant to get you started and to change your mindset from Excel to Python/Pandas.

While Excel has its place and is very fast and convenient, I hope this blog post series helps you to explore other ways of working with data than in Excel.

You can find this blog post as a Jupyter Notebook as well as the data used at our HOSD Mentoring GitHub repo.

Goal of this Tutorial

We will explore and combine different sales data tables like we would in Excel. The tables are:

  • Orders
  • Order_lines
  • Products

More details about the data in a minute.

Requirements

While some basic experience with Python helps, you don’t have to be a a programmer or data scientist to follow this tutorial. Ideally, you should have heard of pandas and spent some time with python yourself. Just enough to feel comfortable to follow along without getting discouraged. Try to follow along and read up on things we did not cover. You can always come back here.

1. The Situation

Source: Pixabay https://pixabay.com/photos/office-startup-business-home-office-594132/
Source: Pixabay https://pixabay.com/photos/office-startup-business-home-office-594132/

Imagine you just started in the sales department of a huge Brazilian online retailer. It’s your first day. You set up your computer, got a great double espresso from your company’s barista aka. the old coffee machine in the break room, and you’re ready to get started!

Unfortunately, it’s summer time and everyone in your team is on vacation. This leaves you with a lot of questions and time to explore the company data yourself. Great!

A few of the questions in your head are:

  • 🏬 How big is the company actually, measured by number of transactions?
  • 💵 Company size in terms of revenue per year?
  • 👗 How many products do we offer per Category?
  • 📦 How successful are our deliveries?

You know that you can answer these questions simply by using and combining the three tables Orders, Order_lines, and Products in Excel using standard formulas and a maybe a pivot table.

But since no one’s arround and your starting of a new job marks a new era in your career, you want to do it differently this time. Why not try out this cool thing called python that everyone on LinkedIn talks about? Yeah, let’s do that!

2. Getting started – the Data

The data used in this tutorial is an adapted and simplified version of the Brazilian E-Commerce Public Dataset by Olist.

In Excel, you would do something like opening a new Excel document, go to Data > Get Data > From Text, then search for the .csv file and click through the Text Import Wizard (choice of delimiter, file encoding, data format, etc.) or you would just double-click on the .csv file and hope Excel will figure everything out correctly by itself (ouch!).

But now we’re in python land and things are a bit different (but not harder!).

To work with an .csv file, we will read it as an dataframe (fancy name for table) using the package pandas. We save the imported tables in variables with appropriate names. It’s conventional that dataframe names end with df.

# setup notebook and load tables
import pandas as pd

orders_df = pd.read_csv("data/Orders.csv")
order_lines_df = pd.read_csv("data/Order_lines.csv")
products_df = pd.read_csv("data/Products.csv")

Great! We’ve just loaded the three tables into our python program.

Did we?

Unlike in Excel, the imported data is not displayed immediately. Python is quite silent if you don’t tell explicitly what you want to see.

After you’ve imported a csv file using pandas’ read_csv() method and saved it as something_df, we can display the first 5 rows using something_df.head(5).

That way we can briefly see how the table looks like and which columns we are dealing with. Let’s do that for our three tables.

Orders will state all transactions, when they happened and whether the order has been fulfilled successfully. Each order has an order_id (unique per order), customer_id (unique per customer), an order_status, and purchase date.

# Preview of the Orders table
orders_df.head(5)

However, orders_df does not contain the content of the individual orders. For that we need to take a look at the Order_lines table and connect the two tables using an order id. In order_lines_df, each row refers to one transaction item, including quantity, price and, freight value.

# Preview of Order_lines table
order_lines_df.head(5)

The product category is not part of this table yet, so we need to combine the Products table (products_df) with the order_lines_df late in case we want to analyze category-level data. For now there are only two columns in products_df, the product id and the associated product category.

# Preview of Products table
products_df.head(5)

3. Answering our Questions with Data

Source: Pixabay https://pixabay.com/photos/audit-chart-graph-hand-writing-3229739/
Source: Pixabay https://pixabay.com/photos/audit-chart-graph-hand-writing-3229739/

Now that we have everything loaded into our notebook, we can start answering our questions with data.

3.1 🏬 How big is the company actually, measured by number of transactions?

Since we are working in the sales department, when we talk about company size here we mean number of transactions. For that we can start by simply counting the number of rows in order_df using python’s len() method. This is similar to clicking on the column in Excel and reading the count at the bottom of the screen.

len(orders_df)
99441

Alternatively, we can use pandas pd.unique() method on the column order_id, which will return the unique values of order_id. In this case it shouldn’t make a difference since the order ids are unique by nature.

However, if we would use this approach on a column with duplicates (like customer_id) then our results with len() on the whole table would not be correct. So let’s do it properly and use pd.unique() to get the unique order ids and then use len() on that.

Note: You can access a column of a table using table_df["column_name"]

len(pd.unique(orders_df["order_id"]))
99441

We are suprised by the large number of transactions at first, but then we notice two things:

  • Not all orders have been delivered. We should only take a look at successfully delivered orders using order_status.
  • These transactions happened in three years from 2016–2018. We should take a look at number of transactions per year.

Let’s start by filtering out all orders that have not been delivered successfully. To find out how many order_status options there are, we would click on the filter of that column in Excel.

It’s convenient that Excel shows us only unique values but as we just learned, we can achieve the same by using pd.unique() on the order_status column.

pd.unique(orders_df["order_status"])
array(['delivered', 'invoiced', 'shipped', 'processing', 'unavailable',
       'canceled', 'created', 'approved'], dtype=object)

In Excel we would simply de-select all options except of ‘delivered’ to filter our table. Excel will check each row and only display the ones where order_status equals ‘delivered’.

We can tell python to do the same. If our computer was a person, we would tell her: Please take the order_df table and show me only the rows where the column "order_status" of the order_df table is equal to "delivered". Luckily, our computer is not a person and needs way less text to do the same.

Translated to python it’s:

orders_df[orders_df["order_status"]=="delivered"]

We should save the filtered result to another dataframe to further work with this table.

delivered_orders_df = orders_df[orders_df["order_status"]=="delivered"]

Next, we want to break our transactions down per year. We will convert the order_purchase_timestamp column into a datetime column. Then we will extract the year from it, and save it to a new column called "year".

This is similar to using Excel’s Year() formula.

# the data type of this column is not yet datetime
delivered_orders_df["order_purchase_timestamp"].dtypes
dtype('O')
# convert to datetime
delivered_orders_df["order_purchase_timestamp"] = pd.to_datetime(delivered_orders_df["order_purchase_timestamp"])
delivered_orders_df["order_purchase_timestamp"].dtypes
dtype('<M8[ns]')
# extract the year from datetime and save to new column
delivered_orders_df["order_year"] = delivered_orders_df["order_purchase_timestamp"].dt.year
# show the new table
delivered_orders_df.head(5)

Great! Now we have everything we need to count the delivered orders per year. In Excel we would filter by year, count each row, and write it to a new summary table.

More advanced Excel users would skip this manual task by using a pivot table with Year as a column and count values of order_id.

We can achieve the same by using pandas groupby and agg functions. The syntax might look intimidating for such a simple task, but the function is quite versatile and powerful in ways that Excel’s pivot can only dream of.

First, we select the table to aggregate (what), then we define the column to aggregate by (where), followed by defining the aggregation function (how).

delivered_orders_df.groupby(by=["order_year"], as_index=False).agg({'order_id':'count'})

3.2 💵 Company size in terms of revenue per year?

Next to the number of transactions we would also like to find out the revenue. As stated above, we need to combine delivered_order_df with order_lines_df through order_id as the key. Furthermore, we notice that the order_lines table has information on the number of order items and price, but not revenue yet.

Hence, this is what we need to do:

  • create a column "line_revenue" in order_lines_df with values for order_item * price
  • create an aggregated table that sums up line_revenue for each order as order_revenue
  • combine the aggregated table with delivered_orders_df and sum up order_revenue per year

In Excel, we would create a new column, use a pivot table, combine the pivot table with the other filtered table using something like VLOOKUP, build yet another pivot on that resulting table, …

Let’s not do that. It’s not a complex task but we can already see why we should switch over to Python for doing tasks like that. Once you get used to the syntax and process, everything will be easier and more scalable to do.

order_lines_df.head()
# create new column and save the result of order_item_id * price
order_lines_df["line_revenue"] = order_lines_df["order_item_id"] * order_lines_df["price"]
order_lines_df.head()
# aggregate table and sum up line_revenue to get the total order value per order id
order_lines_agg_df = order_lines_df.groupby(by=["order_id"], as_index=False).agg({"line_revenue":"sum"})
order_lines_agg_df.head(5)
delivered_orders_df.head(5)

Now we want to combine the two tables above, order_lines_agg_df and delivered_orders_df. What we do is similar to Excel’s VLOOKUP() formula in terms of result, however we merge the two tables using order_id as a key. The python code is straight forward and almost like prose.

After that we repeat what we did above when we calculated the number of transactions per year, but this time we sum up order_revenue.

# merge the two tables to get revenue per order
delivered_orders_merged_df = pd.merge(left=delivered_orders_df, right=order_lines_agg_df, how="left", on="order_id")
delivered_orders_merged_df.head(5)
revenue_per_year_df = delivered_orders_merged_df.groupby(by=["order_year"], as_index=False).agg({"order_revenue":"sum"})
revenue_per_year_df.columns = ["order_year", "total_revenue"]
revenue_per_year_df

This looks wrong, right? Not really. This is called scientific notation where very big or very small numbers are displayed as a calculation.

4.649029e+4 means 4.649029 10⁴ means 4.649029 10,000 means 46,490.29. You can change the way Jupyter Notebook dispays numbers using the following code snippet.

pd.set_option('display.float_format', lambda x: '%.2f' % x)
# display float numbers up to second decimal point
revenue_per_year_df

Great! It seems like our company is quite alive. Using this approach we can calculate all sorts of things. Like the number of customers per year and the average order value. The ratio of transactions to customers might be of interest too, since a company can be at risk when their revenue is only based on a small number of highly valuable customers. What if they switch to our competitor?

These are great questions to investigate once you’re through this tutorial. But first let’s get back to our other questions.

3.3 👗 How many products do we offer per Category?

Since you already know how to count unique values of a column, this will be quite easy now. Above, we counted the number of unique order_ids in orders_df. Now we have to count the product_id in products_df to get a general idea of the scope of our offering, i.e. how many different products we offer.

products_df.head()
# count unique products in our database
len(pd.unique(products_df["product_id"]))
32951

Since we want to also know the number of products per category, we replicate the approach from above where we calculated the number of transactions per year. Instead of "order_year" we will now use "product_category_name" and instead of "order_id" to count we use "product_id".

# count products per category
products_per_category_df = products_df.groupby(by=["product_category_name"], as_index=False).agg({"product_id":"count"})
products_per_category_df.columns = ["product_category_name", "product_id_count"]
products_per_category_df.head(5)
# number of categories
len(products_per_category_df)
73

Wow, we offer almost 33k products in 73 categories? What are the top 5 categories in terms of number of products?

Like in Excel, we simply have to sort the table by product_id_count.

# sort by product_id_count descending
products_per_category_df.sort_values(by=["product_id_count"], ascending=False).head(5)

That is probably the point where you find out that knowing Portuguese is quite helpful when working for a Brazilian company…

3.4 📦 How successful are our deliveries?

In 3.1 we created a new table with only successfully delivered orders. If we count the unique order ids of that table and divide that number by the count of unique order ids in the big unfiltered orders_df table, then we get the percentage of successful deliveries.

unique_orderids_total = len(pd.unique(orders_df["order_id"]))
unique_orderids_delivered = len(pd.unique(delivered_orders_df["order_id"]))

# calculate ratio and round to second decimal point
round(unique_orderids_delivered/unique_orderids_total,2)
0.97

97% delivery success is something our company can be really proud of.

4. Summary

Now that we learned that, thanks to pandas, working in python is a real alternative to Excel, we got a bit carried away and calculated all sorts of KPIs based on aggregated data from the three original table. A quick look at the watch tells us what our stomach already knew: it’s lunch time!

But before we head down to the company cantine aka. wending machine, we want to save what we did and send it to our colleagues. How do we do that?

We can simply export dataframes as csv files and send them to others who prefer to work in Excel.

# Export dataframe as csv for Excel
products_per_category_df.to_csv("output/products_per_category_df.csv", index=False)

Today we learned a few basic but powerful functions that allow us to work in Python instead of Excel. Here’s a summary.

  • _pd.readcsv() can be used to import .csv files as dataframes
  • _tabledf.head() shows us a preview of that table as output
  • to access a column of table_df, we type _table_df["columnname"]
  • _len(pd.unique(table_df["columnname"])) gives us the number of unique values in that column
  • we can multiply two columns and save the result in a new column using _table_df["new"] = table_df["col1"] * tabledf["col2"]
  • Instead of Excel’s pivot table, we can aggregate values per category, e.g. count the number of transactions per year or sum up the revenue per order using _table_df.groupby(by=["category"], asindex=False).agg({"col3":"sum"})
  • Instead of using VLOOKUP() in Excel we can merge two tables using _pd.merge(left=table_1_df, right=table_2_df, how="left", on="keycolumn")

I suggest you to keep going, play around with this dataset or your own a bit further. Whenever you don’t know how to do something best is to check the pandas documentation or to google your specific question ("python datetime from string" or "python pivot table") and see how others do it. Just don’t get discouraged and keep on learning!

The beginning is always tough but stick with it and you will be way more effective and faster than any Excel pro. And in case you need some general guidance on whether Data Analytics and Data Science could be a career path for you, we at HOSD Mentoring offer free one-to-one mentoring.

You can book a session with one of our mentors with just a click on our website. Don’t be shy 🙂

Furthermore, we would like to announce that this and many other tutorials will be turned into a live webinar with Q&As at the end of the session. To find out more about our offerings or to get in touch with us, make sure to head over to our HOSD LinkedIn page.

Keep on learning!

Jonas

HOSD Founder and Data Analyst Mentor


Related Articles