Exploratory Data Analysis on E-Commerce Data

To discover interesting transactional patterns of different customers and countries

Published in

Towards Data Science

8 min readSep 22, 2018

In general explanation, data science is nothing more than using advanced statistical and machine learning techniques to solve various problems using data. Yet, it’s easier to just dive into applying some fancy machine learning algorithms —and Voila! You got the prediction — without first understanding the data.

This is exactly where the importance of Exploratory Data Analysis (EDA) (as defined by Jaideep Khare) comes in which, unfortunately, is a commonly undervalued step as part of the data science process.

EDA is so important for 3 reasons (at least) as stated below:

Make sure business stakeholders ask the right questions — often by exploring and visualizing data — and validate their business assumptions with thorough investigation
Spot any potential anomalies in data to avoid feeding wrong data to a machine learning model
Interpret the model output and test it’s assumptions

There you have it. Now that we have already understood the “WHAT and WHY” aspects of EDA, let’s examine a dataset together and go through the “HOW” that will eventually lead us to discover some interesting patterns, as we’ll see in the next section.

We’ll focus on the overall workflow of EDA, visualization and its results. For technical reference, please refer to my notebook on Kaggle anytime you want to have a more detailed understanding of the codes.

To give a brief overview, this post is dedicated to 5 sections as follow:

Context of Data
Data Cleaning (a.k.a data preprocessing)
Exploratory Data Analysis
Results
Conclusion

Let’s get started and have fun!

Context of Data

In this post, we’ll investigate the E-Commerce dataset obtained from Kaggle. Before dealing with the dataset, let’s try to understand what it is about to give us a better understanding of its context.

In short, the dataset consists of transactional data with customers in different countries who make purchases from an online retail company based in the United Kingdom (UK) that sells unique all-occasion gifts. The information is summarized as below:

Company — UK-based and registered non-store online retail
Products for selling — Mainly all-occasion gifts
Customers — Most are wholesalers (local or international)
Transactions Period — 1st Dec 2010–9th Dec 2011 (One year)

Data Cleaning

We all know data in real world is messy (including Kaggle!) and thus, let’s spend some time to clean the data to the format we need. Below is a snapshot of what the original data looks like after loading the dataset into a dataframe.

As intuitive as the variables (column names) may sound, let’s take a step further by understanding what each variable means:

InvoiceNo (invoice_num): A number assigned to each transaction
StockCode (stock_code): Product code
Description (description): Product name
Quantity (quantity): Number of products purchased for each transaction
InvoiceDate (invoice_date): Timestamp for each transaction
UnitPrice (unit_price): Product price per unit
CustomerID (cust_id): Unique identifier each customer
Country (country): Country name

NOTES → Product price per unit is assumed to follow the same currency throughout our analysis

So far, so good. We see that there are some missing values for Customers ID and Description. The rows with any of these missing values will therefore be removed.

By understanding the data in a more descriptive manner, we notice two things:

Quantity has negative values
Unit Price has zero values (FREE items?)

Interesting…

At this stage, we’ll just remove Quantity with negative values — this notebook explains what negative values mean — and Unit Price with zero values will be explained in the later part.

To calculate the total money spent on each purchase, we simply multiply Quantity with Unit Price:

amount_spent = quantity * unit_price

Finally, we add a few columns that consist of the Year_Month, Month, Day and Hour for each transaction for analysis later. The final dataframe will look like this:

Exploratory Data Analysis

Highest number of orders and money spent on purchases

Top 5 customers with most number of orders

Top 5 customers with highest money spent

In E-Commerce world, we often want to know which customers — where they come from—place the most orders and spend the most money as they drive the sales of companies.

From the results we observe that most orders are made in the UK and customers from Netherlands spend the highest amount of money in their purchases.

How many orders (per month)?

Overall, we consider that the company receives the highest number of orders in November 2011 since we do not have the full month of data for December 2011.

How many orders (per day)?

Surprisingly, there are no transactions on Saturday throughout the whole period ( 1st Dec 2010–9th Dec 2011). Reasons behind are left for discussion as the dataset and its context are limited.

We also spot a trend where the number of orders received by the company tends to increases from Monday to Thursday and decrease afterward.

How many orders (per hour)?

In terms of hours, there are no transactions after 8:00pm until the next day at 6:00am.

Besides, we notice that the company receives the highest number of orders at 12:00pm. One of the reasons could be due to the fact that most customers make purchases during lunch hour between 12:00pm — 2:00pm.

Discover transactional patterns for Unit Price

Before we move our attention to the zero values (FREE items) of unit price, we make a boxplot to check the distribution of the unit price for all products.

We observe that 75% of the data has unit price of less than 3.75 dollars — which indicates most products are relatively cheap. Only minority of them has high prices per unit (Again, we assume each price per unit follows the same currency).

Well… FREE items for purchase? YES, maybe…

Frequency of giving out FREE items for different months

From the plot, the company tends to give out FREE items for purchases occasionally each month (except June 2011).

However, it is not clear what factors contribute to giving out the FREE items to the particular customers. More in-depth analysis could be done for further explanation. Let me know if you have found out the reasons behind! 😏

Discover transactional patterns for each Country

Top 5 countries with most number of orders

Number of orders in each country **(with UK)**

Number of orders in each country **(without UK)**

As expected, the company receives the highest number of orders in the UK (since it is a UK based company).

To better discern the trend, UK is removed for clearer comparison among other countries. As a result, the TOP 5 countries (including UK) that place the highest number of orders are as below:

United Kingdom
Germany
France
Ireland (EIRE)
Spain

Top 5 countries with highest money spent

Total money spent by each country **(with UK)**

Total money spent by each country **(without UK)**

As the company receives the highest number of orders from customers in the UK, it is natural to see that customers in the UK spend the most on their purchases.

Same as before, UK is removed for clearer comparison among other countries. The TOP 5 countries (including UK) that spend the most money on purchases are as below:

United Kingdom
Netherlands
Ireland (EIRE)
Germany
France

Results from EDA

The customer with the highest number of orders comes from the United Kingdom (UK)
The customer with the highest money spent on purchases comes from Netherlands
The company receives the highest number of orders from customers in the UK (since it is a UK-based company). Therefore, the TOP 5 countries (including UK) that place the highest number of orders are as follow → United Kingdom, Germany, France, Ireland (EIRE), Spain
As the company receives the highest number of orders from customers in the UK (since it is a UK-based company), customers in the UK spend the most on their purchases. Therefore, the TOP 5 countries (including UK) that spend the most money on purchases are as follow → United Kingdom, Netherlands, Ireland (EIRE), Germany, France
November 2011 has the highest sales. The month with the lowest sales is undetermined as the dataset consists of transactions until 9th December 2011 in December
There are no transactions on Saturday between 1st Dec 2010 — 9th Dec 2011
The number of orders received by the company tends to increases from Monday to Thursday and decrease afterward
The company receives the highest number of orders at 12:00pm. Possibly most customers made purchases during lunch hour between 12:00pm — 2:00pm
The company tends to give out FREE items for purchases occasionally each month (Except June 2011). However, it is not clear what factors contribute to giving out the FREE items to the particular customers

Conclusion

Awesome!

Simply by performing EDA on the dataset we’ve identified some interesting results. Of course, the results don’t just stop here. They can always be used to validate business assumptions (if any) and interpret a machine learning model’s output and so much more!

Remember. Creativity is your limit when doing EDA. And it really depends on your business understanding, curiosity to ask interesting questions to challenge and validate assumptions, as well as your intuition.

Thank you for reading. Hopefully by showing the overall workflow of EDA, visualization and its results, EDA will become less intimidating to you and you’ll be more interested in getting your hands dirty next time.

As always, if you have any questions or comments feel free to leave your feedback below or you can always reach me on LinkedIn. Till then, see you in the next post! 😄

About the Author

Admond Lee is currently the Co-Founder/CTO of Staq — the #1 business banking API platform for Southeast Asia.

Want to get free weekly data science and startup insights?

Join Admond’s email newsletter — Hustle Hub, where every week he shares actionable data science career tips, mistakes & learnings from building his startup — Staq.

You can connect with him on LinkedIn, Medium, Twitter, and Facebook.