Exploratory Data Analysis on E-Commerce Data

To discover interesting transactional patterns of different customers and countries

Admond Lee
Towards Data Science
8 min readSep 22, 2018

(Source)

In general explanation, data science is nothing more than using advanced statistical and machine learning techniques to solve various problems using data. Yet, it’s easier to just dive into applying some fancy machine learning algorithms —and Voila! You got the prediction — without first understanding the data.

This is exactly where the importance of Exploratory Data Analysis (EDA) (as defined by Jaideep Khare) comes in which, unfortunately, is a commonly undervalued step as part of the data science process.

EDA is so important for 3 reasons (at least) as stated below:

  1. Make sure business stakeholders ask the right questions — often by exploring and visualizing data — and validate their business assumptions with thorough investigation
  2. Spot any potential anomalies in data to avoid feeding wrong data to a machine learning model
  3. Interpret the model output and test it’s assumptions

There you have it. Now that we have already understood the “WHAT and WHY” aspects of EDA, let’s examine a dataset together and go through the “HOW” that will eventually lead us to discover some interesting patterns, as we’ll see in the next section.

We’ll focus on the overall workflow of EDA, visualization and its results. For technical reference, please refer to my notebook on Kaggle anytime you want to have a more detailed understanding of the codes.

To give a brief overview, this post is dedicated to 5 sections as follow:

  1. Context of Data
  2. Data Cleaning (a.k.a data preprocessing)
  3. Exploratory Data Analysis
  4. Results
  5. Conclusion

Let’s get started and have fun!

Context of Data

(Source)

In this post, we’ll investigate the E-Commerce dataset obtained from Kaggle. Before dealing with the dataset, let’s try to understand what it is about to give us a better understanding of its context.

In short, the dataset consists of transactional data with customers in different countries who make purchases from an online retail company based in the United Kingdom (UK) that sells unique all-occasion gifts. The information is summarized as below:

  • Company — UK-based and registered non-store online retail
  • Products for selling — Mainly all-occasion gifts
  • Customers — Most are wholesalers (local or international)
  • Transactions Period — 1st Dec 2010–9th Dec 2011 (One year)

Data Cleaning

(Source)

We all know data in real world is messy (including Kaggle!) and thus, let’s spend some time to clean the data to the format we need. Below is a snapshot of what the original data looks like after loading the dataset into a dataframe.

As intuitive as the variables (column names) may sound, let’s take a step further by understanding what each variable means:

InvoiceNo (invoice_num): A number assigned to each transaction
StockCode (stock_code): Product code
Description (description): Product name
Quantity (quantity): Number of products purchased for each transaction
InvoiceDate (invoice_date): Timestamp for each transaction
UnitPrice (unit_price): Product price per unit
CustomerID (cust_id): Unique identifier each customer
Country (country): Country name

NOTES → Product price per unit is assumed to follow the same currency throughout our analysis

Check missing values

So far, so good. We see that there are some missing values for Customers ID and Description. The rows with any of these missing values will therefore be removed.

Descriptive statistic of data

By understanding the data in a more descriptive manner, we notice two things:

  1. Quantity has negative values
  2. Unit Price has zero values (FREE items?)

Interesting…

At this stage, we’ll just remove Quantity with negative values — this notebook explains what negative values mean — and Unit Price with zero values will be explained in the later part.

To calculate the total money spent on each purchase, we simply multiply Quantity with Unit Price:

amount_spent = quantity * unit_price

Finally, we add a few columns that consist of the Year_Month, Month, Day and Hour for each transaction for analysis later. The final dataframe will look like this:

Final dataframe

Exploratory Data Analysis

(Source)

Highest number of orders and money spent on purchases

Top 5 customers with most number of orders
Top 5 customers with highest money spent

In E-Commerce world, we often want to know which customers — where they come from—place the most orders and spend the most money as they drive the sales of companies.

From the results we observe that most orders are made in the UK and customers from Netherlands spend the highest amount of money in their purchases.

How many orders (per month)?

Number of orders for different months

Overall, we consider that the company receives the highest number of orders in November 2011 since we do not have the full month of data for December 2011.

How many orders (per day)?

Number of orders for different days

Surprisingly, there are no transactions on Saturday throughout the whole period ( 1st Dec 2010–9th Dec 2011). Reasons behind are left for discussion as the dataset and its context are limited.

We also spot a trend where the number of orders received by the company tends to increases from Monday to Thursday and decrease afterward.

How many orders (per hour)?

Number of orders for different hours

In terms of hours, there are no transactions after 8:00pm until the next day at 6:00am.

Besides, we notice that the company receives the highest number of orders at 12:00pm. One of the reasons could be due to the fact that most customers make purchases during lunch hour between 12:00pm — 2:00pm.

Discover transactional patterns for Unit Price

Descriptive statistics of Unit Price
Boxplot for Unit Price

Before we move our attention to the zero values (FREE items) of unit price, we make a boxplot to check the distribution of the unit price for all products.

We observe that 75% of the data has unit price of less than 3.75 dollars — which indicates most products are relatively cheap. Only minority of them has high prices per unit (Again, we assume each price per unit follows the same currency).

Well… FREE items for purchase? YES, maybe…

Frequency of giving out FREE items for different months

From the plot, the company tends to give out FREE items for purchases occasionally each month (except June 2011).

However, it is not clear what factors contribute to giving out the FREE items to the particular customers. More in-depth analysis could be done for further explanation. Let me know if you have found out the reasons behind! 😏

Discover transactional patterns for each Country

Top 5 countries with most number of orders

Number of orders in each country (with UK)
Number of orders in each country (without UK)

As expected, the company receives the highest number of orders in the UK (since it is a UK based company).

To better discern the trend, UK is removed for clearer comparison among other countries. As a result, the TOP 5 countries (including UK) that place the highest number of orders are as below:

  • United Kingdom
  • Germany
  • France
  • Ireland (EIRE)
  • Spain

Top 5 countries with highest money spent

Total money spent by each country (with UK)
Total money spent by each country (without UK)

As the company receives the highest number of orders from customers in the UK, it is natural to see that customers in the UK spend the most on their purchases.

Same as before, UK is removed for clearer comparison among other countries. The TOP 5 countries (including UK) that spend the most money on purchases are as below:

  • United Kingdom
  • Netherlands
  • Ireland (EIRE)
  • Germany
  • France

Results from EDA

  1. The customer with the highest number of orders comes from the United Kingdom (UK)
  2. The customer with the highest money spent on purchases comes from Netherlands
  3. The company receives the highest number of orders from customers in the UK (since it is a UK-based company). Therefore, the TOP 5 countries (including UK) that place the highest number of orders are as follow → United Kingdom, Germany, France, Ireland (EIRE), Spain
  4. As the company receives the highest number of orders from customers in the UK (since it is a UK-based company), customers in the UK spend the most on their purchases. Therefore, the TOP 5 countries (including UK) that spend the most money on purchases are as follow → United Kingdom, Netherlands, Ireland (EIRE), Germany, France
  5. November 2011 has the highest sales. The month with the lowest sales is undetermined as the dataset consists of transactions until 9th December 2011 in December
  6. There are no transactions on Saturday between 1st Dec 2010 — 9th Dec 2011
  7. The number of orders received by the company tends to increases from Monday to Thursday and decrease afterward
  8. The company receives the highest number of orders at 12:00pm. Possibly most customers made purchases during lunch hour between 12:00pm — 2:00pm
  9. The company tends to give out FREE items for purchases occasionally each month (Except June 2011). However, it is not clear what factors contribute to giving out the FREE items to the particular customers

Conclusion

(Source)

Awesome!

Simply by performing EDA on the dataset we’ve identified some interesting results. Of course, the results don’t just stop here. They can always be used to validate business assumptions (if any) and interpret a machine learning model’s output and so much more!

Remember. Creativity is your limit when doing EDA. And it really depends on your business understanding, curiosity to ask interesting questions to challenge and validate assumptions, as well as your intuition.

Thank you for reading. Hopefully by showing the overall workflow of EDA, visualization and its results, EDA will become less intimidating to you and you’ll be more interested in getting your hands dirty next time.

As always, if you have any questions or comments feel free to leave your feedback below or you can always reach me on LinkedIn. Till then, see you in the next post! 😄

About the Author

Admond Lee is currently the Co-Founder/CTO of Staqthe #1 business banking API platform for Southeast Asia.

Want to get free weekly data science and startup insights?

Join Admond’s email newsletter — Hustle Hub, where every week he shares actionable data science career tips, mistakes & learnings from building his startup — Staq.

You can connect with him on LinkedIn, Medium, Twitter, and Facebook.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Published in Towards Data Science

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

Written by Admond Lee

Co-Founder & CTO @ Staq | Building the universal API to help fintech companies access financial data from SMEs across Southeast Asia 🚀

Responses (2)

What are your thoughts?

Wow.. That was a better written about the data analysis and E-commerce.
You can also check this link out for more course about Data science and analytics

--

this notebook

Could you please fix this link to <localhost>?

--