A UK-based online retail store has captured the sales data for different products for the period of one year (Nov 2016 to Dec 2017). The organization sells gifts primarily on the online platform. The customers who make a purchase consume directly for themselves. There are small businesses that buy in bulk and sell to other customers through the retail outlet channel.
As a project objective? We strive to find significant customers for the business who make high purchases of their favorite products. The organization wants to roll out a loyalty program to the high-value customers after identification of segments. We will use the clustering methodology to segment customers into groups.
Data Wrangling
In my analysis I have turned to the provided dataset, which we will first examine to see, if any wrangling and cleaning will be required.
Since we have too many rows in the dataset, it would make sense to see the structure and see if any values are missing. And if there are missing values in the dataset, it will be a good idea to see, which columns are affected and will it be a problem for our analysis
We filter the table to see where the bulk of the NaN values are and their impact on the output. With the help of SQL we filter the data to find out how many purchases were made by each customer, as well as total revenue they generated for the company. If our main goal is to find the biggest spenders, the best thing will be to compare their bulk ordering to determine high flyers
After the table is made and all the NaN values are grouped together, we will drop them to avoid bias in the histograms of the categorical columns. That way, all the customers are more evenly represented and we will not have outliers, that are attributed to the customers with missing information.
After that, we create a scatter plot to see how much revenue is generated by our clients with relation to the items bought. The graph shows the majority of the sales occupying the space near the 0 point of the graph, with a few exceptions, that seem to buy huge quantities with little revenue generated or a handful of items with massive revenue. To make sure our clusters represent the data in the fairest way, we will set constraints on the top revenue and quantity numbers.
Exploratory Analysis
To find the clusters, we will use the K-means algorithm.
K-means algorithm is an iterative algorithm that tries to partition the dataset into K pre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one group. It tries to make the intra-cluster data points as similar as possible while also keeping the clusters as different (far) as possible. It assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum. The less variation we have within clusters, the more homogeneous (similar) the data points are within the same cluster.
The way kmeans algorithm works is as follows:
- Specify number of clusters K.
- Initialize centroids by first shuffling the dataset and then randomly selecting K data points for the centroids without replacement.
- Keep iterating until there is no change to the centroids. i.e assignment of data points to clusters isn’t changing.
- Compute the sum of the squared distance between data points and all centroids.
- Assign each data point to the closest cluster (centroid).
- Compute the centroids for the clusters by taking the average of the all data points that belong to each cluster.
So to find the optimal number of clusters K we will use "Elbow Method". The method consists of plotting the explained variation as a function of the number of clusters, and picking the elbow of the curve as the number of clusters to use. The same method can be used to choose the number of parameters in other data-driven models, such as the number of principal components to describe a data set.
Next we create a numpy array from the columns of the dataset and plot the clusters, based on the optimal K we found. Once done, we plot the dataframe, based on the amount of clusters we chose and the centers we have calculated. To make the results stand out, we use bright colours to clearly see the edges of each cluster.
We will also make a dendrogram for accuracy. A dendrogram is a diagram that shows the hierarchical relationship between objects. It is most commonly created as an output from hierarchical clustering. The main use of a dendrogram is to work out the best way to allocate objects to clusters. For more in-depth description of how and why we use dendrograms, including the math behind them, feel free to use the hyperlink for more information.
Results
Once the clusters are found, we use the designated numbers to associate with each customer, showing which ones belong to the highest grossing ones. Upon filtering the records for the specific cluster number, we can create a list of top tier customers to look after. It is also useful to use it to see if in the coming years new customers will occupy the cluster in question and the dynamic of the existing customers may help tune the list of high flyers for years to come.
Conclusion
Thank you for taking the time to read the above analysis, I hope it was interesting and provided some insight into the clustering and exploratory analysis. All the code for the project is available on my GitHub, please check it out and if you have any comments or notes please feel free to comment. After all, we cannot learn what we think we already know.