Decision Trees for Online Shopping Analysis

Published in

Towards Data Science

6 min readOct 26, 2019

Nowadays there is a trend to use online shopping solutions like Amazon, eBay, AliExpress. These websites provide a platform for the sellers to sell their products to a large number of customers. Since many delivery services are connected with these online shopping platforms, customers from different countries buy products. Unlike the traditional shops, the ratings and the good-name is directly represented on the shopping platform for each seller. Therefore the sellers have let the customers return their bought items if they don’t like the product or there is any defect of the item. Some sellers refund the whole amount if the customers complain that the items are not delivered within the promised period. Some customers are misusing these facilities and fraud to the sellers. Therefore, the sellers on the online shopping platforms experience a huge loss of profits. Let’s discuss how we can spot these types of customers by developing a simple Machine Learning model; a Decision Tree.

Have a look at this medium post on Decision Trees if you are not familiar with them. For a quick recap, a decision tree is a model in machine learning which includes the conditions on which we are categorizing the data (for labelling problem). As an example, think about a simple situation where a man is happy is the weather is sunny or he is on vacation. This scenario is modelled below. Note that you can use weather and vacation status to predict the man’s happiness with this model.

Online shopping items returning problem

I obtained a dataset containing details of an online shopping platform from a data analytics company when I was competing on a Datathon (Data Hackathon). The dataset is encoded such that customer details and seller details won’t be exposed. However, the dataset included plenty of data (in an encoded manner) on the sellers, customers and the products. One of the main tasks of the datathon mentioned above was to find the item returning patterns. This includes the often returning customer attributes, months of the year, often returning item details and seller details. The full dataset includes complete data of one year. Have a look at the dataset schema to get an understanding of the dataset.

Dataset schema

Dataset loading and Preprocessing

The dataset was very huge. Even if it contained data of one year, the size was approximately 25GB. It was written as 4 .csv files(One file per quarter of the year). Even with a computer which has a 16GB memory and an SSD hard drive, the files were too hard to handle. Therefore, pandas python package was used to read the dataset in chunks.

chunks = pd.read_csv('dataset/DataSet01.csv', chunksize=1000000)for chunk in chunks:
    df = chunk
    ## Perform task on dataframe; df

Note that the chunksize parameter indicates that only 1000000 records from the given .csv file are read. When preprocessing the data, training machine learning models (here decision trees) and testing models, we can perform those tasks per chunk.

As for preprocessing, the records with missing attributes were ignored since there were millions of records with necessary attributes. A subset of given attributes was selected for item returning analysis by observing the correlations with returns. Have a look at this GitHub repository for more details and code on correlation analysis. Finally, the following attributes were selected by correlation analysis and domain knowledge.

selected_atributes = [‘ONLINE_STORE_CATEGORY’, ‘SIZE_CDE’, ‘CLOR_CDE’, ‘SELLING_PRICE’, ‘PRODUCT_CLASS_02’, ‘FMALE_IND’, ‘MRYD_IND’, ‘BRAND_CODE’, ‘EDUC_LVL_NBR’, ‘MRYD’, ‘AGE’, ‘2017_M7_PURCHASE_AMT’, ‘2017_M7_RETURNED_AMT’,’PRODUCT_CLASS_01', ‘PRODUCT_CLASS_02’]

Note: that these attributes include the category of the item, size, colour, gender, age and marital status of the customer, past months purchased and returned values, education level and, etc.

Building the Decision Tree model

Let’s build our decision tree model using sklearn’s DecisionTreeClassifier.

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics# load selected attributes and return indication as X and y
X = df[selected_atributes]
y = df.RETURN_INDICATION## Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% testmodel = DecisionTreeClassifier()# train the model
model= model.fit(X_train,y_train)# testing the model
y_pred = model.predict(X_test)# Accuracy calculation
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))

Keeping all the hyperparameters default, I obtained an accuracy of 89%. You can also change the hyperparameters of the decision tree by changing the parameters mentioned in the official documentation.

Visualizing the model

Now, we have built our decision tree, we want to see the conditions(decisions) on which the customer record is categorized as a returning or not. Decision Tree visualization is a great way of understanding these conditions. Let’s use plot_tree option in sklern.tree to generate the tree.

tree.plot_tree(model, max_depth=5, filled=True)

Note that max_depth=5 indicates that visualize first 5 depth levels of the tree. Our tree is a very complex one. Therefore it can take a huge amount of time and memory to plot the full tree.

You can use the option sklearn.tree.export.export_text to export the tree in text. This way, the full tree can be generated easily.

from sklearn.tree.export import export_text
r = export_text(model)
print(r)

Go to the GitHub repository to see the generated plot and text structure of the decision tree.

Storing and re-using the model

You can use pickle to save the model.

pickle.dump(clf, open('finalized_model.sav', 'wb'))

As well as to load the model from the dumped file.

loaded_model = pickle.load(open('finalized_model.sav', 'rb'))

Predicting with the model (using the model)

To classify a given customer record as returning or non-returning, we can use the predict method in the sklearn tree model. Note that you first have to load the same attributes in the same order as you did in the model building step. Let’s predict for the testing data as we split our dataset into training and testing sets.

y_pred_loaded = loaded_model.predict(X_test)

This will return a list of predictions (Item returning indication) which can be compared with the actual returning indication to evaluate our model.

print(“Accuracy:”, metrics.accuracy_score(y_test, y_pred_loaded))
>>> 0.96

More importantly, we can use this model to predict some unseen data. As introduced in the Online shopping items returning problem sub-section, the dataset has 4 .csv files. We have used the 1st file to train our model. Let’s use the 4th file to predict the return indication. Note that we are using pandas to load data in chunks.

selected_atributes= ['ONLINE_STORE_CATEGORY', 'SIZE_CDE', 'CLOR_CDE', 'SELLING_PRICE', 'PRODUCT_CLASS_02', 'FMALE_IND', 'MRYD_IND', 'BRAND_CODE', 'EDUC_LVL_NBR', 'MRYD', 'AGE', '2017_M7_PURCHASE_AMT', '2017_M7_RETURNED_AMT','PRODUCT_CLASS_01', 'PRODUCT_CLASS_02']
chunks = pd.read_csv('dataset/DataSet04.csv',chunksize=1000000)i = 0
for chunk in chunks:
    i = i +1
    if(i>10):
        break
    df = chunk# load features and target seperately
    X_test = df[selected_atributes]
    y_test = df.RETURN_INDICATION
    
    y_pred = loaded_model.predict(X_test)
    
    print("Accuracy for chunk ", i, metrics.accuracy_score(y_test, y_pred))>>> Accuracy for chunk  1 0.865241
>>> Accuracy for chunk  2 0.860326
>>> Accuracy for chunk  3 0.859471
>>> Accuracy for chunk  4 0.853036
>>> Accuracy for chunk  5 0.852454
>>> Accuracy for chunk  6 0.859550
>>> Accuracy for chunk  7 0.869302
>>> Accuracy for chunk  8 0.866371
>>> Accuracy for chunk  9 0.867436
>>> Accuracy for chunk  10 0.89067

In the testing step, we got a 96% accuracy. However, this will result in mid-80s accuracies. This is because the model has been overfitted to the seasonal variations in the first quarter of the year. (Recall that we only trained our model using the 1st .csv file of 4 files) Therefore, it doesn’t capture the seasonal variations in the last quarter of the year. However, having trained the model using all the 4 .csv files may resolve this issue. You can still load the data in small chunks from all the 4 .csv files and train the model.

Check out the code at this GitHub repository. Hope you find the article useful.