The world’s leading publication for data science, AI, and ML professionals.

Visualizing E-commerce Survey Data

Data wrangling and visualization using Python and Pandas from SurveyMonkey data

Photo by Luke Chesser on Unsplash
Photo by Luke Chesser on Unsplash

I had a final project in my Probability & Statistics class with Dr Dimitri Mahayana, which was about analyzing e-commerce survey Data in Indonesia. Each student collected data from some respondents, making around 1600 responses. This survey consists of more than 50 questions. Some of the questions are about personal data and personal behaviour in online shopping.

In this post, I am using data from the year 2019. I am going to show you how data wrangling and visualization can give us more insights into the data. We are going to look at personal data, behavioural data and combination between them to get more information.


Dataset explanation

The survey used SurveyMonkey for data collection, which is a popular platform for a survey. However, the raw data from SurveyMonkey turns out to be quite messy as there are lots of unnamed columns and NaN values. Therefore, it will take some time to clean up and analyze the data. As you can see on the image below, the questions are in the column names, and the options are in the first row. Thankfully, we can use Pandas which is quite versatile for data wrangling.

Raw data from SurveyMonkey (Image by Author)
Raw data from SurveyMonkey (Image by Author)

Reorganizing the data

Data wrangling is not an easy job, especially if you are dealing with lots of data. With that said, it can be exhausting to clean up the data as you have to do an iterative process until you get what you need. So, I rearrange my data into a list of questions which I can access individually. For example, I can type in ‘Q32’ to access question number 32. Here is what the data looks like after rearrangement. (You can see my code on my Github)

Preprocessed data (Image by Author)
Preprocessed data (Image by Author)

You can see how the questions and sub-questions moved into multi-index columns, and all the answers are in rows below it. Also, you may have noticed that there are still a lot of NaN values as SurveyMonkey treats answer options as numbers. Let’s say there are 2 options, A and B. When a respondent chooses A, SurveyMonkey fills column A with 1, and column B with NaN. We will deal with that later when visualizing the data.

Because there are lots of questions in this survey, I only take some of them to be analyzed. I want to see some personal information such as gender, occupation and monthly income. Besides that, I am going to analyze their behaviour such as preferred online shopping platform, buying preference and so on. After that, I will combine some of the features to get more insights regarding the data.


Personal data

First of all, I would like to visualize the number of respondents by gender. The pie chart below represents gender distribution among respondents. The number of females is slightly higher than the males by roughly 8% margin. Notice that when we visualize the data, we count values on each column and ignore all NaN values.

Respondents' genders (Image by Author)
Respondents’ genders (Image by Author)

Next, we visualize occupation distribution to understand our respondents’ background. The pie chart below illustrates that more than two-thirds of respondents are students. The second-highest number is private employees, then followed by housewives and other occupations.

Respondents' occupations (Image by Author)
Respondents’ occupations (Image by Author)

Lastly, we take a look at their monthly income in Indonesian Rupiahs. Based on the question, we have four income categories which are under 2 million, between 2 and 5 million, 5 to 10 million and more than 10 million a month. The bar chart suggests that almost one thousand respondents have a monthly income which is under Rp 2 million (equivalent to around US$ 135).

Respondents' monthly income range (Image by Author)
Respondents’ monthly income range (Image by Author)

Behavioural data

Now, we move on to behavioural data. The first one is the distribution of platforms which respondents use for online shopping. The bar chart illustrates that most of them use marketplace for online shopping. There are several marketplaces in Indonesia, for instance, Tokopedia, Shopee and Bukalapak. Following that, online delivery (such as Gojek and Grab) comes as the second-highest number with more than 800 respondents. Instagram also has a significant percentage as roughly 25% (400 out of 1600) respondents use it for online shopping.

Online shopping platform used by respondents (Image by Author)
Online shopping platform used by respondents (Image by Author)

As we know that most of them use marketplace for online shopping, I want to dive deeper into the distribution of online stores or marketplace. The bar chart below represents how many people choose different online stores as their favourite. Most people choose Gojek, Shopee and Tokopedia as their top three favourite online platforms while not many people use Elevenia, Blanja and Matahari Mall in Indonesia.

Favorite shopping platforms (Image by Author)
Favorite shopping platforms (Image by Author)

Based on my experience, I use Gojek mostly for food and Tokopedia for other things. So, I think it might be interesting to take a look at people’s preferences when buying certain items. The bar chart below indicates the number of people who prefer certain online method when buying certain items, ranging from computer to phone credit. For each category, people can buy items via a marketplace, official online store, online delivery and social media.

Preferred online shopping method based on items (Image by Author)
Preferred online shopping method based on items (Image by Author)

Overall, the marketplace seems to be the most preferred method when buying items in general. Most people use the marketplace when buying items like cosmetics and beauty, fashion, hobbies, electronics and phone credits. However, around 75% of people choose online delivery when buying food and beverages. Also, the official online store is the favourite method when purchasing computers and phones. In the case of buying groceries, online delivery and marketplace seem to be the most favourite method for online shopping.


Combining personal data and behavioural data

Not only that we can look at each feature, but also we can combine two features (or more)to get more information about the data. The stacked bar chart below indicates the proportion of male and female in each platform. The proportion of females is significantly higher on Instagram, Line and Whatsapp. In contrast, there is no stark difference between the number of males and females in other platforms.

Preferred platform based on genders (Image by Author)
Preferred platform based on genders (Image by Author)

Last but not least, I try to combine the three most favourite online stores based on monthly income. The stacked bar chart shows that people with monthly income less than Rp 2 million dominates all of the platforms because around 60% of the respondents are in this category. Nonetheless, we still can observe that each platform has roughly the same income distribution. We can safely say that based on this data, monthly income does not heavily influence shopping platform preferences.

Preferred online shopping method based on items (Image by Author)
Preferred online shopping method based on items (Image by Author)

Conclusion

Data wrangling and visualization is not always a fun thing to do, but it can give us more insights about the data. After cleaning and visualizing, we have some information regarding personal and behavioural data. In addition to that, combining features also adds insights to the data.


You can find my code for this project here.


About the Author

Alif Ilham Madani is an aspiring Data Science and Machine Learning enthusiast who is passionate about gaining insight from others. He is majoring in Electrical Engineering at one of the top universities in Indonesia, Institut Teknologi Bandung.

If you have any topics to be discussed, you may connect with Alif via LinkedIn and Twitter @_alifim.


Related Articles