
I had a final project in my Probability & Statistics class with Dr Dimitri Mahayana, which was about analyzing e-commerce survey Data in Indonesia. Each student collected data from some respondents, making around 1600 responses. This survey consists of more than 50 questions. Some of the questions are about personal data and personal behaviour in online shopping.
In this post, I am using data from the year 2019. I am going to show you how data wrangling and visualization can give us more insights into the data. We are going to look at personal data, behavioural data and combination between them to get more information.
Dataset explanation
The survey used SurveyMonkey for data collection, which is a popular platform for a survey. However, the raw data from SurveyMonkey turns out to be quite messy as there are lots of unnamed columns and NaN values. Therefore, it will take some time to clean up and analyze the data. As you can see on the image below, the questions are in the column names, and the options are in the first row. Thankfully, we can use Pandas which is quite versatile for data wrangling.

Reorganizing the data
Data wrangling is not an easy job, especially if you are dealing with lots of data. With that said, it can be exhausting to clean up the data as you have to do an iterative process until you get what you need. So, I rearrange my data into a list of questions which I can access individually. For example, I can type in ‘Q32’ to access question number 32. Here is what the data looks like after rearrangement. (You can see my code on my Github)

You can see how the questions and sub-questions moved into multi-index columns, and all the answers are in rows below it. Also, you may have noticed that there are still a lot of NaN values as SurveyMonkey treats answer options as numbers. Let’s say there are 2 options, A and B. When a respondent chooses A, SurveyMonkey fills column A with 1, and column B with NaN. We will deal with that later when visualizing the data.
Because there are lots of questions in this survey, I only take some of them to be analyzed. I want to see some personal information such as gender, occupation and monthly income. Besides that, I am going to analyze their behaviour such as preferred online shopping platform, buying preference and so on. After that, I will combine some of the features to get more insights regarding the data.
Personal data
First of all, I would like to visualize the number of respondents by gender. The pie chart below represents gender distribution among respondents. The number of females is slightly higher than the males by roughly 8% margin. Notice that when we visualize the data, we count values on each column and ignore all NaN values.

Next, we visualize occupation distribution to understand our respondents’ background. The pie chart below illustrates that more than two-thirds of respondents are students. The second-highest number is private employees, then followed by housewives and other occupations.

Lastly, we take a look at their monthly income in Indonesian Rupiahs. Based on the question, we have four income categories which are under 2 million, between 2 and 5 million, 5 to 10 million and more than 10 million a month. The bar chart suggests that almost one thousand respondents have a monthly income which is under Rp 2 million (equivalent to around US$ 135).

Behavioural data
Now, we move on to behavioural data. The first one is the distribution of platforms which respondents use for online shopping. The bar chart illustrates that most of them use marketplace for online shopping. There are several marketplaces in Indonesia, for instance, Tokopedia, Shopee and Bukalapak. Following that, online delivery (such as Gojek and Grab) comes as the second-highest number with more than 800 respondents. Instagram also has a significant percentage as roughly 25% (400 out of 1600) respondents use it for online shopping.

As we know that most of them use marketplace for online shopping, I want to dive deeper into the distribution of online stores or marketplace. The bar chart below represents how many people choose different online stores as their favourite. Most people choose Gojek, Shopee and Tokopedia as their top three favourite online platforms while not many people use Elevenia, Blanja and Matahari Mall in Indonesia.

Based on my experience, I use Gojek mostly for food and Tokopedia for other things. So, I think it might be interesting to take a look at people’s preferences when buying certain items. The bar chart below indicates the number of people who prefer certain online method when buying certain items, ranging from computer to phone credit. For each category, people can buy items via a marketplace, official online store, online delivery and social media.

Overall, the marketplace seems to be the most preferred method when buying items in general. Most people use the marketplace when buying items like cosmetics and beauty, fashion, hobbies, electronics and phone credits. However, around 75% of people choose online delivery when buying food and beverages. Also, the official online store is the favourite method when purchasing computers and phones. In the case of buying groceries, online delivery and marketplace seem to be the most favourite method for online shopping.
Combining personal data and behavioural data
Not only that we can look at each feature, but also we can combine two features (or more)to get more information about the data. The stacked bar chart below indicates the proportion of male and female in each platform. The proportion of females is significantly higher on Instagram, Line and Whatsapp. In contrast, there is no stark difference between the number of males and females in other platforms.

Last but not least, I try to combine the three most favourite online stores based on monthly income. The stacked bar chart shows that people with monthly income less than Rp 2 million dominates all of the platforms because around 60% of the respondents are in this category. Nonetheless, we still can observe that each platform has roughly the same income distribution. We can safely say that based on this data, monthly income does not heavily influence shopping platform preferences.

Conclusion
Data wrangling and visualization is not always a fun thing to do, but it can give us more insights about the data. After cleaning and visualizing, we have some information regarding personal and behavioural data. In addition to that, combining features also adds insights to the data.
You can find my code for this project here.
About the Author
Alif Ilham Madani is an aspiring Data Science and Machine Learning enthusiast who is passionate about gaining insight from others. He is majoring in Electrical Engineering at one of the top universities in Indonesia, Institut Teknologi Bandung.
If you have any topics to be discussed, you may connect with Alif via LinkedIn and Twitter @_alifim.