The world’s leading publication for data science, AI, and ML professionals.

Game of Phones.

Amazon user smartphone reviews and a couple of visualisations to see which brands people love the most.

R and ‘ggplot2’ to see what Amazon user reviews can tell us about the best smartphone to buy

Photo by Marianne Krohn on Unsplash
Photo by Marianne Krohn on Unsplash

Since the huge growth in web services and the data explosion we have seen over the last decade, the vast amounts of review data on the internet is staggering. Websites such as IMDB and Yelp have created entire eco systems of comparisons and reviews of products, films, restaurants and individuals. Most internet based retail companies such as Amazon or Ebay have also incorporated reviews into their purchasing platform to create a customer centric hierarchy of the best products available.

Thousands if not millions of these reviews exist strewn across social media platforms such as Twitter and Facebook but also on retail websites and review sites. There is a wealth of publicly available data on these sites and there is huge scope to delve into this raw and unstructured data to gain insights that have the potential to be lucrative for business, market research and innovation. Most reviews are also associated with a rating. A review typically looks like a short paragraph with a title, describing the subject matter being reviewed, followed by a rating, usually out of five or ten.


There is a lot of raw consumer data for smart phones online. As opposed to assigning expert reviewers to decide on the best phone, thousands of customer reviews about the same phone could serve as a more balanced litmus test for opinion mining of the end users.

Data

The datasets used for this project is available from Kaggle here.

The full dataset, once merged, contains 82,815 entries and is broken down into the following columns:

brand : Product Brand : Categorical variable with 10 categories title : Product Title : The name of the product being reviewed url : Product URL : A URL link to buy the product image : Product Image URL : A URL link to an image of the product rating: Product Avg. Rating : the average rating of the product from the users reviewUrl : Product Review Page URL : A URL to the product review page totalReviews : Product Total Review : The number of total reviews of the product prices : Product Prices : a range of potential prices for the product

name : Reviewer Name rating : Reviewer Rating (scale 1 to 5) date : Review Date : the date when the review was made verified : Valid Customer : whether or not the customer is a verified Amazon customer title : Review Title : The tile of the review

body : Review Content : The main content of the review helpfulVotes : Helpful Feedbacks : how much feedback has been given to the review.

Cleaning & Wrangling

The first stage in the initial cleaning process was to remove the columns which were deemed to be unnecessary for the task. The image, url, name, reviewer url, body and helpful votes were removed. the title column (which had been merged already) and the user rating column were also removed.

The next step of the initial cleaning process was completed after the variable for ‘price’ was inspected and due to the high number of NA’s and due to many of the rows having a range of numbers for price, a decision was made to not include this variable in the study.

The first step in the initial wrangling process was to take the two datasets and perform a ‘left join’ to create one merged dataset. This was done using the ‘ASIN’ ID as the key for the merge.

The next step was to combine the title of each review with the actual content of the review to create a super column which contained both the title and the body of the review.

Visualisations Theme

The project covered many brands of phone and a solution for a colour scheme which would respect the brand integrity of all phones while expressing the diversity of the brands was considered. It was decided that a spread of colours, similar to a rainbow would be used as a theme across the three visualisations to keep them consistent with each other and highlight the visualisers neutrality with regard to the information.

The visualisations are presented from the micro to the macro. The first visualisation focuses on a general overview of the reviews; what are the most common words and sentiments expressed about phones.

The next visualisation takes a step deeper and focuses on how users feel about specific brands. I wanted to show what users felt was the top brand on the market is.

The Humble Word Cloud

Some additional cleaning was needed to make something worth paying attention to so some Common English ‘stop words’ (words that do not contribute to understanding of the sentiment such as conjunctions and pronouns) and punctuation were removed from the set as they are not useful for analysing sentiment in this case. A term document matrix for the top ten words was created to asses which words appeared in the set the most frequently. From the TDM it was clear that some more words needed to be added to the stop word list as some obvious words regarding phones were appearing frequently. For example the word ‘phone’ was the most popular (not surprising, considering they are phone reviews)

Image by Author
Image by Author

A sample of the first ten thousand reviews (due to memory errors from the large volume of data, it was not possible to include all of the reviews). Then the columns needed to be converted to a text file so they could be manipulated appropriately. The next step was to turn all unnecessary characters in the document into whitespace.

From here, all of the text was converted to lowercase, to standardise the content and to look more visually appealing in this form.

Insight

What ideas and sentiment are most prevalent when people review phones?. This viz is intended to be the overall view of the phone reviews from Amazon and also an overall view of the dataset content.

The ‘random order’ parameter was set to false so it could be easily understood which words are the most frequent in reviews. The max number of words was set to 200 so as not to overload the graph and words needed to appear just once to be potentially included.

The ‘Dark2’ colour scheme was chosen with a limit of eight colours. This fits in with the rainbow colour scheme set at the beginning and also minimises the mental load of trying to understand a graph with too many colours.

This visualisation relates to an overall view of the content of the reviews and aims to answer questions about general sentiment and use of language with regard to phones in general.

A deeper look…

Image by Author
Image by Author

The first cleaning element for this visualisation is to standardise the name of the ‘title’ variable as functionality was being reduced due to invisible characters.

Wrangling:

For this one, it was necessary to find the brands with the best ratings so a new variable was created which would hold a row for each brand and the value associated to this would be a composite of all that brands reviews averaged.

It was also important to use the ‘reorder’ function to display the ratings from lowest to highest to show definitively the best brand in a scale.

Insight

This viz tries to answer questions about the top brands from the Amazon reviews. Which brands to user reviews identify as the best. A bar chart with all of the brands with their corresponding average rating was plotted and the rainbow colour scheme was incorporated.

Hmmm, interesting…..

Some cool findings and insight to be found here and theres tons more to be discovered by looking into the dataset further.

One thing I thought about while completing this project was in relation to fairness and representation in Data Science. Due to the nature of the the exploration, if I created my own phone brand; the ‘i-Alan’, and gave myself a single five star review, then I would have the highest rated phone on Amazon and would be at the top of my own bar chart.

All of the R code for this project is available over here on my Github page. 🙂


Related Articles