The world’s leading publication for data science, AI, and ML professionals.

Twitter Sentimental Analysis & Algorithm Comparison for Uber & Ola Using ‘R’

Analysis and comparative study of machine learning algorithms

Photo by Markus Winkler on Unsplash
Photo by Markus Winkler on Unsplash

This article presents our research on understanding the cab services throughout India for Uber & Ola using Machine Learning.

Important Points

  1. Twitter is a social networking site where a large number of users are actively present. Data with hashtags are popular widely on Twitter, hence twitter has large amounts of datasets where user tweet their reviews.
  2. The two currently popular cab services in India are Uber & Ola with a vast number of users with a different lifestyle.
  3. These sentiments are used to understand what opinion do people have about a product or service through their tweets.
  4. In this research, we propose two models for sentiment analysis based on Naïve Bayes and Support Vector Machine (SVM).
  5. This system uses R- statistical programming language to generate outputs.

Introduction

A language is a powerful tool that helps in expressing emotions. Sentiment analysis is text mining which helps a business to understand what social sentiment do people have about their brand or product. It uses natural language processing and data mining techniques to solve real-world problems. Besides getting insights about a brand through user reviews businesses could be improved. With those insights, a brand can decide how successful is the new product launched, how customers react to the product or service. Are they satisfied or they aren’t? The tweets are basically the reviews of people and are bifurcated into two sentiments positive and negative in this paper. Since the users tweet in languages in which they are comfortable most of the tweets have texts which are difficult to clean. The datasets which this paper is using are ‘UBER’ & ‘OLA’. R-programming language is used in this project. R is a statistical programming language used for computing and data analysis. The reason for selecting this programming language is that it gives better results for analyzing and understanding the data precisely as it contains different types of packages for example e1071. This paper uses Machine Learning algorithm techniques which are "SVM" (Support Vector Machine) & "Naïve Bayes". These two are classification algorithms that classify the data into different categories and are a part of supervised machine learning. The purpose of selecting these algorithms are, they give better results for text classification.

Literature Review

The advent of the internet has helped in the wide share of information. Today information is available on various social media platforms. People express their reviews, suggestions on such social media platforms. These reviews can be studied for analysis of market trends.

Twitter is one such platform that was formed in the year 2006 by Jack Dorsey, Noah Glass, Biz Stone, and Evan Williams. Since then Twitter has largely grown to a community of 300 million active users & 145 million daily active users.

This gives us an idea of why a large number of businesses have started paying close attention to data collection from Twitter. Having a vast variety of users from different social interests & domains adds to the vastness of the community. With this vastness & available technology, researchers started Twitter sentiment analysis from gathered data. In a research paper "Tapping into the Power of Text Mining" in 2005 by Weiguo Fan, Linda Wallace, Stephanie Rich, and Zhongju Zhang explained the importance of how we can use text to extract useful information to establish a relationship between the words. This helps us to understand the power of text & how can we use data from texts to understand a relationship. With this researchers started extracting data from Twitter to understand how the largely available data can be put to use & generate an opinion from their tweets. The previous work from Pak, Alexander, and Patrick Paroubek. "Twitter as a corpus for sentiment analysis and opinion mining" in the year 2010 helped to further throw the light on how can twitter sentiments help in generating an opinion.

Motivation

Since the use of Twitter sentiment analysis has widely been showcased in other domains of datasets like movie review systems, disease prediction, etc. We felt the domain of cab services can be largely benefited if Twitter’s sentimental analysis is done. Uber has about 110 million users and fulfills about 17 million rides per day(as of May’19) in India we have about 5 million users(as of August’17) & Ola has about 150 million users with about 2 million rides each day with about 23.9 (as of November’19) million users in India hence it adds to the uniqueness of the datasets. With this much availability of data, these two companies are sitting on the peaks of data about users. With these large numbers we felt the curiosity to understand people’s reviews of the cab services from Twitter & how can sentiment analysis helps to understand it better for further improvements.

Methodology

A Diagrammatic representation of the sentimental analysis is explained below.

Sentiment Analysis Flowchart - Image by Author
Sentiment Analysis Flowchart – Image by Author

The Algorithm considered for classification purpose is SVM & Naïve Bayes

SVM (Support Vector Machine) Classification Algorithm

The Support Vector Machine can be described as a binary classifier. It attempts to find a hyperplane that can separate two classes of data by the largest margin. There are 2 types of separation in SVM the one is Linear & the other is Non-Linear. The Non-Linear classification can be done with the help of the kernel trick. Kernel plays an important role in the separation of data as in nonlinear margin cannot be drawn in 2-D it has to be lifted in a higher dimension where the data can be separated that is a 3-D plane, as it can be observed that two different classes indicating circle & square are used, SVM creates a hyperplane that divides the two classes with the maximum margins.

Naïve Bayes Classification Algorithm

Naïve Bayes is also a classification algorithm that is based on the principle of Bayes Theorem. Naïve Bayes is not a single algorithm but a collection of algorithms that gives the probability of an event occurring. The principle that is followed by this algorithm is that every pair of features that have been classified is independent of each other. The probability of the features is considered with the probability of an individual feature occurring divided by the probability of the remaining feature. This states the Bayes’ Theorem on which Naïve Bayes’ is made. As the features are considered independent, the algorithm will give individual results of each variable to perform differently from other algorithms. Take Figure 3 as it can be observed that the conditional probability of B that A has already occurred is multiplied with the probability of A divided by the probability of B.

The Classification Process Flowchart - Image by Author
The Classification Process Flowchart – Image by Author

Steps in flowcharts are explained below in the numerical format

Data to Positive & Negative tweets are created.

The total number of tweets in the data frame is 3000.

It is been divided into 2 separate data frames.

The one with maximum tweets is the Training data frame that is 70% of 3000.

The other is the Testing data frame that is 30% of 3000.

Once the data is split it has been trained with different classification algorithms.

The trained data is then tested with testing data to check what accuracy is generated.

The accuracy is then explained with the help of the Confusion Matrix.

Experimental Results

In this paper, we reviewed the sentiments of people using tweets extracted from Twitter. These sentiments helped in understanding the perception of people towards the company. Finally, the Word Cloud was generated of Uber & Ola displaying the words that were frequently used.

The Secondary purpose of this paper was to categorize the data (Tweets).

Categorization of data was done in 2-category:

  1. Positive
  2. Negative

Then this data was trained with 70% partitioning. The rest 30% was used for testing. After the training was done it was tested to check how good data was trained. To understand the results, Accuracy was generated with the help of the Confusion Table.

The formula for accuracy is

Accuracy = a+b/a+b+c+d

Confusion Matrix

Confusion Matrix for Positive & Negative Tweets - Image by Author
Confusion Matrix for Positive & Negative Tweets – Image by Author

The above table shows the confusion matrix, which is the table where accuracy can be calculated based on values obtained in the respective cells.

Overall Accuracy of 3000 Datasets for Uber & Ola

Accuracy for Uber & Ola - Image by Author
Accuracy for Uber & Ola – Image by Author

In the above Fig, the overall accuracy of two algorithms on two different datasets is calculated with the help of a confusion matrix. It can be seen that the SVM in both the datasets performed good but in comparison to Naïve Bayes on both the datasets was not that good, as Naïve Bayes in both datasets outperformed SVM.

Visual Representation of Algorithms - Image by Author
Visual Representation of Algorithms – Image by Author

Correct & Incorrect Tweets based on Datasets

Bifurcation of Correct & Incorrect Tweets - Image by Author
Bifurcation of Correct & Incorrect Tweets – Image by Author

The above Fig shows the correct tweets & incorrect tweets of the two datasets which are Uber & Ola on two different algorithms that are SVM & Naïve Bayes.

Conclusion

The focus of the paper is to evaluate the accuracy between the two classification algorithms and understand what accuracy is been generated also to understand the sentiments of the people with the help of sentimental analysis. In this paper, the two algorithms are compared to the sentimental classification of tweets. The experimental data showed that the classifier yield better results for the Uber datasets which were trained with Naïve Bayes, similarly the Ola datasets yield good results in the case of Naïve Bayes. As we can see that Naïve Bayes was dominant in both the cases with an accuracy of 86.65% in the case of Uber & 73.64% in the case of Ola. Thus, it can be said that the Naïve Bayes is a better algorithm that can be used to classify the Uber & Ola datasets. Finally, the sentiments of the people through tweets are shown with the help of a word cloud. Word cloud is the visual representation of the word that is used most in the tweets making us understand what people want to convey in the message. It can be helping a particular organization to understand their people and to make the business even better through sentimental understanding.

References

[1] Lopamudra Dey, Sanjay Chakraborty, Anuraag Biswas, Beepa Bose, Sweta Tiwari. "Sentiment Analysis of Review Datasets Using Naïve Bayes’ and K-NN Classifier", International Journal of Information Engineering and Electronic Business, 2016.

[2] P.Kalaivani, "Sentiment Classification of Movie Reviews by supervised machine learning approaches" Indian Journal of Computer Science and Engineering (IJCSE) ISSN: 0976–5166 Vol. 4 №4 Aug-Sep 2013.

[3] "Progress in Computing, Analytics and Networking", Springer Science and Business Media LLC, 2018.

[4] Manish N. Tibdewal, Swapnil A. Tale. "Multichannel detection of epilepsy using SVM classifier on EEG signal", 2016 International Conference on Computing Communication Control and automation (ICCUBEA), 2016.

[5] Weiguo Fan, Linda Wallace, Stephanie Rich, and Zhongju Zhang, "Tapping into the Power of Text Mining", Journal of ACM, Blacksburg, 2005.

Before You Go

Research Paper: https://sersc.org/journals/index.php/IJFGCN/article/view/17896

Code: https://github.com/yashindulkar/Ola-Sentiment-Analysis-using-R

Code: https://github.com/yashindulkar/Uber-Sentiment-Analysis-using-R

Special Thanks to my Co-Author Abhijitpatil


Related Articles