The world’s leading publication for data science, AI, and ML professionals.

Sentiment analysis of an online store independent of pre-processing

In this article, the main goal is to examine that how much is possible to cut down on the need for text pre-processing.

What do you do when pre-processing is difficult or time-consuming?

In this article, the main goal is to examine that how much is possible to cut down on the need for text pre-processing. In the following, I will explain briefly and you can find details here:

Persian sentiment analysis of an online store independent of pre-processing using convolutional…

A convolutional network architecture to sentiment classification. Reference: 10.7717/peerj-cs.422/fig-1
A convolutional network architecture to sentiment classification. Reference: 10.7717/peerj-cs.422/fig-1
  1. Abstract

Sentiment analysis plays a key role in companies, especially stores, and increasing the accuracy in determining customers’ opinions about products assists to maintain their competitive conditions. We intend to analyze the users’ opinions on the website of the most immense online store in Iran; Digikala. However, the Persian language is unstructured which makes the pre-processing stage very difficult and it is the main problem of sentiment analysis in Persian. What exacerbates this problem is the lack of available libraries for Persian pre-processing, while most libraries focus on English. To tackle this, approximately 3 million reviews were gathered in Persian from the Digikala website using web-mining techniques, and the fastText method was used to create a word embedding. It was assumed that this would dramatically cut down on the need for text pre-processing through the skip-gram method considering the position of the words in the sentence and the words’ relations to each other. Another word embedding has been created using the TF-IDF in parallel with fastText to compare their performance. In addition, the results of the Convolutional Neural Network (Cnn), BiLSTM, Logistic Regression, and Naïve Bayes models have been compared. As a significant result, we obtained 0.996 AUC and 0.956 F-score using fastText and CNN. In this article, not only has it been demonstrated to what extent it is possible to be independent of pre-processing but also the accuracy obtained is better than other researches done in Persian. Avoiding complex text preprocessing is also important for other languages since most text preprocessing algorithms have been developed for English and cannot be used for other languages. The created word embedding due to its high accuracy and independence of pre-processing has other applications in Persian besides sentiment analysis.

2.Methodology

A flowchart representing the taken steps. Reference: 10.7717/peerj-cs.422/fig-4
A flowchart representing the taken steps. Reference: 10.7717/peerj-cs.422/fig-4

In this article, we are seeking to analyze the feelings of customer reviews on the website of the largest and well-known online store in Iran (Digikala). At first, lingual problems were taken into account as a significant challenge. There are several problems in Persian text pre-processing such as using slang, using letters of other languages especially Arabic, lack of a clear boundary between phrases. To tackle the problems, we employed fastText because we wanted to examine whether the utilize of this method capable of reducing the need for data pre-processing and make language processing easier. In the following, we will inspect this assumption and compare the obtained results with other algorithms and other reports. Another severe limitation was that the deep learning models required an immense dataset, but most of the available datasets in Persian are small to such an extent that they cannot be employed in deep models. Thus, a rich and immense dataset had to be extracted from the Digikala website which was conducted by web-mining methods. It should be noted that this article seeks to achieve the following goals:

Investigating the reduction of the need for text pre-processing by implementing methods such as fastText, either in Persian language processing or others;

Gathering comprehensive customer reviews dataset based on various types of digital goods to create a general word embedding for a various range of works related to digital goods;

Sentiment analysis of Digikala website’s reviews with high accuracy even compared to other researches.

Having access to a large dataset with richness and content integrity is indispensable to train a deep model. Most available datasets to train a deep model and Sentiment Analysis are in English. To collect a rich dataset, web-mining methods were used and the reviews on the Digikala website were extracted which were in Persian. Posted reviews by buyers express their level of satisfaction with their purchase and product features. After submitting their reviews, buyers could choose between the "I suggest" and "I do not suggest" options. These two options were extracted and used in the model as labels for the problem of sentiment analysis. Our goal was to analyze the opinions of users of the Digikala website, so we extracted the data of the section related to digital goods using web-mining libraries such as the Beautiful Soup. Beautiful Soup is a Python package to parse XML and HTML documents and it is useful for web scraping. In this way, the digital goods’ reviews of the Digikala website were extracted, which are a total of 2,932,747 reviews.

  1. Results

TF-IDF and fastText methods were used to extract the features. The BiLSTM and CNN models used the fastText output, and the Naïve Bayes and Logistics Regression models used the TF-IDF output, and their accuracy was finally compared with each other in Table 4. According to this table, the results of BiLSTM and CNN models are more accurate than others and CNN has given the best results. As expected, due to the use of the fastText method, the need for data pre-processing has been reduced. In other words, stemming and normalization methods have not affected the final result. To examine this more closely, we chose the CNN model as the best model and we once performed the sentiment analysis process using the pre-processing steps and again without these steps. The AUC and F-score were 0.9943 and 0.9291 before pre-processing, and 0.9944 and 0.9288 after pre-processing. The results can be seen in Table 5. In the table, the meaning of the "before preprocessing" is just before the stemming and normalization steps. In other words, the methods used to create word embedding can depict the same words in the same range of spaces without the need to standardize letters and also without the need to identify the original root of words. Contrariwise of pre-processing, the use of the pseudo-labeling method significantly improved the results. After using pseudo-labeling, the values of AUC and F-score improved to 0.996 and 0.956.

Performance of different models based on AUC and F-measure, and Performance of the CNN model in different situations based on AUC and F-measure. **** Reference: [10.7717/peerj-cs.422/table-](https://doi.org/10.7717/peerj-cs.422/table-4)4 and 10.7717/peerj-cs.422/table-5
Performance of different models based on AUC and F-measure, and Performance of the CNN model in different situations based on AUC and F-measure. **** Reference: [10.7717/peerj-cs.422/table-](https://doi.org/10.7717/peerj-cs.422/table-4)4 and 10.7717/peerj-cs.422/table-5

The suggested model has had better results than the previous models which have used pre-processing methods in Persian sentiment analysis. For instance, some researchers introduced pre-processing algorithms and succeed to enhance the results of machine learning algorithms (Saraee & Bagheri, 2013). In the research, the F-score of the proposed pre-processing algorithms employing Naïve Bayes as a classifier algorithm is 0.878. In another research, the various alternatives for pre-processing and classifier algorithms were examined and the best result was assisted with an SVM classifier by 0.915 F-score value (Asgarian, Kahani & Sharifi, 2018). Also, some researches were attempted to utilize state-of-the-art deep models in such a way to reduce dependency on pre-processing and avoiding complex steps (Roshanfekr, Khadivi & Rahmati, 2017). The F-score of the BiLSTM and CNN algorithms in the research is 0.532 and 0.534. All mentioned article’s focus was on the digital goods reviews in Persian two-class sentiment analysis as same as this article. A comparison of the results in this paper with other researches and other common algorithms indicates that not only has the dependence on data pre-processing been eliminated but also the accuracy has increased significantly.

4.Conclusion

The dataset included approximately 3 million reviews were extracted from the digital goods section of the Digikala website as the largest online store in Iran. Basic pre-processing methods were used to modify the words and tokenize them. Due to the lack of labels for a large part of the dataset, the pseudo-labeling method was employed which improved the results. Data balancing was also performed using random over-sampling. Persian data pre-processing was found difficult, so the fastText method was conducted to reduce the need for data pre-processing and word embedding development. The embeddings were employed as the input to the BiLSTM, and CNN models. Using the suggested model, not only have the obtained results been very desirable and are much more accurate in Persian compared to other reports but also there are no complications related to data pre-processing. The effect of stemming and normalization on the output was evaluated and revealed that the proposed method is not dependent on data pre-processing. Eventually, Besides the comparison of machine learning and deep learning methods in sentiment analysis, the TF-IDF and fastText methods were compared to create word embedding. The best result was associated with fastText and CNN. The main achievement of this model is the reduction of the need for data pre-processing. Data pre-processing in English is convenient and accurate due to the expanded text pre-processing libraries. However, in other languages, data pre-processing is very complicated because of the lack of proper libraries. Over the suggested model was proved that this need is largely solvable (AUC = 0.996) and the pre-processing steps can be reduced to preliminary tokenization processes. Avoiding complex Text Preprocessing is also important for other languages Since most text preprocessing algorithms have been developed for English and cannot be used for other languages. In other words, the taken steps are possible to be implemented or other languages to achieve the same results independently of pre-processing steps. Moreover, the created word embedding due to its high accuracy can be used in other text analysis problems especially related to online digital goods.

GitHub:

mosiomohsen/persian-sentiment-analysis-using-fastText-word-embedding-and-pseudo-labeling


Related Articles