The world’s leading publication for data science, AI, and ML professionals.

Identification of Complaint Relevant Posts on Social Media

Preliminary work towards a platform-agnostic semi-supervised approach for complaint detection

Through this blog, I aim to explain the work done in the paper "Semi-Supervised Iterative Approach for Domain-Specific Complaint Detection in Social Media" accepted at the 3rd workshop on e-commerce and NLP, Association for Computational Linguistics 2020. This top-tier venue serves an amalgamation of research, including but not limited to computational linguistics, cognitive modeling, information extraction, and semantics. This work is one of the first attempts to leverage the discursive social media landscape to list complaints and identify grievances. We emphasize the utility of our approach by evaluating it over transport-related services on social media platform Twitter. The post will record a brief overview of the motivation, methodology, and applications of this research. More technical details can be found in the paper. Our team would eagerly look forward to any suggestions and improvements regarding this work.


Motivation

Social media has lately become one of the primary venues where users express their opinions about various products and services. These opinions are extremely useful in understanding the user’s perceptions and sentiment about these services. They are also valuable for identifying potential defects and critical to the execution of downstream customer service responses. Public sector firms like transport and logistics are strongly affected by public opinions and form a critical aspect of a country’s economy. Often, businesses rely on social media to ascertain customer feedback and to initial response. Therefore, automatic detection of user complaints on social media could prove beneficial to both the clients and the service providers.

A social media user tagging relevant authorities with their grievances.
A social media user tagging relevant authorities with their grievances.
Transportation-related companies having significant market shares.
Transportation-related companies having significant market shares.

Traditionally, listing complaints involves social media users tagging the relevant individuals with their complaints. However, there are a certain set of drawbacks that reduces the utility of this approach. The prevalence of such posts is low as compared to others where concerned authorities are tagged. Additionally, media platforms are plagued with redundancy, where the posts are rephrased or structurally morphed before being re-posted. Also, vast amounts of inevitable noise make it hard to identify posts that may require immediate attention.


Our Contribution

To build such detection systems, we could employ supervised approaches that would typically require a large corpus of labeled training samples. However, as discussed, labeling social media posts that capture complaints about a particular service is challenging. Prior work in event detection has demonstrated that simple linguistic indicators (phrases or n-grams) can be useful in the accurate discovery of events in social media. Though user complaints are not the same as events, more of a speech act, we posit that similar indicators can be used in complaint detection. To pursue this hypothesis, we propose a semi-supervised iterative approach to identify social media posts that complain about a specific service. In our experimental work, we started with an annotated set of 326 samples of transportation complaints, and after four iterations of the approach, we collected 2,840 indicators and over 3,700 tweets. We annotated a random sample of 700 tweets from the final dataset and observed that over 47% of the samples were actual transportation complaints. We also characterize the performance of basic classification algorithms on this dataset. In doing so, we also study how different linguistic features contribute to the performance of a supervised model in this domain.


Methodology and Approach

Our proposed approach begins with a large corpus of transport-related tweets and a small set of annotated complaints. We use this labeled data to create a set of seed indicators that drive the rest of our iterative complaint detection process.

Data Collection

We focused our experimentation over the period of November 2018 to December 2018. Our first step towards creating a corpus of transport-related tweets is to identify linguistic markers related to the transport domain. To this end, we scraped random posts from transport-related web forums. These forums involve users discussing their grievances and raising awareness about a wide array of transportation-related issues. We then processed this data to extract words and phrases (unigrams, bigrams, and trigrams) with high tf-idf scores. We then had human annotators prune them further to remove duplicates and irrelevant items.

We used Twitter’s public streaming API to query for tweets that contained any of the 75 phrases over the chosen time range. We then excluded non-English tweets and any tweets with less than two tokens. This resulted in a collection of 19,300 tweets. We will refer to this collection as corpus C. We chose a random sample of 1,500 tweets from this collection for human annotation. We employed two human annotators to identify traffic-related complaints from these 1,500 tweets. The annotation details are mentioned in the manuscript. In cases where the annotators disagreed, the labels were resolved through a discussion. After the disagreements were resolved, the final seed dataset had 326 samples of traffic-related complaints. We will refer to this set as Ts.

Sample tweet image identified as complaint relevant from the curated dataset. Personally identified information removed from the image.
Sample tweet image identified as complaint relevant from the curated dataset. Personally identified information removed from the image.

Iterative Algorithm

Our proposed iterative approach is summarized in the figure. First, we use the seed data Ts to build a set of linguistic indicators I for complaints. We then use these indicators to get potential new complaints Tl from the corpus C. We then merge Ts and Tl to build our new dataset. We then use this new dataset to extract a new set of indicators Il. The indicators are combined with the original indicators I to extract the next version of Tl. This process is repeated until we can no longer find any new indicators.

Iterative Complaint Detection Algorithm.
Iterative Complaint Detection Algorithm.

Extracting Linguistic Indicators

As shown in the algorithm, extracting linguistic indicators (n-grams) is one of the most important steps in the process. These indicators are critical to identifying tweets that are most likely domain-specific complaints. We employ two different approaches for extracting these indicators. For seed data, Ts, which is annotated, we just select n-grams with the highest tf-idf scores. In our experimental work, Ts had 326 annotated tweets. We identified 50 n-grams with the highest tf-idf scores to initialize I. Some examples included terms like problem, station, services, toll-fee, reply, fault, provide information, driver, district, and passenger.

When extracting indicators from Tl, which is not annotated, it is possible that there could be frequently occurring phrases that are not necessarily indicative of complaints. These phrases could lead to a concept drift in subsequent iterations. To avoid these digressions, we use a measure of domain relevance when selecting indicators. This is defined as the ratio of the frequency of an n-gram in Tl to that of in Tr. Tr is a collection of randomly chosen tweets that do not intersect with C. We defined Tr as a random sample of 5,000 tweets from a different time range than that of C.

Our iterative approach converged in four rounds, after which it did not extract any new indicators. After four iterations, this approach chose 3,732 tweets and generated 2,840 unique indicators. We also manually inspected the indicators chosen during the process. We observed that only indicators with a domain relevance score greater than 2.5 were chosen for subsequent iterations.

Examples of some strong and weak indicators. The numbers in brackets denote the respective domain relevance score.
Examples of some strong and weak indicators. The numbers in brackets denote the respective domain relevance score.
Frequency of indicators and tweets collected after each iteration.
Frequency of indicators and tweets collected after each iteration.

We chose a random set of 700 tweets from the final complaints dataset T and annotated them manually to help understand the quality. The guidelines have been discussed in the manuscript and also employed the same annotators as before. The annotators obtained a high agreement score of kappa= 0.83. After resolving the disagreements, we observed that 332 tweets were labeled as complaints. This accounts for 47.4% of the sampled 700 tweets. This demonstrates that nearly half the tweets selected by our semi-supervised approach were traffic-related complaints. This is a significantly higher proportion in the original seed data Ts, where only 21.7% were actual complaints.


Modelling

We conducted a series of experiments to understand if we can automatically build simple machine learning models to detect complaints. These experiments also helped us evaluate the quality of the final dataset. Additionally, this experimental work also studies how different types of linguistic features contribute to the detection of social media complaints. For these experiments, we used the annotated sample of 700 posts as a test dataset. We built our training dataset by selecting another 2,000 posts from the original corpus C and annotated them once again. In order to evaluate the predictive strength of machine learning algorithms, we used various linguistic features. These features can be broadly broken down into four groups.

(i) Semantic: The first group of features is based on simple semantic properties such as n-grams, word embeddings, and part of speech tags.

(ii) Sentiment: The second group of features is based on pre-trained sentiment models or lexicons.

(iii) Orthographic: The third group of features uses orthographic information such as hashtags, user mentions, and intensifiers.

(iv) Request: The last group of features again use pre-trained models or lexicons associated with a request, which is a closely related speech act.

For experimentation purposes, we either used a quantitative or normalized score for the complete tweet from each of the pre-trained models or lexicon. More details about prior literature with regards to these types of features can be accessed from the paper.


Results

We trained a logistic regression model for complaint detection using each one of the features described. The best performing model is based on unigrams, with an accuracy of 75.3%. There is not a significant difference in the performance of different sentiment models. It is also interesting to observe that simple features like the counts of varying pronoun types and counts of intensifiers have strong predictive ability. Overall, we observe that most of the features studied here have some ability to predict complaints.

Predictive accuracy and F1-score associated with different types of features. Classifier utilized Logistic Regression (Elastic Net regularization) as it gave the best performance as compared to its counterparts.
Predictive accuracy and F1-score associated with different types of features. Classifier utilized Logistic Regression (Elastic Net regularization) as it gave the best performance as compared to its counterparts.

Potential Use-Cases of this Research

Our utility of the proposed architecture is multi-fold as discussed: (i) We believe that our work could be the first step in improving downstream tasks, which are complaint relevant such as chat-bots development, creating automatic query resolution tools, or gathering low-cost public opinion about services. (ii) Our methodology would help linguists understand the language used in criticism and complaints from a lexical or semantic point of view. (iii) The proposed approach is highly flexible as it can be expanded to other domains, based on the lexicons used in the seed data. (iv) Iterative nature of the architecture reduces the human intervention, hence any unintentional bias during the training phase. It also makes it robust to lexical variations in posts taking place over time. (iv) Since it is semi-supervised, it reduces the dependence on a large number of pre-labeled samples for complaint detection and also mitigates the problem of class imbalance highly prevalent in supervised approaches.

Conclusion and Future Work

As a part of this work, we presented an iterative semi-supervised approach for automatic detection of complaints. Complaint resolution is a significant part of product improvement initiatives of various product-based companies; hence we believe that our proposed method could be effectively leveraged for gauging a low-cost assessment of the public opinion or routing the grievances to appropriate platforms. We manually validated the usefulness of the proposed approach and observed a significant improvement in the collection of complaint relevant tweets. In the future, we aim to deploy clustering mechanisms for isolating event relevant tweets of diverse nature. We also plan to use an additional meta-data context and tweet conversational nature to augment the system performance. Our team would eagerly look forward to any feedback or suggestions regarding the paper. Please feel free to reach out to any of the authors of the paper. I hope that this post would motivate another young researcher like me to take up a relevant social problem and harness the potential of Artificial Intelligence and Data Science to solve it.

Relevant Links

Semi-Supervised Iterative Approach for Domain-Specific Complaint Detection in Social Media

midas-research/transport-complaint-detection


Related Articles