Sentiment & Influencers

Network Analytics meets Sentiment Analysis

Published in

Towards Data Science

7 min readDec 19, 2018

A few years ago, we started a debate about whether the loudest customers really were as important as everybody — including they themselves — thought! Customer care usually reacts faster to the loudest complainer. Is this convenient? How to identify those complainers worth investing time with?

Happy and disgruntled users are easily identifiable via sentiment analysis on their forum posts. The degree of influence of each user can also be measured via an influence score. There are many influence scores available. A widely adopted one is the centrality index. The idea of this use case is to combine the sentiment measure with the influence score and in this way to identify those disgruntled customers/users with a high degree of influence. Support time and resources should then be redirected to the most influential and unhappy customers or users.

The dataset

The original use case referred to the launch of a new product and aimed at collecting opinions from the beta users. Since it is impossible to share the original dataset due to the company’s privacy policy, we replaced it with a publicly available similar dataset: the Slashdot News Forum.

Slashdot (sometimes abbreviated as “/.”) is a social news website, which was founded in 1997 for science and technology. Users can post news and stories about diverse topics and receive online comments from other users.

The Slashdot dataset collects posts and comments for a number of sub-forums, such as Science Fiction, Linux, Astronomy, etc… Most of the users posted or commented using their username, while some participated anonymously. The biggest sub-forum revolves around politics and contains about 140,000 comments to 496 articles from a total of about 24,000 users. For the purposes of this use case we focus on the “Politics” sub-forum.

Users in the Slashdot dataset are not strictly customers. However, when talking about politics, we can identify the political topic as the product and measure the user reactions as we would for a product.

Each new post is assigned a unique thread ID. Title, subdomain, user, date, main topic, and body all refer to this thread ID. A new data row is created for each comment with comment title, user, date, and body, and appending the thread ID, post title, post user, post date, and post body from the seed post as well. In Figure 1 you can see the seed post data on the left and the data for the corresponding comments on the right. Notice that multiple comments might refer to the same seed post.

Figure 1. *SlashDot Dataset. Data from seed post on the left; data from related comments on the right*.

The workflow

In the analysis, we took all of the non-anonymous users into consideration. Thus, the first step is to remove all data rows where username is “anonymous”, empty, too long, or without post ID. This happens in the “Preprocessing” metanode.

Influence Scores

We want to find out who the most influential users are, by investigating the connections across users. So, the goal is to build a network object to represent user interactions.

The first step is to prepare the edge table as the basis for the network. An edge table has a source column — the authors of the posts — and a target column — the authors of the reference posts. The edge table is built by the “Create edge table” metanode in the upper branch of the final workflow shown in Figure 2. There a left outer join puts together all post authors (source) with all reference authors (target), if any. A GroupBy node counts the number of occurrences for each connection from source to target. All auto-connections, i.e. users answering themselves, are filtered out.

The edge table is now ready to be transformed into a network object. The Object Inserter node transforms the source and target users into nodes and connects them via an edge with the number of connection occurrences as value .

After that, the metanode named “Extract largest component” splits the network into all of its connected components. Sub-networks are then sorted based on their total number of edges and nodes and only the largest sub-network is kept for further analysis. Finally, a second Network Analyzer node calculates the Hub and Authority score.

The Network Analyzer node provides a great summary for social media activity. It calculates the number of different statistics on a network graph at both node and edge level. Such statistical measures try to establish the importance of each node and edge by the number of its connections, their weight, their neighbors, the distance to their neighbors, and similar other parameters. Two of those importance measures are the hub and authority scores.

The concept of hubs and authorities, as described in https://nlp.stanford.edu/IR-book/html/htmledition/hubs-and-authorities-1.html , is rooted in web pages. There are two primary kinds of web pages as results for broad-topic searches:

Authoritative sources of information on the topic (authorities)
Hand-compiled lists of links to authoritative web pages on the topic (hubs).

Hubs are not in themselves authoritative sources of topic-specific information, but rather direct you to more authoritative pages. The hub/authority score calculation relies on hub pages to discover the authority pages.

To calculate the hub and authority score, the Network Analyzer node implements the HITS algorithm in the JUNG (Java Universal Network/Graph) Framework.

Sentiment Analysis

Now we want to measure the sentiment, i.e. quantify each forum user in terms of positivity and negativity rather than authority.

The lower branch of the workflow in Figure 2 extracts the list of documents for each forum user, from posts or comments he/she has written. At the same time, it imports two lists of words: negative words and positive words from the English dictionary according to the MPQA Subjectivity Lexicon. Words in all documents are tagged as positive or negative by the two Dictionary Tagger nodes, depending on whether they match any of the words in these two lists. Untagged words are considered as neutral.

Each positive word is assigned a +1 value, each negative word a -1 value, each neutral word a 0 value. By summing up all word values across all documents written by each user, we calculate the user sentiment score.

Note that user sentiment score is calculated here using the absolute word frequency without taking into account the number of words used. For corpuses with longer documents, i.e. with a more considerable difference in number of words, the relative frequency might be more suitable.

Finally, forum users with a sentiment score above (average + standard deviation) are considered positive; forum users with sentiment score below (average — standard deviation) are considered negative; all other users in between are considered neutral. Positive users are color coded green, negative users red, and neutral users gray.

Putting it all together

To put it all together, a Joiner node joins the authority and hub score with the sentiment score by author.

A Scatter Plot (Javascript) node, inside the wrapped metanode “Scores and Sentiment on Scatter Plot”, plots the forum users by hub score on the y-axis, authority score on the x-axis, and sentiment score as color.

Notice that the loudest complainers in red have actually very little authority and therefore cannot be considered influencers. Thus, this plot seems to go against the common belief that you should listen and pamper the most aggressive complainers. Notice also that the most authoritative users are actually neutral. This neutrality could well be one of the reasons why other users trust them.

The scatter plot view produced by the Scatter Plot (Javascript) node is interactive. By clicking the “Select mode” button at the top of the view, it is possible to select single points on the scatter plot with a single-click or group of points by drawing a rectangle around them.

Figure 3. Authors as points on a scatter plot with authority score on the x-axis and hub score on the y-axis. Authors with a positive sentiment score, i.e. sentiment score > (average + std dev), are color coded green. Authors with a negative sentiment score, i.e. sentiment score < (average — std dev), are color coded red. Authors with sentiment score in between are labelled as neutral and depicted in gray. In the upper right corner, there are the buttons for zooming and selection. The circled button enables point/author selection. The bigger point in the plot is the point that has been selected by single-click.

The final workflow can be seen in Figure 2 and is located on the KNIME EXAMPLES server in: 08_Other_Analytics_Types/04_Social_Media/02_NetworkAnalytics_meets_TextProcessing

So, how did we do?

Posts and connections in a forum can be analyzed by reducing them to numbers, like sentiment measures or influence scores. In this blog post, they have been reduced to sentiment scores via text processing on the one hand and to authority/hub scores via network graph analytics on the other.

Both representations produce valuable information. However, the combination of the two has proven to be of invaluable help when trying to isolate the most positive and authoritative users for reward and the most negative and authoritative critics for damage control.

Acknowledgements

This blog post is the summary of a project run with Phil Winters, Kilian Thiel, and Tobias Kötter. More details are available in the KNIME whitepaper “Creating Usable Customer Intelligence from Social media Data: Network Analytics meets Text Mining”.