From Words to Wisdom

A Review of Text Mining Use Cases

Rosaria Silipo
Towards Data Science
19 min readDec 4, 2020

--

Author: Rosaria Silipo

Text mining is a very rich branch of data science, filled with extremely useful techniques. It allows us to understand the topic of a conversation and, therefore, to summarize it; to quantify the sentiment of the speakers and to reply in the most appropriate tone; to recognize entities hidden in texts; even to generate free texts and automatic answers. All of that is possible with just a few nodes with the Text Processing extension within KNIME Analytics Platform.

In the Evangelism group at KNIME, we have worked for years on text mining applications. Now the moment has come to collect the results and to make a little review of the many use cases where text mining can be — and has been — successfully applied. Let the text mining journey start!

  1. Build a Word Cloud

2. Sentiment Analysis

3. Network Analysis

4. Document Classification & Topic Detection

5. Named Entity Recognition

6. Influence or Sentiment?

7. Topic Detection and Network Analysis

8. Free Text Generation

9. Automatic Translation

1. Build a Word Cloud

The first application, that even a beginner can build in just a few steps, produces a word cloud from a given text corpus around a given topic. Very fashionable are word clouds from politicians’ speeches or social media conversations.

Among social media platforms, the easiest one to connect to, in terms of restrictions and API services, is certainly Twitter. We are writing in 2020 and the only hashtag in everybody’s mind is #COVID19. So, a beginner application in text mining could be a workflow that builds a word cloud from tweets around #COVID19.

KNIME Analytics Platform offers a Twitter API Extension, including a few very versatile nodes to connect, search, tweet, and retweet (Figure 1) to and from a Twitter account via the official Twitter API. Free Twitter accounts allow you to download only the previous week’s tweets around a certain hashtag, up to a maximum of 10000 tweets. 10000 tweets are a pretty representative number as far as tweets on hashtags go; for more popular hashtags, 10000 tweets hopefully represent a statistical sample of sentiment, topic, and other text mining features.

Figure 1. Nodes from the KNIME Twitter API extension (“Image by Author”)

To collect last week’s tweets around #COVID19, we use a Twitter API Connector node and a Twitter Search node. Authentication to the API service is carried out in the Twitter API Connector node, while the hashtags and the fields we want to import are set in the configuration window of the Twitter Search node. Partial results (just a few data rows) of this query, including username, profile image, tweet text, tweet ID, and tweet time, can be seen in Figure 2.

Figure 2. Tweets resulting from a Twitter search around #COVID19 (“Image by Author”)

After some processing of the date and time field and some color coding for morning, afternoon, evening, and night tweets, the word cloud is built using the component named “Interactive Word Cloud”. This component revolves around the Tag Cloud node, producing an interactive view of the word cloud created from the texts at its input port, where words appear color coded according to the time of tweeting. The interactive view has been expanded with a data table of tweets with synchronized interactions between the two items. Clicking a word in the word cloud shows, on the right, the tweets where the word appears.

Figure 3. Video showing actions from the interactive view of the component “Interactive Tag Cloud” (“Image by Author”)

The video in Figure 3 shows some of the interactions and synchronizations available in the composite view of the “Interactive Tag Cloud” component. For example, at the very beginning, you can see that after clicking the word “coronavirus” tweeted in the afternoon (color red), all 21 related original tweets appear in the table on the right. And so on, for all words in the word cloud.

The workflow implementing this whole task — that is: connecting to Twitter, searching tweets, processing time of the day, and finally producing the interactive visualization — is shown in Figure 4 and can be downloaded from the page Interactive Tag Cloud for Twitter Search on the KNIME Hub.

Figure 4. Workflow “Interactive Tag Cloud from Twitter Search” available on the KNIME Hub (“Image by Author”)

This simple case study involved a connection to Twitter. Of course, it would also work connecting to any other social media, for which we have connection rights, or any other text corpus in the appropriate shape to feed the “Interactive Tag Cloud” component.

2. Sentiment Analysis

Let us move now to a more complex case study and try to perform some kind of classification on the documents in a text corpus. Sentiment Analysis is a very popular technique to explore the tone in texts. It is popular because, on the one hand, it is relatively easy to achieve and, on the other, can produce useful information about the user experience for product marketing, or the mood of the crowd with regard to the next elections, for example.

Indeed, it is possible to extract the tone from a text: from positive to negative through the whole grid of pre-defined sentiment states. Let’s keep it simple for this case study and consider only two sentiment states: positive and negative. Thus, the task here is to classify an input text into either a positive or negative class, as shown in Figure 5.

Figure 5. Classification of texts into the positive or negative class (“Image by Author”)

Putting together a sentiment analysis application is more complicated than cleaning up words for a word cloud. Just a few nodes won’t probably be enough. There are two main approaches to sentiment analysis in data science: lexicon based, and machine learning based [1]. Depending on the task, the tolerance to errors, and the available text data, one or the other approach would be advisable.

Lexicon based: In this case, words in a dictionary are labeled as positive or negative and are used to tag all words in text documents. A score is assigned to each positive word and another score to each negative word; thus, the total sum of the scores in the text document defines its sentiment class, based on a threshold decision. The threshold value is usually derived from the statistics of the dataset. Further complexity can be introduced by adding a more granular scale of sentiment labels as well as grammar rules, for example on negations or subordinate sentences. Of course, the more complex the rules, the more reliable the classification. However, the simpler the process, the faster the sentiment assignment. For this approach, a reliable dictionary in the corpus language is required. This type of dictionary is usually distributed by universities or linguistic expert companies.

Machine learning based: Here, we proceed as in a classical data science project. We need a labeled corpus, where each text document has been previously labeled by an expert with a sentiment class. Based on this dataset, we build a training set and a test set. The training set is used to train a machine learning model to recognize sentiment in a corpus and the test set is used to evaluate the trained model. This approach is generally more accurate than the lexicon-based approach. However, it does need a labeled corpus, which is not always easily available.

As part of the machine learning based approach, a few solutions have been developed that use deep learning. Deep learning represents the last frontier in the field of sentiment analysis. Long Short Term Memories (LSTMs) have been successfully applied [2] to this text mining task, in the attempt to exploit the time sequential behavior of words in texts.

Figure 6. Workflow snippets implementing lexicon-based, machine learning based, and deep learning based approaches for sentiment analysis (“Image by Author”)

In all solutions, on whatever approach they are based, a big part of the application consists in acquiring the text documents, cleaning them, tagging them, and transforming or encoding them [3] so as to feed a threshold based rule system or the machine learning algorithm of choice [4].

Four solutions for sentiment analysis are available for free on the KNIME Hub, using a dictionary based approach, a decision tree, a deep learning architecture with Keras, and a deep learning architecture with BERT.

3. Network Analysis

When we talk about text mining, it is inevitable to talk also about social media. Indeed, together with the texts exchanged in conversations, the network of users — posting, commenting, influencing, and following — is often investigated to learn more about connections, groups, and upcoming trends.

There are many ways to represent a network graphically. A common way is to use a graph [5]. In a graph, social media users are represented as nodes and are connected via edges, representing the amount of their interactions. The more active a user, the bigger the nodes; the more frequent the interaction, the thicker the corresponding edge in the graph.

Note that a network representation is mainly applied to users in social media, but it could be applied to any other kind of objects, as long as they share some kind of directional measurable exchange.

We have applied this network representation to the tragedy of “Romeo and Juliet”. The characters are the nodes in the graph and the numbers of dialogues between two characters defines the edges (Figure 7). Romeo and Juliet are the biggest most central nodes in the graph, as they are characters in the plot. The largest stream of interactions happens between Romeo and Juliet, as expected. However, considerable exchanges also take place between Romeo and his friends Tybalt and Mercutio. Also notice the clear separation between the Capulets characters (in pink) and the Montague characters (in blue): No Capulets talk to Montagues and vice versa, reflecting the nature of the families relationships with one another, which is also to be expected if you are familiar with the plot of the tragedy.

The workflow reading the whole tragedy of Romeo and Juliet and summarizing it via a network graph based view is available as well on the KNIME Hub under the “Will They Blend?” series: Will they Blend? Epub meets JPG, Romeo meets Juliet.

Figure 7. Graph Representation of the network of characters in “Romeo and Juliet” (“Image by Author, images inside nodes authorized for publication by the Konstanz Theater”)

The central positions in the graph are occupied by the most prominent characters/users/nodes. From this observation, the centrality index is defined [6]. Users in a social network with a high value of centrality index are considered the influencers of the network. The Network Analyzer node in KNIME Analytics Platform provides a number of measures about the network at its input, including the centrality index.

A second less common option for network visualization is the chord diagram. In a chord diagram all users are reported as arcs on a circle; the circle being the network. Arcs are connected to each other via more or less large sectors, representing the interactions. We drew the interactions on Twitter — tweets and retweets — around #KNIME Summit2019 over one year via a chord diagram (Figure 8). Twitter is the circle; users are arcs on the circle. The longer the arc, the more active the user. @KNIME (grey arc) is the most active tweeter around #KNIMESummit2019; @paolotamg (in green) and @DMR_Rosaria (in blue) follow as second and third most active tweeters. Parts of the @KNIME arc in grey are transferred, as retweeted tweets, to @paolotamag and other users. Indeed, these sectors connecting one user to the others show by their colours the portion of retweeted tweets.

Figure 8. Chord diagram around #KNIMESummit2019 on Twitter (“Image by Author”)

The workflow connecting to the tweet database, importing the tweets for the past one year around #KNIMESummit2019, counting the number of tweets and retweets by users, and shaping the results as to be consumed by the JavaScript function drawing the chord diagram, is available on the KNIME Hub for download.

Notice that KNIME Analytics Platform has no node for chord diagrams. In this workflow the chord diagram was plotted via a Javascript function within a short Javascript code inside the Generic Javascript View node.

4. Document Classification & Topic Detection

Another interesting piece of information that we can extract from text documents includes topics to classify documents, understand the conversation, or summarize a text. In order to implement this, we can proceed via document classification or via topic detection.

Document classification relies on a labeled dataset, where a class has been assigned to each text document. This is the case for example of a “ham vs spam” classification. In a review or email dataset, this task aims at separating useful and real comments/email (ham) from the generic spam.

After partitioning the dataset into training set and test set, we train a supervised machine learning algorithm on the training set and evaluate it on the test set. The workflow in this case includes text document import, cleaning, encoding, and training and evaluating the algorithm. This workflow Document Classification: Model Training and Deployment can be downloaded from the KNIME Hub.

Topic Detection, in comparison, refers to unsupervised procedures running on unlabeled datasets. A widely used unsupervised procedure for topic extraction is Latent Dirichlet Allocation (LDA) [7]. LDA is a generative probabilistic model that discovers clusters of keywords in the text corpus and uses them to describe frequent corpus topics. The KNIME node implementing the LDA algorithm is the Topic Extractor (Parallel LDA) node. This node takes a text corpus as input and produces a number n of topics, each one described with a number m of keywords, at the output port. n and m are parameters to set in the LDA node’s configuration window. Each keyword, and therefore each topic, receives a weight to quantify its reliability.

Even for this use case on topic detection, we have a little example, performing the summarization of the previously mentioned tragedy “Romeo and Juliet” by Shakespeare. It tries to summarize it with n=3 topics, described by m=10 keywords each. After passing the whole text of the tragedy through the LDA algorithm, we finish with the three topics reported in the word cloud in Figure 9. The topic in red describes the tragedy primarily as a story of love (primary keyword) and death (secondary keyword); the topic in blue as a story of ladies and gentlemen; and finally, but to a much lesser extent, the green topic reports it as a story of youth and villains.

The workflow Topic Detection LDA: Summarizing Romeo and Juliet is available on the KNIME Hub. I guess that with this summarization and with the network of characters depicted in Figure 7, we are now prepared for the English test.

Figure 9. Results of topic extraction from the “Romeo and Juliet” tragedy (“Image by Author”)

5. Named Entity Recognition

Another classic problem where text mining turns out to be useful is named entity recognition. Here the goal is quite easy: recognize a named entity (like a city) and tag it as such, that is with tag=CITY; or recognize all diseases and tag them as DISEASE (Figure 10).

The task really does seem easy. If we have a list of all diseases, we can just run a term search and tag all terms appropriately according to the list. However, such a list is often not available and compiling it for this purpose can be cumbersome and time-consuming. But what if we forced a machine learning model to compile it? The nodes from the StanfordNLP NE family in the KNIME Textprocessing extension create a conditional random field model to tag terms in documents.

Figure 10. Tagging diseases (“Image by Author”)

A workflow that trains a model to learn disease tags from a corpus of biomedical literature is available in Fun with Tags on the KNIME Hub.

6. Influence or Sentiment?

Topic detection, sentiment, analysis, and network analysis: Shall we try to combine them?

The first case study, that we worked on, combined sentiment analysis and network analysis. The question to answer was: among the happy and unhappy users and the influencers and followers, which ones should we address first?

To answer this question, we worked on the text documents (posts and comments) in the Slashdot political forum. Here, we measured the user sentiment on the one hand and the degree of influencer vs. follower on the other.

Figure 11. Combining sentiment analysis and network analysis, to determine the negative influencers (“Image by Author”)

For the sentiment analysis, we used the lexicon-based approach, since a sentiment labeled dataset was not available. We quantified each user’s sentiment through the average sentiment score of his/her text documents. Based on the histogram of the sentiment scores across all dataset, users were considered positive if their sentiment score fell above a positive threshold, negative if below a negative threshold, and neutral if in between.

For the network analysis, for each user we calculated the influencer and follower score, using the Network Analyzer node, based on the number of posts and the number of received comments.

You can see the two branches of the workflow in Figure 11: network mining in the top part and sentiment analysis in the lower part of the workflow. A simplified version of this workflow named Network Analytics meets Text Processing can be downloaded from the KNIME Hub.

The workflow identified three groups of people in terms of sentiment — negative, positive, and neutral — and two types of users in terms of networking — leaders and followers. In Figure 12, all users are reported as points in a scatter plot, with the follower score on the x-axis and the leader score on the y-axis, and colored based on their prevalent sentiment, i.e. negative in red, positive in green, and neutral in grey.

This study allowed us to identify the negative influencers, who are not necessarily the loudest negative users, and to address them first. Notice that in the plot the most influential users were actually the neutral ones in the tone of their posts.

Figure 12. Influencers and followers as points in a scatter plot, colored according to their prevalent sentiment (“Image by Author”)

7. Topic Detection and Network Analysis

Another dataset, for which we combined two different kinds of analysis, was the Hillary Clinton email corpus. The long-debated set of emails was released a few years back and made public by Hillary Clinton herself. You can also find it on Kaggle.

You can use a Document Viewer node to visualize the content and sender of each text document (Figure 13). However, a manual investigation of the emails one by one is unrealistic. So, we decided to proceed with an automatic clustering of the emails’ topics and an interactive visualization of the network of recipients and sources for a selected topic.

Figure 13. The Document Viewer node allows us to visualize the documents from a corpus one at a time (“Image by Author”).

Without going into the politics, we wanted to explore the content of the emails (the topics) and the relation between the sender (Hillary Clinton) and the receiver. Therefore, the whole corpus was passed through the LDA algorithm to discover the topics of the conversations and a network graph was built to represent the corresponding email connections.

With the LDA algorithm, we searched for n=10 topics, each represented by m=5 keywords. The list of topics with their representation is shown in the table below.

Table 1. 10 topics found in the Hillary Clinton’s email corpus

This list of topics has been displayed as a radio button list on a web page on the KNIME WebPortal through a component view (Figure 14). From this list, you can select the topic, and then investigate the corresponding network of email sources and destinations. Of all those topics, let’s choose for example the topic: “women, world, people, united, security”. After clicking the button “Next”, the application moves to the next page to visualize the corresponding network of email exchanges (Figure 15). This was (and still is) indeed a very hot and shared topic and therefore with a large network of exchanged emails, where Hillary Clinton’s email address (the node in blue) is most centrally located. The other topics in the corpus have a network of much reduced size.

Figure 14. List of topics to choose from, as extracted by LDA algorithm applied to the Hillary Clinton’s email corpus (“Image by Author”)
Figure 15. Network of email exchanges around the topic “women, world, people, united, security” (“Image by Author”)

The deployment workflow that builds the web application to run on the KNIME WebPortal is shown in Figure 16. Notice that two components — “Value Selection” and “Plot and Network Visualization” — produce the web pages in the web application. Indeed, their composite views translate into web pages once the workflow is running on the KNIME WebPortal. This workflow and its training counterpart are both available for download on the KNIME Hub: 01_Topic Detection Analysis_Training and 02_Topic Detection Analysis_Deployment_WebPortal.

Figure 16. Deployment workflow for the analysis of the Hillary Clinton’s email corpus (“Image by Author”)

8. Free Text Generation

The latest trend in text mining stems from the deep learning branch of machine learning: we are talking about the application of Long Short-Term Memory (LSTM) units. LSTM units are a special type of neural units that deal particularly well with ordered sequences, like for example words in text.

LSTMs have been used successfully to address sentiment analysis problems (see section above). However, the most interesting application of LSTMs in the field of text mining has been in free text generation. Thus, just to conclude this review, we will also cover this use case with two examples: freetext generation and machine translation. Let’s start in this section with the case study around free text generation.

In free text generation, we train a machine to literally speak freely. Usually a trigger sentence is provided at the beginning, whereupon we expect the model to produce free text in the required amount and in the language and style of the training set. Free text generation can be used in chatbots or to automatically generate journal articles, new books, or new movie scripts.

We have worked on a few case studies here as well. They all follow a similar approach: a deep learning LSTM based network (Figure 17) fed with tensors of n past one-hot encoded characters and trained to predict the next character with sequences of characters from a text corpus. Small variations across case studies include the size of the dictionary, and therefore the size of the one-hot encoded character vectors; the size n of past characters; the number of LSTM states; the trigger sentence in deployment.

Figure 17. LSTM based neural network for free text generation (“Image by Author”)

Neural networks and deep learning networks can be trained and applied in KNIME Analytics Platform using the KNIME Deep Learning Keras Integration. This integration applies the familiar KNIME GUI to the Keras libraries for deep learning, making the whole deep learning framework easier to use. There are nodes dedicated to implement layers within the neural architecture and nodes dedicated to train, monitor, and apply the resulting network [8].

In Figure 18 you can see the brown nodes in the upper left corner that build the network architecture and, in the lower right corner, the green node named Keras Network Learner that trains the architecture on examples from the training set.

Figure 18. Workflow training an LSTM based neural network to predict the next character based on the input sequence of characters (“Image by Author”)

Figure 19 shows the deployment workflow. Here, a recursive loop node starts from the trigger sentence, predicts the next character, removes the oldest character from the input sequence and adds the newly predicted character, predicts the next character, … and so on.

Figure 19. Workflow deploying a trained neural network within a recursive loop. Triggered by an initial sentence, the recursive loop creates the new input sequence of characters to predict the next character at each iteration (“Image by Author”)

With some variations, we have applied this approach to generate rap songs, Shakespeare-like text, fairy tales, and product names. The training set and the related parameters were, of course, customized to the task.

Now, while showing the resulting rap songs is always a sensitive topic due to the sometimes not socially acceptable content, showing the resulting Shakespeare-like text is usually safer. Thus, I will show here only the results we obtained when training the network with Shakespeare theatre pieces (“Othello”, “King Lear”, “Much ado about nothing”). In our example, the trigger sentence was the most boring sentence that you can find in the English language: the start of a software license agreement. Let’s see if the network was able to improve on the boredom and to move to a more Shakespearean style text (Figure 20).

The network obviously gets confused about which theater character belongs to which text and mixes all of them together. However, the English style resembles Shakespeare-style, and sometimes it even makes sense semantically. Notice that since the topic of the trigger sentence is legal, the Shakespeare-like text keeps going with the theme and invokes thieves, traitors, and honesty.

Two similar workflows, working on the task of generating fairy tales, are available on the KNIME Hub: Generate Text Using a Many-To-One LSTM Network (Training) and Generate Text Using a Many-To-One LSTM Network (Deployment). You can download and readapt them to generate your kind of free text.

Figure 20. Shakespeare-like free text generated by an LSTM based neural network, trained on Shakespeare texts (“Image by Author”)

9. Automatic Translation

A special case of free text generation is machine translation. In this case, you have a sentence in the source language as input and the model is supposed to produce the correct translated sentence in the target language as output. See Figure 21 for an example from English (source language) to Italian (target language).

Figure 21. Neural Machine Translation from English to Italian (“Image by Author”).

For this task, the neural architecture for free text generation, as shown in the previous section, is not sufficient and must be enhanced. We need a new many-to-many neural structure that can accept a sequence of many words as input and produce a sequence of many words as output, respectively in the source and target language. Here we adopted an encoder-decoder architecture [8], both LSTM unit based neural networks. An example of such workflows implementing a neural translation from English to German can be found on the KNIME Hub as Neural Machine Translation from English to German: Training Workflow and Neural Machine Translation from English to German: Deployment Workflow.

10. Summary & Conclusions

We have reached the end of this review of case studies using text mining: from simple word clouds to machine translation, document classification, topic detection, sentiment analysis, and network analysis.

All solutions have been implemented using the KNIME Textprocessing Extension. The last two solutions involving deep learning networks also need the KNIME Deep Learning Keras Integration. All case studies are linked to the corresponding workflows on the KNIME Hub.

You are empowered now with all you need: The nodes for text processing and the description of the tasks. Just add your enthusiasm and start your own personal journey through text mining.

11. References

[1] R. Silipo, K. Melcher, “Sentiment Analysis: What’s with the Tone?”, InfoQ, Nov 27, 2018.

[2] K. Jain, “Sentiment analysis using deep learning”, Analytics Vidhya, Aug 7, 2020

[3] R. Silipo, “Text Encoding: A Review”, Data Science Central, Feb 11, 2019

[4] V. Tursi, R. Silipo, “From Words to Wisdom”, KNIME Press, 2019

[5] “Network Theory”, Wikipedia

[6] S. P. Borgatti, “Centrality and Network Flow”. Social Networks. 27: 55–71, 2005.

[7] D.M. Blei, A.Y. Ng, M.I. Jordan, “Latent Dirichlet Allocation“, Journal of Machine Learning Research 3: 993–1022, 2003.

[8] K. Melcher, R. Silipo, „Codeless Deep learning with KNIME”, Packt Publishing, 2020

--

--

Rosaria has been mining data since her master degree, through her doctorate and job positions after that . She is now a data scientist and KNIME evangelist.