Leveraging on NLP to gain insights in Social Media, News & Broadcasting

Defining a structured use case roadmap in social media analysis for governments entities and private organizations

Published in

Towards Data Science

16 min readApr 6, 2020

Natural Language Processing (NLP) is a subfield of cognitive science and Artificial Intelligence concerned with the interactions between computers and human natural language. It focuses on processing and analyzing natural language data. The main objective is to make machine learning as intelligent as a human being in understanding the language. The objective here is to showcase various NLP capabilities such as sentiment analysis, speech recognition, and relationship extraction. Challenges in natural language processing involve topic identification, natural language understanding, and natural language generation.

Social media analytics leverages the ability to gather and find meaning in data gathered from social channels to support business decisions — and measure the performance of actions based on those decisions through social media. As an example, consumers in the Middle East and North Africa are among the most active users of social media platforms. The region’s large youth population and high mobile penetration rate have made it the ideal market for companies

In this article, we show how private and government entities can leverage on a structured use case roadmap to generate insights leveraging on NLP techniques e.g. in social media, newsfeed, user reviews and broadcasting domain.

IBM Approach in Social Media Analytics

IBM highlights that with the prevalence of social media: “News of a great product can spread like wildfire. And news about a bad product — or a bad experience with a customer service rep — can spread just as quickly. Consumers are now holding organizations to account for their brand promises and sharing their experiences with friends, co-workers and the public at large.”

Social media analytics helps government entities and companies to address these experiences and use them to:

Spot trends related to offerings and brands
Understand conversations — what is being said and how it is being received
Derive customer sentiment towards products and services
Gauge response to social media and other communications
Identify high-value features for a product or service
Uncover what competitors are saying and its effectiveness
Map how third-party partners and channels may affect performance

IBM Data Science Capabilities in Social Media

The first step for effective social media analytics is developing an objective. Goals can range from increasing revenue to pinpointing service issues. From there, topics or keywords can be selected and parameters such as date range can be set. Sources also need to be specified — responses to YouTube videos, Facebook conversations, Twitter arguments, Amazon product reviews, comments from news sites.

Natural language processing and machine learning technologies identify entities and relationships in unstructured data — information not pre-formatted to work with data analytics. Virtually all social media content is unstructured. These technologies are critical to deriving meaningful insights.
Segmentation is a fundamental need in social media analytics. It categorizes social media participants by geography, age, gender, marital status, parental status and other demographics. It can help identify influencers in those categories. Messages, initiatives and responses can be better tuned and targeted by understanding who is interacting on key topics.
Behavior analysis is used to understand the concerns of social media participants by assigning behavioral types such as user, recommender, prospective user and detractor. Understanding these roles helps develop targeted messages and responses to meet, change or deflect their perceptions.
Sentiment analysis measures the tone and intent of social media comments. It typically involves natural language processing technologies to help understand entities and relationships to reveal positive, negative, neutral or ambivalent attributes.
Share of voice analyzes prevalence and intensity in conversations regarding brand, products, services, reputation and more. It helps determine key issues and important topics. It also helps classify discussions as positive, negative, neutral or ambivalent.
Clustering analysis can uncover hidden conversations and unexpected insights. It makes associations between keywords or phrases that appear together frequently and derives new topics, issues and opportunities. The people that make baking soda, for example, discovered new uses and opportunities using clustering analysis.
Dashboards and Visualization charts, graphs, tables and other presentation tools summarize and share social media analytics findings — a critical capability for communicating and acting on what has been learned. They also enable users to grasp meaning and insights more quickly and look deeper into specific findings without advanced technical skills.

Text Preprocessing Activity for NLP Modeling

During text preprocessing the following stages are performed:

Tokenization is a process when we split the text into sentences and the sentences into words. We lowercase the words and remove punctuation.
Stemming is a crude method for cataloging related words; it essentially chops off a letter from the end until the stem is reached. In the English language, for example, there are a lot of exceptions.
Lemmatization looks beyond word reduction and considers a language’s full vocabulary to apply a morphological analysis to words. For example, words in third person are changed to first person and verbs in past and future tenses are changed into present. Words are stemmed — words are reduced to their root form.

Text Preprocessing Activity for NLP Models

Normalization is a process that converts a list of words to a more uniform sequence. This is useful in preparing text for later processing. By transforming the words to a standard format, other operations are able to work with the data and will not have to deal with issues that might compromise the process. For example, converting all words to lowercase will simplify the searching process.
Word Embeddings or Vectorization is a methodology in NLP to map words or phrases from vocabulary to a corresponding vector of real numbers which used to find word predictions, word similarities and semantics. The process of converting words into numbers is called Vectorization.

Text Representation Models in NLP

In this section, we will outline the most famous text representation models that used in word and sentence vectorization though we will not deep dive into technical details. The simple method of feature extraction with text data which are currently used are:

1. Bag-of-Words

The bag-of-words model is simple to understand and implement and has seen great success in problems such as language modeling and document classification.

It is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.

The bag-of-words model is commonly used in methods of document classification where the (frequency of) occurrence of each word is used as a feature for training a classifier.

The bag-of-words model is simple to understand and implement and has seen great success in problems such as language modeling and document classification.

We will show how Bag-of-Words model works in text vectorization space. Let’s consider that we have the following 3 articles from Middle East News articles.

Then, for each word the frequency of the word in the corresponding document is inserted

Term Frequency Table in Articles

The above table depicts the training features containing term frequencies of each word in each document. This is called bag-of-words approach since the number of occurrences and not sequence or order of words matters in this approach.

2. TF-IDF

TF: Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization:

TF Mathematical Formulation

TF-IDF weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

IDF: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However, it is known that certain terms, such as “is”, “of”, and “that”, may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scaling up the rare ones, by computing the following

IDF Mathematical Formulation

3. Word Embeddings with Word2Vec

Word2Vec model is used for learning vector representations of words called “word embeddings”. This is typically done as a preprocessing step, after which the learned vectors are fed into a discriminative model to generate predictions and perform all sorts of interesting things. It takes the semantic meaning of words. I would analyze the Word2Vec trin my next blog in an exhaustive manner.

Source:https://www.infoq.com/presentations/nlp-practitioners/?itm_source=presentations_about_Natural-Language-Processing&itm_medium=link&itm_campaign=Natural-Language-Processing

Indicative Data & AI Use Case Roadmap

Below you can see an outline of a structured use case roadmap that can be customized to government entities and private organizations taking advantage of Natural Language Processing and Artificial Intelligence to gain significant insights from social media, newsfeed and broadcasting content.

1. Topic Modeling & Text Classification

In natural language processing, there is a hierarchy of lenses through which we can extract meaning — from words to sentences to paragraphs to documents. At the document level, one of the most useful ways to understand text is by analyzing its topics. The process of learning, identifying and extracting these topics across a collection of documents is called topic modeling

There are several existing algorithms you can use to perform the topic modeling. The most common are Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA) and Non-Negative Matrix Factorization (NMF) In this section I give anoverview of the techniques without getting into technical details.

Latent Semantic Analysis, or LSA, is one of the foundational techniques in topic modeling. The core idea is to take a matrix of what we have — documents and terms — and decompose it into a separate document-topic matrix and a topic-term matrix. The first step is generating our document-term matrix.

A frequently used methodology in topic modeling, the Latent Dirichlet Allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word’s presence is attributable to one of the document’s topics. LDA is an example of a topic model and belongs to the machine learning toolbox and in wider sense to the artificial intelligence toolbox.

Non-negative matrix factorization (NMF )can be applied for topic modeling, where the input is term-document matrix, typically TF-IDF normalized. It is derived from multivariate analysis and linear algebra where a matrix Ais factorized into (usually) two matrices W and H, with the property that all three matrices have no negative elements.

Source: https://www.researchgate.net/figure/Conceptual-illustration-of-non-negative-matrix-factorization-NMF-decomposition-of-a_fig1_312157184

Identifying topics are beneficial for various purposes such as for clustering documents, organizing online available content for information retrieval and recommendations. Multiple content providers and news agencies are using topic models for recommending articles to readers. Similarly recruiting firms are using in extracting job descriptions and mapping them with candidate skill set.

Media companies and media regulators can take advantage of the topic modeling capabilities to classify topic and content in news media and identify topics with relevance, topics that currently trend or spam news. In the chart below, IBM team has performed a natural language classification model to identify relevant, irrelevant and spam news.

Clustering of documents based on relevance & non-relevance with specific topic Source:IBM Data Science Elite

Topic modeling helps in exploring large amounts of text data, finding clusters of words, similarity between documents, and discovering abstract topics. As if these reasons weren’t compelling enough, topic modeling is also used in search engines wherein the search string is matched with the results.

2. Sentiment Analysis

Sentiment analysis refers to identifying sentiment orientation (positive, neutral, and negative) in written or spoken language. An alternative approach to sentiment analysis includes more granular sentiment analysis which gives more precision in the level of polarity analysis which aims to identify emotions in expressions (e.g. happiness, sadness, frustration, surprise). The use case aims to develop a sentiment analysis methodology and visualization which can provide significant insight on the levels sentiment for various source type and characteristics.

Today, businesses want to know what buyers say about their brand and how they feel about their products. However, with all of the “noise” filling our email, social and other communication channels, listening to customers has become a difficult task. In this guide to sentiment analysis, you’ll learn how a machine learning-based approach can provide customer insight on a massive scale and ensure that you don’t miss a single conversation.

Sentiment Analysis Dashboard. Source IBM Watson Analytics

3. Named Entity Recognition

Named Entiry Recognition is a process of recognizing information units like names, including person, organization and location names, and numeric expressions including time, date, money and percent expressions from unstructured text. The goal is to develop practical and domain-independent techniques in order to detect named entities with high accuracy automatically.

Named entity recognition (NER)is probably the first step towards information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. NER is used in many fields in Natural Language Processing (NLP), and it can help answering many real-world questions, such as:

Which public figure, government, country or private organization were mentioned in the news article?
Were specified products mentioned in complaints or reviews?
Does the tweet contain the name of a person? Does the tweet contain this person’s location?

The following example illustrates how named entity recognition works in the subject of the article on the topic mentioned.

News Article: Source:https://gulfnews.com/world/gulf/saudi/saudi-arabia-issues-first-permanent-residencies-to-foreigners-1.1573622585766

Excerpt from the Gulf News Article with regards to Permanent Residency in Saudi Arabia

Entity-Name Recognition in Gulf News Article

Named Entity Classification in Gulf News Article

4. Part-of-Speech Tagging (POS)

In corpus linguistics, part-of-speech tagging (POS tagging), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech based on both its definition and its context — i.e., its relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs etc.

The following example shows how POS tagging can be applied in a specific sentence and extract parts of speech identifying pronouns, verbs, nouns, adjectives etc.

Part-of-Speech Identification and relatioship extraction

Part-of-Speech tagging in News Article regarding the

5. Relationship Extraction & Textual Similarity

Text mining collects and analyzes structured and unstructured content in documents, social media, comments, newsfeed, databases, and repositories. The use case can leverage on text analytics solution for crawling and importing content, parsing and analyzing content, and creating a searchable index. Semantic analysis describes the process of understanding natural language–the way that humans communicate–based on meaning and context. It analyzes context in the surrounding text and analyzes the text structure to accurately disambiguate the meaning of words that have more than one definition.

This technique identifies the relationships present between pairs of named entities previously identified in the given text. Semantic textual similarity deals with determining how similar two pieces of texts are. This can take, for example, the form of assigning a score or of a classification (similar vs. different)

Textual Relationship Example: Age Query Source: IBM

6. Content Analytics for Video Broadcasting

Content analytics is an NLP-driven approach to cluster videos (e.g. youTube) into relevant topics based on the user comments. The most frequently used technique is topic modeling LDA using bag of words where as discussed above and it is actually an unsupervised learning technique that documents as bags of words.

The document was generated by picking set of topics, and then picking a set of words for each topic. Subsequently it tries to identify which word belongs to which topic in a proabilistic manner.
It assumes for each word in document , that its topic is wrong but every other word is assigned the correct topic. The final ouptut is a classification and ranking of IDs into specific topics.

Another way to approach this use case is with a technique called Singular Value Decomposition SVD. the singular value decomposition (SVD) is a factorization of a real or complex matrix that generalizes the eigendecomposition of a square normal matrix to any MxN matrix via an extension of the polar decomposition. The SVD methodology includes text-preprocessing stage and term-frequency matrix as described above.

In a typical model output below we can see what are the most dominant topics after applying clustering on the video comments and how these topics are associated wth each other in the intertopic distance map. The horizontal barchart on the right side shows the most relevant terms from a specific topic category based on the frequency of appearance.

Topic clustering results-Intertopic Distance Source IBM

7. Topic Trend Detection & Root Cause Analysis

Anomaly or outlier detection for text analytics can be considered an outlier post, irregular comments or even spam newfeed that seem not to be relevant with the rest of the data.

Root Cause Analysis (RCA) is the process of identifying factors that cause defects or quality deviations in the manufactured product. Common examples of root cause analysis in manufacturing include methodologies such as the Fishbone diagram. To perform RCA using machine learning, we need to be able to detect that something is out of the ordinary, or in other words, that an anomaly or an outlier is present. The root cause analysis process is outlined in the following diagram.

Source: https://medium.com/oceanize-geeks/root-cause-analysis-a992b01685b2

The machine learning model is trained to analyze topics under regular social media feeds, posts and revews. An outlier can take the form of any pattern of deviation in the amplitude, period, or synchronization phase of a signal when compared to normal newsfeeed behavior.

The algorithm forms a prediction based on the current behavioral pattern of the anomaly. If the predicted values exceed the threshold confirmed during the training phase, an alert is sent.

Performing root cause analysis using machine learning, we need to be able to detect that something which trends. Trend Analysis in Machine Learning in Text Mining is the method of defining innovative, and unseen knowledge from unstructured, semi-structured and structured textual data. It aims to detect spike of events and topics in terms of frequency of appearance in specfic sources or domains. This gives significant insight for spam and fraudulent news and posts detection.

The objective is to analyze newsfeed, comments and/or social media posts and try to identify trends in the stream of events, social media and newsfeed that appear with high frequency or seem not to fit with the rest or general context e.g. spam, or fraudulent or trending posts or changes in topic trends.

The Basic Structre of an autoencoder Source:https://www.compthree.com/blog/autoencoder/

One of the most successful techniques in this domain is the use of Autoencoders for outlier topic detection. The autoencoder is an unsupervised artificial neural network and one of tis main uses is its ability to detect outliers. Notice that outliers are observations that “stand out” from the norm of a dataset. Then, if the model trains with a given dataset, outliers will be higher reconstruction error, so outliers will be easy to detect by using this neural network.

8. NLP Model Insights & Visualization

Representing visually the content of an NLP model or text exploratory analysis is one of the most important tasks in the field of text mining. From data science and NLP point of view we not only we explore the content of documents from different aspects and at different levels of details, but also we summarize a single document, show the words and topics, detect events, and create storylines. In many cases, there are some gaps between visualizing unstructured (text) data and structured data. For example, many text visualizations do not represent the text directly, they represent an output of a natural language processing model e.g. word count, character length, word sequences.

Single-variable or univariate visualization is the simplest type of visualization which consists of observations on only a single characteristic or attribute. Univariate visualization includes histogram, bar plots and line charts.

In the chart below we can see the distrubution of polarity on a scale -1 to 1 for customer reviews based on recommendations.

Distribution of sentiment polarity on recommendation for movie reviews Source: https://towardsdatascience.com/a-complete-exploratory-data-analysis-and-visualization-for-text-data-29fb1b96fb6a

Wordclouds are a popular way of displaying how important words are in a collection of texts. Basically, the more frequent the word is, the greater space it occupies in the image. One of the uses of Word Clouds is to help us get an intuition about what the collection of texts is about.

Let’s suppose you want to build a text classification system. If you’d want to see what are the different frequent words in the different categories, you’d build a Word Cloud for each category and see what are the most popular words inside each category.

Worldcloud showing the significance of each word in the text Source:IBM

Looking at the most frequent words in each topic, we have a sense that we may not reach any degree of separation across the topic categories. In another word, we could not separate review text by departments using topic modeling techniques.

As a summary the objective of this article was to give an overview of potential areas that NLP can provide distinct advantage and actionable insughts. The list will be enhanced with additional use cases in the future.

Disclaimer: Part of the views expressed here are those of the article’s author(s) and may or may not represent the views of IBM Corporation. Part of the content on the blog is copyright and all rights are reserved — but, unless otherwise noted under IBM Corporation (e.g. photos, images).