The world’s leading publication for data science, AI, and ML professionals.

Extractive Text Summarization for Patch Management

Identifying relevant discussions on social media and using only the applicable content for key phrase extraction

Getting Started

By Darci Taylor, Josh Kilts, and Jim Sferas

Photo by Angelica Ribeiro on Unsplash
Photo by Angelica Ribeiro on Unsplash

Introduction

This project was developed to offer IT Analysts and System Admins guidance on whether any newly released patches they wish to install are likely to cause issues based upon what sort of issues have been reported by other IT Analysts in early stages of the most recent patch release. There are two components to it – the "Volume" and the "Chirp".

The Volume (see Figure 1) is a value of 0, 1, 2, 3, or 4, displayed as a volume bar to indicate the "loudness" of the chatter on various social media platforms. If it is a 3 or 4, it’s considered trending high enough to warrant a Chirp (see Figure 4), which is essentially key phrase extraction of collected and processed social media content. We use PyTextRank, an open-source implementation of TextRank, to do the phrase extraction for these high-volume chirps, which is a graph-based algorithm based on PageRank (a way that Google ranks webpages). The chirp is then comprised of the top phrases that the algorithm comes up with, representative of the initial "word on the street" of a patch.

The following sections describe how each of these elements are derived.

Patch Chatter Volume

We begin by scanning social media platforms such as Twitter and Reddit daily for any comments (tweets, submissions, etc.) that mention a patch, keying off the patch ID. We then collect the text of the comments, as well as additional factors including retweets, upvotes, likes, and downvotes. If there is enough popularity or chatter about a patch, it gets aggregated and scored. The score itself is an internal calculation, based on summarizations of platform-specific features which include the likes and upvotes.

Figure 1. Patch Chatter Volume as seen by the IT Analyst [image by author]
Figure 1. Patch Chatter Volume as seen by the IT Analyst [image by author]

The score also accounts for the source of the comment in order to decrease any bias that may be caused by patch release update "announcements" that are generated by popular authors (such as Microsoft Help). The reason this step is needed is that often a large list of patches will be mentioned simultaneously (e.g. "Latest Releases") which are devoid of any content containing the user’s experience with a patch, but since it was released by someone with a larger following, it may inspire irrelevant discussion (see Figure 2). An example of this would be "Microsoft releases October patch update for patches kb1234, kb4567, …", followed by several comments such as "Great. Here we go again." A flurry of activity like this can take away from the actual content that has more meaning, such as "Has anyone had problems with kb1234?" followed by "kb1234 is causing me a Blue Screen of Death."

Figure 2. This is a graphical representation of patches and the associated chatter about them on social media. The blue nodes indicate the patches being discussed (larger nodes represent more discussion) whilst the grey and orange nodes represent the people commenting about them. Orange nodes have higher than normal edges (connections between people and patches), showing that these folks are discussing multiple patches, which is indicative of a release announcement. These announcements artificially inflate the score of the patch and are therefore down-weighted since they offer no substantive content as to the user experience for a patch. [image by author]
Figure 2. This is a graphical representation of patches and the associated chatter about them on social media. The blue nodes indicate the patches being discussed (larger nodes represent more discussion) whilst the grey and orange nodes represent the people commenting about them. Orange nodes have higher than normal edges (connections between people and patches), showing that these folks are discussing multiple patches, which is indicative of a release announcement. These announcements artificially inflate the score of the patch and are therefore down-weighted since they offer no substantive content as to the user experience for a patch. [image by author]

Once the scores are created, we use them to calculate "ranks". Scores are bucketed into 5 ranks (0–4), based on boundaries set by incremental growth of upper and lower scores by day (accounting for the increase in daily accumulation). See Figure 3.

Figure 3. Each thin line represents a cumulative score for a patch over time. X-axis is days since discussions about the patch began on a social media platform, and the thick lines represent the bounds. Scores aggregate over the first few weeks, when discussions are taking place, then eventually stop aggregating when people have nothing left to say about the patch. [image by author]
Figure 3. Each thin line represents a cumulative score for a patch over time. X-axis is days since discussions about the patch began on a social media platform, and the thick lines represent the bounds. Scores aggregate over the first few weeks, when discussions are taking place, then eventually stop aggregating when people have nothing left to say about the patch. [image by author]

A chatter volume rank of 0 indicates the patch has not been mentioned at all, whereas a score of 1, 2, or 3 falls within the bounds displayed in Figure 3, sectioned off into percentage splits of the scores. A rank of 4 falls above the top bound and indicates a good deal of interaction from people on social media with the comments. A rank of 4 could imply anything from issues and warnings to the critical status or relevance of a patch. The volume is merely indicative of the magnitude of chatter that is detected on an aggregate of social media platforms.

Patch Chirp

If a patch gets ranked as a 3 or 4, a Text Summarization process (which includes standard NLP text pre-processing) is triggered to provide key phrase extraction of the social media content collected for it. The method involved is an unsupervised, extractive summarization technique called TextRank. The specific library used is PyTextRank, which is a Python implementation of TextRank as a spaCy pipeline extension.

Figure 4. Patch Chirp text summarization as seen by the IT Analyst [image by author]
Figure 4. Patch Chirp text summarization as seen by the IT Analyst [image by author]

The TextRank algorithm, a graph based, vertex (node) ranking algorithm, begins with tokenizing the text and annotating the tokens with parts of speech tags. The vertices are a subset of the tokens, chosen based on their part of speech. An edge is added to the graph for each instance where words co-occur within a window of N words (this is represented by the adjacency matrix in Figure 5). The ranking algorithm runs on the graph until convergence and a final score is obtained for each node. The algorithm is inspired by PageRank, which was developed by Larry Page, one of the founders of Google, used to rank websites. Adjacency between words in the text is analogous to a web page transition probability. We use PyTextRank, which constructs the graph as a lemma graph to represent links among candidate words and their supporting language by tokenizing, tagging parts of speech, using co-occurrence windowing, then using NetworkX to compute TextRank on the graph.

This graph algorithm is based on use of eigenvector centrality to compute the "centrality", or approximate importance, of each node in a graph. It’s a measure of the "influence" of a node in a network based on global information drawn from the entire graph. For a keyword/phrase extraction application such as the Patch Chirp, the task is to automatically identify a set of terms or phrases that best represent the collection of comments made about a patch.

Figure 5. The centrality matrix is an eigenvector of the adjacency matrix. (source: https://demonstrations.wolfram.com/NetworkCentralityUsingEigenvectors/))
Figure 5. The centrality matrix is an eigenvector of the adjacency matrix. (source: https://demonstrations.wolfram.com/NetworkCentralityUsingEigenvectors/))

Once these centrality scores are computed, they are used in post-processing to create the phrases: the Patch Chirps. Words with high ranks are marked in the text and sequences comprised of highly ranked words are collapsed into a multi-word phrase. An example of this would be if we had a comment such as "this patch is causing installation issues". If both "installation" and "issues" are selected as keywords by the TextRank algorithm, they are collapsed into the phrase "installation issues" due to their adjacency and rank.

References:

1- Rada Mihalcea, Paul Tarau, "TextRank: Bringing Order into Text" (2004), _Empirical Methods in Natural Language Processing_

2- https://pypi.org/project/pytextrank/

3-https://demonstrations.wolfram.com/NetworkCentralityUsingEigenvectors

Thank you to Tony Workman for feedback about the article.


Related Articles