As applied to systems for monitoring of IT infrastructure and business processes, NLP algorithms can be used to solve problems of text classification and in the creation of various dialogue systems. This article will briefly describe the NLP methods that are used in the AIOps microservices of the Monq platform.
1. Introduction
Purely theoretically, the ultimate goal of all natural language processing (NLP) algorithms is to create artificial intelligence (AI) that is capable to understand human language, moreover, "understand" in the sense of "realize the meaning" (text analysis) and "make meaningful statements" (text synthesis). While this goal is still very far away – in order to recognize a living language, it will be necessary to give the AI agent all the vast knowledge about the world around him, as well as the ability to interact with it, i.e. create a "really thinking" agent. So for now, in practical terms, natural language processing can be considered as various algorithmic methods for extracting some useful information from text data.
There is a fairly wide range of practical tasks where natural language processing is needed:
- machine translation of texts from foreign languages,
- automatic annotation of text (summarization),
- classification of texts by categories (spam / non-spam, news classification, sentiment analysis, etc.),
- dialogue systems (chat bots, question-answer systems),
- recognition of named entities (finding proper names of people, companies, locations, etc. in the text).
As applied to systems for monitoring of IT infrastructure and business processes, NLP algorithms can be used to solve problems of text classification and in the creation of various dialogue systems. This article will briefly describe the natural language processing methods that are used in the Aiops microservices of the Monq platform for hybrid IT monitoring, in particular for analyzing events and logs that are streamed into the system.

2. Converting text to numeric vector representations
Mathematical algorithms work with numbers, therefore, in order to use the mathematical apparatus for processing and analyzing text data, you must first convert words and sentences into numerical vectors, and preferably with preservation of the semantic relationship and word order, i.e .:
- a numeric vector should reflect the content and structure of the text,
- words and sentences that are similar in meaning should have similar values of vector representations.
Currently, there are two main approaches for transforming text, more precisely a set of texts (or, in NLP terminology, a corpus of documents), into vector representations:
- topic modeling – several types of statistical models for finding hidden (latent) topics in the document corpus: latent semantic analysis (PLSA), latent Dirichlet placement (LDA),
- various models of contextual representation of words based on the distributive hypothesis: neural network models Word2Vec, GloVe, Doc2Vec, fastText and some others.
Vector representations obtained at the end of these algorithms make it easy to compare texts, search for similar ones between them, make categorization and clusterization of texts, etc.
About 10–15 years ago, only experts could participate in natural language processing projects, since this required serious knowledge in the field of mathematics, Machine Learning and linguistics. Now, developers can use many ready-made tools and libraries to solve NLP problems. In particular, in the Python language, a fairly wide range of NLP capabilities for training various models is provided by a combination of two modules: nltk and gensim – it is on them (but not only) that the NLP functionality of the Monq platform is based.
3. Text data preprocessing for model training
Preprocessing text data is an important step in the process of building various NLP models – here the principle of GIGO ("garbage in, garbage out") is true more than anywhere else. The main stages of text preprocessing include tokenization methods, normalization methods (stemming or lemmatization), and removal of stopwords. Often this also includes methods for extracting phrases that commonly co-occur (in NLP terminology – n-grams or collocations) and compiling a dictionary of tokens, but we distinguish them into a separate stage.
Tokenization is the splitting of text into text units, tokens, which can be either single words or phrases and whole sentences. A document, in this context, is a collection of tokens belonging to one semantic unit (for example, a sentence, paragraph or paragraph), and a corpus is a generic collection of documents. In the process of tokenization the text is:
- broken down into sentences,
- cleared of punctuation,
- converted to lowercase,
- split into tokens (most often words, but sometimes syllables or combinations of letters).
Normalization is the reduction of words to their standard morphological form, for which either stemming is used – bringing a word to its stem form ("argued" – "argu", "fishing" – "fish"), or lemmatization – resolving a word to its canonical dictionary form ("is" – "be", "written" – "write"). For the Russian language, lemmatization is more preferable and, as a rule, you have to use two different algorithms for lemmatization of words – separately for Russian (in Python you can use the pymorphy2 module for this) and English.
Removal of stop words from a block of text is clearing the text from words that do not provide any useful information. These most often include common words, pronouns and functional parts of speech (prepositions, articles, conjunctions). In Python, there are stop-word lists for different languages in the nltk module itself, somewhat larger sets of stop words are provided in a special stop-words module – for completeness, different stop-word lists can be combined. Quite often, names and patronymics are also added to the list of stop words.
Here is a relatively complete example of Python code for preprocessing a collection of texts in Russian and English (the output is a list of tokenized documents):
import re
from nltk.tokenize.punkt import PunktSentenceTokenizer
sentTokenizer = PunktSentenceTokenizer()
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'w+')
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
import pymorphy2
morph = pymorphy2.MorphAnalyzer()
import langid
from nltk.corpus import stopwords
from stop_words import get_stop_words
langid.set_languages(['en','ru'])
stopWordsEn=set().union(get_stop_words('en'), stopwords.words('english'))
stopWordsRu=set().union(get_stop_words('ru'), stopwords.words('russian'))
stopWords=list(set().union(stopWordsEn, stopWordsRu))
stopWords.sort()
textCollection=['The main stages of text preprocessing include tokenization methods, normalization methods, and removal of stop words.', 'Text tokenization is the splitting of text into text units. During the tokenization process, the text is first divided into sentences.', 'Tokenization in Python is the most primary step in any natural language processing program. ', 'We have imported re library and used "w+" for picking up words from the expression.']
textCollTokens = []
for text in textCollection: ## loop over the collection of texts
sentList = [sent for sent in sentTokenizer.tokenize(text)]
tokens = [word for sent in sentList for word in tokenizer.tokenize(sent.lower())]
lemmedTokens=[]
for token in tokens:
if langid.classify(token)[0]=='en':
lemmedTokens.append(lemmatizer.lemmatize(token))
elif langid.classify(token)[0]=='ru':
lemmedTokens.append(morph.parse(token)[0].normal_form)
goodTokens = [token for token in lemmedTokens if not token in stopWords]
textCollTokens.append(goodTokens)
4. Extraction of n-grams and compilation of a dictionary of tokens
Extraction of stable phrases in a corpus of text (n-grams or collocations, for example, "New York", "Central Bank", etc.) and their use as single tokens in NLP-models is a fairly standard way to increase the quality of such models. There are several algorithms for determining collocations in a collection of texts, based on calculations of various statistics of the co-occurrence of words in the given collection. But we will not go deep into the analysis of the pros and cons of these algorithms, but we will simply use the methods of extracting bigrams and trigrams provided by the gensim module:
from gensim.models.phrases import Phrases, Phraser
bigrams = Phrases(textCollTokens, min_count=1, threshold=5) ## finding bigrams in the collection
trigrams = Phrases(bigrams[textCollTokens], min_count=2, threshold=5) ## finding trigrams
bigramPhraser = Phraser(bigrams) ## setting up parser for bigrams
trigramPhraser = Phraser(trigrams) ## parser for trigrams
docCollTexts=[]
for doc in textCollTokens:
docCollTexts.append(trigramPhraser[bigramPhraser[doc]])
The final stage of text data preprocessing is compiling a dictionary of tokens (taking into account all found n-grams) for the given collection of texts. As a rule, in order to get into the dictionary a token must meet some additional criteria – filtering of tokens to reduce "noise" and "background" – in our case (the code example follows below):
- the token must appear in the entire collection at least a certain number of times (parameter _nobelow),
- the token should not be found more often than in the half of the texts from the collection (parameter _noabove).
from gensim import corpora
textCollDictionary = corpora.Dictionary(docCollTexts)
textCollDictionary.filter_extremes(no_below=1, no_above=0.5, keep_n=None)
Thus, after preprocessing the collection of texts gets transformed in a list of tokenized (including n-grams) documents, plus we have a dictionary of tokens for the given body of texts, as well as "trained" parsers of bigrams and trigrams. The last three objects must be saved on disk for later use, since they are an important part of the NLP model under construction (for convenience, the token dictionary is also saved as a text file):
bigramPhraser.save('bigramPhraser.pkl')
trigramPhraser.save('trigramPhraser.pkl')
textCollDictionary.save('textCollDictionary.pkl')
textCollDictionary.save_as_text('textCollDictionary.txt')
5. Topic modeling with Latent Dirichlet Allocation (LDA)
As mentioned above, one of the approaches for transforming a collection of texts into vector representations is topic modeling. In our platform, we use the Latent Dirichlet allocation method (LDA) for this. We will not describe in detail the mechanisms of text processing inside LDA (the original article with a description here), but we will try to outline the basic principles of the method:
- each document in the collection is described by a set of latent (hidden) topics,
- a topic is a set of words with certain weights, or probabilities (multinomial probability distribution on the set of words of the given dictionary),
- a document is a random independent sample of words (a bag of words) generated by the latent set of topics,
- matrices that describe the frequency of words occurring within a specific document and within the entire collection are built,
- using Gibbs sampling – an algorithm for generating random samples of a set of variables according to some joint distribution (particularly, Dirichlet distribution), such matrices of words-to-topics and topics-to-documents are determined that reproduce the given corpus of texts most "correctly",
- as the output for each document from the collection, the LDA algorithm defines a topic vector with its values being the relative weights of each of the latent topics in the corresponding text.
The essence of the method is illustrated by the following picture, which, relatively speaking, could be obtained after running the LDA method on the corpus of texts of Russian fairy tales. At the output, the LDA algorithm would match the "Tale of the Fisherman and the Fish" with the topic vector T = (0.35, 0.5, 0.15), where 0.35 is the weight of topic-1, 0.5 is the weight of topic-2, and 0.15 is the weight of topic-3. It should be noted that the LDA method does not interpret topics in any way and does not offer any generic name for each of them – topics are simply sets of words numbered by an index. Sometimes one or more words with the maximum weights within the given topic are taken as a simple "meaningful" identifier of a topic.

Using the gensim module, it is very easy to train an LDA model in Python:
from gensim import models
import numpy as np
textCorpus = [textCollDictionary.doc2bow(doc) for doc in docCollTexts]
nTopics=4
ldamodel=models.ldamodel.LdaModel(textCorpus, id2word=textCollDictionary, num_topics=nTopics, passes=10)
ldamodel.save('ldaModel')
textTopicsMtx=np.zeros(shape=(len(textCorpus),nTopics),dtype=float)
for k in range(len(textCorpus)): ## make the matrix of docs to topic vectors
for tpcId,tpcProb in ldamodel.get_document_topics(textCorpus[k]):
textTopicsMtx[k,tpcId]=tpcProb
In this example, we train an LDA model, save it to disk, and also create a matrix of topic vectors of all documents in the corpus (a matrix of thematic representations of a collection of texts), which can be used in the future to solve specific problems of categorization and clusterization of texts. This is how the topics matrix looks like for our simple example:

There are several modules in Python to visualize the topics themselves and their composition. Here’s an example of code to visualize LDA results using the wordcloud and matplotlib modules:
import matplotlib.pyplot as plt
from wordcloud import WordCloud
cloud = WordCloud(background_color='white', width=2500, height=1800, max_words=5, colormap='tab10',prefer_horizontal=1.0)
topics = ldamodel.show_topics(num_topics=nTopics, num_words=5, formatted=False)
fig, ax = plt.subplots(1, 4, figsize=(8, 3), sharex=True, sharey=True)
for i, ax in enumerate(ax.flatten()):
fig.add_subplot(ax)
topicWords = dict(topics[i][1])
cloud.generate_from_frequencies(topicWords, max_font_size=300)
plt.gca().imshow(cloud)
plt.gca().set_title('Topic ' + str(i), fontdict=dict(size=16))
plt.gca().axis('off')
plt.subplots_adjust(wspace=0, hspace=0)
plt.axis('off')
plt.show()
As a result, we get the following picture for our example:

Since in the given example the collection of texts is just a set of separate sentences, the topic analysis, in fact, singled out a separate topic for each sentence (document), although it attributed the sentences in English to one topic.
In principle, the LDA algorithm does not have many tuning parameters. The main parameter is the number of latent topics and it must be determined in advance, and the choice of the optimal value for this parameter is a very nontrivial task. In our practice, we are guided by the following simple rule: we take the size of the dictionary of tokens built on the given corpus of texts and divide it by a number between 10 and 20 (an average number of words in a single topic) – the resulting value is given as input of the LDA algorithm.
6. Contextual representation of words in Word2Vec and Doc2Vec models
The matrix of topic vectors of a collection of texts at the end of the LDA procedure constitutes the first part of the full vector representation of the text corpus, the second part is formed from semantic vectors, or contextual representations.
The concept of semantic vectors is based on the distributive hypothesis: the meaning of a word is not what sounds and letters it consists of, but what words it most often occurs among, i.e. the meaning of a word is not stored somewhere within it, but is concentrated in its possible contexts. Basically, the semantic vector of a word shows how often it appears next to other words.
The simplest version of semantic vectors is a matrix of the frequency of word usage in one context, i.e. at a distance of no more than n words from each other (often n=10 is used), If you take, for example, an issue of Reader’s Digest magazine and count how often different pairs of words are within the same context window of a certain size, you get a table, a part of which might look like this:

Each row of numbers in this table is a semantic vector (contextual representation) of words from the first column, defined on the text corpus of the Reader’s Digest magazine.
If we carry out such a procedure on a text corpus of the entire language (for example, this one), then two significant drawbacks of this approach immediately become visible:
- the table is too large since its size is determined by the size of the dictionary which is tens of thousands,
- the table will be filled mostly with zeros.
Since the 60s of the last century, various techniques have been proposed to reduce the dimension of the co-occurrence matrix of words (singular value decomposition, principal component analysis, various types of filtration), but no significant breakthrough has been observed. The breakthrough happened in 2013, when a group of researchers from Google proposed to use the neural network architecture Word2Vec to obtain semantic vectors. Again, we will not describe in detail how the Word2Vec neural network works (the original article can be taken [here](https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-understanding-word2vec-e0128a460f0f), in a more accessible form here), but we will limit ourselves to the main aspects:
- the neural network is trained on a large collection of documents to predict which words are most likely to be found next to each other (and the neural network is by no means deep – only two layers),
- two learning modes, CBOW (continuous bag-of-words, faster) and skip-gram (more accurate for less-used words),
- the dimension of the output vectors is several hundred (typically, n=300),
- after training the matrix of weights from the input layer to the hidden layer of neurons automatically gives the desired semantic vectors for all words.
A further development of the Word2Vec method is the Doc2Vec neural network architecture, which defines semantic vectors for entire sentences and paragraphs. Basically, an additional abstract token is arbitrarily inserted at the beginning of the sequence of tokens of each document, and is used in training of the neural network. After the training is done, the semantic vector corresponding to this abstract token contains a generalized meaning of the entire document. Although this procedure looks like a "trick with ears," in practice, semantic vectors from Doc2Vec improve the characteristics of NLP models (but, of course, not always).

Again we use the implementation of the Doc2Vec algorithm from the gensim module:
from gensim.models import Doc2Vec
d2vSize=5
d2vCorpus= [models.doc2vec.TaggedDocument(text,[k]) for k,text in enumerate(docCollTexts)]
d2vModel=Doc2Vec(vector_size=d2vSize, min_count=1, epochs=10, dm=1)
d2vModel.build_vocab(d2vCorpus)
d2vModel.train(d2vCorpus, total_examples=d2vModel.corpus_count, epochs=d2vModel.epochs)
d2vModel.save('doc2vecModel')
textD2vMtx=np.zeros(shape=(len(textCorpus), d2vSize),dtype=float)
for docId in range(len(d2vCorpus)):
doc2vector=d2vModel.infer_vector(d2vCorpus[docId].words)
textD2vMtx[docId,:]=doc2vector
Here we train a Doc2Vec model, save it to disk, and then, using the trained model, create a matrix of semantic vectors of all documents. This is how this matrix looks like for our simple example:

Looking at this matrix, it is rather difficult to interpret its content, especially in comparison with the topics matrix, where everything is more or less clear. But the complexity of interpretation is a characteristic feature of neural network models, the main thing is that they should improve the results.
Combining the matrices calculated as results of working of the LDA and Doc2Vec algorithms, we obtain a matrix of full vector representations of the collection of documents (in our simple example, the matrix size is 4×9). At this point, the task of transforming text data into numerical vectors can be considered complete, and the resulting matrix is ready for further use in building of NLP-models for categorization and clustering of texts.
7. Summary
In this article, we have analyzed examples of using several Python libraries for processing textual data and transforming them into numeric vectors. In the next article, we will describe a specific example of using the LDA and Doc2Vec methods to solve the problem of autoclusterization of primary events in the hybrid IT monitoring platform Monq.