The world’s leading publication for data science, AI, and ML professionals.

Multinomial Naїve Bayes’ For Documents Classification and Natural Language Processing (NLP)

Multinomial naive Bayes' implementation in Python 3.8, NumPy and NLTK

Photo by Arthur V. Ratz
Photo by Arthur V. Ratz

Introduction

Naïve Bayes – a probabilistic approach for constructing the data classification models. It’s formulated as several methods, widely used as an alternative to the distance-based K-Means clustering and decision tree forests, and deals with probability as the "likelihood" that data belongs to a specific class. The Gaussian and Multinomial models of the naïve Bayes exist.

The multinomial model provides an ability to classify data, that cannot be represented numerically. Its main advantage is the significantly reduced complexity. It provides an ability to perform the classification, using small training sets, not requiring to be continuously re-trained.

An audience of this story’s readers will find out about the multinomial naïve Bayes classification algorithm, and its implementation in Python 3.8 by using the latest NumPy and NLTK libraries.

Multinomial Naïve Bayes Classifiers

The multinomial naïve Bayes is widely used for assigning documents to classes based on the statistical analysis of their contents. It provides an alternative to the "heavy" AI-based semantic analysis and drastically simplifies textual data classification.

The classification aims to assign fragments of text (i.e. documents) to classes by determining the probability that a document belongs to the class of other documents, having the same subject.

Each document consists of multiple words (i.e. terms), that contribute to an understanding of a document’s contents. A class is a tag of one or multiple documents, referring to the same subject.

The labeling of documents with one of the existing classes is done by performing the statistical analysis, testing the hypothesis that a document’s terms already occurred in other documents from a particular class. This increases the probability that a document is from the same class as the documents, already classified:

Given: the samples 𝑺 intended to be classified:

Each sample in 𝑺 is defined as a string that occurred in one or multiple documents from a class 𝑪:

The classification is performed, based on the documents 𝐷, already classified:

To perform the classification, the terms in 𝑺 are represented by a vector 𝑾. Each feature 𝒘ᵢ of 𝑾 is an occurrence frequency of the corresponding 𝒊-th term in the documents from class 𝑪. Each features vector 𝑾 – a term/class occurrence frequency histogram:

Terms Occurrence Frequencies Histogram | Image by the author
Terms Occurrence Frequencies Histogram | Image by the author

The classification is performed by estimating the probability 𝒑(𝒘ᵢ | 𝑪) 𝒊=𝟏..𝒏, that the terms in 𝑺 occurred in the documents 𝐷 from the class 𝑪.

The multinomial model solely relies on the evaluation of probability-based decision function, concluded from the Bayes theorem.

The Bayesian probability 𝒑(𝑪ₖ | 𝑾 ) is computed as follows:

Bayesian Probability Formulae
Bayesian Probability Formulae

The main idea behind the naïve Bayes is that all features in 𝑾 independently contribute to the probability that 𝑺 belongs to 𝑪ₖ.

Alternatively, the posterior 𝒑(𝑪ₖ | 𝑾 ) is expressed as:

The posterior 𝒑(𝑪ₖ | 𝑾 ) Formulae | Image by the author
The posterior 𝒑(𝑪ₖ | 𝑾 ) Formulae | Image by the author

The estimated prior 𝒑(𝐶ₖ) and likelihood 𝒑(𝑾 | 𝐶ₖ) proportionally contribute to an outcome that 𝐶ₖ is being the class of 𝑺.

However, only its numerator affects the estimation of posterior 𝐏𝐫(𝐶ₖ | 𝑾). Despite this, the evidence 𝒑(𝑾) equally scales 𝐏𝐫(𝐶ₖ | 𝑾) of each 𝐶ₖ and might be simply omitted from the estimate.

The multinomial model is used for document classification with an assumption that 𝑺 is represented by the features vector 𝑾. Each feature 𝒘ᵢ is the count (i.e. frequency) with which 𝒊-th term from 𝑺 occurred in the documents 𝐷, already assigned to one of the classes 𝐶.

Originally, the posterior 𝐏𝐫(𝑾 | 𝐶ₖ) of each 𝐶ₖ is estimated as:

Multinomial Naïve Bayes Classifier | Image by the author
Multinomial Naïve Bayes Classifier | Image by the author

The prior 𝐏𝐫(𝑪ₖ) is a quotient. which numerator is estimated as the factorial of the sum of all features ∀𝑤ₖᵢ 𝑾 𝒊=𝟏..𝒏. In turn, the denominator is obtained as a product of all features’ 𝑤ₖᵢ factorials.

The numerator is evaluated as a probability distribution, which is the likelihood of all possible outcomes that 𝑺 occurred in documents 𝑫 from class 𝑪ₖ. An overall amount of such outcomes is equal to the factorial of the sum of features ∀𝑤ₖᵢ 𝑾, mentioned above.

In turn, the denominator of 𝒑(𝑪ₖ) is obtained as an amount of all possible outcomes that features of 𝑾 occurred in the corpus of documents 𝑫. This is firmed into an arithmetic progression obtained as the sum of each feature’s 𝑤ₖᵢ! factorial.

The computations, above, are very similar to Legendre’s prime factorization of vector 𝑾. Also, it implies multivariate binomial distribution computation. When the denominator is large enough, the 𝒑(𝑪ₖ) significantly decreases. This normally provides an ability to exclude features, that occurred in all documents from the corpus 𝑫.

The likelihood of 𝑺 given 𝑪ₖ is the product of terms’ probabilities 𝒑ₖᵢ in the statistical degree of 𝑤ₖᵢ, rejecting the null hypothesis:

Likelihood That 𝒊-th Term 𝑤ᵢ Belongs To Class 𝐶ₖ | Image by the author
Likelihood That 𝒊-th Term 𝑤 Belongs To Class 𝐶ₖ | Image by the author

The multinomial model, above, is applied to compute the probability 𝒑(𝑾 | 𝑪ₖ) that 𝑺 is from 𝑪ₖ.

Although, in this case, the goal is to compute the probability that 𝑪ₖ is the class of 𝑺. The 𝐏𝐫(𝐶ₖ | 𝑾 ) is easily computed by representing the multinomial model in log-space:

Multinomial Bayes classifier in Log-Space | Image by the author
Multinomial Bayes classifier in Log-Space | Image by the author

Since the likelihood 𝒑(𝑾 | 𝐶ₖ) is a product of probabilities 𝒑ₖᵢ ≡ 𝒑(𝑤ᵢ | 𝐶ₖ), it’s represented on the logarithm scale. In this case, 𝒑(𝑾 | 𝐶ₖ) is proportional to the maximum possible degree supported by the statistical model.

In the equation above, 𝗹𝗼𝗴(𝒑) is the natural logarithm of probability 𝒑, evaluated as:

Probability Natural Logarithm 𝗹𝗼𝗴(𝒑) | Image by the author
Probability Natural Logarithm 𝗹𝗼𝗴(𝒑) | Image by the author

The 𝒑’s logarithm is always 𝟏.𝟎 when 𝒑 is less than or equal to 𝟏.𝟎, and the natural logarithm 𝐥𝐧(𝒑), unless otherwise.

The decision about the class 𝐶ₛ of 𝑆 is made, such as that:

Probability-Based Classification Formulae | Image by the author
Probability-Based Classification Formulae | Image by the author

The class 𝐶ₛ is determined as one of the existing classes 𝐶ₖ, for which an absolute value of 𝐏𝐫(𝐶ₖ | 𝑾 ) is the maximum. Since 𝐏𝐫(𝐶ₖ | 𝑾 ) is always negative, its absolute value is taken.

Multinomial Bayes Classification Algorithm

Let 𝑺 – an input string, 𝑫 – a corpus of 𝒛-documents, 𝑪 – a set of 𝒎-classes:

Compute the class 𝑪ₛ of sample 𝑺, as follows:

  1. Split sample 𝑺 into a set of 𝒏-terms
  2. For each 𝒌-th class 𝑪ₖ 𝒌=𝟭..𝒎 do the following:
  • Compute the vector 𝑾 of 𝒏-features ∀𝒘ₖᵢ ∈ 𝑾, where 𝒘ₖᵢ is the frequency, with which the corresponding 𝒊-th term occurred in the documents from 𝑪ₖ.
  • Evaluate the prior 𝒑(𝑪ₖ) as the full probability that a document occurred in the documents from class 𝑪ₖ.
  • Compute the posterior 𝐏𝐫(𝑪ₖ | 𝑾) by adding prior 𝒑(𝑪ₖ) to the sum of each term’s 𝒘ᵢ, given 𝑪ₖ, probabilities 𝒑(𝒘ᵢ | 𝑪ₖ) :
The Posterior 𝐏𝐫(𝑪ₖ | 𝑾) Formulae | Image by the author
The Posterior 𝐏𝐫(𝑪ₖ | 𝑾) Formulae | Image by the author
  1. Determine a class 𝑪ₛ of 𝑺, for which 𝐏𝐫(𝑪ₖ | 𝑾) 𝒌=𝟭..𝒎 is the maximum:
The Class 𝑪ₛ Of 𝑺 Computation | Image by the author
The Class 𝑪ₛ Of 𝑺 Computation | Image by the author

This algorithm’s complexity 𝜎 is evaluated as:

The Multinomial Bayesian Classifier Complexity (𝜎) | Image by the author
The Multinomial Bayesian Classifier Complexity (𝜎) | Image by the author

𝒛 – the total amount of documents in 𝑫, 𝒏 – the number of terms in sample 𝑺, 𝒎 -the number of classes 𝑪

Semi-Supervised Learning

Semi-supervised learning provides an ability to increase the multinomial models’ performance. Also, it allows to improve the quality of classification by training the model based on the corpus of documents, already classified.

The semi-supervised learning algorithm is rather intuitively simple and formulated, such as:

  1. Compute class 𝑪ₛ of 𝑺, using the multinomial model, discussed above.
  2. Append 𝑺 labeled with 𝑪ₛ to the corpus of documents 𝐷.
  3. Re-evaluate the classification model.

Proceed with the following process to classify each new sample 𝑺.

The semi-supervised learning process:

Semi-Supervised Learning Process | Image by the author
Semi-Supervised Learning Process | Image by the author

The process, illustrated above, provides an ability to perform classification, assigning samples to a finite set of classes, similar to using the expectation-maximization algorithm (EM).

Using The Code

The code in Python 3.8, NumPy, and NLTK libraries, implementing the multinomial classification algorithm, is listed below:

OUTPUT:

Conclusion

The multinomial Bayesian model is an efficient alternative to the known K-means clustering and decision trees algorithms, capable of classifying various data that are normally not easy to be quantified. For example, it can be used as part of ANN-based text classification models that learn and predict texts to classes, based on the summary of their contents, inferred by the multinomial Bayesian classifier.

Unlike similar AI and Machine Learning (ML), used for content-based texts classification, the multinomial Bayesian classifiers are entirely a data mining approach, that allows predicting classes for texts, introduced to the model, without its continuous training. However, to prevent the early convergence and the cold start issues, encountered in the multinomial models, it’s recommended to use the semi-supervised learning algorithm to train the model for a better prediction of texts. This is normally done by training the model and simultaneously using it for prediction.


For the multinomial Bayes classifier demonstration, please download and run the project, below:


Related Articles