
‘Text Classification’ is a Machine Learning technique which is used to analyse text and then organize or categorize them based on patterns or structure. Categorization of text has a lot of applications in the world of artificial intelligence such as news article analysis, hate speech identification, gender classification etc. In this article I use ‘Text Classification’ with Natural Language Processing (NLP) using Python to analyze Hindu religious verses and categorize them. Before we delve deeper into the technical side of Python, let’s quickly see what data we will be working with.
The ‘Sahasranama‘ – literally 1000 names (where ‘sahasra‘ means 1000 and ‘nama‘ means names)— is a hymn of praise offered to God in Hinduism. ‘_[Lalitha Sahasranama](https://vignanam.org/english/sree-lalita-sahasra-namavali.html)‘ in praise of Goddess Durga, and ‘[Vishnu Sahasranama](https://stotranidhi.com/en/sri-vishnu-sahasra-namavali-in-english/)_’ in praise of God Maha Vishnu are two such slokas (verses) which sing praises on Goddess Durga and Maha Vishnu with 1000 different names each. I got the data for our analysis from the following links: ‘Lalitha Sahasranama‘ and ‘Vishnu Sahasranama‘ and cleaned it up to remove ‘Om’ and ‘namah’ and saving the data as _text files_.
Importing Libraries:
The first step is to import the required libraries for our analysis. As you can see from the code below, we are using 3 different libraries.
#Import required libraries
from nltk.classify import accuracy #For accuracy predictor
import random #For randomly shuffling the names
import nltk #Natural Language toolkit
NLTK – for Natural Language Tool Kit
NLTK Accuracy – For showing accuracy of our prediction
Random – for shuffling the names randomly for our training and testing sets
Loading Data and Labeling:
Using the open() function in ‘read’ mode, the contents of the files are then assigned to a list using the _readlines()_ function.
The next step is to assign a label ‘Devi‘ and ‘Vishnu‘ to the corresponding verse that was read. This label will be used to classify the verses later.

The random.shuffle() function is used to shuffle the list randomly so that the data is evenly distributed between the two lists.
Feature Extraction:
We define a function to extract features from the verses that are passed from the list. The defined function extracts the last 2 characters to develop a model.

Create model using Naïve Bayes Classifier:

The Naïve Bayes classifier is based on the Bayes theorem which is used to find the probability of an event A happening by assuming that a 2nd event B has already occurred.
The first step is to split the data into a training set (first 1000 verses) and a test set (last 1000 verses).
The nltk.classify.accuracy() function is used to show the accuracy of the prediction outcome using either the training set or the testing set as arguments.
The show_most_informative_features(n) function highlights the most informative n features of the available verses based on the feature extraction.
The output of the classifier accuracy and ‘show_most_informative_features’ is shown below. The classifier has almost a 97% accuracy. We also see that verses corresponding to God Maha Vishnu typically end with ‘ya’ while those corresponding to Goddess Durga typically have ‘ai’ as the last 2 characters.

Checking Classification Output:

Let’s check the classification output and compare it to the original labels that were assigned earlier to the verses. To do this, we create an empty list and then compare the output of the model using the (classifier.classify(name_features(name))) function and compare it to the label that was originally assigned earlier. If we are interested in seeing those that were mismatched we can simply replace the ‘if’ condition to look for outputs where model_output != god. A sample of the correct classification output is shown here.
The entire code is available for download through _Github repository_.
References:
- Classification (bgu.ac.il) – Excellent example on gender classification using NLTK
- Python | Gender Identification by name using NLTK – GeeksforGeeks
- Naive Bayes Classifier. What is a classifier? | by Rohith Gandhi | Towards Data Science
- nltk.classify package – NLTK 3.6.2 documentation
- Python Programming Tutorials
- Python’s Natural Language Tool Kit (NLTK) Tutorial part – 3 | by Ishan Dixit | Medium