The world’s leading publication for data science, AI, and ML professionals.

Analyzing Hindu Verses with NLP

Classifying 'Vishnu' and 'Devi' sloka using Python

Photo by Boudhayan Bardhan on Unsplash
Photo by Boudhayan Bardhan on Unsplash

‘Text Classification’ is a Machine Learning technique which is used to analyse text and then organize or categorize them based on patterns or structure. Categorization of text has a lot of applications in the world of artificial intelligence such as news article analysis, hate speech identification, gender classification etc. In this article I use ‘Text Classification’ with Natural Language Processing (NLP) using Python to analyze Hindu religious verses and categorize them. Before we delve deeper into the technical side of Python, let’s quickly see what data we will be working with.

The ‘Sahasranama‘ – literally 1000 names (where ‘sahasra‘ means 1000 and ‘nama‘ means names)— is a hymn of praise offered to God in Hinduism. ‘_[Lalitha Sahasranama](https://vignanam.org/english/sree-lalita-sahasra-namavali.html)‘ in praise of Goddess Durga, and ‘[Vishnu Sahasranama](https://stotranidhi.com/en/sri-vishnu-sahasra-namavali-in-english/)_’ in praise of God Maha Vishnu are two such slokas (verses) which sing praises on Goddess Durga and Maha Vishnu with 1000 different names each. I got the data for our analysis from the following links: ‘Lalitha Sahasranama‘ and ‘Vishnu Sahasranama‘ and cleaned it up to remove ‘Om’ and ‘namah’ and saving the data as _text files_.

Importing Libraries:

The first step is to import the required libraries for our analysis. As you can see from the code below, we are using 3 different libraries.

#Import required libraries
from nltk.classify import accuracy #For accuracy predictor
import random #For randomly shuffling the names
import nltk #Natural Language toolkit

NLTK – for Natural Language Tool Kit

NLTK Accuracy – For showing accuracy of our prediction

Random – for shuffling the names randomly for our training and testing sets

Loading Data and Labeling:

Using the open() function in ‘read’ mode, the contents of the files are then assigned to a list using the _readlines()_ function.

The next step is to assign a label ‘Devi‘ and ‘Vishnu‘ to the corresponding verse that was read. This label will be used to classify the verses later.

Output of the label assignment - Screenshot from Jupyter Notebook
Output of the label assignment – Screenshot from Jupyter Notebook

The random.shuffle() function is used to shuffle the list randomly so that the data is evenly distributed between the two lists.

Feature Extraction:

We define a function to extract features from the verses that are passed from the list. The defined function extracts the last 2 characters to develop a model.

Feature extraction from Verse List - Screenshot from Jupyter Notebook
Feature extraction from Verse List – Screenshot from Jupyter Notebook

Create model using Naïve Bayes Classifier:

Image courtest: Towards Data Science - Naive Bayes Classifier. What is a classifier? | by Rohith Gandhi | Towards Data Science
Image courtest: Towards Data Science – Naive Bayes Classifier. What is a classifier? | by Rohith Gandhi | Towards Data Science

The Naïve Bayes classifier is based on the Bayes theorem which is used to find the probability of an event A happening by assuming that a 2nd event B has already occurred.

The first step is to split the data into a training set (first 1000 verses) and a test set (last 1000 verses).

The nltk.classify.accuracy() function is used to show the accuracy of the prediction outcome using either the training set or the testing set as arguments.

The show_most_informative_features(n) function highlights the most informative n features of the available verses based on the feature extraction.

The output of the classifier accuracy and ‘show_most_informative_features’ is shown below. The classifier has almost a 97% accuracy. We also see that verses corresponding to God Maha Vishnu typically end with ‘ya’ while those corresponding to Goddess Durga typically have ‘ai’ as the last 2 characters.

Most Informative Features of our Dataset - Screenshot from Jupyter Notebook
Most Informative Features of our Dataset – Screenshot from Jupyter Notebook

Checking Classification Output:

Checking classification output - Screenshot from Jupyter Notebook
Checking classification output – Screenshot from Jupyter Notebook

Let’s check the classification output and compare it to the original labels that were assigned earlier to the verses. To do this, we create an empty list and then compare the output of the model using the (classifier.classify(name_features(name))) function and compare it to the label that was originally assigned earlier. If we are interested in seeing those that were mismatched we can simply replace the ‘if’ condition to look for outputs where model_output != god. A sample of the correct classification output is shown here.

The entire code is available for download through _Github repository_.

References:

  1. Classification (bgu.ac.il) – Excellent example on gender classification using NLTK
  2. Python | Gender Identification by name using NLTK – GeeksforGeeks
  3. Naive Bayes Classifier. What is a classifier? | by Rohith Gandhi | Towards Data Science
  4. nltk.classify package – NLTK 3.6.2 documentation
  5. Python Programming Tutorials
  6. Python’s Natural Language Tool Kit (NLTK) Tutorial part – 3 | by Ishan Dixit | Medium

Related Articles