Who does not like libraries and book shelves? They store the objects of knowledge and the doors to all the stories told. Those in charge of ordering books are the librarians, who after years of training know what type of texts people read and hence where to store them. Traditionally speaking this ordering or “classification” is done by letters so librarians can easily find books by say, Edgar Alan Poe under section P. But, what about literary movements? Had these librarians been blind to the Author’s name and only the raw text remained, they would have been forced to go through each text to determine the literary movement to properly classify it. We know for example that Poe is considered a Romantic (more about literary movements here), but we know this because we have been told so and because we have learned to identify keywords, semantic forms and literary shapes that makes up Romantic.

Here, I embark on the quest of creating (training) a librarian algorithm, if you wish to call it that way, that can classify with relatively good performance texts from three different literary movements: Romanticism, Realism, and Surrealism. For this I will use a Python 3 environment with typical machine learning libraries and a not so typical but rather amazing pre-trained Deep Learning algorithm from TensorFlow that can encode any given sentence into a numerical representation through sentence embedding in 16 different languages! What is even more amazing is that this transformation is done without any special text processing, you simply use it in its original form (in any of the languages supported of course) and it computes its respective embedding vector.

I hope that in the end of the text you will have a better idea about how to apply this algorithm, how to use it in your own project and how to improve all the code that I wrote for your own advantage. I will now describe the methods and results in detail, hold tight!

The first thing that we need to do is collect some quotes from these three categories. For this I chose Goodreads, an incredible website where you can search for quotes from thousands of authors. For this analysis I stored (manually, no scraping method used here, normal copy/paste) around 170 quotes total in at least three different languages with English being the absolute most common. Some quotes were in French, some in Spanish and some very few in Portuguese. I could have only used quotes in English, but I wanted to test the flexibility of this embedding algorithm.

I then created three files, one for each movement: Romanticism.txt, Realism.txt. and Surrealism.txt, all containing the name of the Author followed by eight quotes per author with 8,6 ad 7 authors in each movement respectively (a total of 168 quotes). How many authors and quotes I used was a completely arbitrary choice, but the class or quote imbalance is on purpose in case you wonder.

You can find the complete notebook and quote files here if you wish to try it yourself.

For the entire pipeline, you need to import these modules so make sure to have them installed (pip install package would usually do the trick for most cases):

#Tensorflow , tf-hub and tf-sentencepiece, all needed to calculate #embbedings. Check this discussion to install compatible versions. #At the time of this writing, a compatible mix of packages is tf #v1.13.1, hub v0.5.0 and sentencepiece v0.1.82.1:import tensorflow as tf
import tensorflow_hub as hub
import tf_sentencepiece
#numpy and randomimport numpy as np
import random
# sklearn and imblearn packages:from sklearn.neural_network import MLPClassifier
from sklearn.metrics import confusion_matrix, f1_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedShuffleSplit
from imblearn.under_sampling import EditedNearestNeighbours
#Visualization:
import seaborn as sns
import matplotlib.pyplot as plt
import networkx as nx

Then, the first pre-processing step is of course to load the data. For this I used the following lines:

# This loads each file to variables f_X:
f_rom = open(“Romanticism.txt”, “r”)
f_rea = open(“Realism.txt”, “r”)
f_sur = open(“Surrealism.txt”, “r”)
#This creates a list of all author names plus their quotes:
list_all_rom = f_rom.readlines()
list_all_rea = f_rea.readlines()
list_all_sur = f_sur.readlines()

where each of these lists look like this:

['"Author1","quote1","quote2","quote3"\n', '"Author2..."\n']

We can then merge all these quotes for each literature movement with this function:

and then simply run this to merge them all into one large list:

merged_list_rom = merge_list(list_all_rom)
merged_list_rea = merge_list(list_all_rea)
merged_list_sur = merge_list(list_all_sur)
merged_list_all = merged_list_rom + merged_list_rea + merged_list_sur

Here, merged_list_all is a collection of quotes of all movements (ordered) that looks like this (you might recognize the author and possibly the movement of some quotes if you are a good librarian!):

'I have absolutely no pleasure in the stimulants in which I sometimes so madly indulge...',
'Once upon a midnight dreary, while I pondered, weak and weary, Over many a quaint...',
'This is what you shall do; Love the earth and sun and the animals, despise riches, give alms to every...',
...
...
...
'Avec ce coeur débile et blême Quand on est l'ombre de soi-même Comment se pourrait-il comment Comment se ... ,'
...
x 168

These quotes are the inputs for our embedding analysis. Now comes the fun part! First thing is to load the pre-trained Deep Neural Network I talked about before called the Universal Sentence Encoder Multilingual (Yan et al., 2019) from TensorFlow’s hub and initialize its session with:

I recommend to first download and save the network so you don’t have to worry about downloading it each time you restart your computer as the model is always saved in /tmp (in Linux at least) and if you do so you avoid doing this step. It takes about 1 min to download with fast internet anyway. You can read more about how to do this here.

Up until now we are preparing our environment to do actual machine learning. The first machine-learningly code is the following, which computes and stores all sentence embeddings as well as calculates the distance (inner product) of all possible quote pairs (in embedding space of course). The latter will come in handy later on.

Note that this function uses the previously loaded session, so make sure it is initialized in your Python environment. Then we feed this function our merged quote list:

sM_All, e_All = similarity_matrix(merged_list_all)

Also notice that e_all is an array with dimensions quotes x 512 (this number is the default embedding vector size used by TensorFlow’s pre-trained algorithm) while sM_All our semantic similarity matrix we will use at the end of this analysis. Because what we want to do is classify text, we are missing one important piece of the puzzle, a class array. We know the ordering of all columns in e_All and we know that the first chunk is Romanticism, followed by Realism and finally Surrealism (so, three movements). Hence, we are certain on how many quotes we have per movement as given in merge_list_all, so this:

classes = np.asarray([1 for i in range(len(merged_list_rom))] + \
[2 for i in range(len(merged_list_rea))] + [3 for i in range(len(merged_list_sur))])

creates a class list that looks something like: [1,1,1,1, … 2,2,2,2,2,….3,3,3,3], with the correct number of instances per class. Now we are ready to build a classifier pipeline to train our librarian algorithm and do cross-validation to test how good it is. For this, I wrote this simple function that adds the flexibility of providing as input any classifier:

We then define a classifier (we will use a sklearn’s multilayer perceptron in this case, you can try others as SVM, KNN, etc.):

clf = MLPClassifier(max_iter=500,activation="tanh")

and then run the function:

class_pred, class_test, f1_score = class_pipeline(StandardScaler().fit_transform(e_All),classes,clf)

Note that e_All should be standardized first, which is another common preprocessing step.

Great! Now let’s see the results! The mean F1 (we have a slightly umbalanced class distribution so, as a well-known rule of thumb says, in these cases F1 is a better performance estimator than accuracy) across all k-folds is:

>>>print(np.mean(f1_score))>>>0.73

An F1 = 0.73 is really not bad for such a small sample! Let’s now look at the confusion matrix to get a better idea of what is going on:

Confusion Matrix for literary movements

Looks like Realism is causing more trouble. This might be due to de fact that it is the least represented class, a common problem in machine learning. We can try to balance the classes with an undersampling (oversampling creates synthetic data, so better to work with what we have available) method from imblearn. For now, I would balance the classes outside the cross-validation loop. In practice, I always prefer to do it within the validation loop and in the training set only. To do this simply uncomment lines 6–7 within class_pipeline:

#sm = EditedNearestNeighbours()
#[features, class_ground] = sm.fit_resample(features, class_ground)

And now let’s see the results:

>>>print(np.mean(f1_score))>>>0.78

Getting better! This probably indicates that a larger and balanced data set would show even better results!

You could finally argue that literary movements are subjectively defined, so why would these classes be robust enough to show consistent results? To test this we could shuffle the classes and re-ran the pipeline. Simply do:

random.shuffle(classes)

to shuffle the class order we defined at the beginning. Running it again (with shuffled classes) throws an F1 = 0.25 reflecting that our original class boundaries (Romanticism, Realism & Surrealism) are defined by real measurable characteristics present within the text. Interesting!

In short, all this looks rather OK! Remember, this was done with a relatively small sample (less than 170 quotes total) and in four different languages! This approach seems promising! How could you improve it? What else do you think you could do with this approach?

A final analysis we could do is to look at the semantic distance between all quotes with our distance matrix. Remember sM_All? Well, now is when we use it:

Semantic similarity between all posible quote pairs. Heatmap represents a symmetric matrix with all semantic similarity values where very similar quotes are colored in blue and very different in yellow. Right graph represents a render of this matrix where nodes are quotes and connections portray thresholded (see source code) similarity values. Romanticism highlighted with a turquoise blue line, Realism with green and Surrealism with gray. Final figure tweaked with Inkscape.

Great! With this technique you can visualize the complete quote ecosystem. As you can see some quote pairs are more similar than others (distance closer to 1). What about specific quotes? With this information, we can check for any given quote, which other quote is closest in semantic form! We can to this with:

#Quote 0 as an example (from Poe):
find_closest(sM_All,merged_list_all,names_mult,0)

Which gives as output:

‘Edgar_Alan_Poe’,
‘I have absolutely no pleasure in the stimulants in which I sometimes so madly indulge. It has not been in the pursuit of pleasure that I have periled life and reputation and reason. It has been the desperate attempt to escape from torturing memories, from a sense of insupportable loneliness and a dread of some strange impending doom.’,
‘Georges_Bataille’,
‘I enjoyed the innocence of unhappiness and of helplessness; could I blame myself for a sin which attracted me, which flooded me with pleasure precisely to the extent it brought me to despair?’

Do you think these quotes look similar in semantic content? To me they do! What about other quotes/authors? You can play around with the code to find out. This part of the analysis is something that I plan to explore deeply in the future.

REMINDER: You can check the complete notebook for this project here.

I hope that this walk-through provided you with some insights on how to use sentence embeddings, classification as well as rendering and visualization of text similarities. You can apply this in any field, do not feel limited to literature. In fact, the amount of possible applications is quite large, and with so many languages supported, this algorithm is here to stay and for a long run. Personally I was very surprised to see such an F1 score with such a small sample. In the end, this is the power of Machine Learning (including Deep Learning of course): with the correct processing steps, you build algorithms that do work that seems rather impossible with the raw data at hand. A librarian would be very happy to have the help of such algorithm.

Thank you for reading!

--

--

Machine Learning Specialist settled in Barcelona | PhD in Computer Science | Make Complex Look Simple. LinkedIn: https://www.linkedin.com/in/victor-saenger/