
If you haven’t already, I’d suggest you read the first part of this series. It’s a Literature review of the research papers which we will try to implement in code in this article. I’ve linked the article below.
I dive into the Python implementation of mining text data for characters, creating the character network based on character interactions, and modelling these character networks through the Chung-Lu model. All of these concepts were previously discussed in "Mining and Modelling Character Networks" by Bonato et al. [1] and "Extraction and Analysis of Fictional Character Networks: A Survey", Labatut and Bost [2].
The body of text this article will be analyzing is Alice in Wonderland, however, the code will be structured in a manner that any piece of text can be run through to yield its own results. Please note that Alice in Wonderland is in the public domain as it was published before January 1, 1923, and are in the public domain worldwide because Carroll died in 1898, more than 100 years ago [4].
Table of Contents
- Data
- Installation Requirements
- Download Alice in Wonderland
- Clean Data
- Mining Character Names
- Approach 1: Regex
- Approach 2: Names Corpus
- Finding Character Interactions
- Creating Character Networks
- Chung-Lu Model
- Real World Applications
- Conclusion
- Resources
Data
The following will outline the version of my Python and libraries used for the execution of this article.
Installation Requirements
Python=3.8.8
networkx=2.5
pandas=1.2.4
numpy=1.20.1
scipy=1.6.2
matplotlib=3.3.4
nltk=3.6.1
There are other requirements like re, collections, string and random but they come pre-installed along with Python.
Download Alice in Wonderland
After importing NLTK, you can download the txt file of Alice in Wonderland very easily. Simply run the following code in your python environment, if it is already downloaded then proceed to the next segments of this article.
Clean Data
Before trying to identify characters, it would be best to do a little bit of cleaning on the raw text data. The least we can do is remove the stopwords, punctuations and lower case the entire body of text. This will make cross referencing character names from the text with the names corpus perform a lot better.
Mining Character Names
There are various ways to identify character names. Here I’ll go over 2 similar approaches to identifying character names in a body of text. The first is a regex which finds the names of characters, the second is to use a list of known names in the English language and cross-reference those known names with the body of text to identify which known names are also found in this body of text. The second approach will be done through an open source data set provided by the government of Ontario: Ontario Baby Names 1917–2019.
Overall, let me caveat this section by stating the obvious, none of these approaches will yield 100% accurate results for any given body of text. Even trying out NLP solutions in NER won’t give too strong solutions as it’s difficult for a machine / algorithm to identify aliases, groups, and other nuances which only a human can pick up. The best approach is for the user to create a dictionary where the keys are the known names in the body of text and the values are a list of aliases associated with that known name. For example in Harry Potter, the known dictionary would look something like this :
known_dct = {
'Harry' : ['Harry Potter', 'The Chosen One', ...],
'Voldemort' : ['He Who Must Not Be Named', 'Voldything', ...],
...
}
Approach 1 : Regex
This implementation, although worked, yielded a lot of false positives in character names. It’s quite difficult to structure a regex which will work in all situations for any body of text. Some of the false positive names the code below yielded are :
'Uglification Derision', 'Mouse Alice', 'Hatter Dormouse', 'William Conqueror', 'Caucusrace Caucusrace', 'Time Hatter', 'Elsie Lacie', 'Mad Tea', 'Soup Soup', 'Edgar Atheling', 'Gryphon Turn', 'King Queen', 'Alices Lewis'
Approach 2 : Names Corpus
The corpus of names was provided by the government of Ontario. The government released 2 data sources which consisted of names of male and female babies from 1917–2019. This data source yielded a total of 164,034 names. Since Ontario is a relatively diverse province, the names cover various ethnic backgrounds which is a lot more universal for different bodies of text. You can download the names from here:
Or you can reference this CSV I’ve uploaded to my GitHub which is the concatenated results of both the male and female names. This CSV is what’s referenced throughout the code in this article.
Overall, approach 2 yielded much better results as it accounts for aliases of names if the user inputs a names dictionary. The results of approach 2 will feed into the following parts of the article.
Other sources of mining character names were also explored like using NER, the reason I didn’t go forward in showing this was because only the high end solutions yielded any useful results. Through small scale solutions like using the built in NER models in NLTK and Spacy, we ended up with too many false positives, although the results above will also yield false positives, they won’t be of the magnitude of the NER solutions. The one solution through NER which did yield useful results was from training BERT on labelled named data on this corpus of text. However, although this yielded good accuracy, it was also very expensive in both time and money. I didn’t have the time or resources to label thousands of books on name data and retrain the BERT model, nor did I have lots of money to allocate towards GPUs to aid the training process. Thus the simple approach of finding names was the one I showcased right now.
Finding Character Interactions
This section is determined by the way you define character interactions. From part 1 of this series, we found that there are many ways to identify character interactions. From Labatut and Bost, they highlighted 6 different approaches for identifying interaction between characters [2], namely, 1) Co-occurrence of characters 2) Direct verbal interactions between characters 3) Explicit mentioning between characters 4) Non verbal interactions (fighting, kissing, killing, etc.) 5) Affiliation between characters (coworkers, family, friends, etc.) 6) Hybrid approaches of 1–5
For the purposes of this article, we will focus on the first approach of character interaction, the co-occurrences of characters. We can define that two characters have co-occurrences if their names are stated between an N period of words. The value for N can be determined by the user, in "Mining and Modelling Character Networks" by Bonato et al. they chose their N to be equal to 15. We can use the same value of N for our purposes.
Creating Character Networks
For this, we’ll create a weighted network. The nodes will be the characters themselves and the edges will be formed from the co-occurrences we’ve found above. The weight of the edges will be determined by the frequency of co-occurrences in the given body of text. For example, if Alice and the Queen had 15 interactions in the body of text, the weight associated with the edge connecting Alice to the Queen would be 15.


As you can see in the network, Alice, the King, the Queen are very prominent characters with a high edge weight. Similarly, we can see that the most important characters (captured by the page rank of the node) are indeed important characters in the book. We see that Alice, March Hare, Queen of Hearts, and King of Hearts are the top 4 most important characters.
Chung-Lu Model
Based on the paper written by Bonato et al. the Chung-Lu model was the best performing model in capturing the results of original character network, hence we will implement the model here. The model is outlined by the paper Generating large scale-free networks with the Chung-Lu random graph model by Dario Fasino, Arianna Tonetto and Francesco Tudisco [5], they have made the Python and MATLAB source code associated to the model open source via their Github Repository [3]. We can just do some slight refactoring for the input parameters to match the degree distribution of our network.
Real World Applications
Now you might be thinking that this was a fun experiment but where can it really be applied in a real world setting. The applications of the methodology introduced in the papers discussed today is simply a matter of how you structure the problem. For example, if you worked in the legal industry, it might be useful to create a legal network, where nodes are the laws and the edges connecting the nodes are based on laws being broken by the plaintiff and defendants. Upon creating that network you essentially have a strong graph database connecting a variety of laws with a variety of crimes and cases which have had a verdict given by a judge / jury. This network could also be temporal, you can see as time progresses the new nodes being added, the new edges beginning to form and have an idea of how future edges & nodes might be created / interacted on. These networks could be modelled by using Chung-Lu or other models to derive business decisions.
Conclusion
Overall this article went over implementing a variety of topics and concepts introduced during the literature review in the previous article. It covered various aspects of Text Mining, network analysis and graph modelling. The purpose of this series was not only too cover a research paper but also to go and implement the paper, a task many research scientists and research engineers do in large scale companies.
A good problem solver is able to use what they’ve read and learned and apply it to the problems they’re facing. A network based approach similar to this might yield a strong result when it comes to the legal industry.
There are still many ways this area of research could be furthered. In future research papers covering this topic I hope to see a sense of temporal network creation, where we can see how the network changes as the story progresses. This could be for popular long running books / movies / tv shows like Harry Potter, The Simpsons, Fast and Furious. I’m also hoping to see future research papers covering the fact that this approach wouldn’t yield strong results on a short story or novel.
The following script below is all the code shown up above wrapped up into a main function. You can also reference the GitHub repository associated with this and many other of my articles here.
Resources
- [1] Bonato, A., D’Angelo, D. R., Elenberg, E. R., Gleich, D. F., & Hou, Y. (2016). "Mining and modeling character networks." In Algorithms and Models for the Web Graph: 13th International Workshop, WAW 2016, Montreal, QC, Canada, December 14–15, 2016, Proceedings 13 (pp. 100–114). Springer International Publishing
- [2] Vincent Labatut and Xavier Bost. 2019. "Extraction and Analysis of Fictional Character Networks: A Survey" . ACM Computing Surveys 52(5):89. https://doi.org/10.1145/3344548
- [3] https://github.com/ftudisco/scalefreechunglu
- [4] https://www.legalcurrent.com/alice-in-wonderlands-adventures-in-case-law/#:~:text=Note%20that%20%E2%80%9CAlice%20in%20Wonderland,more%20than%20100%20years%20ago.
- [5] https://arxiv.org/pdf/1910.11341.pdf
If you liked this article, here’s a few more which you might enjoy :
Recommendation Systems Explained
Dimensionality Reduction Explained