My Computer Reads ‘The Wheel of Time’.

Natural Language Processing on an epic fantasy book series.

Published in

Towards Data Science

6 min readSep 17, 2020

Photo by Cederic Vandenberghe on Unsplash

Once upon a time, in an era now known as The ’90s, a mighty battle was being waged for the title of Best Epic Fantasy post-Lord of the Rings. In one corner stood George R. R. Martin, with his dark and grim series ‘A Song of Ice and Fire’, or as it is now more commonly known: ‘Game of Thrones’. The other challenger was Robert Jordan with his own series ‘The Wheel of Time’. This grand tale is less well known by many today, but that may change when the TV show based on these books comes out sometime next year. Filming is underway at the time of writing this article.

As more and more announcements come out about filming this new fantasy show, my love for the books has been reignited as well. So when I was choosing my next project to work on as part of my Metis Data Science Bootcamp, I decided to have my computer ‘read’ the series as a Natural Language Processing (NLP) project.

The Books

The Wheel of Time is a massive story, spanning 15 books, 704 total chapters and a whopping 4.4 million words. It features a large cast of characters and locations and is told from 147 unique characters’ point of view, some more central to the story than others. The story starts in a small village with the central group of characters and grows in size and complexity from there. It contains a richly detailed world with fleshed out cultures, magic and politics. As is common in epic fantasy, the central theme is an ancient battle between light and darkness, good and evil, that is fought over and over as the Wheel of Time spins.

The Wheel of Time is a massive story, spanning 15 books, 704 total chapters and a whopping 4.4 million words.

Unfortunately Robert Jordan passed away before he was able to finish his life’s work, but he left extensive notes and even finished parts of future chapters. His wife and editor, Harriet, chose Brandon Sanderson to finish the books using the material available and the final book came out in January 2013.

Methodology

Before I could get to my actual analysis, I had to make sure I could access the full text in Python. I purchased the eBook version of the bundled series (ISBN 9780765376862) and used the EBooklib library to gain access to the XML formatted books, which I parsed using BeautifulSoup. Surprisingly, this turned out to be the most challenging step in my project, because this eBook is a collection of multiple books. Several books turned out to be formatted differently than others, so I ended up having to set up exceptions and rules to parse and extract them in their own way.

This was an unsupervised learning project, so I mainly focused on topic modeling to see if the computer would be able to make sense of this huge body of text. I tried multiple techniques to see the results that different modeling approaches could come up with. You can find my full code and results on my Github here, including annotations for every discarded option, but for this article it’s enough to say that I landed on a TF-IDF vectorizer and Non-Negative Matrix Factorization for my final model. I had a feeling going in that I would need to use bi-grams to capture a lot of the fictional names and places in the books, but that turned out to be unnecessary and did not improve the quality of my topic modeling at all.

As I was trying the different models I mentioned above, the breakthrough moment was switching from a standard count vectorizer to TF-IDF. The moment I tried that, the topics started to align with my knowledge of the books.

Results

The results of my topic modeling were clearer than I could have hoped for when starting the project. The final model very clearly distinguishes character arcs throughout the 14 books. I limited the final number of topics to 15 and all but one has the name of a central character (or a duo of characters) as the feature of highest importance. Increasing the number of topics still yields sensible new topics, but with plots and characters of lesser importance. The one topic that does not revolve around a character centers around the Last Battle, which is so different and so pivotal that it does make sense to have its own topic.

The final model very clearly distinguishes character arcs throughout the 14 books

Having these character arcs so clearly defined made it possible to track them across the books and the series as a whole. I created a heat map visualization that shows the 15 topics and where they appear in the books. The x-axis are the chapter numbers, the y-axis the topics, the red line are the different books and the intensity of color marks how strongly a chapter is related to that topic.

Heat map of topics across the entire Wheel of Time book series

For those unfamiliar with the books, I’ll point out just how accurate this result is with an example. One very central character to the series is Moiraine, who appears in two topics as a combination with her two strongest allies, Siuan and Lan. She is so central in fact, that about halfway through the Wheel of Time she got her own Prequel to explore her backstory. That book is the very first book in the heat map above, and you can tell that the visual is completely focused on both of her topics for that book.

Results like the example above were plentiful. I kept finding interesting applications for this final model. I collected book summaries from fan sites and compared them to individual book heat maps like the one pictured below for the book Winter’s Heart. The results matched up exceptionally well, like the chapters showing an overlap between Mat and Tuon’s character arcs. Spoiler alert: those two end up married after meeting in this particular book.

Then I took the book that had the lowest rating of the series on Goodreads (book 10: Crossroads of Twilight) and modeled its topics separately. The result showed the overly political plots of the book, as well as the topic the fans have come to know as the Plotline of Doom. A particular plot that was drawn out too long and had some odd character behavior. I could give many more examples, but you get the idea.

The final test I performed to really put this model through its paces, was to see how it would compare to human interpretation of the books. The publisher’s website has featured a long running series called the Wheel of Time Reread. It features an in-depth summary of each chapter with added interpretation and analysis by author and super-fan Leigh Butler. I scraped every chapter summary from the website, ran them through my model and found that the main topic matched on 86% of chapters. Many of the unmatched ones were very close calls as well, however…

Limitations

As well as the model performs, the comparison to the re-read showed some limitations. With the model being trained on the entire series, the topics are very high level. Some truly pivotal moments that were highlighted by the re-read summaries were not picked up by the computer, since they may not have taken up many of the chapter’s words. The computer also has no way to model pacing, plot resolution or subplots within a character arc like the aforementioned Plotline of Doom.

As an enthusiastic reader I am actually glad to know that there are still good reasons to read the books myself. As a data scientist I am glad that my model yielded such interesting results. I enjoyed diving into one of my favorite book series in a different way, using the data science skills I have acquired. I would urge anyone who would like to try the same experiment on their own favorite book to have a look at my code and have some fun with it.