The world’s leading publication for data science, AI, and ML professionals.

What can we learn from how readers organize their books online?

Goodreads vs The StoryGraph – how can intentional data collection help readers find their perfect match

Better data collection could get readers their oh-so-perfect recommendations to suit their every mood

Photo by Seven Shooter on Unsplash
Photo by Seven Shooter on Unsplash

On Goodreads.com, users have the opportunity to create their own "shelves" of books. They can name their shelves anything they want. This open-ended data point has the potential to give us insight on how readers self-define the types of books they read, but it comes with its own challenges. These challenges could contribute to why Goodreads recommendations are notoriously mediocre (haven’t heard? Google and Find Out).

After exploring what Goodreads can tell us about reader interest groups, we’ll take a look at a rising star in online platforms for readers, The StoryGraph, and why it proposes a promising answer to some of the challenges faced with Goodreads.


In Part I we’ll explore the following questions:

  • What are different ways users virtually organize their books?
  • Why aren’t Goodreads recommendations on par with the Netflix experience?

In Part II we’ll better understand:

  • Can we use Goodreads data to group our readers into clusters based on interest?
  • How can improvements in the way user data is collected help provide better recommendations to readers?

Before we get started, here’s an example of a Goodreads user who has created custom shelves for the different genres they read:

Example of a user's shelves on Goodreads
Example of a user’s shelves on Goodreads

Part I. Data understanding and preparation

Data on users’ Goodreads "shelves" is sparse and messy. However, there is still information to be gleaned regarding how readers organize their books.

Process for collecting and processing data from Goodreads

  1. Use the Goodreads API to create a table of active users and their custom shelves.
  2. Exclude any shelves that don’t relate to the genre of the book
  3. Map similar shelves to a single label (ya → young-adult)
  4. Normalize skewed data as most shelves are only used by a small portion of users
  5. Address data sparsity by setting a minimum number of users per shelf and a minimum number of shelves per user

Initial observations and limitations

What are different ways users virtually organize their books? Why aren’t Goodreads recommendations on par with the Netflix experience?

Takeaway: Users categorize their books in organic, non-systematic ways. This holds potential for rethinking how we group readers by their interests, but Goodreads’ current method for collecting this data isn’t getting as much out of it as it could.

Challenge 1: Sparse data:

Using the API to randomly search over 120K users only yielded 730 users (0.6%) that were active in 2020 and had created custom shelves.

  1. Plenty of users created accounts but never added any books!
  2. Many users don’t make use of the custom shelves feature and stick with the three default shelves – "read", "currently-Reading", and "to-read". I have to admit I’m one of these people!
Photo by Clay Banks on Unsplash
Photo by Clay Banks on Unsplash

Challenge 2: Messy data:

  1. A lot of shelves don’t tell us about the genre, such as: "kindle", "to-buy", "meh", "beware-the-hype", and "would-i-like-this-now-probably-not". We do have to give credit to these shelves for their entertainment value.
  2. In this data set, over 3,300 shelves (84%) were so unique that only one user in the sample had a shelf of its kind! Some very specific shelves: "bawl-your-eyes-out", "elf-romance", "epic-multi-generational-narratives", "glasses-hero", and "cyborg".

These are great categories, but Goodreads is missing the connection between "thought-provoking-sci-fi" readers and "makes-you-think-sci-fi" readers.


Part II: Modeling and evaluation

Can we use Goodreads data to group our readers into clusters based on interest?

Here is a heatmap of the first 9 components generated with Principal Component Analysis, a method for reducing the number of shelves by grouping those that show similar patterns. The red areas indicate genres that were popular within a specific component and the blue are genres that were rare in that same component.

Observations and areas for further research

Several cluster analyses were conducted but did not yield meaningful groups. (See GitHub repository for details.)

The preprocessing addressed the hyper-specificity of shelves and sparse data, but it led to new barriers to forming meaningful groups:

  • Some users’ categories were too vague (I’m looking at you component 0).
  • Dropping the very unique shelves and rolling up other shelves ("ya-romance" to "young-adult") meant losing a lot of nuance.

How can improvements in the way user data is collected help provide better recommendations to readers?

You can see some potential for more tailored clusters emerging. Components 6 and 8 in particular show an interesting contrast: Both groups read mysteries, but group 6 favors historical-fiction whereas group 8 favors sci-fi and LGBTQ+ reads.

A recommendation system that takes into account multiple genres simultaneously could provide more targeted suggestions. For a reader who identifies more with component 6, it might recommend The Shadow of the Wind, a historical mystery. Conversely, for someone who leans toward component 8, it could recommend Whisperworld, a mysterious sci-fi with LGBTQ+ representation.

Recommendations for groups 6 and 8 from https://beta.thestorygraph.com/browse
Recommendations for groups 6 and 8 from https://beta.thestorygraph.com/browse

But wait – how was I able to find these highly customized recommendations?

https://beta.thestorygraph.com/browse
https://beta.thestorygraph.com/browse

By using The StoryGraph‘s multi-genre search! Not only can you specify that a book should straddle multiple genres, you can also help folks who identify with component 8 by specifying fantasy and historical novels in the "exclude" section.


The future of reader data

In addition to looking at books that extend across genres, The StoryGraph as also launched a mood-based tagging system. Readers can tag books by mood when they leave a review (funny, reflective, tense). These tags then become searchable, so readers can specify whether they would prefer a "challenging, dark, mysterious fantasy" such as The Fifth Season or an "adventurous, lighthearted fantasy" such as The Long Way to a Small, Angry Planet.

This search system provides a new, shared vocabulary for describing books. It encourages users to tag books by including it front and center of the review page, and the more users add data to this system, the more powerful the search functionality will become.

Readers are complex and can’t be defined by a single genre.

Photo by Jilbert Ebrahimi on Unsplash
Photo by Jilbert Ebrahimi on Unsplash

Providing a searchable list of standardized vocabulary for tagging and searching, while making it easy for readers to use will help enrich our data on books, and subsequently, help that one user easily find those "epic-multi-generational-narratives" that they so enjoy.


Related Articles