From Goodreads to Great Reads
Using python to predict what makes books great

What
This is a data set of the first 50,000 book ids pulled from Goodreads’ API on July 30th, 2020. A few thousand ids did not make it through because the book id was changed, the URL or API broke, or the information was stored in an atypical format.
Why
From the reader’s perspective, Books are a multi-hour commitment of learning and leisure (they don’t call it Goodreads for nothing). From the author’s and publisher’s perspectives, books are a way of living (with some learning and leisure too). In both cases, knowing which factors explain and predict great books will save you time and money. Because while different people have different tastes and values, knowing how a book is rated in general is a sensible starting point. You can always update it later.
Environment
It’s good practice to work in a virtual environment, a sandbox with its own libraries and versions, so we’ll make one for this project. There are several ways to do this, but we’ll use Anaconda. To create and activate an Anaconda virtual environment called ‘gr’ (for Goodreads) using Python 3.7, run the following commands in your terminal or command line:
Installations
You should see ‘gr’ or whatever you named your environment at the left of your prompt. If so, run these commands. Anaconda will automatically install any dependencies of these packages, including matplotlib, numpy, pandas, and scikit-learn.
Imports
Data Collection
We pull the first 50,000 book ids and their associated information using a lightweight wrapper around the Goodreads API made by Michelle D. Zhang (code and documentation here), then write each as a dictionary to a JSON file called book_data
.
Data Cleaning
We’ll define and describe some key functions below, but we’ll run them in one big wrangle function later.
Wilson Lower Bound
A rating of 4 stars based on 20 reviews and a rating of 4 stars based on 20,000 reviews are not equal. The rating based on more reviews has less uncertainty about it and is, therefore, a more reliable estimate of the "true" rating. In order to properly define and predict great books, we must transform average_rating
by putting a penalty on uncertainty.
We’ll do this by calculating a Wilson Lower Bound, where we estimate the confidence interval of a particular rating and take its lower bound as the new rating. Ratings based on tens of thousands of reviews will barely be affected because their confidence intervals are narrow. Ratings based on fewer reviews, however, have wider confidence intervals and will be scaled down more.
Note: We modify the formula because our data is calculated from a 5-point system, not a binary system as described by Wilson. Specifically, we decrement average_rating
by 1 for a conservative estimate of the true non-inflated rating, and then normalize it. If this penalty is too harsh or too light, more ratings will over time raise or lower the book’s rating, respectively. In other words, with more information, this adjustment is self-correcting.
Genres
Goodreads’ API returns ‘shelves’, which encompass actual genres like "science-fiction" and user-created categories like "to-read". We extracted only the 5 most popular shelves when pulling the data to limit this kind of clean-up; here, we’ll finish the job.
After some inspection, we see that these substrings represent the bulk of non-genre shelves. We’ll filter them out using a regular expression. Note: We use two strings in the regex so the line doesn’t get cut off. Adjacent strings inside parentheses are joined at compile time.
All-in-one Cleaning
Now we’ll build and run one function to wrangle the data set. This way, the cleaning is more reproducible and debug-able.
Compare Unadjusted and Adjusted Average Ratings
Numerically, the central measures of tendency of mean (in blue) and median (in green) slightly decrease, and the variance significantly decreases.
Visually, we can see the rating adjustment in the much smoother and wider distribution (although note that the x-axis is truncated). This is from eliminating outlier books with no or very few ratings, and scaling down ratings with high uncertainty.

Unadjusted mean: 3.82
Unadjusted median: 3.93
Unadjusted variance: 0.48

Adjusted mean: 3.71
Adjusted median: 3.77
Adjusted variance: 0.17
Data Leakage
Because our target is derived from ratings, training our model using ratings is effectively training with the target. To avoid distorting the model, we must drop these columns.
It is also possible that review_count
is a bit of leakage, but it seems more like a proxy for popularity, not greatness, in the same way that pop(ular) songs are not often considered classics. Of course, we’ll reconsider this if its permutation importance is suspiciously high.
Split Data
We’ll do an 85/15 train-test split, then re-split our train set to make the validation set about the same size as the test set.
(20281, 12) (20281,) (4348, 12) (4348,) (4347, 12) (4347,)
Evaluation Metrics
With classes this imbalanced, accuracy (correct predictions / total predictions) can become misleading. There just aren’t enough true positives for this fraction to be the best measure of model performance. So we’ll also use ROC AUC, a Receiver Operator Characteristic Area Under the Curve. Here is a colored drawing of one, courtesy of Martin Thoma.

The ROC curve is a plot of a classification model’s true positive rate (TPR) against its false positive rate (FPR). The ROC AUC is the area from [0, 1] under and to the right of this curve. Since optimal model performance maximizes true positives and minimizes false positives, the optimal point in this 1×1 plot is the top left, where the area under the curve (ROC AUC) = 1.
For imbalanced classes such as great
, ROC AUC outperforms accuracy as a metric because it better reflects the relationship between true positives and false positives. It also depicts the classifier’s performance across all its values, giving us more information about when and where the model improves, plateaus, or suffers.
Fit Models
Predicting great books is a binary classification problem, so we need a classifier. Below, we’ll encode, impute, and fit to the data a linear model (Logistic Regression) and two tree-based models (Random Forests and XGBoost), then compare them to each other and to the majority baseline. We’ll calculate their accuracy and ROC AUC, and then make a visualization.
Majority Class Baseline
First, by construction, great
books are the top 20% of books by Wilson-adjusted rating. That means our majority class baseline (no books are great) has an accuracy of 80%.
Second, this "model" doesn’t improve, plateau, or suffer since it has no discernment to begin with. A randomly chosen positive would be treated no differently than a randomly chosen negative. In other wrods, its ROC AUC = 0.5.
Baseline Validation Accuracy: 0.8
Baseline Validation ROC AUC: 0.5
Logistic Regression
Now we’ll fit a linear model with cross-validation, re-calculate evaluation metrics, and plot a confusion matrix.
Baseline Validation Accuracy: 0.8
Logistic Regression Validation Accuracy: 0.8013
Baseline Validation ROC AUC: 0.5
Logistic Regression Validation ROC AUC: 0.6424
Logistic Regression Confusion Matrix

Random Forest Classifier
Now we’ll do the same as above with a tree-based model with bagging (Bootstrap AGGregation).
Baseline Validation Accuracy: 0.8
Logistic Regression Validation Accuracy: 0.8013
Random Forest Validation Accuracy: 0.8222
Majority Class Baseline ROC AUC: 0.5
Logistic Regression Validation ROC AUC: 0.6424
Random Forest Validation ROC AUC: 0.8015
Random Forest Confusion Matrix

XGBoost Classifier
Now we’ll do the same as above with another tree-based model, this time with boosting.
Baseline Validation Accuracy: 0.8
Logistic Regression Validation Accuracy: 0.8013
Random Forest Validation Accuracy: 0.8245
XGBoost Validation Accuracy: 0.8427
Majority Class Baseline ROC AUC: 0.5
Logistic Regression Validation ROC AUC: 0.6424
Random Forest Validation ROC AUC: 0.8011
XGBoost Validation ROC AUC 0.84
XGBClassifier performes the best in accuracy and ROC AUC.
Graph and Compare Models’ ROC AUC
Below, we see that logistic regression lags far behind XGBoost and Random Forests in achieving a high ROC AUC. Among the top two, XGBoost initially outperforms RandomForest, and then the two roughly converge around FPR=0.6. We see in the lower right legend, however, that XGBoost has the highest AUC of 0.84, followed by Random Forest at 0.80 and Logistic Regression at 0.64.
In less technical language, the XGBoost model was the best at classifying great books as great (true positives) and not classifying not-great books as great (false positives).

Permutation Importances
One intuitive way of identifying whether and to what extent something is important is by seeing what happens when you take it away. This is the best in a situation unconstrained by time and money.
But in the real world with real constrains, we can use permutation instead. Instead of eliminating the column values values by dropping them, we eliminate the column’s signal by randomizing it. If the column really were a predictive feature, the order of its values would matter, and shuffling them around would substantially dilute if not destroy the relationship. So if the feature’s predictive power isn’t really hurt or is even helped by randomization, we can conclude that it is not actually important.
Let’s take a closer look at the permutation importances of our XGBoost model. We’ll have to refit it to be compatible with eli5.
Permutation Importance Analysis

As we assumed at the beginning, review_count
matters but it is not suspiciously high. This does not seem to rise to the level of data leakage. What this means is that if you were wondering what book to read next, a useful indicator is how many reviews it has, a proxy for how many others have read it.
We see that genres
is the second most important feature for ROC AUC in the XGBoost model.
author
is third, which is surprising and perhaps a bit concerning. Because our test set is not big, the model may just be identifying authors whose books are the most highly rated in Wilson-adjusted terms, such as J.K. Rowling and Suzanne Collins. More data would be useful to test this theory.
Fourth is num_pages
. I thought this would be higher for two reasons:
- Very long books’ ratings seem to have a bit of a ratings bias upward in that people willing to start and finish them will rate them higher. The long length screens out less interested marginal readers, who probably wouldn’t have rated the book highly in the first place.
- Reading and showing off that you’re reading or have read long books is a sign of high social status. The archetypal example: Infinite Jest.
Takeaway
We’ve seen how to collect, clean, analyze, visualize, and model data. Some actionable takeaways are that when and who publishes a book doesn’t really matter, but its review count does – the more reviews, the better.
For further analysis, we could break down genres
and authors
to find out which ones were rated highest. For now, happy reading.