MyAnimeList user scores: Fun with web scraping and linear regression

Published in

Towards Data Science

8 min readMay 5, 2020

The 3rd week of the Metis Data Science Bootcamp is behind us now, and so is the 2nd project of the bootcamp. Below I detail the project I did, which was to scrape data from MyAnimeList (MAL), and then use linear regression models to predict user scores based on the features of the anime.

TL;DR:

Scraped and cleaned 19 features from 11,541 anime entries
Transformed features through dummy variables; clipping outliers; applying log10 and sqrt; and multiplying/dividing features
Selected 11 features through statsmodel’s OLS; scikit-learn’s LASSO; metrics such as R² and MSE; and intuition and domain knowledge
LASSO edged out plain linear regression and Ridge in performance, but choice of linear regression model ultimately didn’t make a difference
The final model can predict whether an anime has high or middling scores, but has negative skew and a relatively wide spread
Different types of anime also had very different distributions for episodes and duration
Next steps would be to apply linear regression models separately to different anime types and to investigate more sophisticated models

I. Background

MAL states that it is “the world’s largest anime and manga database and community”. Most if not all of the information on the website is user-generated, from the scores and reviews given to an anime to the synopsis and list of characters in said anime.

The user-generated score is a good proxy for the quality and popularity of a given anime. Hence I wondered if I could use the data that MAL had on the anime to predict the kind of score it would receive.

This would be valuable information to have for an anime production company deciding on its next production, or a TV network deciding what anime to comission or license for broadcast.

II. Gathering the data

I used BeautifulSoup to scrape the data for 16,706 anime entries between 23 and 24 April 2020.

Out of these entries, 5,165 were left out of the final testing and training dataset because they had no score (aka target value) attached to them. This left me with 11,541 entries, or 2/3rds of the original dataset. I then split off 1/5th (or 2,279) of the remaining entries into a testing dataset, and kept the remaining 9,262 entries for training and validation.

Below is the list of 19 features that were ultimately considered.

Features that were scraped but not used include the producers, licensors and studios, as well as broadcast timing in Japan. Features that were not scraped include the characters, voice actors and other staff involved in the anime. All these features can be included in any further investigation of MAL data.

III. Preparing the data

a) Null data

There were several instances of null data in the final dataset. So I chose to drop the following, leaving me with 9,144 entries:

150 shows that had not aired or that were currently airing. They had no end date listed and/or did not have an episode count
7 shows that had finished airing but did not have a start date

I also treated the remaining null data as follows:

Movies that did not have episode numbers listed were assumed to have 1 episode, as is the case with most movies
Null entries for the respective associated properties could be considered to be empty lists

b) Consolidating genres

There are 41 genres in the dataset, which potentially could mean up to 41 features; however, only 9 genres have more than 200 anime entries associated with them:

I thus consolidated all genres with fewer than 200 entries into an ‘Other’ category, resulting in a more manageable list of 10 genres:

c) Dummy variables for genres and type

I generated dummy variables for the 10 consolidated genres as well as the 6 anime types, of which TV and OVA listings take up the majority.

d) Count of genres and associated properties

Each anime entry can be tagged to multiple genres and associated properties. I thus generated the following:

A count of genres
A count of each associated property
An overall count of associated properties

e) Numerical values for rating

The rating of anime ranged from suitable for all ages (G) to hentai aka pornographic. As such, I decided to assign a rank to each rating, using 0 for the most family-friendly rating and 5 for the least family-friendly.

A small proportion of anime have no rating; for those, I assume that they are rated PG13, which is the most common rating (44% of the training dataset).

f) Clipping episodes and duration

I found that there were extreme outliers for both episodes (some series had very long runs) and duration (some movies ran to almost 3 hours), as evident in the below graphs:

Note the peaks around <5 minutes (corresponding to very short-form anime) and ~25 minutes (corresponding to the length of a typical TV or OVA episode)

I, therefore, chose to clip episodes to 60 (slightly more than a year’s worth of anime) and duration to 150 (the upper end of the length of a typical movie), as below.

Note the peaks at 1 episode (corresponding to movies and other one-shots) as well at 13, 26 and 52 (corresponding to 1, 2 and 4 seasons of a TV anime)

g) Year started and age of anime

From the Started field of the anime entry, I extracted the year the anime started, as well as calculated the age of the anime by year.

h) Other transformations

I applied log10 transformations on the following features that showed a more Poisson distribution:

Clipped episodes and clipped duration
Members and favorites (for these I applied an extra log10 transformation as well)
Age of anime
Gerne and associated counts

Example of a feature that needed a log10 transformation

I also applied the following transformations:

Square root of the year started
Divided log favourites by log members (to obtain a ratio of favourites to members)
Multiplied clipped episodes with clipped duration to get clipped length; multiplied log episodes with log duration to get log length
Multiplied dummy variables for genres and associated properties with log length

IV. Selecting the final features

In total, 86 features were considered; these included members and favorites as well as the following 84 transformed features.

To cut down on the number of features, I used the following:

statsmodel’s OLS, using the t-scores of the parameters to determine their significance (chosen cut-off point was 5% significance)
scikit-learn’s LASSO, to see which parameters are dropped at lambda = 0.01 (chosen cut-off point was 0.01)
scikit-learn’s LinearRegression, looking at R², mean squared error, mean absolute error and maximum error
Intuition and domain knowledge

This left me with 11 features as follows.

V. Fitting and selecting the model

For the purposes of this project I considered the following linear regression models from scikit-learn:

Linear Regression
Ridge
LASSO

These three models were trained with 5-fold cross-validation on the training dataset. The results were very similar, as can be seen in the below table:

I went with the LASSO model as it was marginally better than the other two models (specifically with regards to maximum error), but either model would have worked fine. Below is a graphical representation of the training results for LASSO:

VI. Observations

Applying the final model to the test set gave the following results:

This is marginally better than the performance on the training/validation set. The final model also appears robust enough to indicate whether an anime may receive a high or middling score.

Examining the adjusted weights of the parameters as seen in the table below, I find the following:

The proportion of favorites to members was the most important predictor for user-generated anime scores. This makes sense as a user putting an anime on a favourite list is very likely to give said anime a high score.
Parent story and prequel counts are also important predictors. This could be chalked up to an anime belonging to a known franchise and thus becoming a known quantity among its fans.

This also indicates the risks of relying on user-generated scores; we already see that higher scores are more prevalant than lower scores in the dataset as a whole, so fan favourites may be predisposed to getting higher scores, regardless of the size of the audience.

The anime length is also an important predictor of user-generated scores, which makes sense as long-running TV series are usually so due to popular demand and subsequent financial reward.

Adjusted weight = Absolute value of model weight divided by the value range

However, I noted the following limitations:

The model has a negative skew; it is biased towards lower scores for scores > 8, and towards higher scores for score < 5
The model has a relatively high spread, especially for scores < 5. So precise score predictions are a little beyond it
The distributions for episodes and duration varies wildly between different anime types. For example, movies mostly have one episode with a duration ranging between 1 to 3 hours; TV series tend to have at least 13 episodes of ~24 minutes each.

Top row indicates duration, bottom row indicates episodes. Note how widely the distribution varies between anime types.

VII. Further work and learning points

Evidently, a more sophisticated model is needed to accurately predict the user-generated scores of a MAL anime entry. Possible approaches include:

Applying linear regression or other models separately on each anime type (be it TV or movie). A decision tree algorithm might work wonders here.
Incorporating features than were left out in the current analysis, such as producers, studios and staff members.
Incorporating data from other anime sites such as AniDB and Anime News Network.
Applying more sophisticated models such as neural networks.

I learned a lot about the workflow of a typical data analysis, and I’m excited to see how I can apply what I have learned to the next project.