Data Analysis on OTT Platforms: Which Service Should I Choose?

An attempt to resolve the choice dilemma of OTT Platforms through data analysis

Rick Kim
Towards Data Science

--

The way we consume videos has undergone massive changes. Now we have multiple OTT platforms such as Netflix, Amazon Prime Video, and Disney+ to stream TV shows and movies online. With overabundance of information and multiple criteria to compare various OTT platforms, it has become increasingly difficult for users to find the best fit for their tastes. Through this study, we investigated different OTT platform data sets to provide users with insights into each platform to determine which services to subscribe to. Amongst multiple factors affecting online streaming subscriptions, we mainly analyzed age, genre, and genome-tags using Spark and Hive.

We have discovered that there are many similarities between Netflix and Amazon Prime Videos. They had similar distribution of tags and genres. The distinguishable factor between the two platforms was the age group. Netflix had TV-MA films the most while Amazon Prime Video had TV-PG films the most. Disney+ was, undoubtedly, best suitable for animations. Each OTT platform had its own distinct characteristics.

The following content was written by three authors:
Chrissy Jeon
Jung Soo Ha
Yung Ju Rick Kim

Introduction

In 2019, the OTT market was valued at 85.16 Billion USD and it is expected to reach 194.20 Billion USD by 2025. Under COVID-19, many countries introduced social distancing measures that forced theaters to limit the number of audiences or even shut down and that encouraged people to stay at home, accelerating the increase in OTT platform subscriptions. Therefore, we thought it was the right time to analyze different OTT platforms and provide useful information for people not able to decide which platform fits them best.

This study presents an analysis of three major OTT platforms — Netflix, Amazon Prime, and Disney+. Along with movie datasets for each platform, we incorporated two additional datasets: the IMDb movie dataset to investigate the distribution of movie genres, age limits, and their average ratings, and the MovieLens dataset to evaluate the tag genomes to find the most popular tags that represent each platform.

As the study by Reddy et al. on recommendation systems based on content-based filtering suggests, movie metadata, specifically genre and rating, are key determinants in predicting what a user may want to watch in the future. However, we discovered that accounting for only those two factors was not detailed enough to represent the diversity of contents within OTT platforms. For instance, a recommendation system based only on genre and ratings fails to satisfy the needs of a user who prefers visually appealing content or content based on real-life stories. To dive deeper into researching properties of contents, we looked into a genome tag dataset made available by MovieLens, which measured how strongly a movie exhibited particular properties represented by tags using a machine learning algorithm on user-contributed content such as tag ratings and textual reviews. Adding the analysis of genome tags helped us understand what kind of content properties each platform has.

Architecture Design Diagram

All the necessary datasets for analytics were ingested into Hadoop HDFS. Each OTT platform data went through extensive data cleaning and feature extraction in Hive and Spark to create a unified final schema. We analyzed age, genre, tag genomes in Spark using the common base schema we made. The output data were written back to HDFS.

In the next section, we detail our motivation for this study and introduce some related studies that can help better understand the status of research in the field. Then, we describe the datasets used for the study: content lists for Netflix, Amazon Prime, and Disney+; IMDb ratings and genre classification; MovieLens ratings and tags. We explain the process of cleaning and processing the data to make it suitable for analysis. In the end, we present the analysis done using the dataset mentioned above.

Motivation

In recent years, the advent of various OTT platforms has introduced a novel issue: the difficulty in choosing which OTT platform to subscribe to. Netflix, Amazon Prime, and Disney+ are some of the many OTT services that are well-known to the public, but the number of services is growing as localized OTT platforms like Watcha (South Korea) and Voot (India) are joining the line.

As these platforms are coming up with new ways to stand out among competitors by presenting original content, it is evident that more customers are being lost in deciding which platform would be suitable for their use. Moreover, most of the available recommendation systems are focused on suggesting the content but not the platforms that hold and provide those contents. To ease the choice dilemma, our study aims to present a guideline for choosing the appropriate OTT platform that fits one’s personal preferences.

Datasets

In large, three groups of datasets were used for the study: content lists for Netflix, Amazon Prime, and Disney+; IMDb ratings and genre classification; MovieLens ratings and genome tags. Content lists for each platform were used as the basis for analysis. However, since they were all gathered from different sources, they were not consistent in their time of last update and types of columns. Therefore, to ease the process of analysis and make the result more reliable, we had to clean the data into an identical form and merge them with other datasets that contained consistent information about age limit, rating, and genre.

Among them were IMDb and MovieLens datasets. Although each platform data contained information about genres and IMDb ratings, they were labeled according to different categorizations and updated at different times, requiring the use of separate data for genres and ratings. On top of that, we used MovieLens data to strengthen our analysis. Since we did not have pre-existing studies to confirm our findings, we decided to run a similar analysis on two different data. MovieLens had its own rating and tag data which made it a suitable comparison to IMDb rating and genre. We assumed getting similar results from the two sources would strengthen the validity of our analysis.

Netflix

The Netflix Content dataset consists of both TV shows and movies that are available on Netflix as of 2019, with 7788 rows including the header and 12 columns. The columns include show_id, type, title, director, cast, country, date_added, release_year, rating, duration, listed_in, and description, but only type, title, release_year, rating, and duration were used.

The Netflix original movies dataset contains 525 rows including the header and has 48 columns, and we only used the title and release date for our study.

Amazon Prime Video

The Amazon Prime Video dataset contains a total of 8128 rows and seven columns. The columns include title, language, IMDb rating, running time, year of release, maturity rating, and plot. In the end, we only used four out of those seven columns.

To distinguish original movies from the dataset, we scraped a list of original movies from Wikipedia. This contained 52 rows and 3 columns; the columns were release date, title, and notes.

Disney+

The Disney+ dataset contains data on films and TV series that are irrelevant to our study. It has a total of 992 rows and 19 columns including title, plot, type, director, and genre, but only imdb_id, title, type, released_at, and rated were used for our purpose.

We also used a list of original films scraped from Wikipedia. The data contained 56 rows and had a title, genre, release date, and runtime as columns. However, most of them were released after 2020, which could not be used in the study.

IMDb

From IMDb, we used title.basics.tsv.gz and title.ratings.tsv.gz datasets.

title.basics.tsv.gz contained columns like IMDb ID, title, start year, runtime and genre, etc., but we only used IMDb ID, title, and genre from the data. Titles acted as a key to merge platform data and IMDb. We kept the IMDb ID because, when used as a key, it made merging rating data to platform data easier.

title.ratings.tsv.gz was used, as its name suggests, for retrieving rating data. It contained IMDb ID, average rating, and many votes from which we used the first two columns.

MovieLens

MovieLens is a website run by GroupLens that independently collects rating and tag information from its users. Thankfully, link.csv contained information that links IMDb ID to MovieLens ID, which made it convenient for us to combine MovieLens data with our existing dataframe. Unlike IMDb data which already aggregated average rating and classified genres for each film, MovieLens data had individual ratings and tags that had to be aggregated by us.

In MovieLens, there are a total of 1128 tags and each film has a score ranging from 0 to 1 for each tag. The closer it is to 1, the movie is more relevant the tag. Since there are so many movie and tag pairs, we only filtered pairs that scored over 0.8, decided through trial and error.

Analysis

The experiment consisted of mainly two steps: preprocessing and analysis. Preprocessing was an important step for the project because all the data came from different sources. There were difficulties like disambiguating movies with the same titles and combining different age rating systems that had to be overcome to begin the analysis.

After preprocessing the source datasets, we explored how different platforms focus on content targeted towards specific audiences differing in age and genre preferences.

Preprocessing

Table 1. Common Base Schema For Analysis

Due to differences in schemas of the base movie datasets for Netflix, Amazon Prime, and Disney+, we went through multiple stages of cleaning and transformed the data into a common base schema (Table 1).

First, we cleaned our source datasets — the movie dataset and the original movie dataset, using Hadoop to filter out irrelevant columns and change the column format. For instance, for Netflix, we had to convert the release date from “dd-mm-yyyy” in the string to “yyyy” in integer and filter out extra characters in the “duration” values. Then, through Spark, we created a dataframe for each platform that contains the merged datasets with an “is_original” column to indicate the original movies. Lastly, the dataframe was merged with the IMDb datasets (the IMDb datasets — basic and rating, were cleaned by filtering out rows with null “runtime” values and casting the “runtime” value to an integer) to create the common schema across platforms. For Amazon Prime, after cleaning, Hive was used to join original, non-original, and IMDb data sets to create the base schema without genres. Then, Spark was used to add on the genre column and finalize the common base schema for the analytics.

The dataframes with common base schema were used for analyzing age group and genre. For analysis on genome tags, we created another dataframe for each platform by merging the common base schema with MovieLens datasets using the “imdb_id” field as the key.

IMDb Age Group Analysis

First, we analyzed what age group each platform primarily targeted based on the IMDb age group data. There was difficulty in aggregating this information because different age-rating systems were used for different contents. So, excluding some categories like Approved and Passed, we converted movies labeled using the Motion Picture Association film rating system into corresponding categories in TV Parental Guidelines, which was more finely classified.

Figure 1. Age Group Count of OTT Platforms. Image By Author.
Figure 2. Age Group Average Ratings of OTT Platforms Image By Author.

From Figure 1, we notice that Netflix has an overwhelming count of films that are rated TV-MA, intended to be viewed by adult audiences, and unsuitable for children under 17. It has almost no films rated as TV-G which is suitable for all ages. It has around 900 films that are suitable for teenagers of age 14 and above. Amazon Prime has a balanced distribution in terms of age rating. It has a roughly similar number of films categorized as All, TV-PG (parental guidance recommended), and TV-MA. Finally, although not noticeable because it has a small number of films overall, Disney+ has no content labeled TV-MA and a small number of films rated TV-14. All other films are rated either TV-PG or TV-G, which could be watched by children with or without parental guidance.

As of rating (Figure 2), there is not much difference across different age groups. It seems like Disney+ records a higher rating on average. Yet, most ratings range from 6 to 7, and the narrow range suggests that consumers should focus more on the distribution of the number of contents per age group than ratings.

IMDb Genre Analysis

Figure 3. Genre Count and Average Ratings of OTT Platforms. Image By Author.

Generally, Netflix and Amazon Prime had a similar distribution in terms of genre; drama made up the biggest proportion with comedy and action following. However, Netflix still had more content across almost all genres with Amazon prime notably outnumbering Netflix only in action and romance films. Also, Netflix has more than 500 documentaries while Amazon Prime almost has none.

Disney+ showed the most characteristic distribution. Yet, it had much less number of contents compared to the other two, which could be a strong push factor. It is still important to look at the shape of distribution because it signifies what to expect in the future. Disney+ clearly seems to focus on family, comedy, adventure, and animation films. It also has some action, drama, documentary, and fantasy movies but almost no movies in other genres. People who enjoy various genres may want to avoid Disney+.

In terms of ratings per genre, the scores ranged mostly from 5 to 7. News had a high rating of over 7 both in Netflix and Disney+, but considering that there are almost no News contents on both platforms, it is negligible. Similarly, biographies and documentaries scored well for all platforms, but only Netflix had enough documentaries to take it seriously. It is interesting to note that unpopular genres have better ratings than popular ones. Drama, comedy, and action films on Netflix and Amazon Prime, although being the most popular, have low ratings of around 6. On the other hand, Disney+ does a good job in genres they focus on; Animation has a 6.8 average rating, adventure 6.6, fantasy 6.47, and family 6.43.

MovieLens Tag Genome Analysis

Figure 4. Tag Genome Analysis For Netflix Original Movies. Image By Author.
Figure 5. Tag Genome Analysis For Netflix Non-Original Movie. Image By Author.

Netflix

The analysis on the MovieLens dataset for Netflix (Figure 4, 5) showed correlatable findings with IMDb genre analysis. Tags such as drama, comedy, and action were included within one of the high occurring tags for both original and non-original movies. Other properties that were distinguishable compared to other platforms, such as “visually-appealing” or “good soundtrack”, were also found. The tags with high ratings for original and non-original movies were different with “conspiracy”, “olympics”, and “russia” as the top three for original movies and “berlin”, “east germany”, “entirely dialogue” as the top three for non-original movies. In general, the results of the highly-rated tags and tag occurrences for both original and non-original movies provided additional insight into the kinds of content Netflix had.

Figure 6. Tag Genome Analysis For Amazon Original Movies. Image By Author.
Figure 7. Tag Genome Analysis For Amazon Non-Original Movies. Image By Author.

Amazon Prime

The Amazon Prime dataset contained a total of 52 original movies. The MovieLens data only had about six Amazon original movies. Although limited by the reduced number of original movies, the analytic discovered that the Amazon Prime original movies have tags related to the genre “drama” and “comedy”. The highest average ratings also showed a similar trend. The non-original Amazon Prime movies had tags related to “action” the most and “comedy” next. The highest average rated tags were related to the themes that action movies could have such as “freedom”, “compassionate”, and “scifi cult”. But, some were irrelevant to the trends we discovered throughout the research. Generally, the MovieLens analytics showed similar results to what our genre analytics discovered.

Figure 7. Tag Genome Analysis For Disney+. Image By Author.

Disney+

For Disney+, only non-original films were analyzed because there were only 3 original films that were both released until 2019 and available in MovieLens. Tags that appeared the most for Disney+ films were “animation”, “disney animated feature” and “family”. In terms of rating per tag, topics that are usually related to animation like “superheros”, “toys” and “pixar animation” recorded the highest ratings. Although there were a number of films missing in the MovieLens data, the findings were consistent with those from the IMDb data.

Conclusion

Age, genre, and tag genomes are important factors in determining subscription. Through our research, we discovered distinct characteristics of each OTT platform. From age analytics, we identified that Netflix had overwhelming TV-MA films compared to other platforms. Amazon Prime had almost even distribution of different maturity rating films. Disney+ had no movies rated TV-MA and had only those rated TV-PG or TV-G. The result suggests which platforms to subscribe to depending on the age group of films the users would like to see more. From genre analytics, we discovered that Netflix and Amazon Prime had similar distribution. They both had drama, comedy, and action the most. Nonetheless, Netflix had the most diverse content across all genres. Although Disney+ had much less content compared to the other two, it was the strongest in family, adventure, and animation films. From genome-tag analytics, we could test our goodness of analytics. Our discoveries in the movieLens analysis were mostly in line with the results we found through the genre analytics. Netflix and Amazon had a similar trend of having tags related to drama, comedy, and action while Disney+’s tags were more focused on animated films. However, due to the smaller size of the dataset for original films, partly since the data was limited to films released before 2019, we believe that further analysis would be necessary with the addition of recent movies in order to provide a more accurate picture.

All of our work was done on the NYU Cluster. We would like to thank the NYU High Performance Computing team for hosting and providing the platform for the entire process. We would also like to thank Kaggle users for publishing their datasets for us to use in this analytics project. It was surprisingly difficult to find OTT platform data that would suit our needs. We thank Professor Ann Malavet for guiding our research and Tableau for providing us the software for creating the visualizations.

Github

Please refer to the links below for the code:

Netflix

Amazon Prime

Disney+

References

Kim, E., & Kim, S. (2017). Online movie success in sequential markets: Determinants of video-on-demand film success in Korea. Telematics and Informatics, 34(7), 987–995. https://doi.org/10.1016/j.tele.2017.04.009

M. S. Kristoffersen, S. E. Shepstone and Z. Tan, “The Importance of Context When Recommending TV Content: Dataset and Algorithms,” in IEEE Transactions on Multimedia, vol. 22, no. 6, pp. 1531–1541, June 2020, doi: 10.1109/TMM.2019.2944214.

Reddy S., Nalluri S., Kunisetti S., Ashok S., Venkatesh B. (2019) Content-Based Movie Recommendation System Using Genre Correlation. In: Satapathy S., Bhateja V., Das S. (eds) Smart Intelligent Computing and Applications. Smart Innovation, Systems and Technologies, vol 105. Springer, Singapore. https://doi.org/10.1007/978-981-13-1927-3_42

Wayne, Michael L. “Netflix, Amazon, and branded television content in subscription video on-demand portals.” Media, Culture & Society 40.5 (2018): 725–741.

--

--

I am a full-time Computer Science and Interactive Media Senior at New York University Abu Dhabi