Exploratory Data Analysis With Movies

An investigation into the metrics that make blockbuster and award winning films

Published in

Towards Data Science

8 min readSep 6, 2020

As a part of the Flatiron School bootcamp requirements, we are required to complete a project at the end of each learning module that demonstrates our ability to apply what we’ve learned.

The prompt for the first project is as follows:

Microsoft wants to enter into the movie industry, however they have no prior knowledge of the industry and they need help so that their movie studio can be successful.

The primary skills that required to perform the exploratory data analysis (EDA) of the movie industry included: webscraping, storing and cleaning the data in a pandas dataframe, and visualization of data using seaborn and matplotlib. I’ll describe some of the methodology I used for webscraping and cleaning, and I’ll go through some of the recommendations we made in order to be successful as a movie studio.

Webscraping

I was unfamiliar with webscraping prior to the bootcamp, but I can say without a doubt it has been one of the most useful and fun skills that I have learned in the past few weeks. Web Scraping is essentially the process of looking at the HTML for a webpage and deconstructing that HTML so that you can extract pertinent information for analysis. By using the requests and Beautiful Soup libraries we can easily get all of the html into a Jupyter notebook and start picking apart the pieces. Some of the websites we used to develop recommendations were moviefone.com, imdb.com, and boxofficemojo.com. For example, this page had movie release dates for movies released in 2019 so I ended up writing code like this:

movies_= requests.get("https://www.moviefone.com/movies/2019/?     page=1")
soup = BeautifulSoup(movie_dates_page.content,'lxml')
movie_title = soup.find_all("a", class_="hub-movie-title")

Then I simply use the .text method of each of the elements in the movie_title variable and I can get each of the movie titles on that webpage into a list. I use a similar method as the one shown above to get all of the release dates into a list. The two lists can then be put into a dataframe and the dates column can be manipulated using the datetime library so that we can count the number of movies released in a certain month or on a certain day. The construction of the dataframe would look something like this:

movie_dict = {'movies':movie_list, 'release_date':dates_list}
dates_df = pd.DataFrame(data=movie_dict)#movie_list and dates_list are previously constructed lists from #webscraping

For this particular project, it was easiest to decide which elements of the webpage would be most useful for EDA and then define a function to scrape those elements and construct the dataframe. A good rule of thumb when webscraping is to use a sleep timer in between the scraping of each page. Making repeated calls to a webpage can run the risk of being banned from a website because those repeated calls can cause lots of traffic.

Data Cleaning

After scraping various data from different webpages and compiling the data into a dataframe, the next step was to clean the data. Fortunately, many of the websites structured their movie data in a way that made cleaning relatively simple. Using string methods like .replace() were used to remove commas and dollar signs from budgets and profits so that the .astype()pandas method could be used to convert the number from a string to an integer.

Using the example describe above for the movies dates dataframe, the creation of new columns using the datetime library would look like this:

import datetime as dtdates_df['release_date'] = pd.to_datetime(movie_releases_df['release_date'], format='%B %d, %Y')dates_df['release_month'] = dates_df['release_date'].map(lambda z: z.strftime('%B'))
    
dates_df['release_day'] = dates_df['release_date'].map(lambda z: z.strftime('%A'))dates_df['release_year'] = dates_df['release_date'].map(lambda z: z.strftime('%Y'))
    
dates_df['release_year'] = dates_df['release_year'].astype(int)

The most difficult cleaning came from scraping a table on a Wikipedia page containing data about movies, their Oscar nominations and subsequently won awards. Although the number of nominations and awards were listed in their own separate columns there were instances where a specific entry had a footnote that was considered text by Beautiful Soup. There were only 11 movies where a footnote occurred so it wasn’t a huge burden to correct manually in the dataframe. However, it is worth noting that you should keep an eye out for messy data so that you can develop an appropriate method to clean that data. Had there been hundreds or thousands of rows, then it would have required a more robust solution so that you wouldn’t be manually cleaning data line by line.

Recommendations for a Successful Movie Studio

There were several questions we decided to tackle for this project and I’ll leave a link to my GitHub repo below for those who would like to see the entirety of the project. I’ll go through two of the questions/recommendations for this blog.

Question 1: How much should you spend to make a successful movie?

In order to answer this question we chose to only look at movie data that had a profit greater than zero. All budgets, revenues, and profits were adjusted for inflation by using an average inflation rate of 3.22%. Using seaborn, I created a scatter plot to see if I could identify any trends.

We can see from the plot above that trend line is positive leading us to believe that if we spend more money than we can make more money. However, this plot alone is not enough to make a determination. The plot below shows profitable movie budgets versus their profit margin. Here this scatterplot shows a negative trendline which cautions against spending too much money as you run the risk of reducing the profit margin.

Adjusted Budget vs. Profit Margin for Profitable Movies (Image by Author)

So how is it that we decided upon an appropriate move budget? We decided to look at the profit margins of the top 25 most profitable movies ever made and use the median profit margin as a target for success. We chose to use the median due to the fact that there are extreme outliers that would make the mean less reliable as a measure of central tendency (Titanic, Avatar, and Avengers: Endgame would be unrealistic goals for a company making their first foray into the movie industry).

We found that the median profit margin was 0.84 and we chose to make a recommendation to spend $82,500,000 on a movie as that correlated with a profit margin around 0.8. A budget of $82,500,000 was a significantly smaller budget than that of the top 25 most profitable movies (those budgets were around $200 million). Therefore, we determined it was possible to spend significantly less while making a movie that could have a profit margin that could compete with some of the most successful movies ever made.

Question 2: Which actors and directors bring the most value to a movie?

If we know how much money we should spend on a movie, it stands to reason we should also know who we should hire to act and direct in that movie so that we can maximize profits. In order to determine who brought the most value to a movie we created our own statistic called Value Above Replacement (VAR). For fans of baseball, this is our own watered down version of the WAR statistic. The math behind VAR is simple: if across all movies the average net profit is 100 dollars and the average net profit of movies from ‘Actor: X’ is 200 dollars he/she would have a VAR of 2. This number represents X times over the average. We used a minimum cutoff value of 10 movies for actors and 5 movies for directors.

The movie data used to calculate VAR came from imdb.com and the code we used to calculate VAR is below:

actor_counts = actors_df['value'].value_counts()
actor_list = actor_counts[actor_counts >= 10].index.tolist()
actors_df = actors_df[actors_df['value'].isin(actor_list)]actor_total = actors_df.groupby(['value'],  as_index=False)['Net Profit'].mean().sort_values(by='Net Profit', ascending=False)actor_total['VAR'] = (actor_total['Net Profit']/actor_total['Net Profit'].mean())

We see that directors tend to have higher VARs compared to actors which certainly helps in determining how you should budget for your personnel.

We also explored other topics :

How much should you spend to make an Oscar winning movie?
What time of the year should you release a movie?
Which genres are the most profitable?
Which studios should we look to emulate for best practices?

Next Steps

Given that we only used exploratory data analysis for this project, there are are many more steps we could take in order to try and make more accurate recommendations. The amount of data we collected could certainly allow us to try and create some linear regression models to try and predict profits based upon one or more inputs. We could use simple linear regression to predict profit based on budget or use multiple linear regression and choose several inputs such as budgets, release month, actors, directors, and awards won. Variables like release month, actors, and directors would be categorical variables while budget would serve as a continuous variable. Obviously by reviewing the scatter plots above, we would first need to make sure our data satisfies certain assumptions before attempting to build a model. It would also be beneficial to research and collect data about streaming services to see if there are greater returns from streaming versus the traditional box office. Box office data post-pandemic will certainly prove valuable as movie studios make decisions to recover or adjust as they attempt to insulate themselves from future economic downturns.

This was a great project for understanding the beginning of the data science process and practicing various coding skills. I’ll definitely look to revisit some of this data once I’ve gained more data science skills so that I can refine and improve upon the previous recommendations!

GitHub: https://github.com/jeremy-lee93/dsc-mod-1-project-v2-1-onl01-dtsc-pt-052620

Youtube Presentation: https://youtu.be/C9YgIYwHaIQ