Box Office Revenue Analysis and Visualization

Day 4 and 5 of 100 Days of Data Science

Photo by Krists Luhaers on Unsplash
Welcome back to my 100 Days of Data Science Challenge Journey. On day 4 and 5, I work on TMDB Box Office Prediction Dataset available on Kaggle.

I’ll start by importing some useful libraries that we need in this task.

import pandas as pd

# for visualizations
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline'dark_background')

Data Loading and Exploration

Once you downloaded data from the Kaggle, you will have 3 files. As this is a prediction competition, you have train, test, and sample_submission file. For this project, my motive is only to perform Data Analysis and visuals. I am going to ignore test.csv and sample_submission.csv files.

Let’s load train.csv in data frame using pandas.

%time train = pd.read_csv('./data/tmdb-box-office-prediction/train.csv')
# output
CPU times: user 258 ms, sys: 132 ms, total: 389 ms
Wall time: 403 ms

About the dataset:

id: Integer unique id of each movie
belongs_to_collection: Contains the TMDB Id, Name, Movie Poster, and Backdrop URL of a movie in JSON format.
budget: Budget of a movie in dollars. Some row contains 0 values, which mean unknown.
genres: Contains all the Genres Name & TMDB Id in JSON Format.
homepage: Contains the official URL of a movie.
imdb_id: IMDB id of a movie (string).
original_language: Two-digit code of the original language, in which the movie was made.
original_title: The original title of a movie in original_language.
overview: Brief description of the movie.
popularity: Popularity of the movie.
poster_path: Poster path of a movie. You can see full poster image by adding URL after this link →
production_companies: All production company name and TMDB id in JSON format of a movie.
production_countries: Two-digit code and the full name of the production company in JSON format.
release_date: The release date of a movie in mm/dd/yy format.
runtime: Total runtime of a movie in minutes (Integer).
spoken_languages: Two-digit code and the full name of the spoken language.
status: Is the movie released or rumored?
tagline: Tagline of a movie
title: English title of a movie
Keywords: TMDB Id and name of all the keywords in JSON format.
cast: All cast TMDB id, name, character name, gender (1 = Female, 2 = Male) in JSON format
crew: Name, TMDB id, profile path of various kind of crew members job like Director, Writer, Art, Sound, etc.
revenue: Total revenue earned by a movie in dollars.

Let’s have a look at the sample data.


As we can see that some features have dictionaries, hence I am dropping all such columns for now.

train = train.drop(['belongs_to_collection', 'genres', 'crew',
'cast', 'Keywords', 'spoken_languages', 'production_companies', 'production_countries', 'tagline','overview','homepage'], axis=1)

Now it time to have a look at statistics of the data.

print("Shape of data is ")
# Output
Shape of data is
(3000, 12)

Dataframe information.
# Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 3000 non-null   int64  
 1   budget             3000 non-null   int64  
 2   imdb_id            3000 non-null   object 
 3   original_language  3000 non-null   object 
 4   original_title     3000 non-null   object 
 5   popularity         3000 non-null   float64
 6   poster_path        2999 non-null   object 
 7   release_date       3000 non-null   object 
 8   runtime            2998 non-null   float64
 9   status             3000 non-null   object 
 10  title              3000 non-null   object 
 11  revenue            3000 non-null   int64  
dtypes: float64(2), int64(3), object(7)
memory usage: 281.4+ KB

Describe dataframe.


Let’s create new columns for release weekday, date, month, and year.

train['release_date'] = pd.to_datetime(train['release_date'], infer_datetime_format=True)
train['release_day'] = train['release_date'].apply(lambda t:
train['release_weekday'] = train['release_date'].apply(lambda t: t.weekday())
train['release_month'] = train['release_date'].apply(lambda t: t.month)

train['release_year'] = train['release_date'].apply(lambda t: t.year if t.year < 2018 else t.year -100)

Data Analysis and Visualization

Photo by Isaac Smith on Unsplash
Question 1: Which movie made the highest revenue?

train[train['revenue'] == train['revenue'].max()]
train[['id','title','budget','revenue']].sort_values(['revenue'], ascending=False).head(10).style.background_gradient(subset='revenue', cmap='BuGn')
# Please note that output has a gradient style, but in a medium, it is not possible to show.

The Avengers movie has made the highest revenue.

Question 2 : Which movie has the highest budget?

train[train['budget'] == train['budget'].max()]
train[['id','title','budget', 'revenue']].sort_values(['budget'], ascending=False).head(10).style.background_gradient(subset=['budget', 'revenue'], cmap='PuBu')

Pirates of the Caribbean: On Stranger Tides is most expensive movie.

Question 3: Which movie is longest movie?

train[train['runtime'] == train['runtime'].max()]
plt.hist(train['runtime'].fillna(0) / 60, bins=40);
plt.title('Distribution of length of film in hours', fontsize=16, color='white');
plt.xlabel('Duration of Movie in Hours')
plt.ylabel('Number of Movies')
train[['id','title','runtime', 'budget', 'revenue']].sort_values(['runtime'],ascending=False).head(10).style.background_gradient(subset=['runtime','budget','revenue'], cmap='YlGn')

Carlos is the longest movie, with 338 minutes (5 hours and 38 minutes) of runtime.

Question 4: In which year most movies were released?

sns.countplot(train['release_year'].sort_values(), palette = "Dark2", edgecolor=(0,0,0))
plt.title("Movie Release count by Year",fontsize=20)
plt.xlabel('Release Year')
plt.ylabel('Number of Movies Release')
# Output
2013    141
2015    128
2010    126
2016    125
2012    125
Name: release_year, dtype: int64

In 2013 total 141 movies were released.

Question 5 : Movies with Highest and Lowest popularity.

Most popular Movie:


Least Popular Movie:


Lets create popularity distribution plot.

sns.distplot(train['popularity'], kde=False)
plt.title("Movie Popularity Count",fontsize=20)

Wonder Woman movie have highest popularity of 294.33 whereas Big Time movie have lowest popularity which is 0.

Question 6 : In which month most movies are released from 1921 to 2017?

sns.countplot(train['release_month'].sort_values(), palette = "Dark2", edgecolor=(0,0,0))
plt.title("Movie Release count by Month",fontsize=20)
plt.xlabel('Release Month')
plt.ylabel('Number of Movies Release')
# Output
9     362
10    307
12    263
8     256
4     245
3     238
6     237
2     226
5     224
11    221
1     212
7     209
Name: release_month, dtype: int64

In september month most movies are relesed which is around 362.

Question 7 : On which date of month most movies are released?

sns.countplot(train['release_day'].sort_values(), palette = "Dark2", edgecolor=(0,0,0))
plt.title("Movie Release count by Day of Month",fontsize=20)
plt.xlabel('Release Day')
plt.ylabel('Number of Movies Release')
1     152
15    126
12    122
7     110
6     107
Name: release_day, dtype: int64

On first date highest number of movies are released, 152.

Question 8 : On which day of week most movies are released?

sns.countplot(train['release_weekday'].sort_values(), palette='Dark2')
loc = np.array(range(len(train['release_weekday'].unique())))
day_labels = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
plt.xlabel('Release Day of Week')
plt.ylabel('Number of Movies Release')
plt.xticks(loc, day_labels, fontsize=12)
# Output
4    1334
3     609
2     449
1     196
5     158
0     135
6     119
Name: release_weekday, dtype: int64

Highest number of movies released on friday.

Final Words

I hope this article was helpful to you. I tried to answer a few questions using Data Science. There are many more questions to ask. Now, I will move towards another dataset tomorrow. All the codes of data analysis and visuals can be found at this GitHub repository or Kaggle kernel.

Thanks for reading.

I appreciate any feedback.

