The world’s leading publication for data science, AI, and ML professionals.

Box Office Revenue Analysis and Visualization

Day 4 and 5 of 100 Days of Data Science

Photo by Krists Luhaers on Unsplash
Photo by Krists Luhaers on Unsplash

Welcome back to my 100 Days of Data Science Challenge Journey. On day 4 and 5, I work on TMDB Box Office Prediction Dataset available on Kaggle.

I’ll start by importing some useful libraries that we need in this task.

import pandas as pd

# for visualizations
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('dark_background')

Data Loading and Exploration

Once you downloaded data from the Kaggle, you will have 3 files. As this is a prediction competition, you have train, test, and sample_submission file. For this project, my motive is only to perform Data Analysis and visuals. I am going to ignore test.csv and sample_submission.csv files.

Let’s load train.csv in data frame using pandas.

%time train = pd.read_csv('./data/tmdb-box-office-prediction/train.csv')
# output
CPU times: user 258 ms, sys: 132 ms, total: 389 ms
Wall time: 403 ms

About the dataset:

id: Integer unique id of each movie
belongs_to_collection: Contains the TMDB Id, Name, Movie Poster, and Backdrop URL of a movie in JSON format.
budget: Budget of a movie in dollars. Some row contains 0 values, which mean unknown.
genres: Contains all the Genres Name & TMDB Id in JSON Format.
homepage: Contains the official URL of a movie.
imdb_id: IMDB id of a movie (string).
original_language: Two-digit code of the original language, in which the movie was made.
original_title: The original title of a movie in original_language.
overview: Brief description of the movie.
popularity: Popularity of the movie.
poster_path: Poster path of a movie. You can see full poster image by adding URL after this link → https://image.tmdb.org/t/p/original/
production_companies: All production company name and TMDB id in JSON format of a movie.
production_countries: Two-digit code and the full name of the production company in JSON format.
release_date: The release date of a movie in mm/dd/yy format.
runtime: Total runtime of a movie in minutes (Integer).
spoken_languages: Two-digit code and the full name of the spoken language.
status: Is the movie released or rumored?
tagline: Tagline of a movie
title: English title of a movie
Keywords: TMDB Id and name of all the keywords in JSON format.
cast: All cast TMDB id, name, character name, gender (1 = Female, 2 = Male) in JSON format
crew: Name, TMDB id, profile path of various kind of crew members job like Director, Writer, Art, Sound, etc.
revenue: Total revenue earned by a movie in dollars.

Let’s have a look at the sample data.

train.head()

As we can see that some features have dictionaries, hence I am dropping all such columns for now.

train = train.drop(['belongs_to_collection', 'genres', 'crew',
'cast', 'Keywords', 'spoken_languages', 'production_companies', 'production_countries', 'tagline','overview','homepage'], axis=1)

Now it time to have a look at statistics of the data.

print("Shape of data is ")
train.shape
# Output
Shape of data is
(3000, 12)

Dataframe information.

train.info()
# Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 3000 non-null   int64  
 1   budget             3000 non-null   int64  
 2   imdb_id            3000 non-null   object 
 3   original_language  3000 non-null   object 
 4   original_title     3000 non-null   object 
 5   popularity         3000 non-null   float64
 6   poster_path        2999 non-null   object 
 7   release_date       3000 non-null   object 
 8   runtime            2998 non-null   float64
 9   status             3000 non-null   object 
 10  title              3000 non-null   object 
 11  revenue            3000 non-null   int64  
dtypes: float64(2), int64(3), object(7)
memory usage: 281.4+ KB

Describe dataframe.

train.describe()

Let’s create new columns for release weekday, date, month, and year.

train['release_date'] = pd.to_datetime(train['release_date'], infer_datetime_format=True)
train['release_day'] = train['release_date'].apply(lambda t: t.day)
train['release_weekday'] = train['release_date'].apply(lambda t: t.weekday())
train['release_month'] = train['release_date'].apply(lambda t: t.month)

train['release_year'] = train['release_date'].apply(lambda t: t.year if t.year < 2018 else t.year -100)

Data Analysis and Visualization

Photo by Isaac Smith on Unsplash
Photo by Isaac Smith on Unsplash

Question 1: Which movie made the highest revenue?

train[train['revenue'] == train['revenue'].max()]
train[['id','title','budget','revenue']].sort_values(['revenue'], ascending=False).head(10).style.background_gradient(subset='revenue', cmap='BuGn')
# Please note that output has a gradient style, but in a medium, it is not possible to show.

The Avengers movie has made the highest revenue.


Question 2 : Which movie has the highest budget?

train[train['budget'] == train['budget'].max()]
train[['id','title','budget', 'revenue']].sort_values(['budget'], ascending=False).head(10).style.background_gradient(subset=['budget', 'revenue'], cmap='PuBu')

Pirates of the Caribbean: On Stranger Tides is most expensive movie.


Question 3: Which movie is longest movie?

train[train['runtime'] == train['runtime'].max()]
plt.hist(train['runtime'].fillna(0) / 60, bins=40);
plt.title('Distribution of length of film in hours', fontsize=16, color='white');
plt.xlabel('Duration of Movie in Hours')
plt.ylabel('Number of Movies')
train[['id','title','runtime', 'budget', 'revenue']].sort_values(['runtime'],ascending=False).head(10).style.background_gradient(subset=['runtime','budget','revenue'], cmap='YlGn')

Carlos is the longest movie, with 338 minutes (5 hours and 38 minutes) of runtime.


Question 4: In which year most movies were released?

plt.figure(figsize=(20,12))
edgecolor=(0,0,0),
sns.countplot(train['release_year'].sort_values(), palette = "Dark2", edgecolor=(0,0,0))
plt.title("Movie Release count by Year",fontsize=20)
plt.xlabel('Release Year')
plt.ylabel('Number of Movies Release')
plt.xticks(fontsize=12,rotation=90)
plt.show()
train['release_year'].value_counts().head()
# Output
2013    141
2015    128
2010    126
2016    125
2012    125
Name: release_year, dtype: int64

In 2013 total 141 movies were released.


Question 5 : Movies with Highest and Lowest popularity.

Most popular Movie:

train[train['popularity']==train['popularity'].max()][['original_title','popularity','release_date','revenue']]

Least Popular Movie:

train[train['popularity']==train['popularity'].min()][['original_title','popularity','release_date','revenue']]

Lets create popularity distribution plot.

plt.figure(figsize=(20,12))
edgecolor=(0,0,0),
sns.distplot(train['popularity'], kde=False)
plt.title("Movie Popularity Count",fontsize=20)
plt.xlabel('Popularity')
plt.ylabel('Count')
plt.xticks(fontsize=12,rotation=90)
plt.show()

Wonder Woman movie have highest popularity of 294.33 whereas Big Time movie have lowest popularity which is 0.


Question 6 : In which month most movies are released from 1921 to 2017?

plt.figure(figsize=(20,12))
edgecolor=(0,0,0),
sns.countplot(train['release_month'].sort_values(), palette = "Dark2", edgecolor=(0,0,0))
plt.title("Movie Release count by Month",fontsize=20)
plt.xlabel('Release Month')
plt.ylabel('Number of Movies Release')
plt.xticks(fontsize=12)
plt.show()
train['release_month'].value_counts()
# Output
9     362
10    307
12    263
8     256
4     245
3     238
6     237
2     226
5     224
11    221
1     212
7     209
Name: release_month, dtype: int64

In september month most movies are relesed which is around 362.


Question 7 : On which date of month most movies are released?

plt.figure(figsize=(20,12))
edgecolor=(0,0,0),
sns.countplot(train['release_day'].sort_values(), palette = "Dark2", edgecolor=(0,0,0))
plt.title("Movie Release count by Day of Month",fontsize=20)
plt.xlabel('Release Day')
plt.ylabel('Number of Movies Release')
plt.xticks(fontsize=12)
plt.show()
train['release_day'].value_counts().head()
#Output
1     152
15    126
12    122
7     110
6     107
Name: release_day, dtype: int64

On first date highest number of movies are released, 152.


Question 8 : On which day of week most movies are released?

plt.figure(figsize=(20,12))
sns.countplot(train['release_weekday'].sort_values(), palette='Dark2')
loc = np.array(range(len(train['release_weekday'].unique())))
day_labels = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
plt.xlabel('Release Day of Week')
plt.ylabel('Number of Movies Release')
plt.xticks(loc, day_labels, fontsize=12)
plt.show()
train['release_weekday'].value_counts()
# Output
4    1334
3     609
2     449
1     196
5     158
0     135
6     119
Name: release_weekday, dtype: int64

Highest number of movies released on friday.


Final Words

I hope this article was helpful to you. I tried to answer a few questions using Data Science. There are many more questions to ask. Now, I will move towards another dataset tomorrow. All the codes of data analysis and visuals can be found at this GitHub repository or Kaggle kernel.

Thanks for reading.

I appreciate any feedback.


100 Days of Data Science Progress

Analysing and Visualising the Country wise Population from 1955 to 2020 with Pandas, Matplotlib…


If you like my work and want to support me, I’d greatly appreciate it if you follow me on my social media channels:


Related Articles

Some areas of this page may shift around if you resize the browser window. Be sure to check heading and document order.