Box Office Revenue Analysis and Visualization

Welcome back to my 100 Days of Data Science Challenge Journey. On day 4 and 5, I work on TMDB Box Office Prediction Dataset available on Kaggle.

I’ll start by importing some useful libraries that we need in this task.

import pandas as pd

# for visualizations
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('dark_background')

Data Loading and Exploration

Once you downloaded data from the Kaggle, you will have 3 files. As this is a prediction competition, you have train, test, and sample_submission file. For this project, my motive is only to perform Data Analysis and visuals. I am going to ignore test.csv and sample_submission.csv files.

Let’s load train.csv in data frame using pandas.

%time train = pd.read_csv('./data/tmdb-box-office-prediction/train.csv')

# output
CPU times: user 258 ms, sys: 132 ms, total: 389 ms
Wall time: 403 ms

About the dataset:

id: Integer unique id of each movie

belongs_to_collection: Contains the TMDB Id, Name, Movie Poster, and Backdrop URL of a movie in JSON format.

budget: Budget of a movie in dollars. Some row contains 0 values, which mean unknown.

genres: Contains all the Genres Name &amp; TMDB Id in JSON Format.

homepage: Contains the official URL of a movie.

imdb_id: IMDB id of a movie (string).

original_language: Two-digit code of the original language, in which the movie was made.

original_title: The original title of a movie in original_language.

overview: Brief description of the movie.

popularity: Popularity of the movie.

poster_path: Poster path of a movie. You can see full poster image by adding URL after this link → https://image.tmdb.org/t/p/original/

production_companies: All production company name and TMDB id in JSON format of a movie.

production_countries: Two-digit code and the full name of the production company in JSON format.

release_date: The release date of a movie in mm/dd/yy format.

runtime: Total runtime of a movie in minutes (Integer).

spoken_languages: Two-digit code and the full name of the spoken language.

status: Is the movie released or rumored?

tagline: Tagline of a movie

title: English title of a movie

Keywords: TMDB Id and name of all the keywords in JSON format.

cast: All cast TMDB id, name, character name, gender (1 = Female, 2 = Male) in JSON format

crew: Name, TMDB id, profile path of various kind of crew members job like Director, Writer, Art, Sound, etc.

revenue: Total revenue earned by a movie in dollars.

Let’s have a look at the sample data.

train.head()

As we can see that some features have dictionaries, hence I am dropping all such columns for now.

train = train.drop(['belongs_to_collection', 'genres', 'crew',
'cast', 'Keywords', 'spoken_languages', 'production_companies', 'production_countries', 'tagline','overview','homepage'], axis=1)

Now it time to have a look at statistics of the data.

print("Shape of data is ")
train.shape

# Output

Shape of data is
(3000, 12)

Dataframe information.

train.info()

# Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 3000 non-null   int64  
 1   budget             3000 non-null   int64  
 2   imdb_id            3000 non-null   object 
 3   original_language  3000 non-null   object 
 4   original_title     3000 non-null   object 
 5   popularity         3000 non-null   float64
 6   poster_path        2999 non-null   object 
 7   release_date       3000 non-null   object 
 8   runtime            2998 non-null   float64
 9   status             3000 non-null   object 
 10  title              3000 non-null   object 
 11  revenue            3000 non-null   int64  
dtypes: float64(2), int64(3), object(7)
memory usage: 281.4+ KB

Describe dataframe.

train.describe()

Let’s create new columns for release weekday, date, month, and year.

train['release_date'] = pd.to_datetime(train['release_date'], infer_datetime_format=True)

train['release_day'] = train['release_date'].apply(lambda t: t.day)

train['release_weekday'] = train['release_date'].apply(lambda t: t.weekday())

train['release_month'] = train['release_date'].apply(lambda t: t.month)

train['release_year'] = train['release_date'].apply(lambda t: t.year if t.year < 2018 else t.year -100)

Data Analysis and Visualization

Question 1: Which movie made the highest revenue?

train[train['revenue'] == train['revenue'].max()]

train[['id','title','budget','revenue']].sort_values(['revenue'], ascending=False).head(10).style.background_gradient(subset='revenue', cmap='BuGn')

# Please note that output has a gradient style, but in a medium, it is not possible to show.

The Avengers movie has made the highest revenue.

Question 2 : Which movie has the highest budget?

train[train['budget'] == train['budget'].max()]

train[['id','title','budget', 'revenue']].sort_values(['budget'], ascending=False).head(10).style.background_gradient(subset=['budget', 'revenue'], cmap='PuBu')

Pirates of the Caribbean: On Stranger Tides is most expensive movie.

Question 3: Which movie is longest movie?

train[train['runtime'] == train['runtime'].max()]

plt.hist(train['runtime'].fillna(0) / 60, bins=40);
plt.title('Distribution of length of film in hours', fontsize=16, color='white');
plt.xlabel('Duration of Movie in Hours')
plt.ylabel('Number of Movies')

train[['id','title','runtime', 'budget', 'revenue']].sort_values(['runtime'],ascending=False).head(10).style.background_gradient(subset=['runtime','budget','revenue'], cmap='YlGn')

Carlos is the longest movie, with 338 minutes (5 hours and 38 minutes) of runtime.

Question 4: In which year most movies were released?

plt.figure(figsize=(20,12))
edgecolor=(0,0,0),
sns.countplot(train['release_year'].sort_values(), palette = "Dark2", edgecolor=(0,0,0))
plt.title("Movie Release count by Year",fontsize=20)
plt.xlabel('Release Year')
plt.ylabel('Number of Movies Release')
plt.xticks(fontsize=12,rotation=90)
plt.show()

train['release_year'].value_counts().head()

# Output

2013    141
2015    128
2010    126
2016    125
2012    125
Name: release_year, dtype: int64

In 2013 total 141 movies were released.

Question 5 : Movies with Highest and Lowest popularity.

Most popular Movie:

train[train['popularity']==train['popularity'].max()][['original_title','popularity','release_date','revenue']]

Least Popular Movie:

train[train['popularity']==train['popularity'].min()][['original_title','popularity','release_date','revenue']]

Lets create popularity distribution plot.

plt.figure(figsize=(20,12))
edgecolor=(0,0,0),
sns.distplot(train['popularity'], kde=False)
plt.title("Movie Popularity Count",fontsize=20)
plt.xlabel('Popularity')
plt.ylabel('Count')
plt.xticks(fontsize=12,rotation=90)
plt.show()

Wonder Woman movie have highest popularity of 294.33 whereas Big Time movie have lowest popularity which is 0.

Question 6 : In which month most movies are released from 1921 to 2017?

plt.figure(figsize=(20,12))
edgecolor=(0,0,0),
sns.countplot(train['release_month'].sort_values(), palette = "Dark2", edgecolor=(0,0,0))
plt.title("Movie Release count by Month",fontsize=20)
plt.xlabel('Release Month')
plt.ylabel('Number of Movies Release')
plt.xticks(fontsize=12)
plt.show()

train['release_month'].value_counts()

# Output
9     362
10    307
12    263
8     256
4     245
3     238
6     237
2     226
5     224
11    221
1     212
7     209
Name: release_month, dtype: int64

In september month most movies are relesed which is around 362.

Question 7 : On which date of month most movies are released?

plt.figure(figsize=(20,12))
edgecolor=(0,0,0),
sns.countplot(train['release_day'].sort_values(), palette = "Dark2", edgecolor=(0,0,0))
plt.title("Movie Release count by Day of Month",fontsize=20)
plt.xlabel('Release Day')
plt.ylabel('Number of Movies Release')
plt.xticks(fontsize=12)
plt.show()

train['release_day'].value_counts().head()

#Output
1     152
15    126
12    122
7     110
6     107
Name: release_day, dtype: int64

On first date highest number of movies are released, 152.

Question 8 : On which day of week most movies are released?

plt.figure(figsize=(20,12))
sns.countplot(train['release_weekday'].sort_values(), palette='Dark2')
loc = np.array(range(len(train['release_weekday'].unique())))
day_labels = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
plt.xlabel('Release Day of Week')
plt.ylabel('Number of Movies Release')
plt.xticks(loc, day_labels, fontsize=12)
plt.show()

train['release_weekday'].value_counts()

# Output
4    1334
3     609
2     449
1     196
5     158
0     135
6     119
Name: release_weekday, dtype: int64

Highest number of movies released on friday.

Final Words

I hope this article was helpful to you. I tried to answer a few questions using Data Science. There are many more questions to ask. Now, I will move towards another dataset tomorrow. All the codes of data analysis and visuals can be found at this GitHub repository or Kaggle kernel.

Thanks for reading.

I appreciate any feedback.