A Beginner’s Guide to EDA

That is the first lesson that I learned after completing my first project through Flatiron’s Online Data Science Bootcamp. You can derive a lot of meaningful insights that answer relevant business questions without building a single model! And of course, if you are building a model, you are going to build a much more informed, effective model if you spend time getting to know your data.
This was my first ‘project’ I completed at Flatiron but in retrospect, this was really only the ‘EDA’ portion of the entire lifecycle. Ah, to be young again.
No matter – I still learned a lot from this project and it helped to drive home one important fact before I moved on to more complex projects:
EDA is important.
Project Overview
We were tasked with answering the question of: "What types of films are performing best at the box office?"

The Data
The data was provided by Flatiron and came from four main sources (as seen above):
- Box Office Mojo
- IMDB
- Rotten Tomatoes
- TheMovieDB.org
Features included: release_date
, popularity
, production_budget
, domestic_gross
, worldwide_gross
, runtime
, genre
, and others related to the actors/directors involved with each film.
Keep in mind, this is the first project of the course and therefore no models are being built. This is pure and simple EDA that was then packaged up into a succinct, business-oriented presentation.
Cleaning the Data
Perhaps I am one of the few people who don’t mind cleaning the data. I also don’t mind cleaning my apartment, so maybe there’s a connection somewhere…
What I had to do to clean the data:
1️⃣
- Convert
strings
todatetime
objects e.g.:
df['release_date] = pd.to_datetime(df['release_date'])
Pretty simple. Pandas has a built-in method to do this and datetime
objects give us a lot more options to explore the data with – which leads me to my next point…
2️⃣
- Create a
release_day
feature based off therelease_date
column:
df['release_day] = df['release_date].map(lambda x: x.weekday())
Fun! Converting release_date
to a datetime
object allowed us to call the .weekday()
method on it to give us the day of the week a movie was released.
I also created a release_month
column to use for later analysis.
3️⃣
- Convert all numbers to an
integer
. Surprisingly, a lot of columns containing numbers stored them asstrings
. No thank you!
df['domestic_gross'] = df['domestic_gross].str.replace('$','').str.replace(',','').astype('int')
You can tell that the numbers were actually stored as a string in the format $45,986,233
. Before I could convert each record to an int
, I needed to first remove the $
and the ,
.
4️⃣
- Add a few calculated columns. For example:
df['worldwide_net'] = df['worldwide_gross'] - df['production_budget']
This isn’t necessarily part of the ‘cleaning’ process, but I like to add any calculated columns I can think of while I’m cleaning just so they’re already there.
That was pretty much it for this dataset – nothing too tricky. I did also remove some features that I knew wouldn’t be helpful, just to keep things tight.
Exploring the Data
One big thing I learned from this project is that you CANNOT ‘mess up’ data exploration. It is impossible.
That was a big lesson that I learned, at least. For some reason when I first started performing EDA, I thought I was somehow going to do something ‘wrong’. This is the time that you are supposed to try anything and everything!
If you think there is a relationship worth exploring, then you are automatically right. The WORST thing that can happen is you find no correlation or find a relationship that’s opposite to what you expected.
But even still! That is still telling you something about your data!!
So I don’t know who needs to hear this, but one more time for the people in the back:
You cannot mess up data exploration.
Okay, now that I have that off my chest, let me tell you what I explored.
1. Runtime
I wanted to see if runtime
affected either the popularity
or worldwide_gross
of a movie.
This is what I found for runtime
and popularity

I found something similar for runtime
and worldwide_gross

While we don’t necessarily find a particularly high correlation coefficient, if we look at the mean runtimes of the Top 100 and Bottom 100 movies, we can see movies hovering around the 120-minute mark, as opposed to the 90-minute mark.
2. Budget
I wanted to see if a bigger budget always meant a more profitable/popular movie. Let’s take a look.
budget
and popularity

And again for budget
and worldwide_gross

We definitely see a stronger correlation when it comes to budget vs. runtime. The numbers are clear, however – budget matters!
3. Day/Month of Release
Remember how we converted our release_date
column to a datetime
object and then used that to calculate a release_day
and release_month
even though they weren’t explicitly given?
Well now is the time to use it!
Let’s take a look to see if either the day of the week or the month of the year has any impact on a film’s performance.
release_day
& release_month
with popularity

release_day
& release_month
with worldwide_gross

I definitely would have expected Friday and Saturday to be the top days for movie releases – nope!
Friday is the top day when it comes to both popularity
and worldwide_gross
but according to this data, Saturday and Sunday are actually the two worst days when it comes to releasing a movie!
December also has a fairly significant lead in both categories.
AKA – release a movie on a Friday in December!
4. Genre
What are the most popular genres? Let’s look!
Let’s look based on popularity
first.

And now worldwide_gross
.

Cool! We can see that Action, Adventure top the list for both metrics, with Sci-Fi and Fantasy also making both lists.
Comedy makes the Top 5 when it comes to highest-grossing movies, and Drama makes the Top 5 when it comes to the most popular movies.
Conclusion
To sum it all up:

That’s all there is to it! Come up with some ideas that you want to explore and then…. explore them.
Sometimes your assumptions will be proven right and sometimes they will be proven wrong – neither is better than the other.
Both will tell you something useful about your data.
I hope this was helpful to somebody! Or at the very least, let somebody know that they are not alone starting off their Data Science journey.
More content on the way.