Popcorn Data — Analysing Cinema Seating Patterns (Part I)

What Can Data Analytics Reveal About Your Cinema Habits?

Noel Mathew Isaac

Published in

Towards Data Science

6 min readJun 10, 2020

By Noel Mathew Isaac and Vanshiqa Agrawal

In Part II, we analyse the data, visualise it, and then build a website for our findings.

Part I — Obtaining the Data for Analysis

Ever felt the crippling disappointment of finding out your favourite seat at the theatre has been booked?

How popular really is your favourite seat?

We wanted to find out more about the movie trends in Singapore — from which seats people prefer to the way they like to watch different movies. So we created PopcornData — a website to get a glimpse of Singapore’s Movie trends — by scraping data, finding interesting insights, and visualising it.

On the website, you can see how people watched different movies, at different halls, theatres, and timings! Some unique aspects include heat maps to show the most popular seats and animations to show the order in which seats were bought. This 2 part article elaborates on how we obtained the data for the website and our analysis of the data.

Scraping the Data

To implement our idea, the first and maybe even the most crucial step was to collect the data. We decided to scrape the website of Shaw Theatres, one of the biggest cinema chains in Singapore.

Starting with basic knowledge of scraping in python, we initially tried using python’s requests library to get the site’s HTML and BeautifulSoup library to parse it but quickly realized that the data we required was not present in the HTML we requested. This was because the website was dynamic — it requests the data from an external source using Javascript and renders the HTML dynamically. When we request the HTML directly, the dynamic part of the website is not being rendered, hence the missing data.

To fix this issue, we used Selenium — A web browser automation tool that can first render the website with the dynamic content before getting the HTML.

Issues With Selenium

Getting the selenium driver to work and fixing minor issues with it was a big learning curve. After countless StackOverflow searches and ‘giving up’ multiple times we managed to scrape through (pun intended) and get it to work.

The main issues we faced were:

Scrolling to a specific portion of the screen to click a button so that the data will be found in the HTML.
Figuring out how to run headless Selenium on the cloud.
After deploying the script on Heroku, some of the data was not being scraped when in fact the script was working properly on the local machine. After racking our brains we figured out that some pages loaded by Selenium were defaulting to the mobile version of the page. We fixed it by explicitly mentioning the screen size.

With Selenium and BeautifulSoup, we were finally able to get the data for all the available movie sessions for a particular day!

Sample movie session data:

{
   "theatre":"Nex",
   "hall":"nex Hall 5",
   "movie":"Jumanji: The Next Level",
   "date":"18 Jan 2020",
   "time":"1:00 PM+",
   "session_code":"P00000000000000000200104"
}

We were halfway there! Now we needed to collect the seat data for each movie slot to see which seats were occupied and when they were bought. After going through the Network tab of the website in the Developer Tools, we found that the seat data was being requested from Shaw’s API.

The data could be obtained by requesting the URL https://www.shaw.sg/api/SeatingStatuses?recordcode=<session_code> where the session code was the unique code for each movie session which we had already scraped earlier.

The data we got was in JSON format and we parsed it and reordered the seats in ascending order of seat buy time to obtain an array of JSON objects where each object contained information about each seat in the movie hall, including seat_number, seat_buy_time, and seat_status.

Sample seat data:

[
  {   
     "seat_status":"AV",
     "last_update_time":"2020-01-20 14:34:53.704117",
     "seat_buy_time":"1900-01-01T00:00:00",
     "seat_number":"I15",
     "seat_sold_by":""
  },
   ...,
  {  
     "seat_status":"SO",
     "last_update_time":"2020-01-20 14:34:53.705116",
     "seat_buy_time":"2020-01-18T13:12:34.193",
     "seat_number":"F6",
     "seat_sold_by":""
  }
]

seat_number: Unique identifier for a seat in a hall
seat_status: Indicates the availability of a seat (SO-seat occupied, AV-Available)
seat_buy_time: time seat was purchased by the customer
last_update_time: time seat data was last scraped

Halls have anywhere between 28–502 seats and each seat corresponds to a JSON object in the array. Add this to the fact that there are upwards of 350 movie sessions in a single day, and the amount of data generated is pretty big. Storing data for a single day took about 10MB. The movie session data was combined with the seat data and stored on a MongoDB database.

We managed to scrape all the movie data from Shaw for January 2020.

A single document in the database

{
   "theatre":"Nex",
   "hall":"nex Hall 5",
   "movie":"Jumanji: The Next Level",
   "date":"18 Jan 2020",
   "time":"1:00 PM+",
   "session_code":"P00000000000000000200104"
   "seats":[
   {   
     "seat_status":"AV",
     "last_update_time":"2020-01-20 14:34:53.704117",
     "seat_buy_time":"1900-01-01T00:00:00",
     "seat_number":"I15",
     "seat_sold_by":""
   },
   ...,
   {  
     "seat_status":"SO",
     "last_update_time":"2020-01-20 14:34:53.705116",
     "seat_buy_time":"2020-01-18T13:12:34.193",
     "seat_number":"F6",
     "seat_sold_by":""
   }
 ]
}

To view the full document, follow this link: https://gist.github.com/noelmathewisaac/31a9d20a674f6dd8524ed89d65183279

The complete raw data collected can be downloaded here:

Raw Movie Data.zip

drive.google.com

Time to Get Our Hands Dirty

It was now time to get our hands dirty by cleaning the data and pulling out relevant information. Using pandas, we parsed the JSON, cleaned it, and made a DataFrame with the data to improve readability and filter it easily.

Since the seat data took a lot of memory, we could not include all of it in the DataFrame. Instead, we aggregated the seat data using python to obtain the following:

Total Seats: Total number of seats available for a movie session
Sold Seats: Number of seats sold for a movie session
Seat Buy Order: 2-dimensional array showing the order in which seats were bought

[['A_10', 'A_11'], ['A_12'], ['B_4', 'B_7', 'B_6', 'B_5'], ['C_8', 'C_10', 'C_9'], ['B_1', 'B_2'], ['C_6', 'C_7'], ['C_5', 'C_4'], ['B_8', 'B_10', 'B_9'], ['D_8'], ['A_15', 'A_14', 'A_13']]

Each element in the array represents the seats bought at the same time and the order of elements represents the order in which the seats were purchased.

4. Seat Distribution: Dictionary showing the number of seats that were bought together (in groups of 1, 2, 3 or more)

{
   'Groups of 1': 8,
   'Groups of 2': 30,
   'Groups of 3': 9,
   'Groups of 4': 3,
   'Groups of 5': 1
}

5. Seat Frequency: Dictionary showing the number of times each seat in a hall was bought over the month

{'E_7': 4, 'E_6': 5, 'E_4': 11, 'E_5': 9, 'E_2': 2, 'E_1': 2, 'E_3': 7, 'D_7': 15, 'D_6': 17, 'C_1': 33, 'D_2': 15, 'D_1': 14, 'B_H2': 0, 'B_H1': 0, 'D_4': 45, 'D_5': 36, 'D_3': 32, 'C_3': 95, 'C_4': 94, 'A_2': 70, 'A_1': 70, 'B_2': 50, 'B_1': 47, 'C_2': 37, 'C_6': 53, 'C_5': 61, 'B_4': 35, 'B_3': 40}

6. Rate of Buying: Two dictionaries, with the first dictionary showing the time left to a movie showing (in days) with the corresponding accumulated number of tickets bought in the second dictionary.

{“1917”: [4.1084606481481485..., 2.566423611111111, 2.245578703703704, 2.0319560185185184, 1.9269907407407407, 1.8979513888888888....],
...}
{“1917”: [1, 3, 8, 10, 11, ...],
...}

The cleaned data can be viewed here:

Cleaned Movie Data

docs.google.com

Finally, we were done! (with 20% of the work)

With our data scraped and cleaned, we could now get to the fun part — analysing the data to find patterns amongst the popcorn litter.

To find our analysis of the data and interesting patterns we found, check out Part II of this article: