Neo4j for Bollywood

A fun project to learn graph database

Published in

Towards Data Science

8 min readSep 24, 2021

Tired of all those “JOIN” in SQL? Did you have a headache every time you need to modify a schema in a relational database? If either answer is “Yes”, then you should give graph database such as Neo4j a try.

A graph database stores information as nodes and edges. Nodes are connected via edges and they both have properties. We can retrieve and aggregate data with queries. Because their logics and semantics are closer to the way that our minds model the real world, graph databases are easy to learn. Users of relational databases may quickly find the similarities between SQL and Cypher, the query language of Neo4j. In addition, Neo4j comes with aggregate functions and machine learning modules that can deliver quick insights into the data.

In my previous articles (Neo4j for antibiotic resistance, for diseases and for genome analyses), I have shown that the graph database Neo4j could deliver many interesting insights that were not so obvious in relational databases. But they were all about bioinformatics, not the easiest subject out there. And the articles were more on the advantages of Cypher over SQL. Readers have asked me for a more general introduction into Neo4j to steepen the learning curve. I have therefore searched around for data for a Neo4j beginner project, especially for readers with an SQL background. Recently, I stumbled upon the dataset of Bollywood movies (CC BY 4.0) by P. Premkumar and considered it as a nice fit.

The portmanteau word Bollywood is coined from Bombay and Hollywood. As its name suggests, it is the Hindi/Urdu-language film industry based in Bombay. And the movies cover a wide range of topics, and some of them even touched sensitive political and religious subjects. Many of the movies contain joyful dance scenes that lighten up the mood. Bollywood movies are very popular around the world. For example, the movie Dangal was such a huge success in China that two thirds of its $330 million worldwide gross came from China alone. Hollywood has been written about abundantly, but Bollywood rarely gets its fair share of attention. It will be therefore great to learn more about this film industry. The tabular data by Premkumar is small, but it contains 1,698 Bollywood movies between 2005 and 2017. So I consider it as an excellent data source for this introductory project. The code for this project is hosted on my Github repository here:

GitHub - dgg32/bollywood

Contribute to dgg32/bollywood development by creating an account on GitHub.

github.com

The dataset is from https://boxofficeindia.com. In comparison to IMDB.com, data from boxofficeindia.com are very incomplete. Also, be aware that the table only lists the lead actors. So it underestimates the actor-movie connections. In addition, this project only includes actors, directors, and their movies. There is also a small gotcha in the raw data: the budget and revenue columns have been misplaced.

1. Prepare and import the data

First download the data from the link here. Swap the the column names Budget(INR) and Revenue(INR). Next, run all the cells in my Python notebook prepare.ipynb. They should generate three CSV files in the data_for_neo4j folder. You can find all the data files in my Github repository too.

Open Neo4j Desktop. Create a project called bollywood and put the three CSV files into its import folder.

Afterwards, open Neo4j Browser and enter the following commands. They will import the data into the database (See the “2. Import data into Neo4j” section for detailed instructions).

And we can check how the three types of nodes (actor, director and movie) are connected:

call db.schema.visualization()

Figure 1. Schema of the Bollywood project. Image by author.

2. Overview

Let’s first get an overview of the dataset. Open Neo4j Bloom in Neo4j Desktop. In Settings, increase the Node query limit to 3,600. In Perspective -> Search phrases, add a Search phrase called match all with the following query:

Return to the main window, in Search graph select match all and run it. Bloom should generate the topological overview of the dataset (Figure 2).

Figure 2. The topological overview of the Bollywood dataset. Green: movie; Red: actor; Blue: director. Image by author.

The overview immediately gives us an overall impression of the Bollywood movie world. Many of the actor-movie-director trios are “islands”. But a large cluster is visible. We can learn more about it by zooming in or using the filters:

Figure 3. A detailed portion of the large cluster in the dataset. Green: movie; Red: actor; Blue: director. Image by author.

It is clear that this large cluster is held together by prolific Bollywood actors such as the Three Khans, Akshay Kumar, Emraan Hashmi and directors such as Vikram Bhatt and Mohit Suri.

We can also have a look at the genre distribution in the dataset:

The documentation from Neo4j explains that WITH is like the pipe operator in Bash. In the query above, I calculated the total number of films with the WITH statement and then used it in the percentage calculation. The results suggested that dramas, comedies, and thrillers are the top three movie genres in Bollywood.

3. Get some statistics

After the overview, we can quickly calculate statistics with Neo4j. For starters, let’s calculate the combined revenue of Aamir Khan’s movies:

This Cypher first matched all films with Aamir Khan as the lead actor and then summed their revenues up. It quickly gave us a total sum of 29,030,895,000 Indian Rupee ($393,932,987). As one of the most successful and influential actors in Bollywood, Aamir Khan has played the leading roles in several highest-grossing Indian films of the year such as Ghajini, 3 Idiots, Rangeela, Dhoom 3, PK, and Dangal. He has also been called “the King of the Chinese box office” because his films were hugely successful in China too.

Next, we can see the top earners:

In contrast to the previous query, this query removed the constraint on the actor nodes. It calculated the revenue-budget ratios to show how profitable the films were relatively. Finally, the query sorted the results based on revenues in descending order.

The query revealed that Prabhas’s Baahubali 2 led the list. But that movie was also quite expensive to make. In comparison, Dangal showed a revenue-budget ratio of 5. Again, the Khans have dominated the list. Be aware that these results are not congruent with the data from Wikipedia.

We can see the list of the biggest box-office bombs.

This quick query returned the top 10 flops in the dataset. At the top of the list was Bombay Velvet, which was criticized for its script and direction. The occupancy on first day was only 10–20%. All the theaters removed the film on the third day. It burnt a hole of 748,635,000 INR in its investor’s pocket. The second spot was occupied by Broken Horses. The film received mostly negative reviews, and it was also a commercial flop. But for me, featuring an Indian director alone did not qualify the film as a bona fide Bollywood movie.

Afterwards, we can also see which actors have the most thrillers or horror movies under their belt.

It appeared that Emraan Hashmi and Ajay Devgn were the most active thriller actors. Mr Hashmi has successfully established himself as a thriller movie star. And we can see the list of all his movies in the dataset:

It turns out that Mr Hashmi was also the lead actor in the last three installments of the horror movie series Raaz. According to the dataset, there were not so many horror movies around in Bollywood (52 out of 1,695, or 3%, see “2. Overview”). These three horror films alone have already made Mr Hashimi the actor with the most horror films in our dataset.

Finally, let’s see how often some actor-director duo worked together.

The bromance between superstar Ajay Devgn and director Rohit Shetty is well known. The film Phool Aur Kaante from 1991 was their first movie together. And so far they have worked together in 11 movies.

4. Community detection

In Section 2, the topological overview by Neo4j Bloom has shown us a large cluster centered around some of the biggest names in Bollywood. Now the question is whether it is possible to obtain that community through a query? With the help of Neo4j’s Graph Data Science Library (GDS), the answer is a resounding “Yes”.

In my previous article “Neo4j for Diseases”, I have shown how to use the Louvain algorithm to calculate communities. But in this project, I failed to get the optimal results with Louvain. I have then found the Medium article “Subgraph filtering in Neo4j Graph Data Science library” by Tomaz Bratanic. In it, he mentioned another algorithm “Weakly Connected Components (WCC)”. This algorithm finds sets of connected nodes in an undirected graph, where all nodes in the same set form a connected component. My test showed that it could quickly identify disconnected groups and isolate the large cluster in Section 2 effectively.

To use WCC, you need to enable the GDS plugin in your project (read the instructions in Section 5 from Neo4j for Diseases). First, we need to create an in-memory graph with the command:

Afterwards, we run WCC. It returns the top 10 largest communities (called “components” in WCC):

We can see that the Component 0 has as many as 1750 nodes. However, I have found that some nodes were double-counted. Let’s see what the nodes are with the DISTINCT function:

The query returns 1,734 nodes instead of the original 1,750. We can even show the networks in Neo4j Browser. But first, adjust the visualization parameters in Neo4j Browser (Figure 4).

Figure 4. Settings to display large clusters. Image by author.

And then run the command:

The COLLECT function transforms the names into a list. We then do a normal MATCH query and filter the results with that list.

Figure 5. Component 0 in Bollywood data. Image by author.

Upon careful inspection, we can confirm that this is the large cluster we observed in Neo4j Bloom from Section 2.

Conclusions

The film Arrival has a message: a new language can change the way how you think. This is also true for the transition from SQL to Cypher. Unlike SQL, we need to define relationships explicitly in Cypher. Cypher is intuitive because we can formulate queries through the relations without using foreign keys. In other words, Cypher makes us do things in networks instead of tables. And the graph visualizations in Neo4j Desktop and Bloom make it easy by letting us explore the relationships interactively.

This project shows that Neo4j not only excels in relation-rich data, but it can also aggregate tabular data as easily as SQL. This tutorial showcases primarily the statistical functions in Neo4j. Readers can easily compare the SQL queries with my Cypher queries above and feel the differences in the syntax and the expressiveness of the two languages. According to Wikipedia, the syntax of Cypher is based on ASCII-art. So the queries look very visual and are easy to understand. Combined with its aggregating and Graph Data Science functions, Neo4j can do just anything that a relational database can do and then some.

The Bollywood data also gave the project lots of fun factors. We learned that some of the biggest names in Bollywood have impressive statistics. And this knowledge may help us to enjoy Bollywood movies more in the future.

Here is a provocative idea: can we only use graph databases such as Neo4j for all our needs?

Join Medium with my referral link - Sixing Huang

As a Medium member, a portion of your membership fee goes to writers you read, and you get full access to every story…

dgg32.medium.com