Hands-on Tutorials
Using clustering algorithms to segment Steven Spielberg films

Scene 1: Setting the Stage for Cinematic Clustering
"Whether in success or in failure, I’m proud of every single movie I’ve directed."
– Steven Spielberg
With a career spanning nearly six decades; filmmaker Steven Spielberg burst onto the scene with Jaws, making 1975 the "summer of the shark" according to Time Magazine. Since then, he’s built a legacy few have rivaled as he continually pushed himself – and audiences – outside their comfort zones. As Bilge Ebiri of Rolling Stone noted, Spielberg "has always alternated between blockbusters and more serious fare," with his films spanning multiple genres. Along the way he’s racked up 18 Academy Award nominations, finding a way to garner both critical and commercial acclaim as his 30+ films have grossed more than $10 billion at the domestic box office when adjusted for inflation. He doesn’t seem to be slowing down either; a quick glance of his IMDB page hints at the efforts underway to perhaps direct his next big hit.
Attempting to categorize or segment the filmography of perhaps the most well-known movie director of all time is no simple task. But let’s give it a shot. A simple chronological list or genre breakdown could suffice, but what if we took a more data-driven approach by assessing multiple film attributes, potentially surfacing underlying patterns or themes amongst the collection of Spielberg films?
Scene 2: K-Means Clustering, An Overview
Unsupervised Learning refers to models which attempt to extract meaning when there is no explicit identification of a label, class, or category for each data point. Clustering is an unsupervised learning approach that groups unlabeled data to help us better understand patterns in a dataset. Clustering is a key component of widely used recommendation engines embedded into the products of Amazon, Netflix, and many others that millions of people use every single day. While we’ll use clustering as the building blocks for a very simplistic movie suggestion tool here, other applications include:
Personalized marketing – Using demographic data from a customer database, marketers can group different types of users and offer them more appropriate products or personalized campaigns. Knowing that people in this group like "x" (and not "y") can decrease wasted ad spend while increasing campaign conversion rates.
Anomaly detection – Anomalies are data points that deviate from the common trends or patterns in a dataset. Combined with the visual nature of K-Means clustering, predetermined thresholds or boundary distances can be set to help observe and isolate outliers that stray away from the pack.
File reduction/image compression – Digital images are composed of pixels, each of which has a size of 3 bytes (RBG), with each byte having a value from 0 to 255, leading to a massive amount of color combinations for each pixel. K-Means can be applied to image files to group like or similar colors to create a new version where the file size is reduced but the image is still interpretable. For example, instead of an image of a dog containing 200 colors, K-Means clustering can perhaps present the image using just 10 colors, without substantially degrading the visual nature of the image.
A multitude of clustering algorithms exist; the problem you’re attempting to solve as well as the data at your disposal will often dictate which approach to take. K-Means clustering, one of the most widely used approaches, attempts to classify unlabeled data by creating a certain number of clusters, as defined by the letter "k". The algorithm determines the appropriate clusters and assigns every data point to a cluster based on their proximity (or similarity) to a centroid, which is the fancy name for the center of a cluster. It is important to note, K-Means doesn’t ensure each cluster will have the same size but instead focuses on identifying the clusters that are most appropriately separated and grouped.
K-Means clustering, often considered a "partitioning" technique, offers several advantages when compared to other clustering approaches. For starters, it’s fairly simple: tell the model how many groups you want and assign all the data points to the appropriate group. This simplicity enables it to run quickly even on very large datasets. Additionally, its popularity has led to a plethora of open-source resources and related libraries in Python – which we’ll explore later.
While K-Means clustering is relatively simple to use and scales well with large datasets; a handful of drawbacks accompany the approach. K-Means clustering requires strictly numeric data, which might require some data preparation based on the dataset available. Also, the algorithm can be quite sensitive to outliers. Additionally, K-Means doesn’t scale very well with an increasing number of dimensions or variables; we’ll see how two or even three variables work well for this technique, but moving beyond that can create difficulties when it comes to interpretation. Fortunately, dimension reduction techniques such as principal components analysis (PCA) pair nicely with K-Means.
Interpretation is the most important step of any K-Means clustering exercise; as Dawn Iacobucci writes in her textbook _Marketing Models_, "once clusters have been derived, we must make sense of them." A proper analysis of the cluster output should reveal both answers to existing questions as well as generate new questions to ask about your dataset. The size, location, and number of clusters are all potentially helpful clues to inform the next steps of your analysis.
Scene 3: Breaking Down the Algorithm
The process of the K-Means clustering algorithm can be broken down into a few steps:
Step 1 (initialization) – First, we’ll need to decide how many clusters (and therefore centroids) we want. This step can either be done randomly or informed by the elbow method, which will be explained later. We’ll tell the model to select "K" random points that will serve as our initial centroids.
Step 2 (assignment) – The algorithm then assigns each data point to the closest centroid as measured by squared distance.
Step 3 (recalculation) – The centroids are then recalculated or adjusted by taking the mean of all data points currently assigned to that cluster.
The iterations of steps 2 and 3 continue until no new optimal reassignments need to be made and all data points are in the cluster whose data most closely resembles their own. Essentially, the algorithm is looking to continue the iterations, and minimize the error sum of squares (SSE) as a measure of clustering performance. This iterative process showcases the idea of "grouping like data points" by minimizing the variability within clusters and maximizing the variability between clusters.
When centroid positions are chosen randomly in step 1, K-Means can return different results on successive runs of the algorithm. To overcome this limitation, the algorithm can be run multiple times and eventually serve the optimal result based on SSE.

Scene 4: Reading in and Exploring Data
While R is a completely suitable tool for this type of analysis, we’ll be using Python given its access to helpful K-Means methods as well as its strong plotting capabilities. Before loading up our data to analyze, and ultimately recommend, Spielberg films, we’ll import the libraries needed for K-Means as well as a few others for subsequent plotting while also importing our dataset.

I’ve combined data from a few different sources – IMBD, Rotten Tomatoes, and The Numbers – to gather a collection of films crediting Spielberg as the director and relevant variables for each film such as year, film title, TomatoMeter rating, domestic box office performance (adjusted for inflation), and film duration. As mentioned above, K-Means clustering requires all numeric data; fortunately, that’s the case here, so we won’t need to transform our data set. (Check out my previous post for more detail on the value of data cleansing and preparation as well as this guide for data preparation efforts specific to clustering).
Pair plots can be a helpful tool to quickly explore distributions and relationships in your dataset. We’ll use the .pairplot function from the seaborn library and with a single line of code we’ll get a sense of the relationships between our numeric variables (BoxOffice, TomatoMeter, and Duration). Unsurprisingly, we notice a weak but positive correlation between BoxOffice and TomatoMeter indicating some shared attitudes between both film critics and paying moviegoers when it comes to Spielberg films.

Visualizing our data helps to quickly give a sense of its shape, distribution, and symmetry. In addition to scatterplots, the .pairplot method also provides histograms of our variables (The .displot method can also be used to achieve this). We notice some tails on both, with TomatoMeter having a left-skewed distribution and BoxOffice skewing slightly right. Skewness can be removed with logarithmic transformation, assuming the variables are positive. For the sake of this exercise – and given the relatively small size of our dataset and the skewness is fairly mild – we’ll hold off on running that transformation in this case, but it’s still good to know about.
Lastly, let’s use the graphing library plotly to create an interactive scatter plot that shows the films plotted against our two variables of interest, Rotten Tomatoes and Box Office.
As noted above, K-Means can be very sensitive to outliers in a dataset. Removing these points before running the K-Means algorithm is advisable, but given the absence of major outliers and a relatively small dataset, we will skip that step.
Scene 5: Determining the Number of Clusters
When a predetermined number of clusters based on the data is not apparent, we turn to the Elbow Method, a statistical approach that helps identify an appropriate number of clusters. Recall that clustering aims to define clusters so that the total intra-cluster variation (as measured by distance from the points to the centroid, also known as the WCSS or within clusters sums of squares) is minimized. Naturally, an increase in the number of clusters has an inverse relationship with error; more clusters provide more options for observations in our data to find similar groups to join. However, adding more and more clusters creates smaller incremental gains and may lead to an example of "overfitting". The elbow method helps us strike the appropriate balance between too many and too few clusters.
The elbow method also adds a visual element to this effort. A plot of the elbow will show the WCSS value for each value of K, representing the total number of clusters. The point where we start to see diminishing returns, which can sometimes look like an elbow joint, can be considered a good candidate for the value of K.

It looks like the ideal number of clusters for us is three! No hard and fast rule exists when determining the number of clusters to use for analysis; therefore, questions such as "are these clusters interpretable?" or "how likely are they to be replicated with new data?" will help to inform the final decision. Proper application of the Elbow Method serves as a great example of how data analysis requires a unique blend of art and science.
Scene 6: Applying & Interpreting K-Means
While it’s usually best practice to run K-Means on both unscaled and scaled versions of the data, we’ll stick to running the algorithm on an unscaled version of our data set for the sake of simplicity and to also maintain the interpretability of the variables and their relative scales. We can run the k_means () function from the cluster module of sci-kit learn, making the implementation of K-Means relatively straightforward.

This is our first glimpse at our clustered data! But our job is far from over. As noted in the fantastic textbook Python for Marketing and Research Analytics, "the most crucial question in a segmentation project is the business aspect: will the results be useful for the purpose at hand?"
Thinking back to the original intent of identifying unique groupings of films directed by Spielberg, we can turn to Python to help us identify other puzzle pieces to pair with the above plot (while also using some simple tricks to help enhance the graphic itself). For example, we’ll use the .countplot method to get a sense of the relative size of each cluster while also assigning them to numeric groupings.

Imbalanced clusters usually indicate distant outliers or a group of data points that are quite dissimilar, which would warrant further investigation. By glancing at the graphic above, we can see how Cluster 2 does seem to stand out for its relatively small size – even for arguably the greatest director ever, creating a movie that audiences and critics both overwhelming adore is still really hard to do! Given the three films in this cluster strike the desired balance of both box office and critical success, it’s certainly worth seeing if we can identify similarities between these three films beyond the person sitting in the director’s chair.
Calculating summary statistics of the centroids can help to start answering the question of "what makes this cluster unique?" We’ll add back in the duration variable here to see if that can help unlock some new insights.

We’ll also create a Snake Plot so we can look at the variation across the variables of interest for each of the clusters in a more visual manner. This will require a normalized version of our dataset along with the clusters. The pandas .melt method to rearrange the normalized dataframe into an appropriate layout.

We can see how this plot offers a helpful visual representation of how the generated clusters differ across the attributes of interest. For example, it would seem as if the most distinct differentiator between the three clusters is the BoxOffice variable. Let’s bring back the K-Means scatterplot from above with some helpful annotations to bring everything together.

When combined, the Cluster Average, the SnakePlot, and the K-Means visualization can help to inspire some appropriate names that help us delineate our clusters. The red Cluster 0 could be "A Mixed Bag". While none of these films broke box office records, it contains a wide range of Tomato Meter ratings. The green Cluster 1 could be the "Strong Contenders" cluster, a representation of films that were mostly liked by critics, outside of Jurassic Park 2, and performed fairly well at the box office. Finally, we could refer to the smaller, yellow Cluster 2 as "The Best Of". If you had to curate a Stephen Spielberg Starter Pack for your film-loving friends, this group would be a great place to start.
What jumps out to me is the incredible proportion of quality, highly regarded films Spielberg has directed. Considering the average TomatoMeter score for a film 59.3 in 2019, Spielberg’s films averaged a staggering 78. Another interesting takeaway emerges when we look at Duration, a variable included in the snake plot and summary statistics chart but not in our K-Means algorithm. The inverse and stark relationship between Duration and BoxOffice begs the question about whether cutting a few scenes or lines of dialogue might lead to more commercial success for Spielberg’s next slew of films.
Remember, it’s critical to think about connecting the outputs of the cluster analysis to the eventual business objective or question you are attempting to answer. At the end of the day, a cluster output is simply a vector of group assignments for data points, it’s up to the data analyst to figure out how or whether or not that tells an insightful story!
Scene 7: Limitations, Future Research, Additional Resources, and Final Thoughts
As we noted above, the dataset we were working on here was relatively small both in the number of observations and the number of variables. Encoding film-related categorical variables such as Genre or MPAA ratings would be a way to potentially surface interesting patterns and enhance our film suggestion efforts. Moreover, combining our approach with a time-series analysis to see how Spielberg films have changed throughout the years is another avenue to uncover for future research. In terms of approach; we only scratched the surface of clustering techniques in this discussion. Other methods like hierarchical clustering – and their fun to say "dendrograms" – help determine which two data points are most similar (or dissimilar). While the k-means approach has a global nature in its approach to analyzing a dataset, hierarchical could lead to interesting insights about which of Spielberg films are most similar based on the variables at our disposal.
If you’d like to learn more about the topic or check out additional examples of clustering algorithms in action, I’ve included below a list of helpful K-Means clustering articles and tutorials I came across in the process of researching and writing this post:
- Ben Alex Keen – K-Means Clustering in Python
- Amanda Dobbyn – Been-in-Hand Data Science
- Real Python – K-Means Clustering in Python: A Practical Guide
- Variance Explained – K-Means clustering is not a free lunch
- Oracle AI & Data Science Blog – Introduction to K-Means Clustering
While I don’t think our very simplistic recommendation model will start stealing subscribers away from Netflix anytime soon, hopefully this high-level overview of K-Means clustering was helpful in showing how even the simpler aspects of Python and its associated features can uncover unique insights and patterns from a dataset and help inform some of the most widely used recommendation engines.
Check out the original code here. What did I forget or what would you have done differently? Any and all feedback is welcome via a comment below, or drop me a note at [email protected]. Thanks for sticking with this post til the end; people who enjoyed this article also enjoyed…. just kidding ; )