Getting Started

You don’t need a linear regressor to recognize one of the core tenants of data science – bad data leads to a bad study. This was vividly demonstrated in my second project with Metis data science bootcamp, a linear regression model designed to predict the gross of a Broadway play or musical adapted from a film based on the commercial success of said film. In this article, we’ll talk about data purity, and why a lack of it in this study led to a model that isn’t quite yet meant for the stage.
(For anyone interested in the nuts and bolts, check out the project repo!)
What is data purity?
Data purity is a term I first encountered while learning SQL, referring specifically to a value in a column being out of range of that column. If you have a column with numbers representing day of the week, any number greater than 7 would be considered impure. There are seven days in a week, so any number beyond that must come from a computational or human error along the way. This doesn’t necessarily represent outliers (like "1" representing Monday in a table full of weekend days – 6 and 7’s) but more refers to numbers that are logically and statistically impossible (-9, or 900).
Data purity becomes of particular concern when you’re working with data scraped from the web. Web scraping, a meticulous but oddly satisfying task, is best described as coding thousands of little bots to go into webpages and extract data. It is also necessary for some enormous data collection task – like scraping information from 10,000+ IMDB pages, for example. This was my first step in my project, and I was interested, among other features, specifically in the budget and the domestic and worldwide gross of each film. These variables are considered features, probably referred to by your high school science teacher as an independent variable.
Feature Variable Purity
For most feature-length films, this information is readily available on IMDB, and quite reliable. The 2017 remake of Beauty & the Beast was a perfect reference when designing this web scrape, and the estimated $160 million dollar budget made perfect sense in relation to the $504 million domestic gross, and $1.26 billion international gross. However, some films weren’t quite so robust or reliable in their information. Take the 1933 film 42nd Street, the inspiration for the hit Broadway musical of the same name. This movie has an interesting balance between budget and worldwide gross:

You try explaining that margin to an investor.
Of course, 42nd Street made more than $1,600 in its commercial lifetime. In fact, it has made over 1,000 times more, to the tune of $2.2 million. These errors were not always rectified by additional research; in fact, often cross-referenced sources had some clearly misrepresentative information, and IMDB had the correct data. More often than not though, the information didn’t exist anywhere. Sometimes there would be domestic gross information (but no budget), and sometimes there would be budget information (but no gross). Turns out there isn’t too much of a stir when the budget gross for "Arsenic and Old Lace" isn’t readily available.
This, of course, was the most substantial issue. In the final dataset, almost half had impurity in these two features alone. Domestic gross, while not a reliable predictor of stage adaptation success, was the highest correlated variable, but domestic gross was only available for about two thirds of the scraped movies. The study benefitted from ironclad financial data on Broadway grosses (thanks to this Kaggle data set, drawn from the fastidious weekly gross records of the Broadway League), but all of the feature variables came from film performance.
The easiest way to ensure data purity is by making sure the accurate data exists among your features, and utilizing only that accurate data in your model. Missing values are a reality in almost every data set, but in a relatively small sample, you must ensure that all of the data is, you know, accurate. Once I realized the issue with my data, I spent hours searching the web and manually entering the values that hadn’t made it through the web scrape. This helped, but I still had hundreds of empty values. What could be done with all of those empty values?
Target Variable Purity
After scraping and cleaning the data from 10,000 movies off of IMDB, it was time to merge with the above mentioned Broadway gross data set (for all shows since 1986). 27,064 matches were found by name alone! As you can imagine, there have been quite a few adaptations of Beauty & The Beast throughout the years, including a 2010 Lebanese adaptation, which inspired one reviewer to write "let the beast remain inside the cage". Several stages of whittling down brought the dataset to 521 unique movies, about half of which still didn’t have reliable or even full financial information. I discovered MICE, a very handy algorithm designed to handle missing values based by using the values around it that was much more sophisticated and thorough than using the mean.
Luckily, trends were starting to pop up. The web scrape involving the movie genre was much more successful, designating the assigned genre(s) from its IMDB page. Unsurprisingly, movies that were "Musical" in nature led to some high-grossing stage adaptations – musicals routinely reflect the highest grossing Broadway productions every week, with Hamilton alone often accounting for nearly 10% of the weekly grosses. Interestingly, "Adventure" films seem to make relatively successful stage adaptations – King Lear and The Lion In Winter are aptly labeled adventure movies that had successful stage adaptations. However, somehow Cinderella snuck into this category, which may have been better fit to the similarly performing "Fantasy" or "Family" category.

These factors, in addition to the gross and budget information (be it real or inferred by the MICE algorithm) allowed the completion of a serviceable model. Using Linear Regression, I created an algorithm to predict the total lifetime gross of a Broadway show based on the success of the film.

The result can be found on the left, and is probably enough to give hives to any self-respecting data scientist. This graph represents the difference between predicted and actual values. Those few and far reaching outliers represent shows with vastly under-predicted grosses. These were shows like Cats and Phantom Of The Opera – long-running productions that had the chance to make much more money over their multi-decade runs than musicals like Hairspray or Legally Blonde – commercially successful productions with a more "normal" run of 1–3 years. One could deduce that length of run is an aspect of commercial success, and that "baking in" that element would help the algorithm. More money made over more time means it’s more successful, right? Turns out, it really hindered the model to try to heavily manufacture the target variable, rather than letting average weekly gross represent the success of the show.
This leads to our second point — data purity in your target (predicted, or dependent) variable can be achieved through a simple and non-weighted target. When determining what exactly you’re trying to predict, it’s best to not bake in additional factors before you’ve even calculated the relationship. Even factoring in Consumer Price Index (which represents inflation and market sentiment by year) actually hurt the model. A simple target creates a more effective model. For factors like length of run or inflation rates, try considering those as features rather than a target variable or, better yet, implementing a Generalized Linear Model, which Yuho Kida beautifully breaks down in this helpful article. This easy but important step will help your residual graph look a bit more like the one below, indicating a more informative distribution of prediction mistakes.

Problem Purity – What am I solving, again?
The original intention of this study was to predict the success of a stage adaptation of a feature-length film. One factor that felt impossible to consider given limited sample size, was which came first. The data set had about 400 films that were movies before they were Broadway plays or musicals, leaving only 120 stage-to-film adaptations. Simply put, these are two separate problems. The film industry has historically outperformed the commercial theatre industry (which, by the way, is still a tourism and economic powerhouse, generating a reported $12 billion in 2018–2019 revenue from NYC’s Broadway district alone – thanks to Yaakov Bressler for this updated statistic!). The widespread awareness of a film will often lead to a successful stage adaptation; film adaptations are only likely to be adapted if they are from a surefire Broadway hit. Factor in movie-to-musical-to-movie again commodities like Footloose and Hairspray, and there are a few factors at play here that venture beyond the scope of this project.
Just as important to data purity is maintaining problem purity – our third and final tip is to always keep your "solvable" problem in mind. As I scrambled to make sure the sample size was large enough for the project, it becomes more and more difficult, if not impossible, to keep solving the same problem. This is a solvable problem, and a relationship does exist. Data purity is often a surmountable problem, if and when we take the time to notice that impurity. A data scientist must find supporting data on the front end that serves their project, rather than fickle and unsubstantiated data that will ultimately corrupt the model.

Future of this Model
With more manual entry, extensive research outside of web scraping, a finer process of sifting through duplicates, and a sophisticated Generalized Linear Model, I see the possibility of a genuinely useful tool for producers to determine the financial feasibility of a stage adaptation. I look forward to revisiting this model later this year, with intentions of incorporating all I’ve learned from preparing this blog post, and from working at Metis in general.
It feels uncouth to publish these results and tips without noting that this data may not be useful for some time. With Broadway’s reopening data being pushed later and later due to COVID-19 infection rate and uncertainty, it’s hard to know when exactly this sort of data will matter again. However, these sorts of problems will be facing Broadway every day as we stage a massive recovery in 2021 and beyond. This process is enhanced, not threatened by the data available from commercial theatre. With pure data, open minds, and a commitment to the transformative power of Broadway theatre that no logistic regression can capture, recovery is not only possible; it is imminent.