The world’s leading publication for data science, AI, and ML professionals.

Reconstructing The Year of the Legendary Florida Man

A Data Science microproject in Python to compile the top headlines per day

Miami Beach, potential spotting location of the Florida Man. Image by tammon via Pixabay
Miami Beach, potential spotting location of the Florida Man. Image by tammon via Pixabay

You probably know the meme. Google ‘Florida Man’ + your birthday, and the strangest search results pop up. Every day of the year, the enigmatic Florida Man seems to try his hand at something even more peculiar, violent, tragic or downright bizarre than the day before. Often, the headlines are so gaudy they can’t help but bring a smile to your face.

Googling what the Florida Man did on your birthday is fun, but you must be curious what he does the rest of the year as well (at least I was). Googling 366 days is a bit too much – although that would probably have been faster than writing a script for it – so let’s try to automate that in Python! We perform a Data Science microproject, dabbling in some cloud provisioning, web scraping, multi-threading, string operations, regular expressions and word clouds to top it all off.


Data Science Pipeline

For any Data Science project (even a microproject) I find it useful to utilize the Data Science Pipeline as outlined by the UC San Diego[1]. It contains five phases that encompass the full project:

  • Acquire: Use a Google Custom Search Engine and Google CSE JSON APIs to retrieve query data.
  • Prepare: Visualize raw output, filter out headlines from query results.
  • Analyze: Remove unsuitable headlines with set of taboo words and similarity check. Use string operations and regular expressions to produce uniformly styled output.
  • Report: Display sample output, visualize common words in word cloud.
  • Act: Write and publish a Medium article based on project.
Data Science pipeline (own work by author, inspired by [1])
Data Science pipeline (own work by author, inspired by [1])

Setting up Google Custom Search

For scraping Google ‘s Custom Search Engine – automated requests on their regular search engine are not permitted – we use the Google Custom Search API. This API provides up to 100 free queries per day, returning output in JSON format. Using Python, we create a query for each day and pass them to the Google Custom Search Engine JSON API to extract the top headlines for each search. For more information on creating and configuring both the API and the Google Search Custom Engine, please check out my tutorial Scraping Google With Python – A Step-by-Step Tutorial. In this project, we configure the search engine such that it incorporates results from the United States within the .com domain.

Scraping Google With Python – A Step-by-Step Tutorial

Generating data queries

Once the Custom Search Engine is set up, we create a list of 366 birthdays (yes, including that pesky February 29th). 2020 was a leap year, so we are covered if we take all dates from that year (from ‘1-January’ to ‘31-December ‘):

Now we simply concatenate each date with ‘Florida Man’ to generate the 366 search queries in orderly fashion: search_term_list = ['Florida Man' + date for date in date_list]

Having our list of queries ready, we are ready to rumble!

Photo by Denys Kostyuchenko on Unsplash
Photo by Denys Kostyuchenko on Unsplash

Running the query

Well, almost ready. As you recall, we can only run 100 requests per days, and we definitely don’t want to spend four days on this pet project. Fortunately we can provision up to 12 APIs under a single account, so we can simply create four duplicates and sequentially run ~92 queries on each of them by switching keys.

Let’s see if we can do better though! If we have to distribute the workload over four separate APIs anyway, we might as well do a bit of multi-threading. Probably it’s not going to safe us a tremendous amount of time, but as we are dealing with an Input-Output process that requires a lot of waiting time, we should do it right.

Ok, now we are ready to rumble.

Let’s run a single query and check how a single result looks. Even prettyprint couldn’t make it pretty, and we have 3660 of those outputs. Fortunately, we are only interested in the headlines. Let’s see what we information actually need from this mess and do some data cleaning.

Example of JSON output from the search query
Example of JSON output from the search query

The main problem we encounter is that the actual headlines— title as displayed on the Google search page – are often incomplete due to their length. ‘Florida Man Who Threw Toilet Through Window in East St. Louis …’ is a cliffhanger, for sure, but clicking to read the full story is exactly what we try to avoid here. The solution we pick is to instead read the field og:title, which includes the full title. og stands for Open Graph protocol, a data format designed to make sharing across social media easier. Unfortunately not every website includes it, so we drop these results. Still plenty of source material remains though.

Now we have a dictionary with only full titles; things are getting more and more readable. With that, we also note many headlines are not exactly what we are looking for though. Time to perform a deeper analysis.

Photo by Ryan Spencer on Unsplash
Photo by Ryan Spencer on Unsplash

Data analysis

First things first, we are going to reformat the headlines. Many headlines contain only lower case words, one headline uses double quotation marks whereas another use singles, some media outlets post their own name in the title… We want a nice, uniform format, using string operations and regular expressions on certain patterns in the headlines and update accordingly. Some examples:

Formatting headlines before checking which ones we actually want to keep is arguably not necessarily the most efficient way to go, but having the headlines in a uniform format certainly helps when filtering. Besides, speed is not the key objective of this script anyway.

Having nicely formatted the headlines, let’s see which ones we want to remove. As the Florida Man gained quite some traction in popular culture, many articles are about the phenomenon rather than the man himself. In our overview, we don’t want to see headlines like this polluting our results:

  • 90 Days of This Year’s Wildest Florida Man Headlines (So Far)
  • Florida Man February 28 (2/28) // Which Florida Man Are YOU? #FloridaMan
  • "Florida Man" Challenge Becomes Viral Sensation: Have You Tried It Tet?

To remove results such as the above, we create a taboo set to exclude any headline containing words such as ‘Viral’, ‘Headline’, ‘Challenge’, symbols like ‘#’ and ‘?’, months, and social media platforms such as ‘Twitter’ and ‘Facebook’. Upper- and lower case variants are considered as well. It’s a bit of a trial-and-error process, taking a few iterations to check whether all top headlines meet the intended result. Furthermore, we want ‘Florida Man’ to be the first two words of every single headline, such that we end up with a cohesive list. For an example of deploying the taboo set, check the gist below.

We are making progress. Now that we have a nice set of headlines, a final problem we run into is that of duplication. Similar (or even identical) headlines tend to make the news on consecutive days, for instance:

Florida Man Dressed Like Fred Flintstone Pulled Over In His "Footmobile" Florida Man Dressed As Fred Flintstone Pulled Over For Driving "Footmobile" Florida Man Dressed As Fred Flintstone Pulled Over For Speeding In Footmobile

To remove such results, we perform an overlap check using difflib's SequenceMatcher. It performs a fairly straightforward check to measure the overlap ratio between two sequences, but it works quite well for our purposes. As comparing each headline to a number of headlines for future dates is quite computationally intensive, we limit the lookahead horizon to ten days. To check for meaningful overlaps, we remove the substring Florida Man as well as all special symbols, and make each word lower case. The overlap threshold is set at 60%. This seems low, but exact matches are often hard to find.

The overlap check works well and removes many similar headlines, but does not completely get the job done. Occasionally, headlines like this seep through:

  • Florida Man Carrying Nearly 500 Grams Of Weed Tries To Steal Plane To Meet Girlfriend, Police Say
  • Florida Man With Weed Steal Plane To See Girlfriend

For the human reader, it is clear both headlines pertain to the same incident. Although semantically having comparable meanings, the SequenceMatcher obviously cannot verify this. Natural Language Processing could be a nice solution in theory, e.g., using the Doc2Veclibrary. Unfortunately, we only have 3660 headlines, far too few to properly train a sophisticated neural network. Constructing a custom search for matching nouns sounds feasible (e.g., ‘Weed’, ‘Plane’ and ‘Girlfriend’ in the example above), manually removing keys would work as well (albeit unsatisfactory from an automation perspective). At this point I decided the additional design effort was not worth the potential benefits. Given the lack of real added value or purpose of this project, one should know when to stop.

Results and visualization

After all that hard work, we naturally want to see the final overview. Without further ado, let’s print and present our carefully curated list of top Google headlines. Actually, the full list is a bit lengthy, so let’s stick to a single month:

Sample output for the month November
Sample output for the month November

Although probably not the best way to represent information, let’s create a word cloud to wrap the project up with a nice visual. Regular square word clouds are getting a bit stale though. To liven things up a bit, we fit the cloud into the shape of some pretty palm trees, creating a numpy mask from the image: mask = np.array(Image.open("img/palm tree.jpg")) and using the ImageColorGeneratorfrom the wordcloud library.

And that concludes it – the year of the Florida Man summarized in one picture, as unintelligible as the man himself!

Word cloud constructed with mask of palm tree image (own work by author)
Word cloud constructed with mask of palm tree image (own work by author)

_Curious for more? The full project code can by found on my GitHub repository._

Want to see the full list of headlines? Check out:

A Year In The Life Of The Florida Man


Takeaways

  • To study the life of the Florida Man, we performed a Data Science microproject ranging from data ingestion to reporting and visual presentation.
  • Multi-threading streamlines the scraping process with multiple Google Cloud APIs.
  • Data formatting combines string operations and regular expressions to convert the headlines into a uniform style.
  • Cycling through a set of taboo words removes headlines that are thematically outside the project scope. The SequenceMatcher succeeds in removing most duplicate headlines.
  • An image mask is used to generate a custom-shaped word cloud, using the ImageColorGenerator.
  • The Florida Man is a strange guy. Seriously.

References

[1]Altintas, I. & Porter, L. (2021) UCSanDiegoX DSE200x: Python for Data Science. Retrieved from learning.edx.org


Related Articles