A Data Science microproject in Python to compile the top headlines per day

You probably know the meme. Google ‘Florida Man’ + your birthday, and the strangest search results pop up. Every day of the year, the enigmatic Florida Man seems to try his hand at something even more peculiar, violent, tragic or downright bizarre than the day before. Often, the headlines are so gaudy they can’t help but bring a smile to your face.
Googling what the Florida Man did on your birthday is fun, but you must be curious what he does the rest of the year as well (at least I was). Googling 366 days is a bit too much – although that would probably have been faster than writing a script for it – so let’s try to automate that in Python! We perform a Data Science microproject, dabbling in some cloud provisioning, web scraping, multi-threading, string operations, regular expressions and word clouds to top it all off.
Data Science Pipeline
For any Data Science project (even a microproject) I find it useful to utilize the Data Science Pipeline as outlined by the UC San Diego[1]. It contains five phases that encompass the full project:
- Acquire: Use a Google Custom Search Engine and Google CSE JSON APIs to retrieve query data.
- Prepare: Visualize raw output, filter out headlines from query results.
- Analyze: Remove unsuitable headlines with set of taboo words and similarity check. Use string operations and regular expressions to produce uniformly styled output.
- Report: Display sample output, visualize common words in word cloud.
- Act: Write and publish a Medium article based on project.
![Data Science pipeline (own work by author, inspired by [1])](https://towardsdatascience.com/wp-content/uploads/2021/06/1aO7nkhjMFq0AZHuR8Yrllg.png)
Setting up Google Custom Search
For scraping Google ‘s Custom Search Engine – automated requests on their regular search engine are not permitted – we use the Google Custom Search API. This API provides up to 100 free queries per day, returning output in JSON format. Using Python, we create a query for each day and pass them to the Google Custom Search Engine JSON API to extract the top headlines for each search. For more information on creating and configuring both the API and the Google Search Custom Engine, please check out my tutorial Scraping Google With Python – A Step-by-Step Tutorial. In this project, we configure the search engine such that it incorporates results from the United States within the .com domain.
Generating data queries
Once the Custom Search Engine is set up, we create a list of 366 birthdays (yes, including that pesky February 29th). 2020 was a leap year, so we are covered if we take all dates from that year (from ‘1-January’ to ‘31-December ‘):
Now we simply concatenate each date with ‘Florida Man’ to generate the 366 search queries in orderly fashion: search_term_list = ['Florida Man' + date for date in date_list]
Having our list of queries ready, we are ready to rumble!

Running the query
Well, almost ready. As you recall, we can only run 100 requests per days, and we definitely don’t want to spend four days on this pet project. Fortunately we can provision up to 12 APIs under a single account, so we can simply create four duplicates and sequentially run ~92 queries on each of them by switching keys.
Let’s see if we can do better though! If we have to distribute the workload over four separate APIs anyway, we might as well do a bit of multi-threading. Probably it’s not going to safe us a tremendous amount of time, but as we are dealing with an Input-Output process that requires a lot of waiting time, we should do it right.
Ok, now we are ready to rumble.
Let’s run a single query and check how a single result looks. Even prettyprint
couldn’t make it pretty, and we have 3660 of those outputs. Fortunately, we are only interested in the headlines. Let’s see what we information actually need from this mess and do some data cleaning.

The main problem we encounter is that the actual headlines— title
as displayed on the Google search page – are often incomplete due to their length. ‘Florida Man Who Threw Toilet Through Window in East St. Louis …’ is a cliffhanger, for sure, but clicking to read the full story is exactly what we try to avoid here. The solution we pick is to instead read the field og:title
, which includes the full title. og
stands for Open Graph protocol, a data format designed to make sharing across social media easier. Unfortunately not every website includes it, so we drop these results. Still plenty of source material remains though.
Now we have a dictionary with only full titles; things are getting more and more readable. With that, we also note many headlines are not exactly what we are looking for though. Time to perform a deeper analysis.

Data analysis
First things first, we are going to reformat the headlines. Many headlines contain only lower case words, one headline uses double quotation marks whereas another use singles, some media outlets post their own name in the title… We want a nice, uniform format, using string operations and regular expressions on certain patterns in the headlines and update accordingly. Some examples:
Formatting headlines before checking which ones we actually want to keep is arguably not necessarily the most efficient way to go, but having the headlines in a uniform format certainly helps when filtering. Besides, speed is not the key objective of this script anyway.
Having nicely formatted the headlines, let’s see which ones we want to remove. As the Florida Man gained quite some traction in popular culture, many articles are about the phenomenon rather than the man himself. In our overview, we don’t want to see headlines like this polluting our results:
- 90 Days of This Year’s Wildest Florida Man Headlines (So Far)
- Florida Man February 28 (2/28) // Which Florida Man Are YOU? #FloridaMan
- "Florida Man" Challenge Becomes Viral Sensation: Have You Tried It Tet?
To remove results such as the above, we create a taboo set to exclude any headline containing words such as ‘Viral’, ‘Headline’, ‘Challenge’, symbols like ‘#’ and ‘?’, months, and social media platforms such as ‘Twitter’ and ‘Facebook’. Upper- and lower case variants are considered as well. It’s a bit of a trial-and-error process, taking a few iterations to check whether all top headlines meet the intended result. Furthermore, we want ‘Florida Man’ to be the first two words of every single headline, such that we end up with a cohesive list. For an example of deploying the taboo set, check the gist
below.
We are making progress. Now that we have a nice set of headlines, a final problem we run into is that of duplication. Similar (or even identical) headlines tend to make the news on consecutive days, for instance:
Florida Man Dressed Like Fred Flintstone Pulled Over In His "Footmobile" Florida Man Dressed As Fred Flintstone Pulled Over For Driving "Footmobile" Florida Man Dressed As Fred Flintstone Pulled Over For Speeding In Footmobile
To remove such results, we perform an overlap check using difflib's
SequenceMatcher
. It performs a fairly straightforward check to measure the overlap ratio between two sequences, but it works quite well for our purposes. As comparing each headline to a number of headlines for future dates is quite computationally intensive, we limit the lookahead horizon to ten days. To check for meaningful overlaps, we remove the substring Florida Man as well as all special symbols, and make each word lower case. The overlap threshold is set at 60%. This seems low, but exact matches are often hard to find.
The overlap check works well and removes many similar headlines, but does not completely get the job done. Occasionally, headlines like this seep through:
- Florida Man Carrying Nearly 500 Grams Of Weed Tries To Steal Plane To Meet Girlfriend, Police Say
- Florida Man With Weed Steal Plane To See Girlfriend
For the human reader, it is clear both headlines pertain to the same incident. Although semantically having comparable meanings, the SequenceMatcher
obviously cannot verify this. Natural Language Processing could be a nice solution in theory, e.g., using the Doc2Vec
library. Unfortunately, we only have 3660 headlines, far too few to properly train a sophisticated neural network. Constructing a custom search for matching nouns sounds feasible (e.g., ‘Weed’, ‘Plane’ and ‘Girlfriend’ in the example above), manually removing keys would work as well (albeit unsatisfactory from an automation perspective). At this point I decided the additional design effort was not worth the potential benefits. Given the lack of real added value or purpose of this project, one should know when to stop.
Results and visualization
After all that hard work, we naturally want to see the final overview. Without further ado, let’s print and present our carefully curated list of top Google headlines. Actually, the full list is a bit lengthy, so let’s stick to a single month:

Although probably not the best way to represent information, let’s create a word cloud to wrap the project up with a nice visual. Regular square word clouds are getting a bit stale though. To liven things up a bit, we fit the cloud into the shape of some pretty palm trees, creating a numpy
mask from the image: mask = np.array(Image.open("img/palm tree.jpg"))
and using the ImageColorGenerator
from the wordcloud
library.
And that concludes it – the year of the Florida Man summarized in one picture, as unintelligible as the man himself!

_Curious for more? The full project code can by found on my GitHub repository._
Want to see the full list of headlines? Check out:
Takeaways
- To study the life of the Florida Man, we performed a Data Science microproject ranging from data ingestion to reporting and visual presentation.
- Multi-threading streamlines the scraping process with multiple Google Cloud APIs.
- Data formatting combines string operations and regular expressions to convert the headlines into a uniform style.
- Cycling through a set of taboo words removes headlines that are thematically outside the project scope. The
SequenceMatcher
succeeds in removing most duplicate headlines. - An image mask is used to generate a custom-shaped word cloud, using the
ImageColorGenerator
. - The Florida Man is a strange guy. Seriously.
References
[1]Altintas, I. & Porter, L. (2021) UCSanDiegoX DSE200x: Python for Data Science. Retrieved from learning.edx.org