The Wild World of Crossword Data

Kurt Reckziegel
Towards Data Science
5 min readFeb 23, 2018
Photo by Natalia Ostashova on Unsplash

I’ve always been pretty into crosswords. My mom and sisters would pull the New York Times Crossword out of the Montreal Gazette and work on it together, sometimes throwing a sports-related question my way.

Fast-forward to 30+ Kurt, and now I’m solving full puzzles on my own, like a real boy. When I really got into it though, was over the most recent Christmas holiday, when the temperature outside was a brisk -31°F at our lakehouse on Lac Pilon in Sainte-Adèle. Needless to say, we spent most days inside by a crackling fire sipping something warm. Despite the cold, it was a delightful holiday, and with no cable (and pretty spotty wifi) we passed the time doing crosswords.

Now, back in NYC, where the weather is a little less arctic, I consider myself a full-blown puzzle-fiend, and I get my fix in a few different ways:

  • Since she still gets an actual newspaper delivered to her door every morning, my lovely mother is kind enough to scan and email each and every NYT Crossword to my girlfriend and me (which we promptly print out and add to the growing stack on our dedicated crossword clipboard).
  • I paid for a digital subscription to the NYT (I know, crazy, right?) just to get access to their crosswords. Yeah, this does duplicate what mom sends, but it also gives me access to the daily minis, which are perfect for the subway commute.
  • Speaking of subway commutes, mine’s pretty long. So if I don’t have any minis left to do, I sometimes get to scratch my crossword itch by solving a couple clues in my head while peeking at someones Metro puzzle over their shoulder (creepy or resourcful?).

What’s the big deal about the NYT Crossword anyway?

I’m paraphrasing from Wikipedia here, but basically it’s a daily puzzle published by the New York Times and also syndicated to over 300 other newspapers and publications (which is how my mom gets it in Montreal).

It’s got a whole slew of different contributors, but since 1993 has always been edited by good ol’ Will Shortz. Every day of the week the puzzle gets progressively harder, with the easiest being Monday and hardest Saturday (Sundays are a larger puzzle and usually equated to about the same difficulty as a Thurdsay). Over time, I’ve gotten pretty comfortable with Mondays and Tuesdays, but still have some trouble finishing a whole Wednesday puzzle.

Enough about history, let’s talk data

Thanks to michael, I was able to get my hands on a decently sized data dump that includes NYT Crossword data from December 1993 to July 2017. What did I do next? Well, first I put down the Wednesday crossword I was working on, and then I dove right in.

After uploading the data into Google Cloud, and running some SQL queries on it in BigQuery, here are a few fun facts I found:

Fun Facts

  • 24.5 years
  • 8,238 puzzles
  • 432,205 clues
  • 108,423 unique answers
  • 54,820 (or 51%) unique answers were never used more than once

Ok, so wait, that means 49% of all unique answers were used multiple times?!

Hmmm interesting, I wonder which words are used most frequently…

Top 10 Most Used Answers Overall (& Percent Puzzle Presence)

So, what have we got here? It looks like a set of short words that can easily be used to fill gaps left by longer answers. It’s also interesting to see the prominence of letters like E, A, and R, since those are the three most frequently used letters in the English language (according to Oxford Dictionaries). Those three letters combined account for 27.24% of the English language, while for the NYT, they make up 68.75% of the top 10 most used answers (22 out of 32 characters).

Quick tip: based on this data, you can probably expect the word ERA to show up about once every 2.4 weeks. (Remember, the clue for this could be something related to time or baseball or equal rights.)

Curious to see how that looks broken down by day of the week? I got you.

Top 10 Most Used Answers by Day of Week

Something else I looked into was how long the clues were. I’ve found that they vary from being short, one-word clues, to being much longer sentences or even paragraphs. Since the puzzle difficulty changes as the week progresses, I thought it may be cool to look at how the average clue character count changes throughout the week.

Average Clue Character Count by Day of Week

Well, there it is, clear as day: higher difficulty = longer clues

With the average character count increasing day-over-day from Wednesday to Saturday, we can confidently say that longer clues are usually harder clues. And, this may just be me, but I find it so satisfying that on Sunday the average character count drops back down to almost exactly the same level as Thursday (remember earlier I mentioned that Sundays are larger puzzles but are about the same level of difficulty as a Thursday).

The one thing I can’t explain is why there’s a slight drop in clue length from Monday to Wednesday. Anyone have any ideas??

If you’ve ever worked on (or even looked at) a crossword, you know that there are clues and answers going in two directions: across and down. So, the last thing I looked into was whether there is a difference in clue lengths based on direction.

Average Clue Character Count by Answer Direction (& Day of Week)

Ok, this one didn’t pan out into much. Not a whole lot to look at here, but let’s try to stretch it into something. If we previously determined that longer clues usually mean harder clues, then based on the data above, could we say:

Down clues are easier than Across clues. ¯\_(ツ)_/¯

Well, that’s all for today, folks. I’ve got to make some coffee and get back to this pesky Wednesday. Anyone know a six-letter word for “joe”?

Tools Used

Google Cloud Storage, Google BigQuery, Excel, GitHub, infogram

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Published in Towards Data Science

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

Written by Kurt Reckziegel

Working on consumer insights and brand strategy (alum of @onepeloton @abinbev @VICE @jmsbconcordia) he/him

Responses (1)

What are your thoughts?

Love this!! Well written, fun, and nerdy as heck. Cheers, hope all’s well!