WORDLE-VISION: Simple Analytics To Up Your Wordle Game

Ravi Gupta
Towards Data Science
5 min readJan 25, 2022

--

I love word games. For nearly a decade I had a not-so-healthy obsession with Words With Friends, playing up to a dozen games simultaneously each day with players who were, in truth, not my friends but random opponents. However, in recent years the once-popular mobile app has fallen out of favor, leaving me and a lingering crew of oddballs who would initiate bizarre conversations with me via the in-app chat function (see Figure 1 for one of many examples). Consequently, my interest in the app waned and eventually died out.

Figure 1: Screenshot of Word With Friends in-app conversation.
Figure 1: Snippet of a Words With Friends in-app conversation with a stranger (one of many). My attempt at humor always went unappreciated (Image by author).

Then I heard about this game called Wordle that was sweeping the nation. No downloads needed, no account logins either, no ads, and no unsolicited messages from randos. I was intrigued. For those unfamiliar with Wordle, the object of the game is to guess the 5-letter English word of the day (the “Wordle”) in 6 tries. The game tells you if you guessed a letter correctly and also if you guessed its position correctly within the word.

After playing for a week I started wondering: what is the optimal first guess? With your very first guess you’ve got nothing to go on, right? Well, that’s not completely true. There is some a priori knowledge we have going in. For example, we know the correct word must be an English word that is 5 letters long. It probably needs one or more vowels, or at least a Y. Also, you might have heard that E is the most used letter in the English language¹ (though this is not necessarily true of all 5-letter English words). The scientist in me itched to discover a way to decide on the best Wordle guess using publicly available word data.

Contrary to what you might read online, it’s not always necessary to approach a data problem with a state-of-the-art machine learning model. You may not need to spend days worrying about tuning hyperparameters, cross-validation, or exactly how many hidden layers to include in your neural network. Sometimes a cup of coffee, a few dozen lines of code, and a couple bar charts can get you pretty far in a short amount of time.

The first step toward gleaning some quick insights was to find a lexicon of all words in the English language. Luckily, this proved to be fairly easy to do (because the Internet). I found a text file on GitHub in a repo by dwyl that contains 370,000 words. For the purposes of Wordle we only care about 5-letter words, which leaves us with almost 16,000 words. The next thing I wanted to know is what the most common letters are among 5-letter words. For this I needed to compute the frequency of each letter, the results for which are shown in Figure 2.

Figure 2: Bar chart showing frequency distribution of letters within 5-letter words.
Figure 2: Frequency distribution of letters found in 5-letter words. 10.5% of letters are A, 9.8% of letters are E, and so on (Image by author).

I was a bit surprised to see that among 5-letter words A is the most common, comprising 10.5% of all letters. The letter E is a close second at 9.8%, followed by S at 8.2%. Unsurprisingly, all vowels sit in the top 10 while letters like V, Z, J, X, and Q are the least frequent. Another interesting way to view the data is to take the cumulative sum to obtain a cumulative frequency version of Figure 2. Figure 3 below shows what that distribution looks like.

Figure 3: Bar chart showing cumulative frequency distribution of letters within 5-letter words.
Figure 3: Cumulative frequency distribution of letters found in 5-letter words. We can see that the top 7 letters account for over half of all letters (Image by author).

The neat thing about the cumulative distribution is that we can clearly see that the top 7 letters (A, E, S, O, R, I, L) account for over half (53%) of all letters among 5-letter words! It seems logical that the best first Wordle guesses would be words that contain only these top 7 letters. It turns out that there are 231 words in this lexicon that only use the letters A, E, S, O, R, I, and L. So we’ve now shrunk our set of 16,000 5-letter words to 231 — a 98.5% reduction!

But some of these 231 words include SALSA, which (while delicious) only consists of 3 distinct letters and thus diminishes our power to eliminate letters in our first attempt. Therefore, a better word to guess first would be one that uses only the top 7 letters and only uses each letter once. Imposing this requirement leaves us with only 60 words, which is a more reasonable amount that we can sort through quickly.

One important caveat with this entire analysis is that I am using as input a full list of 5-letter English words. In fact, the list of possible Wordle words is a pared-down collection of about 2,500 words that are the most common.² Having access to this list of 2,500 words would greatly improve the results here. In lieu of this, I can instead manually sift through our list of 60 words and remove the more esoteric ones. The remaining words form my curated list of 21 Best Words to Use as Wordle Guesses (in alphabetical order):

AISLE
ALOES
ARISE
AROSE
EARLS
LAIRS
LASER
LIARS
LIERS
LORIS
LOSER
OILER
ORALS
RAILS
RAISE
REALS
RILES
ROILS
ROLES
SLIER
SOLAR

Let’s actually try some of the words on this list as the first guess in a couple of real Wordle games. In Figure 4 I show my Wordle results from Jan 22 (left) and Jan 23 (right).

Figure 4: LEFT — My Wordle (#217 from 22 Jan 2022) where I used the very first word, AISLE, from my curated list of 21 Best Words to Use as Wordle Guesses. RIGHT — My Wordle (#218 from 23 Jan 2022) where I tried another word from my list, RAISE, as a first guess (Author’s screenshot of Wordle).

Guessing AISLE on Jan 22 helped me solve the Wordle in 3 tries! After the initial guess, I used the frequencies from Figure 2 above to help guide my subsequent guesses. For example, I guessed MINCE before WINCE because M appears more frequently than W. Amazingly, each correct letter I guessed was in its correct position, illustrating that the role of luck in this game cannot be discounted. Guessing RAISE on Jan 23 also allowed me to solve the puzzle in 3 tries. I again referred to what I had learned about frequencies to help me select the next letters to guess.

Now equipped with some data-driven insights, go forth and Wordle!

The code I wrote for this analysis along with the bar charts above can be found in my WORDLE-VISION GitHub repo.

[1] Lexico.com. “Which letters in the alphabet are used most often?” 2021. https://www.lexico.com/explore/which-letters-are-used-most (Accessed 23 Jan 2022)

[2] Victor, Daniel. “Wordle Is a Love Story.” The New York Times, 3 Jan 2022, https://www.nytimes.com/2022/01/03/technology/wordle-word-game-creator.html

--

--

Data Scientist at Disney+ with a PhD in astrophysics and 15 years of experience analyzing large data sets: www.raviryangupta.com