Wordle Word Length and Letter Frequency Analysis Using Julia

Exploring English word datasets to improve our game

Naresh Ram
Towards Data Science

--

Inspired by Wordle. Image by the Author.

Wordle™ is a great game that needs no introduction. But keeping a streak alive is no easy task. So, to better understand how the gameplay works, let’s load up the English words in Julia and perform a little analysis on them. We’ll look at letter frequencies in general and the 5-letter words specifically to help make better-informed guesses in the game. Along the way, let’s also look at why five was chosen as the word length for the game.

While I cannot guarantee it will reduce the number of guesses you make, I can guarantee a fun session of data analysis using Julia and Julia Plots.

At AAXIS, where I work, we implement business-to-business as well as business-to-consumer digital commerce solutions. Doing so includes migrating large amounts of data from an existing older system to a newer system with different data structures. We use a variety of data tools to analyze source data to ensure consistency. I routinely use techniques outlined in this article for that purpose. A good way to become familiar with these techniques is to use them often and the Wordle word list sounds like a fun way to practice.

So, let’s get started.

Setup

Photo by Linus Mimietz on Unsplash

If you are planning to work alongside me to learn Julia, I strongly recommend you go through this section. If not, you can safely skip it.

Installing Julia

Please install Julia from the Julia Language site. Installers are available for Windows, Mac, and Linux operating systems. I also installed the Julia Language Support extension for VSCode and am executing these commands in the Julia REPL in VSCode. That way, my graphs show up in the window, making it mightily convenient to analyze datasets.

Julia REPL in Visual Studio Code. Image by the Author.

Now, let’s go ahead and load up the graphics and other required libraries. You’ll want to pull up a Julia REPL Window by typing CNTRL-SHIFT-P and selecting Julia: Start REPL after installation.

Type these in the Julia REPL to check to see if everything is working:

julia> println("Hello world")
Hello World

julia> f(x) = x^2 + 3x + 2
f (generic function with 1 method)

julia> f(2)
12

In VSCode, pressing SHIFT-ENTER while the code is selected, executes it and outputs the result in the Julia REPL underneath. You can just cut the code from this article, paste it into VSCode, select it, and press SHIFT-ENTER to execute it. To facilitate this method, I am going to be skipping the julia> prompt in the code. Note that you can type all these commands into the REPL, even the long ones with for-loops or functions in them.

Next, let’s load Plots. Your UI may be a little different as Julia might download and install some of the packages. We will also need DelimitedFiles and Statistics packages:

import Pkg
Pkg.add("Plots")
using Plots
using DelimitedFiles
using Statistics

plot(f,-10,10)

That last line should generate a graph as seen in the screenshot above.

Word Datasets

Several word datasets are available online with varying degrees of cleanliness. Unfortunately, I don’t have permission to use most of them in this article. I have chosen SCOWL (Spell Checker Oriented Word Lists) and Friends, a database of information on English words useful for creating high-quality word lists suitable for use in spell checkers of most dialects of English, and is available for our use. Other word lists available to the public for educational purposes should you decide to explore them are listed here:

  • Infochimp's simple English word list is a list of English words by Infochimp. The GitHub site provides the list but specifies that the copyright lies with Infochimp. The link to Infochimp goes to a non-existent page.
  • Collins Official Scrabble™ Words list is endorsed by WESPA for use in Tournament & Club play worldwide, excluding USA and Canada. The software is made available for private, non-commercial use only. It is an excellently curated list of English words.
  • The Wordle Guess List is the complete list of 13K allowed words in the game and is copyrighted by Josh Wardle and now probably by the NY Times. This list closely matches the subset of 5-letter words from the Scrabble list above. The answer list, which is the 2.5K possible answers for Wordle can be found here.

Now, let's go ahead and grab the SCOWL word list and get it prepared for analysis. Head over to http://app.aspell.net/create and download the word list. The options I selected are shown here. I used the 80-size list for the next part and named the file scowl_american_words_80.txt.

Image of options chosen for generating the word list. Screenshot by the Author.

I also downloaded the 35-size option and named it scowl_american_ words_35.txt. I use this for analyzing more common words later in the article.

For word usage, I use the presidential speeches available at Miller Center¹. The speeches are in the public domain, so there are no restrictions on their use. Simply select the name of the president, click on the speech, and then click on the transcript. I copied the text and pasted it into a text editor (VSCode) for analysis.

Why 5-letter words?

Photo by Glen Carrie on Unsplash

My good friend and neighbor, Milan, brought up this question. He thought perhaps 5-letter words might be the most common in the English language. It’s certainly worth checking by loading the SCOWL word list dataset we discussed in setup and computing the frequency of word lengths.

rawwords = readdlm("data/scowl_american_words_80.txt", header=false,String)
343681×1 Matrix{String}:
"A"
"A'asia"
"A's"
"AA"
"AA's"
"AAA"
"AAM"

"zymurgies"
"zymurgy"
"zymurgy's"
"zythum"
"zyzzyva"
"zyzzyvas"
"zzz"

totalWords = length(rawwords)
343681

The aspell wordlist contains possessives like zymurgy's and proper nouns like Albert. Let’s get rid of them

dictwords = String[];
for word in rawwords
# remove words with ', -' space and capital letter start
if (length(word)>0 && !(occursin("\'",word)
|| occursin("-",word)
|| occursin(" ",word)
|| (word[1]<'a' || word[1]>'z')
))
push!(dictwords,word)
end
end

# get count
totalWords = length(dictwords)
244274

Next, we should build a histogram for word lengths. We can do this in three ways that I know of; list comprehension, map function, and for-loop. They should all be approximately the same in Julia with regard to performance. Let’s take a look:

# use list comprehension
wordlengths = zeros(Int64,0)
@time wordlengths = [length(x) for x in dictwords];
0.021251 seconds (45.30 k allocations: 4.246 MiB, 84.02% compilation time)

# use map
wordlengths = zeros(Int64,0);
@time wordlengths = map(word->length(word), dictwords);
0.024202 seconds (49.09 k allocations: 4.473 MiB, 85.97% compilation time)

# use for loop
wordlengths = zeros(Int64,0);
@time for word in dictwords
push!(wordlengths, length(word))
end
0.027516 seconds (488.05 k allocations: 14.131 MiB)

Generally speaking, while the list comprehension is slightly better in performance, they are all pretty close and I would pick the for-loop as it makes the logic clear and easy to follow. Now, the word lengths array should contain the length of each word. The min and max are:

lrange = minimum(wordlengths),maximum(wordlengths)
(1, 45)

The min is 1 and the max is 45, but for the purpose of the histogram, I decided to put all the words greater than 20 in the “20” bin ( len>20 ? 20 : len). Now let’s look at the distribution of each word length. Julia Plots make it easy.

julia>  histogram(wordlengths,bins=20)
Word Lengths in the English words dataset. Image by the Author

I actually ran the command above with some options to make the graph look a little more informative.

histogram(wordlengths,
bins=20,
xaxis=("WORD LENGTH"),
yaxis=("COUNT"),
xticks=([1:1:20;]),
yticks=([0:5e3:4.5e4;],["$(x)k" for x=0:5:45]),
label=("Word count"),
xguidefontsize=8, yguidefontsize=8,
margin=5mm, ylims = (0,4e4),
framestyle = :box,
fill = (0,0.5,:green),
size=(800,420))

savefig("pics/2-english-word-lengths-v2.png")

And surprise, 5-letter words aren’t even in the 5 most frequent word lengths! They make up just 4.6% of our word dataset. Ravi Parikh² and Reginald Smith³ also report similar findings (5.2%). The difference may be because I chose a word dataset which probably does not have slang, hyphenated words, etc.

Firstly, the graph looks like a classic bell curve distribution. This bell-shaped curve is a common feature of nature and psychology⁴ such as birthweights of babies, the heights of males, blood pressure measurements, etc. I am no linguist, but in my opinion, this reveals to me that the English language seemed to have evolved as if it were nature-based and random.

julia> mean(wordlengths)
9.250849455938823

julia> median(wordlengths)
9.0

The most common word lengths are 8 and 9, but I can’t seem to readily think of any. So, let’s list a few to make sure:

julia> filter(x->(length(x)==9), dictwords)[1:5]
5-element Vector{String}:
"aardvarks"
"aasvogels"
"abactinal"
"abamperes"
"abandoned"

julia> filter(x->(length(x)==9), dictwords)[27001:27005]
5-element Vector{Any}:
"sassabies"
"sassafras"
"sassarara"
"sassiness"
"sassolite"

AASVOGELS? Apparently, it’s a South African vulture. SASSAFRAS? No wonder I am no good at Scrabble.

So, how many 5-letter words are there?

julia> length(filter(x->(length(x)==5), dictwords))
11210

As opposed to 36K 9-letter words, there are only 11K 5-letter words. Hmm …

Then why 5-letter words for Wordle?

Perhaps we use them more.

My editor Megan, the good lady who helps me with these articles, was of the opinion that an average person is most comfortable with 5-letter words. This would mean that 5-letter words should make up most of our communication and appear at a higher frequency. To validate this theory, I turned to presidential speeches.

Presidential speeches are often written professionally for a wide audience that includes people from different backgrounds and perspectives, so it’s important for the writing to be inclusive, culturally sensitive, and accessible. The speechwriter must also have a deep understanding of current events and political issues to effectively communicate the president’s stance and goals. There is a lot riding on a presidential speech, as it can impact public opinion and shape the political discourse. And hence, they make excellent representative texts of the language of that time.

And these speeches are in the public domain and easily accessible for us to analyze¹.

For this exercise, I grabbed speeches from the last four presidents of the United States of America. I then analyzed the word lengths and graphed them.

To make the graphing easy, I wrote a function to return the histogram data. Words that are longer than 18 letters, get bucketed under 18 letters.

function wordLengthHistoFromFile(filename, doUnique=true)
set_tokenizer(poormans_tokenize)

# split the words in the file into tokens
words = collect(tokenize(read(filename, String)))

if (doUnique)
words = unique(words)
end
totalWords = length(words)

# this array holds the count. Index is the word length``
local wordLengths = zeros(Int64,18)

# Now iterate through the words
for word in words
l = length(word)

# there should be very few words more than 18
l = l<=18 ? l : 18;
wordLengths[l] += 1
end

# Return % of total word lengths used
return wordLengths./(totalWords/100.0)
end

I am using the Poor man’s tokenizer. A tokenizer, essentially, splits a text into words or sub-words so we can process them individually. Poor man’s tokenizer deletes all punctuation and splits into spaces, which isn’t all that sophisticated but is enough for our purpose.

Julia’s tokenizer interface takes a text as input and returns an array of individual words. Then, we simply loop through all the words in the array, compute the length and increment the bucket that represents that length.

Now that the function is ready, we can process the speech files in text format. First, let’s take a look at the inaugural speeches found here.

hBushJrInaugural = wordLengthHistoFromFile("data/speech-bushjr-inaugural.txt")
hObamaInaugural = wordLengthHistoFromFile("data/speech-obama-inaugural.txt")
hTrumpInaugural = wordLengthHistoFromFile("data/speech-trump-inaugural.txt")
hBidenInaugural = wordLengthHistoFromFile("data/speech-biden-inaugural.txt")

So, we process each inaugural speech and create an array of buckets. We horizontally concatenate ( hcat) them to get a single 18x5 array that we can plot.

plot(hcat(hBushJrInaugural,hObamaInaugural,hTrumpInaugural,hBidenInaugural ), 
xticks=([1:18;]),
yticks=([0:2:100;],["$(x)%" for x=0:2:100]),
label=["Bush" "Obama" "Trump" "Biden"],
yaxis=("% OF APPEARANCE"),
xaxis=("WORD LENGTH"),
xguidefontsize=8, yguidefontsize=8,
margin=5mm,
framestyle = :box,
lw=2, size=(800,420), marker=(:circle),
lc=[:red :blue :lightcoral :dodgerblue],
mc=[:white :white :black :black]
)

savefig("pics/3-presidential-speeches.png")

The first line above does the plot and then I save the graph as an image file.

Word length in presidential speeches. Image by the Author.

The graph here shows the distribution for each of the speeches. For example, under 3% of all unique words that made up President Obama’s inaugural speech consisted of 2-letter words

5-letter words seem to be popular along with 4, 6-letter words. Note that while 9-letter words make up 15% of the dictionary, they only constitute less than 8% of all unique words used in the speech.

Now, let's look at a more unscripted setting. This should give us an idea of daily diction instead of well-thought-through scripted work. I pick one press conference for each of the last four presidents and analyze them.

hBushJrPC = wordLengthHistoFromFile("data/pc_bushjr_final.txt")
hObamaPC = wordLengthHistoFromFile("data/pc_obama_final.txt")
hTrumpPC = wordLengthHistoFromFile("data/pc_trump_laborday.txt")
hBidenPC = wordLengthHistoFromFile("data/pc_biden_first.txt")

And plot the results:

plot(hcat(hBushJrPC,hObamaPC,hTrumpPC,hBidenPC ), 
xticks=([1:18;]),
yticks=([0:2:20;],["$(x)%" for x=0:2:20]),
label=["Bush" "Obama" "Trump" "Biden"],
xaxis=("WORD LENGTH"),yaxis="% OF APPEARANCE",
lw=2, size=(800,420), marker=(:circle),
xguidefontsize=8, yguidefontsize=8,
framestyle = :box,
margin=5mm, ylims = (0,20),
lc=[:red :blue :lightcoral :dodgerblue],
mc=[:white :white :black :black]
)
savefig("pics/4-presidential-press-conf.png")
Word length of presidential press conferences. Image by the Author.

Here the pattern is clearer. Except for President Biden (who seems to favor 4-letter words), 5-letter words are a clear winner.

In literature and online, we can find further evidence of this fact. The average word length of articles in the New York Times apparently is 4.9⁵. According to WolframAlpha, the average word length in the English language works is 5.1 characters. The same site also claims that the average word length for Encyclopaedia Britannica Online is 5.3 and for Wikipedia, it is 5.2. This article has an average of 4.75 letters per word, computed here. That doesn’t seem like too bad of a company.

So, why 5 letters? In English, our diction appears to make up of 5-letter words even though there is a larger variety of words that are longer to choose from. Perhaps Josh Wardle tried four, five, and six and decided five makes the most sense for the target audience and the target time to solve it. The game should be a little challenging so you feel like you accomplished something but not too hard. Or, though less likely, he went through this kind of analysis and decided on five letters. Guess we will never know.

Letter Frequency

Photo by Isaac Smith on Unsplash

Now that we have put the question of why five letters to bed, let’s focus on improving our game.

One strategy for playing Wordle is to maximize hits (yellows or greens) so as to reduce the number of total guesses. An obvious approach is to choose words with the most common letters in them⁶. Simulations at the University College Dublin found that the mean number of guesses could go from 5 to 4, based on a good first-word choice⁷. This is where letter frequency analysis can help.

First, let’s do some work on the entire word dataset we downloaded earlier. These will include words of all lengths. Once we have figured out the frequency in all English words, we can compare that with letter frequency in 5-letter words to see if there are any significant differences.

I can’t readily think of a way to do this using list comprehension. However, “map” and “for-loop” are quite easy. Since the map was a little faster, let’s do that first.

charfrequency = zeros(Int64,26)
@time map(word->(
map(
char-> ((char>='a' && char<='z') ? charfrequency[char-'a'+1]+=1 : 1), collect(word))
), dictwords);

0.386160 seconds (5.85 M allocations: 142.942 MiB, 7.98% gc time, 26.68% compilation time)

# convert to percentage
charfrequency = charfrequency ./ (totalWords/100.0)

I don’t know about you, but if that wasn’t my code, I’d be a little lost as to what it is doing. Let’s see if for-loop is any better:

charfrequency = zeros(Int64,26)
@time for word in dictwords
for char in collect(word)
if (char>='a' && char<='z')
charfrequency[char-'a'+1] += 1
end
end
end
0.437091 seconds (7.49 M allocations: 171.794 MiB, 6.45% gc time, 1.16% compilation time)

# convert to percentage
charfrequency = charfrequency ./ (totalWords/100.0)

This one I can read: a for-loop for the words and then a for-loop for the characters. After that, I simply tally the letters. The performance isn’t that bad either, a ten percent drop with increased readability seems worth it.

Let’s graph this array:

bar(charfrequency,
orientation=:h,
yticks=(1:26, 'A':'Z'),
xlabel=("% of APPREARANCE"),
ylabel=("ALPHABET"),
yflip=true,
legend = :none, framestyle = :box,
xticks=([0:10:110;],["$(x)%" for x=0:10:110]),
xguidefontsize=8, yguidefontsize=8,
margin=5mm, xlims = (0,110),
ytickfont = font(6,"Arial"),
fill = (0,0.5,:green),
size=(800,420))

savefig("pics/5-dict-char-freq.png")

And here it is.

Percent of appearance of letters in the entire word dataset. Image by the Author.

The way the code works, we ended up counting 2 for an ‘E’ in ‘SLEEP’. That is why we are seeing over 100% for the appearance of E in the dataset. We should only count 1, even if the letter appears more than once in the word. We can fix this by simply removing duplicate characters in the word, before doing the tally.

charfrequencyu = zeros(Int64,26)
@time for word in dictwords
for char in unique(word)
if (char>='a' && char<='z')
charfrequencyu[char-'a'+1] += 1
end
end
end

0.491082 seconds (7.41 M allocations: 241.289 MiB, 7.55% gc time)

I changed the collectin the second for-loop to unique. Now, let’s graph both and see if things change:

bar(hcat(charfrequency, charfrequencyu),
orientation=:h,
yticks=(1:26, 'A':'Z'),
xlabel=("% of APPREARANCE"),
ylabel=("ALPHABET"),
yflip=true, framestyle = :box,
xticks=([0:10:110;],["$(x)%" for x=0:10:110]),
xguidefontsize=8, yguidefontsize=8,
margin=5mm, xlims = (0,110),
ytickfont = font(6,"Arial"),
lc=[:black :black], mc=[:black :black],
fill=[:green :blue], fillalpha=[0.5 0.5],
label=["frequency" "unique"],
size=(800,420))
savefig("pics/6-compare-vs-unique.png")

Here is the graph:

Comparison of the count of appearances with and without unique letters. Image by the Author.

I can’t see anything majorly different in the patterns. The most frequent letters are the most repeating. We can use the unique letters to drive our word choice decision in Wordle from this point forward.

The guess and answer word sets

The above analysis is for the entire English word dataset, however, would it be different with just the 5-letter words? Let’s find out. Let's make a new array with just the 5-letter words using the filter Julia keyword.

julia> guesswords = filter(x->(length(x)==5), dictwords)
julia> length(guesswords)

11210

The official Wordle guess word list contains slightly under 13K words. So, we are close enough. These are words that you are allowed to enter in the game box. The Wordle answer word list is a subset that contains 2.3K words and constitutes only words that could possibly come as the hidden answer. So, why the difference?

Josh Wardle and his partner, Palak Shah, created a curated list of answer words avoiding obscure and cryptic words to make the game fun for everyone⁸. While you’ll find ZOPPA is a valid guess word that you can use, don’t hold your breath for it to appear as the answer any time soon.

To see if there is a difference in letter frequency patterns between the two, let’s generate that list as well. I use the size-35 aspell word list as a starting point as it is supposed to have more common English words in it.

rawwords35 = readdlm("data/scowl_american_words_35.txt",header=false,String)
length(rawwords35)
50043

dictwords35 = String[];
for word in rawwords35
# println(word)
if (length(word)>0 && !(occursin("\'",word)
|| occursin("-",word)
|| occursin(" ",word)
|| (word[1]<'a' || word[1]>'z')
))
push!(dictwords35,word)
end
end
totalWords = length(dictwords35)
39142

This list contains 39K words as compared to 343K words in the 85-size word list. Let’s filter for just the 5-letter words.

answerlist = filter(x->(length(x)==5), dictwords)
length(answerlist)
3467

Wardle and Shah also removed all the plural words from their answer word list. Plurals such as CROPS, ROCKS, DROPS, etc have been removed from the answer list where as they are still present in the guess list. Let’s see how many plurals we have roughly.

answerlist = filter(x->(x[5]!='s' ||  x[4]=='s'), answerlist)

length(answerlist)
2282

3467-2282
1185

35% or 1185 of the 3467 words in the guess word list are plurals! CROSS, CLASS, etc. are some representative words in the answer list and they do not have a 3 or 4-letter singular variant (all of these words have already appeared in Wordle, so no spoilers here).

This explains why recently, my good friend, Sirisha, uncharacteristically struggled with that day’s word, finally solving it in 6 tries. She is an active member of my middle school group which solves Wordle on a daily basis and is usually very good at it. However, she felt that she had too many choices not knowing that 30% of the plurals forms were removed from the answer list⁸. Now armed with this information, she is unlikely to get stumped next time an S is flagged green in 5th slot. And so are we.

Before we start analyzing the answer list for patterns, let's take a look at the letter frequency in the three sets we have so far; the entire 85-word list, 5-letter guess word list, and the 5-letter answer word list.

Let’s write a helper function to load the word lists,

function letterFrequency(words)

# count total words in the file
totalWords = length(words);

# initialize freq vector
charFreq = zeros(Int64,26)
for word in words
for char in unique(word)
if (char>='a' && char<='z')
charFreq[char-'a'+1] += 1
end
end
end

# element wise devide by totalWords and return
return charFreq ./ (totalWords/100.0) ;
end

With the function ready, let’s load it up.

cfDictwords = letterFrequency(dictwords)
cfGuesslist = letterFrequency(guesslist)
cfAnswerlist = letterFrequency(answerlist)

And now the comparison plot,

plot((1:26, 'A':'Z'), hcat(cfDictwords, cfGuesslist, cfAnswerlist), 
label=["Dict Words" "Guess list" "Answer list"],
yticks=([0:5:70;],["$(x)%" for x=0:5:70]),
framestyle = :box,
xlabel="ALPHABET",ylabel="% OF APPEARANCE",
xguidefontsize=8, yguidefontsize=8,
margin=5mm, ylims = (0,75),
xticks=(1:26, 'A':'Z'),
size=(800,420)
)

savefig("pics/7-compare-guess-answer.png")

I concatenated the frequencies and then, plotted them on the line plot.

Letter frequencies for words from the entire dataset, guess, and answer lists. Image by the Author.

In the graph above, each data point indicates the percentage of words that contain that specific letter. As an example, for A, 47% of all words in the English word list have at least one A in them. Comparatively, only 37% of the words in the guess, and answer word list contain an A. The difference between the frequency percent is due probably to the longer words in the English word list as some of the letters like A, E, I, R etc. appear with higher frequency in longer words. We already know that S is so much lower in the answer word list because of the removal of plurals.

Looking at the guess list may give incorrect patterns for S. So, let’s focus on the answer list and analyze it for trends in more detail.

Frequency of letters

We will now plot the letter frequency in sequential order so that the letters with the highest frequency of appearance are shown on top of the graph.

listOrder = sortperm(cfAnswerlist, rev=true)

bar(cfAnswerlist[listOrder],
orientation=:h,
yticks=(1:26, ('A':'Z')[listOrder]),
xlabel=("% of APPREARANCE"),
ylabel=("ALPHABET"),
yflip=true, framestyle = :box,
xticks=([0:5:55;],["$(x)%" for x=0:5:50]),
xguidefontsize=8, yguidefontsize=8,
margin=5mm, xlims = (0,55),
legend=:none,
ytickfont = font(6,"Arial"),
lc=(:black), mc=(:black),
fill=(:orange), fillalpha=(0.5),
size=(800,420))

savefig("pics/8-letterfrequency.png")

Julia’s sortperm return a permutation vector I that puts cfAnswerlist[I] in sorted order. The bar chart gets cfAnswerlist[listOrder] which is a sorted list of frequency values. The yticks parameter of the bar function is assigned('A':'Z')[listOrder] , which sorts the array from A to Z in order of decreasing frequencies.

Frequency of appearance of letters in the answer list. Image by the Author.

E is the most frequent with it appearing at least once in 52% of the answer words. A, R, O, I, and T are the next five. ORATE appears to be a good starting word.

V, X, Z, Q, and J are the least common, with J appearing in slightly over 1% of the answer list. I haven't verified this, but the analysis stating that XYLYL was the worst starting word⁹ seems to be right.

How to use the distribution

The distribution of letters in the answer list can be quite useful to study and remember. Let’s say you went with your first guess of ‘ORATE’ or (my other favorite) ‘AROSE’. For ORATE, You received 3 yellows R, A, and E. The next choice could be LASER, WAVER to name a few. I am assuming you are playing using the “Hard Mode”. The graph above tells us that, L and S are more likely to be in the answer word than the letters W and V. In fact, roughly 3 times as likely. So, my guess would be LASER.

As another example, if you received no hits on AROSE, I usually try ‘UNWIT’ followed by GLYPH. For a no-hit on ORATE, SULCI can be a good second guess followed by NYMPH.

As a strategy, you can either remember the fill sequence of frequencies;

EAROI TLSDN CUHYP GMBKW FVXZQ J

Or, remember the three words ORATE, SULCI, and NYMPH (and D). When you get a hit on ORATE, say E, you can start picking letters from SULCI and NYMPH for your next word (CLUES comes to mind here).

Other researchers have recommended a similar strategy of picking three words with mutually exclusive letters. Nisansa de Silva¹⁰ recommends RAISE, CLOUT, and NYMPH. I personally have also used AROSE, UNWIT, and GLYPH in the past.

Conclusion

Photo by Sigmund on Unsplash

We’ve looked at two aspects of the game Wordle, using Julia to analyze the words. First, I tried to justify the choice of the number of letters (five) in the game and, it turns out, our speech and written works tend to use 5-letter words with higher frequency than words of other lengths. So, the choice of 5-letters for the game makes a lot of sense.

Secondly, we looked at the entire English word list first and then the frequency of letters in the five-letter guess and the condensed answer lists. The distribution graph for the letters can be used to guess words more effectively. I noted that remembering ORATE, SULCI, and NYMPH can help us to guess words that are likely to return more hits. Using these letters as a guide will help improve your game.

With these tips in mind, all the best with your next Wordle game tomorrow!

References

  1. Miller Center of Public Affairs, University of Virginia. “Presidential Speeches: Downloadable Data.” Accessed March 17, 2022. https://millercenter.org/presidential-speeches-downloadable-data.
  2. Ravi Parikh, “Distribution of Word Lengths in Various Languages”, Website http://www.ravi.io/language-word-lengths
  3. Reginald Smith, “Distinct word length frequencies: distributions and symbol entropies”, https://arxiv.org/ftp/arxiv/papers/1207/1207.2334.pdf
  4. Liston Tellis, “Normal Distribution — The Bell Curve”, Medium, July 5, 2020. https://medium.com/analytics-vidhya/normal-distribution-the-bell-curve-4f4a5fc2caaa
  5. Ann Wylie, “What’s the best length of a word online?”, Ann Wylie’s blog of writing tips, https://www.wyliecomm.com/2021/11/whats-the-best-length-of-a-word-online/
  6. Ste Knight, “12 Wordle Tips and Tricks to Improve Your Score”, Make Use Of, Sept 2022, https://www.makeuseof.com/wordle-tips-hints-tricks/
  7. Barry Smyth, “Big Data in Little Wordle”, Jun 2022, Towards Data Science, https://towardsdatascience.com/big-data-in-little-wordle-306d5502c4d9
  8. Jonathan Lee, “The New York Times is finally making changes to Wordle”, The Washington Post, Nov 2022, https://www.washingtonpost.com/video-games/2022/11/07/wordle-new-answers-new-york-times-update/
  9. Brittany Alva, “The Worst Starting Word in Wordle, According to Science”, SVG, Mar 2022, https://www.svg.com/751712/the-worst-starting-word-in-wordle-according-to-science/
  10. Nisansa de Silva, “Selecting Seed Words for Wordle using Character Statistics”, Feb 2022, arXiv:2202.03457, https://arxiv.org/pdf/2202.03457.pdf

--

--

Chief Science Officer @ AAXIS.IO . Has fun solving complex business and technical problems. Link up at linkedin.com/in/nareshram