I know it’s all just an accident of language, but the fact that you can rearrange the letters in "the eyes" into "they see" feels… magical.
Ever since I was a kid, I’ve enjoyed anagrams. So, I thought I’d try to build an anagram generator in Python. It turned into a fun little weekend project that uses hash tables, in the form of Python dictionaries, and recursion.
Hash tables are used to implement things like database indexes and caches. They enable quick lookup of data by mapping keys that are cheap to compute to a value. But if you haven’t been exposed to things like trees and graphs, you might not think about making the values of a hash table other hash tables.
As you’ll see, building a tree structure out of nested hash tables can help make searching for patterns in an enormous set of data more manageable. And recursion is the tool that allows you to express the search logic cleanly and compactly.
The anagram-generating algorithm that you’ll see here is a type of combinatorial search algorithm. These kinds of Algorithms play an important role in robotics and artificial intelligence, where the space of possible states grows exponentially, or even factorially.
Before we dig into generating anagrams, though, let’s take a look at a simpler problem: recognizing when two strings are anagrams of one another.
How To Check If Two Strings Are Anagrams
string1
is an anagram of string2
if the characters in string1
are a rearrangement of the characters in string2
.
The order of the characters doesn’t really matter, though. What matters is that the count of each character in string1
is the same as the count of each character in string2
. If each character appears the same number of times in each string, then the two strings are anagrams of each other.
Python’s [collections.Counter](https://docs.python.org/3/library/collections.html#collections.Counter)
– which is a specialized type of dictionary – can calculate this:
See that empty string ' '
in the Counter
object? That’s a problem. "Theater" and "the tear" are anagrams, even though one has a space and the other doesn’t. Anagrams are case-insensitive. They may also have different punctuation, as in "vitalise" and "IT’S ALIVE!"
A robust solution needs to take all this into account, so collections.Counter
won’t cut it without some preprocessing:
Like most problems, there are many ways to solve this one.
I wanted a solution that is easy to understand and easy to change in case it doesn’t handle every case it should. I went with the string [.translate()](https://docs.python.org/3/library/stdtypes.html#str.translate)
method, which replaces each character in a string according to a translation dictionary. Characters mapped to None
are removed from the string. Plus, I can easily create translation dictionaries with [str.maketrans()](https://docs.python.org/3/library/stdtypes.html#str.maketrans)
.
This gives me fine-grained control over how the strings are processed before being passed to Counter()
:
It might be better to use [.casefold()](https://docs.python.org/3/library/stdtypes.html#str.casefold)
instead of .lower()
in the process()
function. But overall, this makes for a robust solution. It’s easy to understand, provided you know about Counter
, .translate()
, and str.maketrans()
. And it’s easy to swap out processing functions if needed.
But what about finding anagrams for a particular string?
How To Generate Anagrams
It’s harder than checking if two strings are anagrams of each other.
Simply generating all possible rearrangements of a string isn’t enough. You have to insert spaces and punctuation in places that make sense. The resulting words in the string need to be actual dictionary words.
We need a way to efficiently generate words that use letters from a string.
Solving It On Paper
Set aside the problem of obtaining a list of words and start simple:
eyes
the
they
see
sea
The process goes like this:
- Write down the phrase you want to generate anagrams for, such as "the eyes."
- Pick a letter from the phrase (say, "t") and scan the list for words that start with that letter (words = "the" and "they"). Cross off one instance of the letter in the phrase, and write down the letter as the first letter of an anagram (anagram = "t").
- Pick a letter from the phrase that hasn’t been crossed off yet (say, "h"), and filter the words selected in the last step to those whose second letter is the new letter you picked (words = "the" and "they"). Cross off the letter in the phrase and add it to your anagram (anagram = "th").
- Continue this process of selecting unused letters and filtering the list of words. When you reach the end of a word, check if the phrase you’ve generated so far is an anagram of the original phrase. If it is, you’re done. If not, start over with the complete list of words but only use letters that haven’t been crossed off.
- Repeat the whole process using different initial letters from the phrase.
If you follow this algorithm step-by-step on the small list of words above, you’ll end up with four anagrams:
- the eyes
- eyes the
- they see
- see they
Of course, "the eyes" is the phrase you started with. That’s fine. Just discard it.
Wouldn’t it be great if, instead of scanning the entire list for all words starting with the letter T, we could jump straight to them? Or jump to all the words that start with TH? Or begin with THE?
Sounds like we need a hash table!
What we really want is a hash table of hash tables.
A hash table of hash tables is one way to represent a tree. And, if you stop and think about it, the method I’ve described for generating anagrams feels a lot like a depth-first search. That’s a big clue for how to implement the algorithm.
And Python has all of the pieces we need.
Building a Tree Of Dictionaries
The tree we’re going to build will look something like this:
Python dictionaries are hash tables, so they’re the natural choice for the basis of our tree. The keys for each dictionary are strings containing a single letter, and the values are more dictionaries. We’ll need to know when we’re at the end of a word, so we’ll mark those places with the dictionary {" ": {}}
.
The tree of dictionaries for the words eyes, the, they, see, and sea looks like this:
But how do you construct a dictionary like this?
The key observation is that you can add words to the dictionary recursively. Start with an empty dictionary. For each word, pop off the first letter – call it char
. If char
is a key in the dictionary, grab its value. Otherwise, set the key char
in the dictionary with an empty dictionary for its value. Then repeat the process, using the dictionary mapped to char
and the word without its first character.
Assume word
has no whitespace and tack the string " "
on to the end so that we get the terminal dictionaries right.
Here’s the code:
Recursion is nice because it can compactly describe a repetitive process. But it can be difficult to understand what a recursive algorithm does without reasoning through the steps by hand. Documentation with some good examples can help a reader (including your future self!) understand a recursive function.
Now that we’ve got a tree to work with, we can start generating anagrams.
Traversing The Tree To Make Anagrams
You can implement the anagram algorithm as a walk over the nodes in a word tree.
We pick an initial node – one of the characters in the phrase we want anagrams for – and mark that node as visited in an array. Then move to the branch in the tree rooted at that node. Keep repeating this until you visit all of the characters in the phrase or reach the end of a word. More recursion!
It’s essentially a depth-first search algorithm, except that the nodes available to walk to are limited by the nodes in the tree and the characters in the phrase that haven’t been visited yet.
This is captured in the following Python code:
The filter_()
function should return a tuple of nodes available to walk to. There’s some nuance to how exactly this works for finding anagrams. We’ll get to that in a bit and hold off on defining a filter function for now.
We’re still not done implementing the walk for the anagram algorithm.
walks()
yields tuples of nodes visited by a walk – characters from a phrase in this case – but walks end as soon as they reach the end of a word.
To produce multi-word anagrams you need to circle around and continue walking the tree from the root. I call this "walking around" the tree. You keep walking around and around, collecting words until the walk is done.
Here’s the code:
We need to implement done
and filter
functions to pass to walk_around()
, both of which depend on the phrase we’re generating anagrams for.
A walk is done when the string of characters visited in the walk is an anagram of the phrase. We already know how to do that! We can use the are_anagrams()
function:
Now we need to define filter_()
.
I said there was some nuance to this.
At each step in the walk, we only need to move to a character in the phrase that we haven’t visited yet. Or a punctuation character, since we want to include words like contractions. Oh, and we can also move to a " "
character, since that is how we know we’re at the end of a word.
We can express these rules cleanly using sets:
Counter
makes it easy to calculate the characters in the phrase that haven’t been visited yet.
Subtracting a Counter
from another returns a Counter
with the counts subtracted. Characters whose count is zero are removed. For instance, Counter("tea") - Counter("e")
returns Counter({'t': 1, 'a': 1})
.
We have all the pieces in place to write an anagram generator:
functools.partial()
is used here to "fill in" the phrase
parameters of the done()
and filter_()
functions. We have to do this because the done
and filter_
parameters of walk_around()
expect functions without a phrase
parameter. Whitespace in the anagram is removed with .rstrip()
since walks produced by walk_around()
end with a space.
Let’s take anagrams()
for a spin on a small list of words:
Hey, it works!
Testing On a Large Set Of Words
If your computer has a Unix-like OS, you have a large list of words available right now.
On most machines, there’s a words
file located at /usr/share/dict/words
or /usr/dict/words
. It’s a newline delimited list of dictionary words. If you have a Windows machine or don’t have a words
file, you can find a tarball for the file in your language on the GNU website.
Let’s try it out:
It works, but the words
file includes single letters and two-letter words like "te." That adds a bunch of noise to our anagrams. We should probably clean it up.
Let’s filter out some words we know that we don’t want:
This definitely gives us better results. For a better anagram generator, you’ll need to spend some time curating words. And the words
file isn’t necessarily the best set of words to start with, since it doesn’t contain any contractions.
You can generate more interesting word lists using aspell’s SCOWL tool.
So What’s The Point Of All This?
First of all, it was a fun project to tackle and touched on a number of concepts I don’t get to use on a daily basis. That alone makes it worthwhile.
But the solution itself is interesting in general. It suggests an approach for constructing objects with combinatorial restrictions. One can imagine a number of applications for that, from constructing codes and genetic sequences to finding extremal solutions to combinatorial problems.
Not bad for a weekend project!
Become a better coder in 5 minutes.
One email, every Saturday, with one actionable tip.
Always less than 5 minutes of your time.
Subscribe here