The world’s leading publication for data science, AI, and ML professionals.

How Bayes’ theorem helped win the second world war

Part I: Introduction to Bayesian statistics and how it cracked the Japaneses naval cipher JN 25

U.S. Navy SBD-5 scout plane flying patrol over USS Washington and USS Lexington during the Gilbert and Marshall Islands campaign (cropped), 1943, Public Domain
U.S. Navy SBD-5 scout plane flying patrol over USS Washington and USS Lexington during the Gilbert and Marshall Islands campaign (cropped), 1943, Public Domain

During World War II, cryptoanalysts in both the United States and the UK were frantically trying to decipher encrypted communications sent by the Axis powers’ military branches. Among these efforts, the British code-breaking center at Bletchley Park probably stands out in its fame as the birthplace of the British Bombe, the machine that would decode the infamous German cipher Enigma between 1939 and the end of the war. While the Bombe and its successor models have received most of the public’s attention, their efforts would often have been futile without the support of statistical methods. The goal of this article is to cast light on one of these methods, namely Bayesian inference, first by introducing the theory behind it and then by outlining how it was used to crack Axis ciphers.

Hut 1 at Bletchley Park, photographed by Toby Oxborrow. Source: Flickr. This file is licensed under the Creative Commons Attribution-Share Alike 2.0 Generic license
Hut 1 at Bletchley Park, photographed by Toby Oxborrow. Source: Flickr. This file is licensed under the Creative Commons Attribution-Share Alike 2.0 Generic license

Part I will be dedicated to the breaking of the Japanese Naval cipher JN 25, whereas Part II will outline the role Bayesian inference played in cracking Enigma. Large parts of this report are based on an article written by Edward Simpson, one of the code-breakers who worked at Bletchley Park. I have tried to make the topic more accessible by reviewing the Bayesian formalism and carrying out most of the math that is only implied in Simpson’s article. If at the end of this post, you find yourself intrigued by the topic, I encourage you to read the original article by Simpson for a fascinating first-hand report of his time at Bletchley Park.

The theorem

Like so many other famous theories, Bayes’ theorem is surprisingly simple:

The formula itself is easily derived and has many applications outside of Bayesian Statistics. However, its simplicity can be deceiving as most of the theorem’s power lies in the interpretation of the probabilities P involved. The true controversy that has stuck with Bayes’ theorem for centuries lies in the way its usage challenges the more traditional frequentist approach. While so-called frequentists define the probability of an event as the "limit of its relative frequency in many trials" [1]. Bayesians interpret probability as a measure of personal belief.

One may ask how they dare include something as subjective as belief in a mathematical theory. And voicing this critique, one would certainly be in good company. Many heavyweights of statistics, including Fisher and Pearson, have discarded the Bayesian interpretation of probability based on similar arguments [4].

To really understand the difference between the two approaches and be able to pass fair judgment, let us consider the following example:

Imagine a friend challenges you to a game of flipping coins and promptly produces a coin that she would like to use. Before agreeing to this game you, being rather suspicious by nature, would like to ascertain that the coin she handed you is fair, i.e. the chances of it landing on head and tail are equal.

Frequentist approach

A frequentist will frame this problem as a hypothesis test with null hypotheses H₀: the coin is fair, and alternative hypotheses H₁ that it is not. She will then decide on a number of trials (say, 𝑛=100), and a confidence level (e.g., 𝛼=0.05). After flipping the coin 𝑛 times, the outcomes are recorded. If we let 𝑘 denote the number of times we observe heads and 𝑝 the probability that the coin lands on heads then, following a binomial distribution, the probability of a particular outcome is given as

Given the confidence level 𝛼 she can then compute a rejection region, meaning intervals of 𝑘 for which the null hypotheses H₀:𝑝=0.5 can be safely rejected. This region can be obtained by solving

for 𝑘. Thankfully people have dealt with these kinds of problems in the past so rather than actually solving the equation herself she can simply look up its solution in a table or use any statistical software to give her the answer. It turns out that for this problem, 𝑘=10, so H₀ can be rejected if |𝑘−50|>10, or equivalently if 𝑘<40 or 𝑘>60.

Let us assume that out of 100 coin flips, (𝑘=) 73 came back as heads. The frequentist can now conclude that the coin is rigged and that the chance that her conclusion is wrong is less than 5% (the confidence level).

Bayesian approach

You, a firm supporter of Bayesian Statistics, enter the scene. You are appalled by the "waste of time" and suggest that all this could have been done with much fewer trials. To understand how you then proceed, let us first revisit Bayes’ theorem:

Notice that we have replaced the variables A, B by more meaningful symbols 𝜃 and 𝐷. 𝜃 denotes the model’s parameters, which in our case is just the probability of heads 𝑝, a characteristic that is inherent to the coin used and the unknown value we are trying to infer. 𝐷 stands for the observed data, i.e. the number of times the coin lands on heads and tails.

  • 𝑃(𝜃|𝐷) is the probability of 𝜃 given the evidence or data 𝐷. In other words, the probability that our coin’s 𝑝 has a particular value after having landed on heads a given number of times. This is called the posterior.
  • 𝑃(𝐷|𝜃) answers the following question: what is the likelihood of observing the data 𝐷 given 𝜃? This is something the frequentist computes as well and it is called the likelihood.
  • 𝑃(𝜃) is the statistician’s belief in how likely different values for 𝜃 are bef_ore ob_serving any data. It is commonly known as the model prior
  • 𝑃(𝐷) is usually called the marginal and describes the probability of observing the data, independent of the model parameters. It ensures that the posterior distribution is normalized, but we can often find ways to avoid its direct computation.

The first thing we need to do is to decide on a model to describe the coin-tossing experiment. Just as before, the binomial distribution is our function of choice:

The data D comprises both the number of tosses 𝑛 and the number of heads 𝑘. Having defined the likelihood, we now have to pick a prior. Remember, the prior characterizes the statistician’s belief in parameter 𝜃 before seeing any data. So, if we are dealing with a trustworthy friend, we, as Bayesians, can pick a prior that has a sharp peak around 𝜃=0.5 (the value for a fair coin), as shown by the orange curve. If our friend has tried to pull tricks on us in the past, we might be more inclined to pick the blue curve as prior. It is wider and therefore puts less belief in any particular value for 𝜃.

As you’ve probably noticed, we didn’t write down any equations for the prior. Unfortunately, the math involved in Bayesian inference can be rather tricky. Oftentimes integrals cannot be computed analytically and one has to resort to numerical tools such as Monte-Carlo sampling. For that reason, we will rely on graphs to develop an intuitive understanding of Bayesian statistics.

Let’s say out of 6 tosses, the coin lands on heads 4 times. Using the "Low Trust" prior, the posterior distribution 𝑃(𝜃|𝐷) will look like this:

As you can see, the distribution has shifted to larger values of 𝜃 because heads appeared twice as often as tails. We can interpret the likelihood term as a filter acting on the prior distribution, only letting through values for 𝜃 that are more or less consistent with the data 𝐷.

The distribution is still rather wide, so we conclude that to make any reliable inference, we need more samples. After 20 tosses and having observed heads 15 times, the posterior looks like this:

We can conclude with high certainty that the coin is rigged, and we have been able to do so with 80 fewer coin tosses than the frequentist. Talk about efficient use of resources!

Now should you agree to play with our friend? Depends. If she lets you pick either heads or tails, bet on heads! If however, she insists on picking heads for herself, maybe it’s time you find a new friend…

One can actually calculate the expected gain for the favorable version of this game: If we bet one dollar on heads our expected return is

As we do not know the exact value for 𝜃 we need to integrate over its probability distribution:

For the last expression, we don’t actually need to compute any integrals. To get a rough approximation for the expectation value (the mean) of 𝜃 under the posterior distribution, it is enough to look at said distribution to find ⟨𝜃⟩≈0.7 (the exact value is about 0.708). So

With an expected ROI of about 40% per coin toss I would say "go for it"; but before you bet all your money, make sure you’re aware of gambler’s ruin.

The USS Arizona (BB-39) burning after the Japanese attack on Pearl Harbor , Public Domain
The USS Arizona (BB-39) burning after the Japanese attack on Pearl Harbor , Public Domain

Japanese Naval cipher – JN 25

Japanese perspective

The way the Japanese navy encoded their messages during World War II was rather straightforward. The sender would use a code book (I) to transform every word of his message into a five-digit number. As a safety measure, the sum of these five digits would always be divisible by 3. While giving the Japanese a way to make sure their messages had been transmitted without error, this measure would greatly help the allied forces decode intercepted messages; but more about that later.

Having encoded the plain text, the sender would then consult an enciphering table for so-called additives (II). These five-digit numbers would be added (non-carrying) to the code to produce the final enciphered message (III). The receiver would have access to the same code book and additives and could simply decode the message by reversing the above procedure.

Allied perspective

Being presented with a set of intercepted messages, the allied cryptoanalysts’ task would be to figure out the right additives. At the same time, so-called book builders would try to reproduce the Japanese code book by relying on a combination of linguistic and combinatoric skills.

As the cryptoanalysts only had access to a limited amount of evidence, i.e. a given number of intercepted messages, the best they could hope for was to make a probabilistic statement about the most likely additives underlying a message. Therefore, (Bayesian) statistics was particularly suited to tackle this problem.

The deciphering process would start with a set of intercepted messages (IV) that were known to have the same additives. These messages were said to be "in depth" and one would speak of a "depth" of messages.

The five-digit ciphers were differenced against each other and the result was recorded (always picking the number below 5555). The logic behind this procedure was that taking differences between encoded messages, the underlying additives would cancel out. The allied forces already had knowledge of part of the Japanese code book, and many of the code groups were known. These were referred to as "good groups" and were tabulated along with their (non-carrying) differences (V). If 50 groups were known, 1225 differences had to be calculated and recorded.

Cryptoanalysts would compare the differences calculated from the intercepted messages and those from good groups. If a match was found (in our case 22571, shown in red), a hypothetical additive was calculated (shown in green).

The final and most important task was to test this hypothetical additive for its validity. As a first step, the additive was subtracted from its corresponding encoded word in every message (VI). If the resulting code (shown in blue) broke the "divisible by 3" rule, the additive could be quickly discarded, saving the Allies a large amount of time. Usually, multiple potential additives would pass this test and hence statistical analysis was used to determine their relative strengths. As evidence for a true additive, the cryptoanalyst would use the presence of any other good groups in the above test. The math works out like this:

Let us denote 𝐴 as the event that the additive found is true, and ¬𝐴 that it is not. We want to determine the posterior probability given the data of good groups after the additives were added 𝑃(𝐴|𝐷).

If we are only interested in the the odds ratio (true vs. false), the marginal 𝑃(𝐷) in the denominator cancels out:

Not having any prior knowledge about the probability for A, one can always just use a prior that assigns equal probability to both events, i.e. 𝑃(𝐴)=0.5. The last term in Eq. thereby cancels out and one is left with determining the likelihood ratio, which is fairly straightforward to obtain. (Note that if we had any reason to believe that we are dealing with a correct additive, based on evidence other than the presence of good groups, we could always modify the prior to reflect this belief.)

As an example, let us consider the good group 32151 which can be recovered in Message 2. If 98213 is not a true additive, then 32151 is just a random five-digit number. The probability of this number occurring would be 1/10⁵, however, as the number needs to check the "divisible by 3" rule, we obtain 𝑃(32151|¬𝐴)=3/10⁵.

If 98213 is a true additive, then the likelihood 𝑝(32151|𝐴) is given by the relative frequency of occurrence for 32151 ("Fleet"). Certainly, words such as "Ship" or "Weather" occur more often than others like "Attack". This needs to be taken into account in a correct statistical treatment, as more frequent terms add more evidence towards the hypothesis. For the sake of this example, let us assume that every 80th intercepted word is "Ship" and every 250th is "Fleet", then the total evidence for additive 98213 would be calculated as:

Technically, codes that are not contained in the list of good groups also add factors to the product above. However, as these factors are only slightly less than one, they can be safely ignored for the sake of saving time.

As a last simplifying trick, the likelihoods would be converted into their logarithmic values, trading complicated multiplication for simple addition. The log-likelihoods for every good group would be tabulated in advance, so that staff could simply look up values and the process of testing additives could be streamlined. One of the main advantages of this procedure was that the somewhat labor-intensive and repetitive task of testing additives could be outsourced to staff with only limited math knowledge. Expert opinion would merely need to be consulted for fringe cases or messages of great importance.

Conclusion

I hope that I was able to cast some light on Bayesian inference and show how it was used at Bletchley Park to decipher JN 25. If you want to know more about the topic, check out Edward Simpson’s article here [3]. Stay tuned for Part II, where I will dissect the inner workings of the German Enigma and discuss how Bayes’ theorem helped break it.

References

[1] https://en.wikipedia.org/wiki/Frequentist_probability

[2] https://en.wikipedia.org/wiki/Bayesian_probability

[3]Edward Simpson, _Bayes at Bletchley Park (_2010), The Royal Statistical Society,http://mathcenter.oxford.emory.edu/site/math117/bayesTheorem/enigma_and_bayes_theorem.pdf

[4] Sharon Bertsch Mcgrayne, The Theory That Would Not Die: How Bayes’ Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant from Two Centuries of Controversy, Yale University Press


Related Articles