The potato train — using Python with extremely large numbers and arbitrary precision for binomial probability

Published in

Towards Data Science

9 min readJun 7, 2020

Math is hard, let’s go shopping — for tutorials, that is. I definitely wish I had read this tutorial before trying some things in Python that involve extremely large numbers (binomial probability for large values of n) and my code started to crash.

But wait, I hear you saying, Python can handle arbitrarily large numbers, limited only by the amount of RAM. Sure, as long as those are all integers. Now try to mix some float values in, for good measure, and things start crashing. Arbitrarily large numbers mixed with arbitrary precision floats are not fun in vanilla Python.

Let’s try to gradually increase the demands on integer arithmetic in Python while calculating binomial distributions and see what happens.

Welcome to…

The potato train

This is the Cold War era. There’s a very large industrial center, off in some faraway, frozen lands. Nothing grows there, so the workers are being sent food for sustenance. Once in a while, a train full of potatoes arrives at the station.

Now, this is hard stuff — we’re talking blizzards and storms — so potatoes get damaged. It is a fact that, in every transport, 20% of potatoes arriving in the station are bad.

Well, it’s your turn to make dinner, so grab a big, hefty bag and head off to the station. 20 potatoes will be enough for your team. But it’s lights out, pitch black, leadership has wisely allocated all energy to the factories in order to meet the quotas, so you pick potatoes at random, blindly.

What are the odds that all potatoes you pick will be good? What are the odds that just 1 potato will be bad? Or 2 bad, or 3? Or all bad? (which would definitely ruin dinner)

Or maybe the comrades said: “Look, we all know how it is, just try not to pick more than 4 bad potatoes this time around. Up to 4 bad out of 20 is fine. We’ll survive. Less is better. More is not good.” What are the odds that your team won’t have a sad dinner this time around?

This is where you meet your arch-enemy:

The binomial distribution

There’s a very large pile of objects that are either type A or type B. The fraction of type A objects in the pile is p, where 0 <= p <= 1; the rest are type B. Out of that pile, you pick n objects, with n much smaller than the size of the whole pile. What are the odds that, out of the n objects you’ve picked, exactly k will be type A?

In other words, you pick n=20 potatoes, out of a train where p=0.2 of all potatoes (or 20%) are bad. What are the odds that k potatoes (k between 0 and 20) out of the ones you pick are bad?

This is the formula for the odds:

Many Stats 101 books will give you some sort of proof (see Credits at the end). I just cook potatoes, so let’s move on.

But before we do, please notice the factorials in the formula. Factorials can grow very large very quickly. This will turn around and nicely bite us in the code later on.

But not just yet. We’ll start slow, we’ll use small numbers at first. Trust me, I’m an engineer. Everything will be fine — for a while.

Let’s code!

Fire up your Jupyter notebook and tag along. Let’s import the good stuff:

Implement the probability formula:

And compute the probabilities for all k values:

Let’s see the results:

print(df)
     probability
0   1.152922e-02
1   5.764608e-02
2   1.369094e-01
3   2.053641e-01
4   2.181994e-01
5   1.745595e-01
6   1.090997e-01
7   5.454985e-02
8   2.216088e-02
9   7.386959e-03
10  2.031414e-03
11  4.616849e-04
12  8.656592e-05
13  1.331783e-05
14  1.664729e-06
15  1.664729e-07
16  1.300570e-08
17  7.650410e-10
18  3.187671e-11
19  8.388608e-13
20  1.048576e-14

The probability that all potatoes are good is about 1%. The probability that only 1 potato is bad is about 5%; that 2 potatoes are bad is 14%; for 3 bad potatoes it’s 21%, for 4 bad is 22%, and it’s downhill from there — 17%, 11%, etc. That all potatoes are bad is very unlikely — 1 case out of 100 billion billion.

How about cumulative probabilities — that 4 or fewer potatoes are bad? It’s what the teammates asked. Let’s see:

dfcs = df.cumsum()
dfcs.rename(columns={'probability': 'cumulative probability'}, inplace=True)
print(dfcs)
    cumulative probability
0                 0.011529
1                 0.069175
2                 0.206085
3                 0.411449
4                 0.629648
5                 0.804208
6                 0.913307
7                 0.967857
8                 0.990018
9                 0.997405
10                0.999437
11                0.999898
12                0.999985
13                0.999998
14                1.000000
15                1.000000
16                1.000000
17                1.000000
18                1.000000
19                1.000000
20                1.000000

There’s a 63% chance that only 4 or fewer potatoes will be bad. That’s almost 2 out of 3 times that your teammates will find dinner at the very least acceptable. Hey, it could be worse.

But seeing is believing, so let’s plot the probabilities for all k values. Code:

df.plot(figsize=(16,9))
dfcs.plot(figsize=(16,9))

And image results:

The plot(‘s waistline) thickens

But alas, that’s not where it ends. You cook so well, now they want you to make dinner for the whole Precision Tooling section of the factory. That’s a bunch of teams, and you figure you’ll need 200 potatoes for everyone. So seize the means of production, which in this case are a couple burlap sacks, find someone to help you, and head to the train station.

That’s n=200, so probably about 40 bad potatoes or less would be acceptable to the comrades, assuming all teams are like yours. Let’s see what the odds are.

Change the value for n in the code and re-run it. And this is where things, like so many potatoes in the train, go bad:

---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
<ipython-input-4-c84b1e23a65f> in <module>
      7 # compute probabilities for all k
      8 for k in tqdm(range(n + 1)):
----> 9     df.loc[k] = bin_prob(n, p, k)<ipython-input-3-60f4e81371ca> in bin_prob(n, p, k)
      1 def bin_prob(n, p, k):
----> 2     arr = math.factorial(n) / math.factorial(k) / math.factorial(n - k)
      3     bp = arr * (p ** k) * ((1 - p) ** (n - k))
      4     return bpOverflowError: integer division result too large for a float

Ah, yes. We’re computing 200! in there, which has 375 digits (no joke). The / operator wants to handle things as float, and Python can’t do float in that range.

But wait, the // operator also does division (floor division, close enough), and handles everything as the integer type. Python’s only limit for integers is the size of the RAM. Those two divisions always make integers anyway in that formula (think why), so it’s okay to use // instead. Let’s give it a try again at n=200:

It works beautifully:

The odds that 40 potatoes or less will be bad are about 54%. That’s pretty sketchy, but hey, you’re there to work, not to enjoy fancy food, so toughen up, comrade.

Industrial scale potatomongering

Sometimes we fail. Sometimes we succeed. And sometimes, as they used to say in the Eastern Bloc, we succeed too well.

Despite the rotten potatoes, your cooking is so excellent now you have to make dinner for the whole factory. That’s 2000 potatoes for a lot of hungry workers. Get your teammates, hop on the truck, drive to the train station. For the glory of the Motherland.

What are the odds that 400 potatoes, or less, will be bad in the bounty you bring back from the train? Change the code so that n=2000 and run it again. The result is pretty sad:

---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
<ipython-input-5-d85e2d80093f> in <module>
      7 # compute probabilities for all k
      8 for k in tqdm(range(n + 1)):
----> 9     df.loc[k] = bin_prob(n, p, k)<ipython-input-4-88c12493fe27> in bin_prob(n, p, k)
      1 def bin_prob(n, p, k):
      2     arr = math.factorial(n) // math.factorial(k) // math.factorial(n - k)
----> 3     bp = arr * (p ** k) * ((1 - p) ** (n - k))
      4     return bpOverflowError: int too large to convert to float

We’re trying to compute 2000! in there. That number has 5736 digits. There’s nothing you can do in plain Python to work around that. That integer is fine, but you have to make it play nice with float, and plain Python doesn’t do float in that range.

Decimal to the rescue. It handles arbitrarily large numbers of any kind. By default, it works with a precision of 28 digits, but you can change it to whatever you want, and it does change dynamically in some cases.

import decimal
print(decimal.getcontext())
Context(prec=28,
        rounding=ROUND_HALF_EVEN,
        Emin=-999999,
        Emax=999999,
        capitals=1,
        clamp=0,
        flags=[],
        traps=[InvalidOperation, DivisionByZero, Overflow]
)

It even has some math functions embedded, so int(decimal.Context().power(2, 3)) is 2³ = 8. You may have to convert formats into Decimal while working with it, then back to regular float or int, but that’s a small price to pay for eliminating size limits. Just keep in mind it’s a little slower than regular math libraries, so break this glass only in case of emergency.

Let’s fix our function:

The reason why we wrap everything in Decimal() in there is that we’re dealing with float values; if those arguments were integer, ctx.power() would take them directly. At the end, before we return it we convert the Decimal format to float.

And, wow, does it ever work:

The probability that 400 potatoes, or less, will be bad is 51%. Maybe you won’t be cooking for the whole factory for too long.

To infinity and beyond

Decimal is awesome. I’ve cranked it up to n=20000. It took 10 minutes to complete, single-threaded, but it did the job. You may want to switch to multiprocessing at that point.

Source code

If you want to see the Jupyter notebook with the whole code for this tutorial, here it is.

The shortcut

Of course, you can always import scipy and do the same calculation using library functions…

import scipy.stats as ss
n, p, k = 2000, 0.2, 40
ss.binom.cdf(k, n, p)

…and that will give you the same cumulative probability calculated the hard way above (use pmf() instead of cdf() if you want the plain probability, instead of cumulative).

But that was not the point — the point was to show how Python can handle very large numbers outside of specialized libraries, and to explore a bit of statistics while we’re at it.

Credits

The Basic Practice of Statistics, 8th edition, by Moore, Notz, and Fligner.

Sources for images not made by me:

Soviet diesel power by jpachl on DeviantArt

User Menu In summer 1986, a freight train headed by a Soviet built M62 engine (East German class 120) was waiting for…

www.deviantart.com

Potato

The potato is a root vegetable native to the Americas, a starchy tuber of the plant Solanum tuberosum , and the plant…

en.wikipedia.org