The world’s leading publication for data science, AI, and ML professionals.

Benford’s Law – A Simple Explanation

What Latif Nasser didn't tell us about the "First-Digit Law" on his Netflix show, Connected.

Photo by Nick Hillier on Unsplash
Photo by Nick Hillier on Unsplash

If you haven’t seen it yet, check out the Netflix series, Connected. It’s a good show. The host, Latif Nasser, discusses various subjects in popular science. Netflix touts it as a series "that investigates the surprising and intricate ways in which we are connected to each other, the world and the universe." [1]

In Numbers, the fourth episode, Latif explores Benford’s Law (BL), also known as the First Digit Law. It’s an observation that many datasets, both man-made and from nature, contain more digits that start with the number 1 than any other digit, about 30% of all numbers. And the subsequent leading digits drop down in frequency until number 9, which appears as the first digit in only about 5% of the numbers in the datasets. This is surprising because you would expect the distribution of the nine possible leading digits of numbers to be evenly spread out at around 11% each.

The show looks at the history of BL and shows that such varied datasets like the sizes of volcanoes, combined lengths of notes in classical music, and financial statements from companies seem to follow BL.

There is one question that was asked, but never really answered – why do many datasets follow BL? The show implies that the existence of BL reveals some sort of deep cosmic scheme of the universe.

Here’s a simple explanation that was never mentioned by Nasser, nor anyone he interviewed:

Datasets comprised of numbers that are products of multiple, independent factors will tend to follow Benford’s law.

This explanation has been known for a while [2][3][4] but didn’t quite make it into the show. Benford’s Law is not a mysterious property of our universe. It’s just basic math.

Overview

In this article, I’ll cover a brief background of BL, explain two key concepts: normal distributions and logarithms, show how a dice rolling exercise can lead to BL, and finally take a look at some real datasets to see if this explanation holds up.

Simon Newcomb photo from Wikimedia, Public Domain, Frank Benford photo illustration by the author, based on an image on Nigrini.com in the Public Domain
Simon Newcomb photo from Wikimedia, Public Domain, Frank Benford photo illustration by the author, based on an image on Nigrini.com in the Public Domain

Background

Benford’s law is named after the American physicist Frank Benford who published a paper in 1938 called "The Law of Anomalous Numbers" which describes the frequencies of first digits of numbers observed in datasets [5]. Note that this phenomenon had previously been observed and published by Canadian astronomer Simon Newcomb in 1881 [6].

(Brief side note: The fact that things are often named for someone who didn’t discover it first is common. In fact, there is a name for this, Stigler’s Law of Eponymy. It was proposed by the American statistics professor Stephen Stigler in 1980 when he wrote that no scientific discovery is named after its original discoverer [7]. In an ironic twist, Stigler acknowledged that the American sociologist Robert Merton had previously discovered "Stigler’s law".)

Back to Benford. He observed in his paper that many diverse datasets closely adhered to the following distribution of first digits, shown as percentages in the chart below.

Image by Author
Image by Author

The datasets Benford looked at included such diverse things as the populations of cities, atomic weights of compounds, financial expenses, even all the numbers he could find in a particular newspaper. Here’s an excerpt from his paper.

Excerpt from Frank Benford's paper "The Law of Anomalous Numbers", 1938, Public Domain
Excerpt from Frank Benford’s paper "The Law of Anomalous Numbers", 1938, Public Domain

He goes on to explain his law of observed first digit frequency in mathematical terms, but he doesn’t really pose a reason for it. He writes that the law "evidently goes deeper among the roots of primal causes than our number system unaided can explain." [5]

Up next, we’ll learn about normal distributions by looking at highway traffic speeds.

Image by Christine Sponchia from Pixabay
Image by Christine Sponchia from Pixabay

Normal Distributions

You have probably seen what a normal distribution looks like. It’s the famous "bell curve" from probability theory. A normal distribution, also known as a Gaussian distribution, is a type of continuous probability distribution for a variable.

For example, imagine that a city’s planners want to check the average speed of a particular spot on a highway. They put down a pair of sensor strips on the road and start recording the speeds of cars going by. Take a look at the sample data below in a histogram.

Image by Author
Image by Author

The average speed of vehicles is about 52 MPH and most travelers go between 45 and 60 MPH. There are a few exceptions. At least one car was poking along at 35 MPH and another car was speeding at over 70 MPH.

Logarithms

The second key concept is understanding how logarithms work. A logarithm is an inverse function of exponentiation. For this discussion, we’ll stick to 10-based logarithms, although others exist. For example, if we take 10 to the 5th power we get 100,000 (a leading one with five zeros). So the log of 100,000 gives us 5. And the log of 10,000 gives us 4. You get the picture.

Logarithms are useful for looking at data where the values near zero are clustered together, but the higher values are more spread out. Consider the two graphs below.

Images by Author
Images by Author

Both graphs show the same data on different horizontal scales. The top graph shows the data points on a linear axis, and the bottom graph shows the datapoints on a log axis. Notice how the data points on the log scale are more evenly spread out. Also on the bottom graph, notice how the intervals between the numbers with a leading digit of one are much bigger than the other intervals. I’ll tell you a bit more about these intervals in the following section.

For the next section, we’ll run through three simulations of rolling dice.

Photo by Riho Kroll on Unsplash
Photo by Riho Kroll on Unsplash

A Roll of the Dice

To get a better understanding of how distributions follow BL, we’ll take a look at three simulations using dice rolls.

Imagine that you are taking an online statistics class. It’s a big class. There are 10,000 students. The teacher asks each student to roll a six-sided die and enter their result into a spreadsheet. Here is a histogram of the results.

Image by Author
Image by Author

With 10,000 rolls and six possible outcomes for each roll, the expected result would be about 1,666 rolls per outcome. The results above are close to this, ranging from 1,624 for the number 1, to 1,714 for the number 5. This is an approximation of an even distribution.

Summing Up Dice Rolls

For the next exercise, the teacher asks each student to roll their die 100 times and total them up as a sum. The sum for each student is approximately 350. This is because 3.5 is halfway between 1 and 6, and they each performed 100 rolls. You can see the results in the histogram below.

Image by Author
Image by Author

There’s our normal distribution again. Some students got a low result below 300, and some got a high result above 400, but most were in the range of 330 to 370.

Why do we get a bell curve as a result? The Central Limit Theorem (CLT) from probability theory states that adding up independent random variables will tend towards a normal distribution [4].

Multiplying Dice Rolls

For the third and final exercise, the teacher asks each of the students to roll the dice again, but only 15 times and then multiply each number together. The products get pretty big. The average is about 76 billion. The histogram shows the results below, but this time the horizontal axis is using a log scale.

Image by Author
Image by Author

Once again, we see our normal distribution. But because the x-axis is on a log scale, the distribution is called lognormal. Why do we get this distribution? The Multiplicative Central Limit Theorem (MCLT) states that multiplying independent random variables will tend toward a Lognormal Distribution [4].

Notice that the x-axis of the histogram above uses powers of 10 for the tick marks. The results range from one million (10⁶) to 10 trillion (10¹³). Let’s take a closer look at the interval on the x-axis between 10⁹ and 10¹⁰.

Image by Author
Image by Author

You can see that the big green stripe spans about 30% of the segment, and the following 8 intervals between the tick marks get smaller in size from left to right until the next green stripe starts. In fact, the spans are exactly the size of the intervals defined by BL.

Number Interval 1-2  2-3  3-4  4-5  5-6  6-7  7-8  8-9  9-10
Log Interval   .301 .176 .125 .097 .079 .067 .058 .051 .046

Because the distribution in the histogram is somewhat continuous (roughly following a smooth line) and it spans many orders of magnitude (having seven stripes), it would follow that the numbers in this example will conform to BL. The numbers fall into Benford-sized buckets.

This observation is covered in R. M. Fewster’s paper, "A Simple Explanation of Benford’s Law". He relates it to a hat with stripes. In his analogy, the hat represents a lognormal distribution of numbers, the rim represents the x-axis, and the stripes represent the 30.1% area that has digits that start with 1. If the stripes cover a proportion of the rim, they will cover approximately the same proportion of the whole hat, given enough stripes [8].

Let’s take a look at the frequency of leading digits for our third dice rolling exercise.

Image by Author
Image by Author

Sure enough, the frequency of leading digits lines up with Benford’s predictions (the orange diamonds) fairly well. If we run the simulation with more products and/or more students, the results would conform more closely.

Note that not all datasets with lognormal distributions conform to BL. The caveats are spelled out in the paper "Benford’s Law: An Empirical Investigation and a Novel Explanation" by Paul D. Scott and Maria Fasli at Essex University [2].

"Data whose distributions conform to a lognormal distribution whose [standard deviation] exceeds 1.2 should give rise to leading digit distributions satisfying [Benford’s] law. Data that are likely to satisfy this criterion will: (1) Have only positive values. (2) Have a unimodal distribution whose modal value is not zero. (3) Have a positively skewed distribution in which the median is no more than half of the mean." – Paul D. Scott and Maria Fasli

Conversely, not all datasets that conform to BL have lognormal distributions. This fact is covered by Anro Berger and Ted Hill in their book, "A basic theory of Benford’s Law" [3]. For example, they mention that combining independent datasets will result in conformity with BL.

Next up, we’ll take a look at some datasets from the real world.

Photo by ben o'bro on Unsplash
Photo by ben o’bro on Unsplash

Populations of Cities and Towns

One of the "poster child" datasets that follows BL closely are the populations of cities and towns. It doesn’t matter if you look at cities, counties, states, or countries. As long as you have hundreds of data points that span several orders of magnitude, the data seems to line up with BL well.

Below is a dataset of US cities and towns from the 2010 US Census [9]. It ranges from towns with a population of one person, like Chesterfield, Indiana, to huge cities with 3.7 million people, like Los Angeles.

Images by Author
Images by Author

The distribution is clearly lognormal, and the first digits closely follow BL.

Why do city/town populations have a lognormal distribution? My first thought is that there may be multiple, independent factors at play here. For example, cities have different areas, densities of housing units, and numbers of residents per housing unit. Multiplying these, and perhaps other factors together could lead to a lognormal distribution.

The distribution of city populations has been studied. For example, there’s a paper by Ethan Decker, et al., entitled "Global patterns of city size distributions and their fundamental drivers" [10].

"Here we show that national, regional and continental city size distributions, whether based on census data or inferred from cluster areas of remotely-sensed nighttime lights, are in fact lognormally distributed through the majority of cities … To explore generating processes, we use a simple model incorporating only two basic human dynamics, migration and reproduction…" Ethan Decker, et al.

OK, it seems to be a growth thing. Next up, we’ll look at finances.

Photo by StellrWeb on Unsplash
Photo by StellrWeb on Unsplash

Finances

Many datasets in the world of finance seem to follow BL. Accountants can use this fact to help detect fraud and other irregularities.

"Benford’s law has been found to apply to many sets of financial data, including income tax or stock exchange data, corporate disbursements and sales figures, demographics and scientific data." [11] – Mark Nigrini

The chart below shows all of the expenses that the State of Oklahoma paid out in 2019 [12].

Images by Author
Images by Author

You can see that there is a lognormal distribution, but it is leaning to the left a bit. Also, the first digits deviate a bit from BL. For example, numbers that start with the digit 9 seem to be out of conformance. It’s not clear if this analysis shows a problem with the books. I’ll leave it up to the auditors to take a closer look.

There is a nice paper on this subject by Cindy Durtschi, et al., called "The effective use of Benford’s law to assist in detecting fraud in accounting data" [12]. The paper has a table that shows which types of financial data are expected to follow BL.

C Durtschi, W Hillison, C Pacini - Journal of Forensic Accounting, 2004
C Durtschi, W Hillison, C Pacini – Journal of Forensic Accounting, 2004

Notice the first two examples of when Benford Analysis is likely useful are number sold price and number bought price. These values are products of independent factors. Other multiplicative factors for these kinds of values may include taxes and percentage fees. This would likely make this account data follow a lognormal distribution if the values span several orders of magnitude.

Let’s take a look at something from nature next: the lengths of rivers.

Photo by Dan Roizer on Unsplash
Photo by Dan Roizer on Unsplash

In the prior examples, we saw datasets that have a lognormal distribution and follow BL that are comprised of things determined by humans: city/town populations and financial line-items. But these types of datasets can also be found in nature, with little or no human involvement. For example, we’ll take a look at the lengths of rivers in New York State, from the data available at data.ny.gov.

Images by Author
Images by Author

This time the distribution is leaning to the right. We can also see that the numbers with a leading digit of one are below the BL prediction. The latter is probably due to the fact that the dynamic range, the ratio between the largest and smallest values, is not very large. There are only three green stripes in the distribution histogram, and none of them catch the peak of the curve. While in the population and payments examples above we see five and six green stripes respectively.

Why do lengths of rivers follow a lognormal distribution? Alex Kossovsky has a reasonable explanation. He states that …

"… lengths and widths of rivers depend on average rainfall (being the parameter) and rainfall in turns depends on sunspots, prevailing winds, and geographical location, all serving as parameters of rainfall." – Alex Kossovsky

Wait, what? Sunspots affect rainfall? Apparently so, according to NASA [16]. So it looks like lengths of rivers are determined by multiple, independent factors.

There are other places in nature where we can find datasets with lognormal distributions. For example, Malcolm Sambridge, et al., explore a number of physical datasets in their paper, "Benford’s Law in the Natural Sciences" [17]. Here’s a table from their paper.

From Sambridge S., Tkalčić H., and Jackson A., "Benford's law in the natural sciences"
From Sambridge S., Tkalčić H., and Jackson A., "Benford’s law in the natural sciences"

You can see that these datasets follow BL fairly closely. As to why this happens, Alex Kossovsky sums it up fairly well [4].

"One plausible explanation for the prevalence of Benford’s Law in the natural sciences is that such physical manifestations of the law are obtained via the cumulative effects of few or numerous multiplicative random factors, all of which leads to the Lognormal as the eventual distribution …" – Alex Kossovsky

Summary

In this article, I gave an overview of Benford’s Law, with some background and history. I explained normal distributions and logarithms as an insight into understanding lognormal distributions. With some theoretical dice rolling exercises, I showed how multiple, independent variables can lead to normal distributions (with addition) and lognormal distributions (with multiplication). I then showed how some datasets with lognormal distributions can lead to conforming with BL. Finally, I walked through three examples of real datasets (city/town populations, accounts payable, and lengths of rivers) to show how datasets with a lognormal distribution will tend to adhere to BL.

Future Work

Future work could include exploring how the two types of analyses, adherence to lognormal distributions and compliance with Benford’s Law, may be related when the datasets do not closely match the ideals. These tools together may help in determining any underlying reasons for any discrepancies in the data.

Acknowledgments

I would like to thank Jennifer Lim and Matthew Conroy for their help and feedback on this project.

Source Code

All of the data and source code for creating the graphs in this article are available on GitHub. The sources are released under the CC BY-NC-SA license.

Attribution-ShareAlike
Attribution-ShareAlike

References

[1] Netflix, Connected, 20200, https://www.netflix.com/title/81031737

[2] Scott, P. and Fasli, M., "CSM-349 – Benford’s Law: An Empirical Investigation and a Novel Explanation", 2001, http://repository.essex.ac.uk/8664/1/CSM-349.pdf

[3] Berger, A., Hill, T.P., "A basic theory of Benford’s Law", Probability Surveys, 2011, https://projecteuclid.org/download/pdfview_1/euclid.ps/1311860830

[4] Kossovsky A. E., "Arithmetical Tugs of War and Benford’s Law", 2014, https://arxiv.org/ftp/arxiv/papers/1410/1410.2174.pdf

[5]Benford, F. "The Law of anomalous numbers", Proceedings of the American Philosophical Society, 78, 551–572, 1938, https://mdporter.github.io/SYS6018/other/(Benford) The Law of Anomalous Numbers.pdf

[6] Newcomb, S., "Note on the frequency of use of different digits in natural numbers", American Journal of Math. 4, 39–40, 1881

[7] Stigler S., "Stigler’s Law of Eponymy", 1980, https://archive.org/details/sciencesocialstr0039unse/page/147/mode/2up

[8] Fewster, R.M., "A Simple Explanation of Benford’s Law", The American Statistician, Vol. 63, No 1, 2009, https://www.stat.auckland.ac.nz/~fewster/RFewster_Benford.pdf

[9] US Census Data, 2010, https://www2.census.gov

[10] Decker, E. H., Kerkhoff, A. J., & Moses, M. E. (2007). Global patterns of city size distributions and their fundamental drivers. PloS one, 2(9), e934. https://doi.org/10.1371/journal.pone.0000934

[11] Nigrini, M.J., "I’ve Got Your Number", Journal of Accountancy, 1999, https://www.journalofaccountancy.com/issues/1999/may/nigrini.html

[12] The state of Oklahoma, "Oklahoma’s Open Data", https://data.ok.gov, 2019

[13] Durtschi, C., Hillison, W. Pacini C., "The effective use of Benford’s law to assist in detecting fraud in accounting data", Journal of Forensic Accounting, 2004, http://www.agacgfm.org/AGA/FraudToolkit/documents/BenfordsLaw.pdf

[14] New York State, Waterbody Classifications, 2019, https://data.ny.gov/Energy-Environment/Waterbody-Classifications/8xz8-5u5u

[15] Kossovsky A. E., "Towards A Better Understanding of the Leading Digits Phenomena", 2006, https://arxiv.org/ftp/math/papers/0612/0612627.pdf

[16] Rind, D., "Do Variations in the Solar Cycle Affect Our Climate System?" Science Briefs, Goddard Institute for Space Studies, NASA, 2009, https://www.giss.nasa.gov/research/briefs/rind_03/

[17] Sambridge S., Tkalčić H., and Jackson A., "Benford’s law in the natural sciences", Geophysical Research Letters, Vol. 37, L22301, 2010, https://agupubs.onlinelibrary.wiley.com/doi/epdf/10.1029/2010GL044830


Related Articles