STATISTICS

Ever heard a co-worker confidently declaring something like "The longer I lose at roulette, the closer I am to winning?" Or had a boss that demanded you to not overcomplicate things and provide "just one number", ignoring your attempts to explain why such a number doesn’t exist? Maybe you’ve even shared birthdays with a colleague, and everyone in office commented on what a bizarre cosmic coincidence it must be.
These moments are typical examples of water cooler small talk – a special kind of small talk, thriving around break rooms, coffee machines, and of course, water coolers. It is where employees share all kinds of corporate gossip, myths, and legends, inaccurate scientific opinions, outrageous personal anecdotes, or outright lies. Anything goes. So, in my Water Cooler Small Talk posts, I discuss strange and usually scientifically invalid opinions I have overheard in office, and explore what’s really going on.
🍨 DataCream is a newsletter offering data-driven articles and perspectives on data, tech, AI, and ML. If you are interested in these topics subscribe here.
Today’s water cooler moment comes from a curious observation about invoices:
I was going through last month’s invoices, and it’s weird – so many of them start with a 1 or 2. That’s just random, right?
Nope, it’s not random. 🙃
In fact, the distribution of first digits in many naturally occurring datasets follows a phenomenon called Benford’s Law.
What about Benford’s Law?
Benford’s Law refers to the observation that in many naturally occurring datasets the leading digit is much more likely to be a 1 than a 9. In particular, it provides a formula for the expected distribution of first digits in natural datasets, as well as, makes some predictions about the distribution of second digits, third digits, digit combinations, and so on. On the contrary, assigned and fabricated numbers, such as telephone numbers or fabricated financial statements, usually don’t conform to Benford’s Law.
The law is named after physicist Frank Benford, who explained it in his 1938 article ‘The Law of Anomalous Numbers‘. Nonetheless, Benford was not the first person to make this observation – Simon Newcomb had previously stated the law in 1881, thus the law is also referred to as the Newcomb–Benford law. More specifically, Newcomb noticed that early pages of logarithmic tables, which corresponded to numbers beginning with 1, were noticeably dirtier and more frequently used than later pages containing numbers with larger leading digits. Notably, he did publish the correct distribution. Later on, Benford demonstrated the law’s applicability across a wide range of datasets.
In particular, according to Benford’s Law the probability of each first digit d from 1 to 9 with base 10 is given by:

Aww, how nice! 🤗 This formula results in a reverse logarithmic pattern – number 1 is the leading digit about 30.1% of the time, while number 9 appears as the first digit in only 4.6% of cases. Although counterintuitive at first glance, this strange pattern holds true on a surprisingly large variety of datasets, ranging from stock prices to earthquake magnitudes, and from lengths of rivers to electricity bills. This phenomenon is scale-invariant, meaning the pattern remains consistent regardless of the unit of measurement, such as meters, kilometers, or miles. On top of this, the law only applies to datasets that span several orders of magnitude – amounts that are bound within a limited range, like human height or exam scores, don’t comply to Benford’s Law.

But why does this happen? 🤨 Many real-world phenomena grow proportionally -financial data like stock prices or interest rates, or population sizes, often grow proportionally or exponentially. This exponential growth leads to datasets whose leading digits inherently fit a logarithmic distribution. This happens because of the logarithmic spacing of numbers. For instance, on a logarithmic scale, the interval between 1 and 10 is much larger than the interval between 90 and 100. As a result, numbers with smaller leading digits – like 1 – appear more frequently, aligning with Benford’s Law. On top of this, random natural phenomena like the river lengths or lake sizes, generally follow highly skewed distributions that align with logarithmic spacing.

Catching cheaters with Benford’s Law
An unexpected, but very popular application of Benford’s Law is fraud detection, spanning from cooked accounting numbers to fake election votes. Organic processes often generate data that follow Benford’s distribution, making Benford’s Law an effective and simple test to detect fraud. There are numerous examples of successfully using Benford’s Law for fraud detection. It is important to highlight that a deviation from the Benford distribution does not necessarily mean that the data were manipulated – even rounding the data may result in deviation from the nominal distribution. Nevertheless, such a deviation certainly means that the data look suspicious and we need to take a closer, more careful look. It can be used rather as an initial screening for fraud detection.
Being Greek, I find it especially fascinating to reflect on the case of Greece allegedly manipulating macroeconomic data to join the EU back in the day. 🤡 This is a well-known incident, which has been repeatedly and publicly discussed since then – European Commission has officially confirmed concerns about the reliability of the provided data, leading to a widespread suspicions of manipulation. In particular, EU requires candidate countries to provide data for checking the Stability and Growth Pact criteria, like public deficit, public debt and gross national product. Sadly, Greece’s data from 1999 to 2009 had the largest deviation from Benford’s distribution, out of the 27 EU member states. Again, data not conforming to Benford’s Law is not conclusive proof. But c’mon! 🤷♀️
Another classic example of being caught through Benford’s Law includes financial advisor Wesley Rhodes, whose financial statements failed to pass a first digit Benford’s Law test. Taking a closer look, it was revealed that Rhodes was pulling his numbers out of thin air, and had stolen millions of dollars from investors.
Another famous but much more controversial application of the Benford’s Law is election fraud. In the 2020 U.S. presidential election between Biden and Trump, some analyses claimed that Trump’s distribution of vote tallies complied with Benford’s Law, whereas Biden’s did not. Of course there was a bit of a fuss, but this distribution was ultimately explainable. A more controversial case is the Iranian 2009 election – overall, vote counts appear not to satisfy Benford’s Law, and look suspicious. Nonetheless, there is a large discussion about the applicability of Benford’s Law to election fraud detection, as electoral data often fail to meet the necessary conditions for the law to hold.
Benford’s Law can also be very handy for fraud detection in social media. More specifically, the law applies on social media metrics, as for instance, follower counts, likes, or retweets. In this way, it allows for the identification of suspicious behaviors, such as bot activity or purchased engagement, by examining such amounts and comparing them to Benford’s distribution.
Getting our hands dirty
We can easily check if a dataset fits Benford’s Law in Python. This allows us to quickly determine if a dataset is legitimate or suspicious, and thus needs further examination. To demonstrate this, I will be using this Kaggle dataset for credit card fraud detection. The dataset is licensed as Open Data Commons, allowing commercial use.
The dataset contains numerous columns, but I will only use the following two:
- Amount: indicating the amount of a transaction
- Class: indicating if the transaction is legitimate (0), or fraudulent (1)
So, we can import the necessary libraries and the dataset in Python by:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv("creditcard.csv")
Then, we can easily calculate the nominal probabilities predicted by Benford’s Law by applying the respective formula.
# calculate nominal Benford's Law probabilities
benford_probabilities = np.log10(1 + 1 / np.arange(1, 10))
Next, we have to calculate the distribution of leading digits for the legitimate and fraudulent subsets of the dataset. To do this, it is essential to extract the leading digit of the ‘Amount‘ column for each transaction.
# extract leading digit
def leading_digit(x):
return int(str(int(x))[0]) if x > 0 else None
# split into legit and fraud transactions
legitimate_data = data[data["Class"] == 0]
fraudulent_data = data[data["Class"] == 1]
# calculate frequencies
def calculate_frequencies(data, label):
observed = data["Amount"].apply(leading_digit).value_counts(normalize=True).sort_index()
return observed.reindex(range(1, 10), fill_value=0)
legit_freq = calculate_frequencies(data[data["Class"] == 0], "Legitimate")
fraud_freq = calculate_frequencies(data[data["Class"] == 1], "Fraudulent")
Then, we can plot the two distributions – legitimate and fraudulent – in comparison to the nominal Benford’s Law distribution. For the visualizations I use Plotly library, as usual.
import plotly.graph_objects as go
fig_legit = go.Figure()
# bar for frequencies
fig_legit.add_trace(go.Bar(
x=list(range(1, 10)),
y=legit_freq.values,
name="Observed (Legitimate)",
marker_color="#4287f5"
))
# line for Benford's Law probabilities
fig_legit.add_trace(go.Scatter(
x=list(range(1, 10)),
y=benford_probabilities,
mode="lines+markers",
name="Benford's Law",
line=dict(color="orange", width=2)
))
fig_legit.update_layout(
title="Leading Digit Distribution for Legitimate Transactions",
xaxis=dict(title="Leading Digit"),
yaxis=dict(title="Frequency"),
height = 500,
width = 800,
barmode="group",
template="plotly_white",
legend=dict(title="Legend"),
)
fig_legit.show()
Similarly, we produce the plot for the fraudulent transactions.

Even at a glance, it is visually apparent that the legitimate transactions align much more closely with the nominal distribution, whereas the fraudulent transactions show significant deviations. To further quantify these deviations, we can calculate the difference between the observed and nominal probabilities for each leading digit and then aggregate them.
# calculate deviations from nominal distribution
legit_score = np.sum(np.abs(legit_freq - benford_probabilities))
fraud_score = np.sum(np.abs(fraud_freq - benford_probabilities))
print(f"Legit deviation Score: {legit_score:.2f}")
print(f"Fraud Deviation Score: {fraud_score:.2f}")

Clearly, something is going on in the second subset, and we would be required to perform a closer and more in-depth investigation.
But does this make sense? 🤨 In credit card fraud, the transaction amounts themselves are not typically fabricated – fraudsters aim to charge your credit card with very real amounts. However, in their effort to bypass certain security thresholds, as for instance a 50 USD or 100 USD limit per purchase, they may produce irregular patterns that deviate from Benford’s Law. For example, attempting to stay below a $100 limit might result in an overrepresentation of transactions starting with 9, such as $99.99. Thus, while the data may not be outright fabricated, the irregularity in the patterns suggests that something unusual is happening.
On my mind
Ultimately, Benford’s Law is not proof of fraud or data manipulation, but rather in indicator of something going on. If the data do not conform on the expected distribution, all we need to do is come up with an explanation on why the data do not conform – what aspect of the data may be unnatural, forced or fabricated. When a logical explanation cannot be found, it is most probably time to take a closer, more detailed look.
Data problem? 🍨 DataCream can help!
- Insights: Unlocking actionable insights with customized analyses to fuel strategic growth.
- Dashboards: Building real-time, visually compelling dashboards for informed decision-making.
Got an interesting data project? Need data-centric editorial content? Drop me an email at 💌 [email protected] or contact me on 💼 LinkedIn.
💖 Loved this post?
Let’s be friends! Join me on 💌 Substack 💼 LinkedIn ☕ Buy me a coffee!
or, take a look at my other Water Cooler Small Talks:
Water Cooler Small Talk: Simpson’s Paradox
Water Cooler Small Talk: What Does Having a High IQ Even Mean?