The world’s leading publication for data science, AI, and ML professionals.

SARS genome fits Benford’s Law with the Cartesian product of nucleobases

~Python code~

Stars also follow Benford's Law in a multitude of ways. (Image by author)
Stars also follow Benford’s Law in a multitude of ways. (Image by author)

Many people might have heard of Benford’s Law from Netflix’s nerdy and entertaining documentary Connected chapter "Digits." Honestly, that’s where I first heard of it. The first-digit dominance of smaller numbers out of 1 to 9… the hidden pattern we somehow all fit in and cannot escape from.

Given the current events, I wonder how Benford’s Law could apply to viral genome sequences, as a big number of instances is needed for Benford’s to appropriately apply. The SARS species genome (including SARS-CoV-2, commonly known as the novel coronavirus) is ~29,800 in length in terms of nucleotides or nucleobases (a.k.a. the building blocks of the genome: A, T, G, C). Publications already showed trends with Benford analysis on specific genes, but just looking at the genome sequence alone seems not telling about biological information, other than to offer some fun "playing with" the law. Today, I will play with the law and unveil the "Benfordiness" of the SARS sequences. First, I imported the sequence text file, taking away the line breaks. Then I split the string into individual letters and created a function "count_pattern" just for the counting purpose (code is shown below). In this post, I am showing examples of both 2003 SARS and 2019 SARS-CoV-2 (click to see genome information).

One way to create more instances is to look at how many times a specific nucleobase (I call it bp) arrangement appears in the genome (i.e. how many times AG, AT, GC, ATG, GGG, CTA, ATGC, CGTT…are found). To exhaust the list, we just take the Cartesian Product of the bp: A, T, G, C. [Briefly, what is Cartesian product? It is permutation with repeats: say, putting apples, pears, and bananas, in some order into four pockets #1, 2, 3, 4, with one fruit per pocket.] To not end up with mostly single digits, I tallied from single bp (how many times A, T, G, C show up, respectively) to 5 bp (i.e. AGTGC).

One "bug" worth mentioning is that the count function in Python 3 does NOT repeatedly count the same character in an arrangement more than once. For example, if encountering ‘..AAAA..,’ we know that there are technically 3 ‘AA’ in it, but the function would answer 2. Similarly, if encountering ‘..TTTTT..,’ the function would tally only 1 ‘TTT’ in this arrangement, while we would tally 3. To solve this problem quickly (forget elegance for one moment), I simply semi-manually added the counts of repeat=3 & 5 to counts of 2 bp, repeat=4 & 5 to counts of 3 bp, and repeat=5 to counts of 4 bp. I am not digging deeper, but the math just works out this way. And this only applies to counting arrangements that are of uniform bp.

Interestingly, there seems to be a parity-dependent or alternating trend: the count of even bp arrangements yields a more Benford-like trend – meaning first digits are far more likely to be ones and twos, while that of odd bp arrangements seems to greatly violate Benford’s Law. Note the y-axis is in natural log instead of the linear scale, to show the lower-count values well.

SARS by number of bp for Cartesian product (Image by author)
SARS by number of bp for Cartesian product (Image by author)
SARS-CoV-2 by number of bp for Cartesian product (Image by author)
SARS-CoV-2 by number of bp for Cartesian product (Image by author)

If pooling all the tallies together, however, we get a Benford-like curve based on Mann-Whitney test (P > 0.86 for both SARS and SARS-CoV-2 sequences). Although the Mann-Whitney test may not be the best to detect any nuances, I will leave it for now as the trend is clear. Black bars represent actual data while the purple curve is Benford’s curve from the formula log(1+1/d).

SARS sample result (Image by author)
SARS sample result (Image by author)
SARS-CoV-2 sample result (Image by author)
SARS-CoV-2 sample result (Image by author)

What can we learn from this exercise? I am blown away by the prevalence of Benford’s Law – it seems to lurk underneath the counting of this world, one way or the other. There are numerous ways to create instances to count (like the one introduced here). I am willing to bet that the conclusion from this post holds true for other genomes as well, although I am no expert in genetics. I also wonder if randomly generated sequences would fit the conclusion. Is Benford’s Law underlying all measured things or just the "naturally occurring" ones? How is "naturally occurring" truly defined? Perhaps researchers could work on establishing more meaningful standards and guidelines on using Benford’s Law to detect fraud and provide actionable insights. An article that shows human brain electrical activities also obeying this omnipresent law can be found here. I will leave further insights and interpretations to whoever interested – but I had fun exploring this topic a little.

Snapshot of text from FASTA
Snapshot of text from FASTA

Related Articles