Naive Bayes and disease detection

Moving from PDFs to discrete values

R Andrew Cocks
Towards Data Science

--

Previously I wrote about Bayesian inference in 1760 where I looked at the Bernoulli and beta distributions.

But you don’t need to use a probability density function, Bayes theorem also works on discrete values. Let’s look at disease detection in a population.

Take a disease which occurs in 1% of the population for which you have a test which is 99% accurate:

True Positive (TP): 99% Correctly detects disease
False Positive (FP): 1% Incorrectly detects disease in healthy person
True Negative (TN): 99% Correctly detects absence of disease
False Negative (FN): 1% Incorrectly detects absence of disease in sick person

A random person comes into the clinic and tests positive — what is the probability that this person actually has the disease? Think of an answer before you read on.

This can be solved numerically without Bayes. Let’s assume an arbitrary population size of 10,000 people. Of these people we know 1% of the population has the disease:

100 people with the disease
9900 people without the disease

Next we can calculate the number of positive tests. 99% for disease group and 1% for healthy:

100 * 99% = 99 True Positive
9900 * 1% = 99 False Positive

which gives us our answer:

TP / (TP + FP) = 99 / (99 + 99) = 50%

A random person who tests positive has only a 50% chance of having the disease when 1% of the population has the disease and your test is 99% accurate! Do you still remember the number you guessed? The true answer of 50% is a surprise to most people.

Now again using Bayes’ theorem:

Bayes’ theorem
P(A ∣ B) = P(B ∣ A) P(A)
───────────────
P(B)
A = Disease
B = Positive test

The question again: What is the probability of having the disease given a positive test, what is P(A ∣ B) ?

Probability of a positive test given the person has the disease:

P(B ∣ A) = 0.99

Probability of having the disease:

P(A) = 0.01

Probability of a positive test:

P(B) = 0.99 × 0.01 + 0.01 × 0.99 = 2 × 0.99 × 0.01

Complete the calculation:

P(A ∣ B) = P(B ∣ A) × P(A)
───────────────
P(B)
= 0.99 × 0.01
───────────────
2 × 0.99 × 0.01
= 1/2
= 50%

A test with 99% accuracy but a random patient with a positive test result has only a 50–50 chance of actually having the disease! The incidence of the disease in the population is a critical variable which many people overlook.

See how they get this wrong in the news: 81% of ‘suspects’ flagged by Met’s police facial recognition technology innocent, independent report says

See also: Bayesian Ranking System

--

--