Oh! What are we if not patterns?
It’s amazing how we perform so many sophisticated tasks easily and yet yearn to teach machines how to do them. Telling the difference between apples and oranges is quite trivial for us but to teach this to someone who only understands ‘0’ and ‘1’ is troublesome. Now what would seem an strenous task can be made easy (or at least, feasible) using a very familiar mathematical formula. But the question is: Is the formula intuitive?
Bayes’ Theorem is like E=mc² of probability theory. Everyone has seen it but only few understands it. If I had to choose one adjective for it, it’d be revolutionary! It changed the course of many applications we were fiddling with before. But before diving into the nitty dritty of the algorithm, we need to set our fundamentals straight.
P.S.: To explain the concepts I have taken the example of binary classification for the sake of simplicity but without any loss of generality.
Marginal, Conditional and Joint Probabilities
Marginal Probability
When we usually talk about probability of an event, it is the marginal probability we are concerned with. In other words, it is the probability of an event irrespective of any other factor/event/circumstance. Basically, you ‘marginalize’ other events and hence the name. It is denoted by P(A) and read as "probability of A".
Conditional Probability
Conditional probability is when the occurence of an event is wholly or partially affected by other event(s). Alternatively put, it is the probability of occurrence of an event A when an another event B has already taken place. It is denoted by P(A|B) and read as "probability of A given B".
Joint Probability
Joint probability is calculated when we are interested in the occurrence of two different events simultaneously. It is also often referenced as probability of intersection of two events. It is denoted by P(A, B) and read as "probability of A and B".
Probability and Likelihood
There is a very subtle difference between likelihood and probability and perhaps this is the very reason why people often consider them similar, if not same.
To understand the difference between them, we first need to understand what a model is, more specifically, statistical model.
A model can be viewed as any one of the process, relationship, equation or an approximation to understand and describe the data.
Consider the below graph:

This could be a model as it gives us a ‘description’ of how our data look like. We can see a relationship between the features (x and y) in the above graph i.e., the variation of the features w.r.t each other.
Now, if I try to approximate it with a mathematical equation, I get:
y = 1.3055 * x - 5.703
which when plotted gives:

This equation is also a model as it gives a more concrete description of the data (more specifically the relationship between the features).
A lot of statistics is dedicated to determining if a model is a good or bad approximation of the data.
Now that we have the intuition behind statistical model, we can address the difference between likelihood and probability.
Whenever we calculate the probability of an event in a stochastic process, it depends on the parameters of the model we are using to describe our process. Say, O is an observed outcome of the process and θ be the parameter describing the underlying model. Then the probability we are interested in calculating is denoted by P(O|θ) i.e., "the probability of a particular outcome given the parameter of the model used to describe the process".
But seldom does it happens that we know the value of θ beforehand. We simply observe O and the goal then is to arrive at an estimate for θ that would be a plausible choice given the observed outcome O. Our best estimate for the parameter is the value which gives the maximum probability for the outcome O to occur. We can then define the likelihood function as L(θ|O) i.e., it is function of θ for the given set of outcomes, and then use it to find the optimal value of the parameter θ.
Graphical explanation:

Consider a Gaussian distribution as shown in above graph. Let X be the random variable for the process in concern. Then,
Probability of the random variable equals x given the underlying model is Gaussian:
P(X = x | N(μ, σ)) = 0 # For continous random variable, but can be closely approximated to the dark pink area
Probability of the random variable to be greater than x given the underlying model is Gaussian::
P(X > x | N(μ, σ)) = (Pink + Dark Pink) area
Likelihood of the random variable at x:
L(N(μ, σ) | X = x) = y
In simple terms, the likelihood is how well the parameters (in this case, μ and σ) of the model (in this case, Gaussian Distribution) describes the outcomes (in this case, X). The model with the optimal parameters maximises the probability. On the other hand, the probability is how probable an event or observation is for the given parameters of the model used.
Now that we have gone through our fundamentals, we can finally dive into the working of Bayes’ Decision Theory.
#
Bayesian Decision Theory is a fundamental statistical approach to the problem of pattern classification. It is considered as the ideal pattern classifier and often used as the benchmark for other Algorithms because its decision rule automatically minimizes its loss function. It might not make much sense right now, so hold on, we’ll unravel it all.
It makes the assumption that the decision problem is posed in probabilistic terms, and that all the relevant probability values are known.
Bayes’ Theorem
Derivation of Bayes’ Theorem:
We know from the conditional probability:
P(A|B) = P(A, B) / P(B)
=> P(A, B) = P(A|B) * P(B) ... (i)
Similarly,
P(A, B) = P(B|A) * P(A) ... (ii)
From equation (i) and (ii):
P(A|B) * P(B) = P(B|A) * P(A)
=> P(A|B) = [P(B|A) * P(A)] / P(B)
For the case of Classification, let:
- A ≡ ω (state of the nature or the class of an entry)
- B ≡ x (input feature vector)
After substituting we get:
P(ω|x) = [P(x|ω) * P(ω)] / P(x)
which becomes:
P(ω|x) = [p(x|ω) * P(ω)] / p(x)
because,
* P(ω|x) ≡ called the posterior, it is the probability of the predicted class to be ω for a given entry of feature (x). Analogous to P(O|θ), because the class is the desired outcome to be predicted according to the data distribution (model). Capital 'P' because ω is a discrete random variable.
* p(x|ω) ≡ class-conditional probability density function for the feature. We call it likelihood of ω with respect to x, a term chosen to indicate that, other things being equal, the category (or class) for which it is large is more "likely" to be the true category. It is a function of parameters within the parameteric space that describes the probability of obtaining the observed data (x). Small 'P' because x is a continous random variable. We usually assume it to be following Gaussian Distribution.
* P(ω) ≡ a priori probability (or simply prior) of class ω. It is usually pre-determined and depends on the external factors. It means how probable the occurence of class ω out of all the classes.
* p(x) ≡ called the evidence, it is merely a scaling factor that guarantees that the posterior probabilities sum to one. p(x) = sum(p(x|ω)*P(ω)) over all the classes.
So finally we get the following equation to frame our decision rule:

Decision Rule
The above equation is the governing formula for our decision theory. The rule is as follows:
For each sample input, it calculates its posterior and assign it to the class corresponding to the maximum value of the posterior probability.
Mathematically it can be written as:

Now that you know what Bayesian Rule is and how it works, you might be wondering what so special here? Well, the very algorithm is remarkable, owing to its elegance and near ideal results. Bayes’ Decision Theory is considered as a benchmark for other classification algorithms.
Under Bayes’ theorem, no theory is perfect. Rather, it is a work in progress, always subject to further refinement and testing.
Let’s try to understand why is Bayes’ Classifier the best classifier!**
** (provided all the assumptions required by Bayes’ Decision Theory are true)
Loss Function
Let {ω1, ω2, ω3, …, ωc} be the finite set of c categories and {α1, α2, α3, …, αa} be the finite set of a possible actions. By an action I refer to a particular decision out of all possible decisions, for example, an action could be assigning an input feature vector to category (hereinafter class) 3 and another could be assigning an input to class 7.
We can generalize our loss function (λ) as follows:

which can be understood as follows:
λ is the loss incurred on assigning a feature vector to class ωi when its true class is ωj. In other words, it is the loss suffered on taking an action α of assigning an input to class ωi when it should have been in class ωj.
For example, in case of binary classification if we take an action of classifying an input feature vector into class 1 when it should have been in class 2, we incur the following loss:

If i = j, then we get a smaller value of the loss as compared to the alternative cases because it corresponds to a correct decision.
But it would be totally naive to accept this as the complete function we seek to find for our loss. As can be seen, the loss is conditional, it depends on the class under concern at any particular time. To understand it better, answer this question:
Will the loss be independent of how probable a class is for an input feature vector?
If you think the answer is ‘yes’, let me explain why the answer is ‘no’!
Say the posterior probability of class 2 is more than that of class 1 for a given feature vector, i.e.,

and let's consider an action of assigning x to class 1.
Since λ is the loss associated with one particular action (here, assigning x to class 1) we get its two values:

Now are these two loss values equal?
You cannot answer this question without the knowledge of posterior probabilities. You need to know how likely a class is to be the true class to calculate the associated loss with it. This modifies our loss as follows:

This explains why it is called conditional loss.
λ __ can be any suitable function. It could be, for example, _symmetrica_l or _zero-on_e function.
Expected loss for a decision:
Now we are in a confortable position to define our expected loss for a particular decision (or action) over all the classes. That is to say, the loss that I would incur after taking a specific action and how it would be affected by the ‘presence’ of all the classes and not just the true class.

We can minimize our expected loss by selecting the action that minimizes the conditional risk. We shall now show that this Bayes decision procedure actually provides the optimal performance.
Our problem is to find a decision rule that minimizes the overall risk. A general decision rule is a function α(x) that tells us which action to take for every possible observation (x). To be more specific, for every x the decision function α(x) assumes one of the a values α1, α2, α3, …, αa. Because conditional risk is associated with action αi and because the decision rule specifies the action, the overall risk is given by:

If α(x) is chosen so that conditional risk for each action is minimized, then the overall risk will be minimized too. This justify the Bayes decision rule.
Perhaps an example would be better…
Let λ be symmetrical or zero-one loss function,

Then the conditional risk becomes:

The very decision rule of Bayes’ Classifier automatically minimizes the conditional risk associated with an action.
And that’s Bayes’ Decision Theory!
I hope this article helped you in understanding this algorithm. Feel free to drop your suggestions and queries down below. And if you liked it, smash that clap button!
If you have any topic in mind and would like to see it here, leave it in the comment section. It would be my pleasure!
Godspeed!
(All the images but the thumbnail were created by the author.)