Introduction
Markov Chain Monte Carlo is a group of algorithms used to map out the posterior distribution by sampling from the posterior distribution. The reason we use this method instead of the quadratic approximation method is because when we encounter distributions that have multiple peaks, it is possible that the algorithm will converge to a local maxima, and not give us the true approximation of the posterior distribution. The Monte Carlo algorithms however, use the principles of randomness and chaos theory to solve problems that would otherwise be difficult, if not impossible, to solve analytically.
Analogy of King Markov
Let’s start this discussion with an analogy, which we can update as we traverse through the different types of algorithms. Assume that there are 10 islands placed in a circular fashion, and there is a king that oversees these islands. King Markov has been advised by his trustees that in order to avoid a mutiny from the people, he must visit each island regularly. The condition is that each island must be visited in proportion to its population.
The populations of the islands are distributed such that island 10 has 10 times the population of island 1, island 5 has 5 times the population of island 1 and so on; therefore, king Markov will stay at island 10, 10 times more than he will stay at island 1. One of king Markov’s statisticians explains a way in which he can plan his visits to the islands and keep his people happy. The method is:
Step 1: Each week, king Markov decides between moving to the next island or staying on the current island.
Step 2: He flips a coin. If the coin lands heads, king Markov considers moving to the next island in a clockwise direction. If the coin lands a tails, king Markov considers moving to the next island in a counter clockwise direction. Let’s call this island the proposal island.
Step 3: if the population of the proposal island is more than that of the current island, king Markov always accepts the proposal, and moves to the proposal island. If the population of the proposal island is less than that of the current island, then he accepts the proposal with the Probability of population_proposal/population_current
. Since this is a probability, this can also result in king Markov not moving at all, and rejecting the proposal.
The code for simulating this is given below. As you increase the number of weeks, you will definitely see that the number of weeks spent on the islands are proportional to the relative populations of the islands.
num_weeks <- 1e4
positions <- rep(0,num_weeks)
current<-10
for(i in 1:num_weeks){
##record current position
positions[i] <- current
##flip coin to generate a proposal
proposal <- current + sample(c(-1,1), size = 1)
##This is just to make sure that proposal remains between 1 and 10
if(proposal < 1) proposal <- 10
if(proposal >10) proposal <- 1
#move?
prob_move <- proposal/current
current <- ifelse(runif(1)<prob_move, proposal, current)
}
barplot(prop.table(table(positions)), ylim = c(0,0.2))

Metropolis Algorithm
This algorithm is known as the Metropolis algorithm. This is the simplest algorithms that belongs to the class of the Markov Chain Monte Carlo algorithms. Another addition to this is the Metropolis-Hastings algorithm.
The only way that these two differ, is that while the Metropolis algorithm is just a random walk through the negative log posterior distribution, the Metropolis-Hastings presents a much more sensible proposal. This can be understood in greater detail with reference to its mathematics here.
Let’s try to connect the dots here; you pick an estimated starting point on the posterior distribution. This is the island the king was on at the start of his journey. You generate a random number that proposes the direction in which the point should move. You calculate the density on the proposal point and the current point. If the proposal point has a higher posterior density than the current, you move to the proposal point. If it has a lower density, you move to the proposal point with the probability of density_proposal/density_current
.
So once you do this over and over again, you will always end up moving towards the peak of the posterior. The samples you take along the way help you estimate the shape of the posterior distribution.
The reason we use this algorithm, which samples from the posterior to inform us about the probable shape of the posterior, is because if we try to find the equation for the posterior, and then try to optimize it, we would have to solve a very complex integral. Sometimes, these integrals aren’t even solvable, which is why sampling from the highest density regions gives us a much quicker and precise estimate about the shape of the posterior.
Analogy of King Monte
Let’s assume king Markov has a brother named King Monte, who oversees the valley which is steep around the edges and flat at the center. For the sake of ease, let’s say that the valley is shaped like a bowl, and the population in the valley is spread inversely proportional to the steepness of the terrain. In simple words, around the edges w[here](https://github.com/sheharyarakhtar/MediumArticles) the valley is steep, there is a lower population, and around the center where the valley is flatter, there is a higher population. King Monte has been advised, just like king Markov, that in order to avoid a mutiny by his people, he must visit the valley in proportion to the population density. This problem is a little trickier than the previous one. Unlike King Markov’s problem, we are dealing with a continuous chunk of land to explore here. King Monte gets in his car and gives his car some random momentum impulse in some random direction. When the car stars going uphill, its kinetic energy starts converting into potential energy until the car stops and turns around, and moves back towards the center. At some predefined intervals, king Monte stops his car and meets the people in the region where he is; and this process repeats over and over. In the long run, king Monte will always visit the higher population regions more, as gravity will always force him towards the center more than the edges. Let’s try to analyze how this is applicable in determining the shape of the posterior. We have understood so far that computing the exact solution to the posterior would be too computationally expensive; However, we are capable of calculating the probability density at any single point. We can also compute the gradient of the posterior at any given point. The mathematics of this can be found here. So the algorithm basically runs a dynamics simulation, near exact to how king Monte’s car behaves. The only difference is, our point is moving on a frictionless plain. We must also define two things before we start the simulation. One is the number of steps that the point should take before stopping, lets call these leapfrog steps, and the second is the step size. The point is flicked in a random direction, with some random momentum. The gradient and density of the next step is evaluated and the energy considerations are taken into account. After we have taken our total leapfrog steps, the point stops and samples its current position on the posterior. The code for this can be found here. The figure below shows us the path the point takes for 5 samples.

Hamiltonian Monte Carlo
Hamilton did not create this algorithm, although he did contribute a lot to modern dynamics. Since HMC uses the principles of dynamics and energy conservation, quite literally, to reach the peak of the posterior, the algorithm was named after him. The reason why HMC is so popular is because unlike the Metropolis or Metropolis-Hastings algorithm, there is a very low chance of rejection to the proposal. The reason for this is, that the next point to sample is determined by a long chain of events which nearly always leads to the next sample having a greater posterior density than the current. The only time the proposal is rejected is when the energy of the system is not conserved, which usually happens when we have a bad numerical approximation of the proposal density.
The biggest reason for using HMC, however, is not this. With a low number of parameters, Metropolis and HMC work very similarly. HMC is always more efficient, but the outcome is the quite the same. As you move towards higher dimensions, the mode of the posterior lies further and further away from where most of the probability mass exists. Its difficult to visualize more than 3 dimensions, so think of it this way. Where you have the number of parameters as a relatively low number like 3, the mode of the posterior would give a very good approximation to where most of the probability mass is. When you have a 1000 parameters, the shape of distribution becomes more and more like a donut, with most of the probability mass lying further away from the mode.
This causes the Metropolis algorithms to reject proposals a lot more frequently than HMC, and hence takes a lot more time to converge. The plot shown below gives a little more meaning to this.

Summary
In this article we went through two popular MCMC methods to understand them conceptually. In articles to come, I will be going over their implementations in R and Stan. All the code used in this article to generate the plots and simulate the data can be found here.

References
[1] Richard McElreath, Statistical Rethinking with Examples in R and Stan (2020) [2]Stephanie Glen. "Metropolis-Hastings Algorithm / Metropolis Algorithm" From StatisticsHowTo.com: Elementary Statistics for the rest of us! https://www.statisticshowto.com/metropolis-hastings-algorithm/ [3] Colin Carroll, https://colindcarroll.com/2019/04/11/hamiltonian-monte-carlo-from-scratch/