
In a previous post I introduced different numerical sampling techniques, one of them being Importance Sampling. In that post we used this technique to allow sampling from complex distributions, sampling from which would otherwise be infeasible. However, importance sampling is frequently used for another reason, namely variance reduction: that is, by choosing a suited proposal distribution we can reduce the variance of our estimator – which we will cover here.
Recap of Importance Sampling
Assume we don’t just want to calculate the expectation E[X]
of a random variable X
, but instead the expectation of a function of that variable, f[X]
. In a continuous setting this is calculated as:

We can approximate this expectation using numerical approximation, also known as Monte Carlo methods, by sampling n
random values from the distribution p
and then calculate the sample mean:

The idea behind importance sampling now is to use a simple re-formulation trick and write the expectation as

— giving the expectation of f(x)p(x)/q(x)
over the distribution q
! And with that, allowing us to calculate the sample mean by sampling from q:

Variance Reduction
The variance of the standard Monte Carlo estimator is given by:

The variance for the reformulated importance sampling estimator is:

So as a first step we definitely observe a difference in variance, meaning with high probability we can also find a way to reduce this. And indeed it is relatively easy to see that this variance can be reduced to 0 by choosing q
as:

(Insert this term in the equation above, and picture f(x)p(x)
cancelling each other out – leaving Var[E[f(X)]]=0
.)
Naturally, we don’t know E[f(X)]
, as the reason we are doing this sampling after all is to find the expectation of f
.
However, we can think of E[f(X)]
as some normalisation constant, and at least take one essential insight away from this: we should construct q
s.t. it has high density wherever f(x)p(x)
is high. And with this, let’s dive into a practical example and apply this learning.
Practical Example
For the sake of demonstration, we want a pointy function f
, and a probability distribution p
which do not overlap too well. Thus for simplicity let us set both to be normal distributions, e.g. f = N(5, 1)
and p = N(9, 2)
:

I hope choosing a normal distribution for both does not confuse the reader, so let’s re-iterate what we’re trying to do here: we want to compute E[f(X)]
, where X
is a random variable which follows the distribution p
– i.e. we want to compute the mean of f
under p
. Note this mean is not the mean usually associated to a normal distribution (which is a value on the x-axis, namely the mode of the distribution), but now we are after the mean of the y-values under p
: in this example it is ~0.36 – a much lesser known and used value.
To approximate this numerically, as stated above we would now sample values x
from the distribution p
, and compute the empirical mean of f(x)
.
Intuitively one can see why sampling from this distribution is a bad idea, hopefully amplified by the previous section: for most values sampled from p
, f
will be close to 0 – but for a few sampled values f
will be very large – thus we obtain a large variance.
Therefore, following above introduced ideas we now propose a new distribution q = N(5.8, 1)
, which satisfies the derived criterion that its density is high in regions where f(x)p(x)
is high:

Note it’s not trivial to find this function, and certainly there are much more difficult real-word scenarios. We have to try and satisfy the criterion as well as possible, but also take care of satisfying the importance sampling requirement of p
covering q
, etc. For this example I actually plotted p(x)f(x)
and then picked a q
which resembled it best.
Python Implementation
Let’s code this in Python. First, we introduce the necessary functions and distributions, and for convenience use functools.partials
to obtain a function representing a normal distribution with fixed mean / standard deviation:
MEAN_F, STD_F = 5, 1
MEAN_P, STD_P = 9, 2
MEAN_Q, STD_Q = 5.8, 1
def normal_dist(
mean: float, standard_deviation: float, x: np.ndarray
) -> np.ndarray:
return (
1
/ (standard_deviation * np.sqrt(2 * np.pi))
* np.exp(-0.5 * ((x - mean) / standard_deviation) ** 2)
)
f = partial(normal_dist, MEAN_F, STD_F)
p = partial(normal_dist, MEAN_P, STD_P)
q = partial(normal_dist, MEAN_Q, STD_Q)
Then, we generate the plot from above for orientation:
x = np.linspace(0, 15, 100)
plt.plot(x, f(x), "b-", label="f")
plt.plot(x, p(x), "r-", label="p")
plt.plot(x, q(x), "y-", label="q")
plt.legend()
plt.show()
Finally, we come to the (importance) sampling part. First, we compute the direct Monte Carlo Estimator for E[f(X)]
. We generate random samples x
from p
, and calculate the mean of f(x)
:
x_p = np.random.normal(loc=MEAN_P, scale=STD_P, size=NUM_SAMPLES)
y_p = f(x_p)
Now we apply importance sampling, i.e. sample from q
and correct via the importance weights:
x_q = np.random.normal(loc=MEAN_Q, scale=STD_Q, size=NUM_SAMPLES)
y_q = f(x_q) * p(x_q) / q(x_q)
Putting it all together:
from functools import partial
import matplotlib.pyplot as plt
import numpy as np
NUM_SAMPLES = 1000000
MEAN_F, STD_F = 5, 1
MEAN_P, STD_P = 9, 2
MEAN_Q, STD_Q = 5.8, 1
def normal_dist(
mean: float, standard_deviation: float, x: np.ndarray
) -> np.ndarray:
return (
1
/ (standard_deviation * np.sqrt(2 * np.pi))
* np.exp(-0.5 * ((x - mean) / standard_deviation) ** 2)
)
f = partial(normal_dist, MEAN_F, STD_F)
p = partial(normal_dist, MEAN_P, STD_P)
q = partial(normal_dist, MEAN_Q, STD_Q)
x = np.linspace(0, 15, 100)
plt.plot(x, f(x), "b-", label="f")
plt.plot(x, p(x), "r-", label="p")
plt.plot(x, q(x), "y-", label="q")
plt.legend()
plt.show()
x_p = np.random.normal(loc=MEAN_P, scale=STD_P, size=NUM_SAMPLES)
y_p = f(x_p)
x_q = np.random.normal(loc=MEAN_Q, scale=STD_Q, size=NUM_SAMPLES)
y_q = f(x_q) * p(x_q) / q(x_q)
print(
f"Original mean / variance: {np.mean(y_p):.6f} / {np.var(y_p):.6f}"
)
print(
f"Importance sampling mean / variance: {np.mean(y_q):.6f} / {np.var(y_q):.6f}"
)
The output will be something like:
Original mean / variance: 0.036139 / 0.007696 Importance sampling mean / variance: 0.036015 / 0.000027
Thus, we still obtain the correct mean, but have reduced variance by ~100x!
Conclusion
Importance sampling is a clever reformulation trick, allowing us to compute expectations and other moments by sampling from a different proposal distribution. This not only allows sampling from complex, otherwise hard-to-sample distributions, but also changes the variance of the resulting estimator. In this post we showed how to make use of this to reduce the variance. In particular, we proved and showed that selecting a proposal distribution with high probability in regions where p(x)f(x)
(product of original distribution and function in question) is high, yields best results.
Thanks for reading!
This post is Part 2 of a series about sampling. You can find the others here: