The Paradox Series #1

The Two Envelopes Problem

How time and causality are emerging from randomness

Published in

Towards Data Science

13 min readSep 8, 2023

The two envelopes problem, leading to paradoxical and inconsistent decisions, appears when using an intuitive but wrong Bayesian probability estimation to determine the best course of action. Correcting for the mathematical mistake is simple, but there is more to it: first, by modifying the problem very slightly, we can make it undecidable — an example of language ambiguity as opposed to mathematical formalism; second, when comparing several possible solutions, we can observe how time is emerging in the mathematical world, theoretically allowing us to test causal hypotheses.

The two envelopes problem (TEP)

Imagine I show you two seemingly identical envelopes on a table, telling you (without lying) that both contain money, one twice as much as the other, and I propose you to take one of them and keep the money in it for yourself.

Once you have chosen one, and before you open it, I ask you if you want to modify your choice and rather take the other envelope.

What would you do?

You would probably tell me that it would be useless to switch envelopes, as the situation is the same whatever envelope you choose. However, you should note that you have chosen an unknown amount of money x, and the amount y in the other envelope can be 2x or x/2 with equal probability, meaning the expected amount y is 2x (1/2) + x/2 (1/2) = 5x/4, which is greater than x. So maybe you should switch nevertheless?

Obviously you could also compute the expected amount x based on y, and because there is half a chance x is either 2y or y/2, you would find that the expected amount x is 5y/4, which is greater than y.

So what is wrong with this computation? Which envelope is more likely to contain more than the other, if any?

The mathematical flaw in the reasoning

We can arbitrarily label one envelope “X” and the other “Y”. Let us now properly compute the conditional expectation of the amount in the envelope X when we know the amount y is in the Y envelope.

The expectation of the amount in X given the observed amount y in Y, noted E[X|Y = y], obviously depends on the specific amount y observed: even if over all possible values for y, the amount x in X can be either y/2 or 2y with a probability of 1/2 each time, it does not mean that this will be the case for specific values of y. For example, if y is “very small” (in a sense that will be clarified later), there is more chance that x is bigger than y, and if y is “very big”, there is more chance that x is smaller than y: over all possible values for y, probabilities can be balanced so that X is half the time half Y, and half the time double Y, but it does not mean that P(X = y/2|Y = y) = 1/2 and P(X = 2y|Y = y) = 1/2, only that P(X = Y/2) = P(X = 2Y) = 1/2.

So we shall try to properly compute E[X|Y = y], but first we need to clarify the process that led us to have these two envelopes on the table, with labels “X” and “Y”. Let us assume that we filled a first envelope with a random amount U, and a second envelope with an amount 2U. Then we shuffled them, and randomly named one of the envelopes X, while we named the other Y. We could represent this naming process as follows: we draw a binary number Z (half a chance of being 0 or 1). If Z = 0, X is the envelope with U in it, otherwise (if Z = 1) the envelope with the amount 2U.

Now we can see that for the exterior observer who is being asked to choose but has no idea of what random numbers were picked for U and Z, the amounts in the envelopes look like this:

We can verify that P(X = 2Y) = P(U + ZU = 4U - 2ZU) = P(3U - 3ZU = 0) = P(U=ZU) = P(Z = 1) = 1/2 (and it would be the same for P(X = Y/2)).

We still have to compute E[U|Y = y], and for this we need to know P(U=u|Y=y) that is (from Bayes’ theorem) proportional to P(Y=y|U=u)P(U=u).

To compute P(Y = y|U) we recall that Y is either U or 2U, meaning that the value u taken by U is either y or y/2:

when y is not u or u/2, there is no chance that Y = y: P(Y = y|U = u) = 0
when y is u, there is half a chance (Z = 1) that Y = y: P(Y = y|U = u) = 1/2
when y is u/2, there is half a chance (Z = 0) that Y = y: P(Y = y|U = u) = 1/2

With the mathematical formalism:

where:

All this summarizes as:

Then we have to know P(U = u). We can only make an assumption, e.g. that U is exponentially distributed on positive real numbers (with parameter λ>0):

In the end, P(U = u|Y = y) is proportional to:

In other words:

Now we have all we need to compute E[X|Y = y] = 3E[U|Y = y] — y, which is equal to:

Summarizing, we now know that:

This is quite different from the initial 5y/4 !

The expectation for x is (strictly) greater than y if and only if:

or said otherwise if and only if:

(which is twice the median of the exponential distribution of parameter λ from which the amounts are drawn).

So we can better understand the error in our previous reasoning. While it is true, by design, that X is half the time twice the amount y and half the time half this same amount when averaging over all possible values y, it is also true that for a specific value of y the probabilities are not half and half: if y is “big” compared to what is expected from the way the values U were picked, there is more probability that the envelope X contains a smaller amount, and if y is “small” there is on the contrary more chances for the envelope X to contain a bigger amount. Here the frontier between “big” and “small” is simply twice the median of the exponential distribution.

The choice of X or Y is symmetric, as E[Y|X = x] = E[3U — X|X = x] = 3E[U|X=x] - x and from here all previous computations still apply, mutatis mutandis.

It seems that the paradox is solved, but I claim that in reality the two envelopes problem can be undecidable, meaning that we cannot really know if the problem is symmetric, or if we should prefer one envelope to the other.

An undecidable problem

Let us now assume that on the table lie two envelopes seemingly identical except that they have already been labelled “X” and “Y”. We are now told that the envelope X contains half the amount in Y or double this amount with half a chance for each possibility. By symmetry, the same applies to the envelope Y. You are now asked to choose one envelope: which one should you choose?

Based on the previous example, it seems obvious that we can choose indifferently one or the other. However, this is wrong! It all depends on our hypotheses, or said otherwise it all depends on the (statistical) representation of the problem.

Here, the fact that the envelopes are already labelled when we are given to choose one is key. What was the process to choose the amounts in the envelopes and label them? If they were randomly labelled like in the previous studied example, I would agree that choosing one or the other is statistically equivalent.

But let us imagine that the amount for X is chosen from an exponential distribution on positive real numbers (with parameter λ>0) similarly to what was done for U in the previous example. Then the amount for the envelope Y is simply randomly chosen as half or double the amount in Y (with uniform probabilities): Y = HX where H takes the values 1/2 or 2 with half a chance each time (H is independent from X).

Now let us compute the cumulative distribution of values for Y: P(Y < y) = P(HX < y) = P(HX < y |H = 1/2) P(H = 1/2) + P(HX < y |H = 2) P(H = 2)

= P(X/2 < y) (1/2) + P(2X < y) (1/2) = (1/2) P(X < 2y) + (1/2) P(X < y/2)

= (1/2) F(2y) + (1/2) F(y/2) where F is the cumulative distribution function of X (exponential distribution)

for non-negative values of y.

Differentiating to get the probability density for Y = y, we get:

This is the average of two probability density functions of exponential distributions, one of parameter λ/2 and the other of parameter 2λ, meaning that the average value in the envelope Y is the average of the averages 2/λ and 1/(2λ):

This is more than the average value of X, the mean of an exponential random variable of parameter λ being 1/λ (for those interested only in the computation of the expectation, E[Y] = E[HX] = E[H] E[X] as H and X are independent, and so E[Y] = [(1/2)(1/2) + 2(1/2)] E[X] = (5/4)E[X]).

The conclusion is that in this case, and if we care only about the mean to take a decision, then we should systematically choose the envelope Y.

However, we could also assume that instead of having Y = HX, we have X=HY, the amount in Y being drawn from an exponential distribution of parameter λ, and in that case we should rather choose the envelope X.

We do not know enough about the process that generated the two envelopes on the table to be able to decide with no additional assumption what envelope we should choose.

Is that all there is to say? No, the most interesting is still to come. We can see from what we did to that point that the physical processes to generate the situation with the envelopes have to be modeled with random variables.

But in physical processes, there is time: for example, we choose an amount for X, and then we deduce from it the amount to be put in Y, or the reverse; and the statistical model is able to reproduce it, with different conclusions if the amount of X is chosen before the amount of Y, or after. In other words, our statistical models are able to reproduce mathematically the physical reality of time.

The emergence of time and causality from randomness

It is often said that mathematics can only prove correlation, not causation. In that regard, causality analysis in econometrics is no more than a correlation analysis as far as mathematics are involved. It is the human mind that decides that an event is the consequence of another one based on correlation between both events and based on time: the event coming after the first one can be only the consequence, not the cause.

Because time is not a mathematical concept but a physical one, mathematics seem to be helpless to establish causal relationships independently from any human input about what phenomenon happened first (thus being characterized as the cause) and what phenomenon happened second (thus being characterized as the consequence). But is it really the case? The concept of time originates in the concept of irreversibility: when an object moves from left to right, it is not a change due to time because the object can move back to its original location; when an object is aging, it is a change due to the passage of time because the process is irreversible. Time is the irreversible change in the states of the world.

In physics, irreversibility is viewed as a consequence of an increase in disorder, formally called entropy: it is because the molecules composing an object are getting more disordered that the object will never be able to revert to its initial state, and so the changes are not only viewed as happening in time, but because of the time. While changes in states are sufficient to say that time goes by, the physical irreversibility causes time to flow only in one direction, allowing us to distinguish causes and consequences.

Without entering too much into the details, only the macro-state of an aging object is not reversible: at a microscopic level, from the viewpoint of theoretical physics, molecules and particles can reorder themselves in a way similar to a past state. Thus, physical irreversibility could not simply be modeled by a non-invertible mathematical function, as this characteristic would be absent. Instead, random variables are macroscopically non-invertible but microscopically invertible: e.g. if Y = HX, it does not mean that X = Y/H (irreversibility from a macroscopic point of view), however for any values y, h and x taken by Y, H and X, y = hx and x = y/h (reversibility from a microscopic point of view). The two envelopes paradox is particularly confusing because in its formulation everything seems symmetrical (if x is half or twice y, it implies that y is twice or half x), while this is only true at a “microscopic” level.

But how does the link between physical entropy and random variables could help in the study of causality?

Let us consider again the last example with two pre-labelled envelopes X and Y and let us assume we know that either Y = HX or X = HY, meaning that either Y is the consequence of X or vice versa. We can test each hypothesis by taking a large number of observations of X and Y, in order to identify the probability densities of these two random variables and one will have a “more entropic” density (“more entropic” under some specific mathematical relationship to be tested) as it will be based on the density of the other random variable, but “disordered” by the random variable H (whose density is assumed to be known).

Let us now consider more usual problems. Often linear regressions are performed to quantify a causal relationship between several variables. For instance, Y = αX where we assume Y is the consequence of X, and we want to quantify the causality coefficient α. However, it does not prove in any way a causal relationship from X to Y, it only allows to quantify the assumed relationship between X and Y if the assumption is true.

With such a simple example where Y is assumed to be equal to αX, it is not possible to identify mathematically a causal relationship, because it is equivalent to say that X = Y/α. However, if the coefficient α is considered to be one historic value of the more general process A, it is possible to compare the distributions of Y, A and X and see which one is more plausible of Y = AX or X = Y/A. Another example would be the study of a relationship Z = X + Y (Z is caused by X and Y), to be compared to other possibilities such that Y = Z - X (Y is caused by X and Z): a comparison of the distributions of X, Y and Z would provide an answer to the causality problem.

While such considerations are very theoretical and would not prove themselves directly useful in real life, where properly estimating the distributions of the random variables could be costly, complicated or unfeasible, it is possible to imagine using aggregates to perform a causality analysis. For example, in the case where we have to choose between Y = HX and X = HY, we have seen that in the first case E[Y] > E[X] and that in the second case E[X] > E[Y]. In case of linear relationships, we could have to test between X = Y + Z, Y = X - Z and Z = X - Y, but then the expectations are not useful (except if we take the exponential, e.g. exp(X)=exp(Y)exp(Z)), as E[X] is equal to E[Y] + E[Z] in every case, but the relationship Var(X) = Var(Y) + Var(Z) + 2Cov(Y, Z) would be true only in the first one.

Such techniques could provide useful indications about causal relationships, and help in testing hypotheses. But even more importantly, is it not beautiful that the physical time of our world emerges in the mathematical world from the concept of randomness?

Conclusion

Starting by analyzing a well-known statistical “paradox”, the two envelopes problem, we recognized that the paradox emerged not only because of a mathematical flaw in the naïve solution of the problem, but also because there is some ambiguity in the human language that made it look like two distinct functions of random variables (HX and X/H) were equivalent.

Digging further, it appeared that equations involving random variables, while impossible to “reverse” in the general case (macroscopic view), were “reversible” when considering instead realizations of the random variables (microscopic view).

This was then the occasion to propose an analogy between the sample space Ω of the random variables and the phase space of physical systems, leading subsequently to observe the emergence of “physical entropy” in the statistical world and thus of irreversibility and time.

Finally, after time emerged from our obscure computations, we were able to draw conclusions about ways to test causality hypotheses that go beyond simple correlation analyses. All this from two envelopes!