The Counter-Intuitive Nature of Probabilistic Relationships

If y can be estimated as a linear function of x does not imply that x can also be estimated as a linear function of y

Alireza Modirshanechi
Towards Data Science

--

An example of probabilistic relationships (same visualization style as in Figure 1A-B) — Image by the author

Consider two real-valued variables x and y, for example, the height of a father and the height of his son. The central problem of regression analyses in statistics is to guess y by knowing x, e.g., to guess the height of the son based on the height of his father¹.

The idea in linear regression is to use a linear function of x as a guess for y. Formally, this means to consider ŷ(x) = α₁x + α₀ as our guess and find α₀ and α₁ by minimizing the mean squared error between y and ŷ. Now, let’s assume that we use a huge dataset and find the best possible values for α₀ and α₁, so we know how to find the best estimate of y based on x. How can we use these best values for α₀ and α₁ to find a guess x̂(y) about x based on y? For example, if we always knew the best guess about the son’s height based on his father’s, then what would be our guess about the father’s height based on his son’s?

Such questions are special cases of “How can we use ŷ(x) to find x̂(y)?” Even though it may sound trivial, this question appears to be really difficult to address. In this article, I study the link between ŷ(x) and x̂(y) in both deterministic and probabilistic settings and show that our intuition for how ŷ(x) and x̂(y) relate to each other in deterministic settings cannot be generalized to probabilistic settings.

The formal statement of the problem

Deterministic settings

By deterministic settings, I mean situations where (i) there is no randomness and (ii) each value of x corresponds always to the same value of y. Formally, in these settings, I write y = f(x) for some function f: R → R. In such cases where x determines y with complete certainty (i.e., no randomness or noise), the best choice of ŷ(x) is f(x) itself. For example, if the height of a son is always 1.05 times his father’s height (let’s ignore the impossibility of the example for now!), then our best guess about the son’s height is to multiply the father’s height by 1.05.

If f is an invertible function, then the best choice of x̂(y) is equal to the inverse of f. In the example above, this means that the best guess about the height of a father is always the height of his son divided by 1.05. Hence, the link between ŷ(x) and x̂(y) in deterministic cases is straightforward and can be reduced to finding the function f and its inverse.

Probabilistic settings

In probabilistic settings, x and y are samples of random variables X and Y. In such cases where a single value of x can correspond to several values of y, the best choice for ŷ(x) (in order to minimize the mean squared error) is the conditional expectation E[Y|X=x] — see footnote². In application-friendly words, this means that if you train a very expressive neural network to predict y given x (with a sufficiently big dataset), then your network would converge to E[Y|X=x].

Similarly, the best choice for x̂(y) is E[X|Y=y] — if you train your very expressive network to predict x given y, then it converges, in principle, to E[X|Y=y]. Hence, the question of how ŷ(x) relates to x̂(y) in probabilistic settings can be rephrased as how the conditional expectations E[Y|X=x] and E[X|Y=y] relate to each other.

The goal of this article

To simplify the problem, I focus on linear relationships, i.e., cases where ŷ(x) is linear in x. A linear deterministic relationship has a linear inverse, meaning that y = αx (for some α≠0) implies that x = βy with β = 1/α — see footnote³. The probabilistic linear relationship analogous to the deterministic relationship y = αx is

Equation 1

where Z is an additional random variable, often called ‘noise’ or ‘error term’, whose conditional average is assumed to be zero, i.e., E[Z|X=x] = 0 for all x; note that we do not always assume that Z is independent of X. Using Equation 1, the conditional expectation of Y given X=x is (see footnote⁴)

Equation 2

Equation 2 states that the conditional expectation ŷ(x) is linear in x, so it can be seen as the probabilistic twin of the linear deterministic relationship y = αx.

In the rest of this article, I would ask two questions:

  1. Does Equation 2 imply that x̂(y) := E[X|Y=y] = βy for some β≠0? In other words, does the linear relationship in Equation 2 have a linear inverse?
  2. If it is indeed the case that x̂(y) = βy, then can we write β = 1/α as in the deterministic case?

I use two counter examples and show that, as counter-intuitive as it may sound, the answer to both questions is negative!

Example 1: When β is not the inverse of α

As the first example, let me consider the most typical setup of linear regression problems, summarized in the following three assumptions (in addition to Equation 1; see Figure 1A for visualization):

  1. Error term Z is independent of X.
  2. X has a Gaussian distribution with mean zero and variance 1.
  3. Z has a Gaussian distribution with mean zero and variance σ².
Figure 1. Visualizing example 1 and example 2. Panels A and B visualize the conditional distribution of Y given X for example 1 (A; α = 0.5 with fixed σ² = 3/4) and example 2 (B; α = 0.5 with σ² dependent on x). Given a value x for the random variable X, the random variable Y follows a Gaussian distribution in both examples: Black lines show the conditional expectation E[Y|X=x], and the shaded areas show the standard deviation of the Gaussian distributions. Points show 500 samples of the joint distribution of (X, Y). Panel C shows the marginal distribution of Y (with X having a standard normal distribution) for example 1 (blue) and example 2 (red): The marginal distribution of Y in example 1 is Gaussian with mean zero and variance α² + σ², but we can only numerically evaluate the marginal distribution of Y in example 2.

It is straightforward to show, after a few lines of algebra, that these assumptions imply that Y has a Gaussian distribution with mean zero and variance α² + σ². Moreover, the assumptions imply that X and Y are jointly Gaussian with mean zero and covariance matrix equal to

Since we have the full joint distribution of X and Y, we can derive their conditional expectations (see footnote⁵):

Hence, given the assumptions of our first example, Equation 2 has a linear inverse of the form x̂(y) = βy, but β is not equal to its deterministic twin 1/α — unless we have σ = 0 which is equivalent to the deterministic case!

This result shows that our intuitions about deterministic linear relationships cannot be generalized to probabilistic linear relationships. To more clearly see the true insanity of what this result implies, let us first consider α = 0.5 in a deterministic setting (σ = 0; blue curves in Figure 2A and 2B):

This means that, given a value of x, the value of y is half of x, and, given a value of y, the value of x is twice y, which appears to be intuitive. Importantly, we always have x < y. Now, let us again consider α = 0.5 but this time with σ² = 3/4 (red curves in Figure 2A and 2B). This choice of noise variance implies that β = α = 0.5, resulting in

This means that, given a value of x, our estimation of y is half of x, yet, given a value of y, our estimation of x is also half of y! Strangely, we always have x̂(y) < y and ŷ(x) < x — which would be impossible if the variables were deterministic. What appears to be counter-intuitive is that Equation 1 can be rewritten as

Equation 3

However, this can only imply that (as opposed to Equation 2)

Equation 4

The twist is that, while we have E[Z|X=x]=0 by design, we cannot say anything about E[Z|Y=y] and its dependence on y! In other words, what makes x̂(y) different from y/α is that observation y has also information about error Z, e.g., if we observe a very large value of y, then it means that, with high probability, the error Z has also a large value, which should be taken into account when estimating X.

Figure 2. Linear relationships and their inverses in examples 1 and 2. Panel A shows the linear relationship between ŷ(x) and x in the probabilistic settings of examples 1 and 2 (red; α = 0.5) and the equivalent deterministic relationship between y and x (blue); not that ŷ(x) as a function x is the same in both examples. Panels B and C show the inverse relationships between x̂(y) and y in the probabilistic settings of example 1 (red in B; fixed σ² = 3/4) and example 2 (red in C; σ² dependent on x). The blue line shows the inverse of the equivalent deterministic relationship for the reference. In all panels, the dashed black shows the y=x line.

This is the simple explanation for seemingly contradictory statements like ‘tall fathers have sons who are (on average) tall but not as tall as themselves, and, at the same time, tall sons have fathers who are (on average) tall but not as tall as their sons’!

To conclude, our example 1 shows that even if the probabilistic linear relationship ŷ(x) = αx has a linear inverse of the form x̂(y) = βy, the slope β is not necessarily equal to its deterministic twin 1/α.

Example 2: When x̂(y) is nonlinear

Having an inverse of the form x̂(y) = βy is only possible if E[Z|Y=y] in Equation 4 is also a linear function of y. In the second example, I make a small modification to example 1 in order to break this condition!

In particular, I assume that the variance of the error term Z depends on the random variable X — as opposed to assumption 1 in example 1. Formally, I assume (in addition to Equation 1; see Figure 1B for visualization):

  1. X has a Gaussian distribution with mean zero and variance 1 (same as assumption 2 in example 1).
  2. Given X=x, the error Z has a Gaussian distribution with mean zero and variance σ² = 0.01 + 1/(1 + 2x²).

These assumptions effectively mean that, given X=x, the random variable Y has a Gaussian distribution with mean αx and variance 0.01 + 1/(1 + 2x²) (see Figure 1B). As opposed to example 1 where the joint distribution of X and Y was a Gaussian distribution, the joint distribution of X and Y in example 2 does not have an elegant form (see Figure 1C). However, we can still use the Bayes rule and find the relatively ugly conditional density of X=x given Y=y (see Figure 3 for some examples evaluated numerically):

Equation 5

where curly N denotes the probability density of the Gaussian distribution.

Figure 3. Conditional distribution of X given Y=y in example 2. Prior distribution p(x) (blue curves), likelihood p(y|x) (orange curves), and the posterior distribution p(x|y) (black curves; evaluated numerically using Equation 5) for y = 0.5, 1.5, and 2, from left to right (assuming α = 0.5 in all cases).

We can then use numerical methods and evaluate the conditional expectation

Equation 6

for a given y and α. Figure 2C shows x̂(y) as a function of y for α = 0.5. As counter-intuitive as it may sound, the inverse relationship is highly nonlinear — as a result of the x-dependent error variance shown in Figure 1B. This shows that the fact that y can be estimated well as a linear function of x does not imply that x can also be estimated well as a linear function of y. This is because E[Z|Y=y] in Equation 4 can have any strange functional dependence on y when we go beyond standard assumptions similar to those in example 1.

To conclude, our example 2 shows that the probabilistic linear relationship ŷ(x) = αx does not necessarily have a linear inverse of the form x̂(y) = βy. Importantly, the inverse relationship between x̂(y) and y is dependent on the characteristics of the error term Z.

Conclusion

Throughout our education, most of us have built an enriched intuition about deterministic relationships — based on all the cool results we have seen in calculus, analysis, etc. However, it is crucial to be aware of the limitations of this intuition and that it must not be trusted when we think about probabilistic relationships. In particular, examples 1 and 2 show that even extremely simple probabilistic relationships can behave against our intuition.

Acknowledgements

I am grateful to Johanni Brea, Mohammad Tinati, Martin Barry, Guillaume Bellec, Flavio Martinelli, and Ariane Delrocq for useful discussions and valuable feedback on the content of this article.

Code:

All code (in Julia language) for the analyses can be found here.

Footnotes:

¹ Interested readers can see “How the father’s height influences the son’s height” in Towards Data Science for an accessible treatment of this problem.

² See the “Minimum mean square error” page on Wikipedia for more details.

³ Without loss of generality, we always assume that both x and y have zero average. Hence, in the example of the heights of fathers and their sons, x and y denote the difference between their heights and the average heights of fathers and sons, respectively.

⁴ The relationship between Equations 1 and 2 is reversible, i.e., if Equation 2 is the only constraint on X and Y, then we can always write Y as in Equation 1 with a random variable Z that satisfies E[Z|X=x] = 0.

⁵ See the section ‘Bivariate conditional expectation’ in the ‘Multivariate normal distribution’ page on Wikipedia.

--

--