(Image by Author) (Dataset Source: UC Irvine)

Understanding Conditional Variance and Conditional Covariance

And a tutorial on how to calculate them using a real-world data set

Towards Data Science
9 min readApr 10, 2022

--

Conditional Variance and Conditional Covariance are concepts that are central to statistical modeling. In this article, we’ll learn what they are, and we’ll illustrate how to calculate them using a real-world data set.

First, a quick refresher on what is variance and covariance.

Variance of a random variable measures its variation around its mean. The covariance between two random variables is a measure of how correlated are their variations around their respective means.

Conditional variance

The conditional variance of a random variable X is a measure of how much variation is left behind after some of it is ‘explained away’ via X’s association with other random variables Y, X, W…etc.

It is expressed in notation form as Var(X|Y,X,W) and read off as the Variance of X conditioned upon Y, Z and W.

First, let’s state the formula for the unconditional (total) variance:

The formula for the sample variance of X
The formula for the sample variance of X (Image by Author)

In the above formula, E(X) is the “unconditional” expectation (mean) of X.

The formula for conditional variance is obtained by simply replacing the unconditional expectation with the conditional expectation as follows (Note that in equation (2), we now calculating of Y (not X):

The formula for the sample variance of Y conditioned upon X
The formula for the sample variance of Y conditioned upon X (Image by Author)

E(Y|X) is the value of Y that is predicted by a regression model that is fitted on a data set in which the dependent variable is Y and the explanatory variable is X. The index i is implicit in the conditional expectation, i.e. for each row i in the data set, we use E(Y=y_i|X=x_i).

Here, our choice of regression model is important. A correct choice of model will result in a substantial amount of variance in Y to be explained by the fitted model and therefore the conditional variance of Y on X will be correspondingly small. On the other hand, an incorrect choice of model will result in a large conditional variance since the model is unable to explain most of the variance in Y.

The above formula for conditional variance can be extended to more than one variable on which the variance is conditioned by using a regression model in which X matrix contains more than one regression variable.

Illustration

Let’s illustrate the procedure for calculating conditional variance using some real world data. The following data set contains specifications of 205 automobiles taken from the 1985 edition of Ward’s Automotive Yearbook. Each row contains a set of 26 specifications about a single vehicle.

The automobiles data set
The automobiles data set (Source: UC Irvine)

We’ll consider a small subset of this data set consisting of the following six variables:
City_MPG
Curb_Weight
Vehicle_Volume
Num_Cylinders
Vehicle_Price
Engine_Size

This 6-variable data set can be downloaded from here.

Let’s plot Engine_Size versus Num_Cylinders. We’ll use Python and the Pandas and Matplotib packages to load the data into a DataFrame and display the plot:

Let’s import all the required packages, including ones that we will use later in the article.

import pandas as pd
from patsy import dmatrices
import numpy as np
import scipy.stats
import statsmodels.api as sm
import matplotlib.pyplot as plt

Now let’s load the data file into a Pandas DataFrame and plot Engine_Size versus Num_Cylinders.

#Read the automobiles dataset into a Pandas DataFrame
df = pd.read_csv('automobile_uciml_6vars.csv', header=0)
#Drop all empty rows
df = df.dropna()
#Plot Engine_Size versus Num_Cylinders
fig = plt.figure()
fig.suptitle('Engine_Size versus Num_Cylinders')
plt.xlabel('Num_Cylinders')
plt.ylabel('Engine_Size')
plt.scatter(df['Num_Cylinders'], df['Engine_Size'])
#Plot a horizontal mean line
plt.plot([0, df['Num_Cylinders'].max()], [df['Engine_Size'].mean(), df['Engine_Size'].mean()],
[df['Engine_Size'].mean()], color='red', linestyle='dashed')

#Group the DataFrame by Num_Cylinders and calculate the mean for each group
df_grouped_means = df.groupby(['Num_Cylinders']).mean()

#Print out all the grouped means
df_grouped_means = df.groupby(['Num_Cylinders']).mean()

#Plot the group-specific means of Engine_Size
for i in df_grouped_means.index:
mean = df_grouped_means['Engine_Size'].loc[i]
plt.plot(i, mean, color='red', marker='o')

plt.show()

Here is the table of grouped means i.e. the means conditioned upon various values of Num_Cylinders.

Table of grouped means (conditional means) (Image by Author)

And we also see the following plot showing the variation in Engine_Size across different values of Num_Cylinders:

Scatter plot of Engine_Size versus Num_Cylinders showing the unconditional mean and the conditional mean of Engine_Size
Scatter plot of Engine_Size versus Num_Cylinders showing the unconditional mean and the conditional mean of Engine_Size (Image by Author)

The red horizontal line indicates the unconditional mean value of 126.91. The red dots indicate the mean Engine_Size for different values of Num_Cylinders. These are the conditional means a.k.a. conditional expectations of Engine_Size for different values of Num_Cylinders and they are denoted as E(Engine_Size|Num_Cylinders=x).

Unconditional (Total) variance in Engine_Size

Let’s revisit the formula for the total variance of X:

The formula for the sample variance of X
The formula for the sample variance of X (Image by Author)

In the above formula, if X=Engine_Size, the mean, denoted by E(X) is 126.88. Using this formula, we calculate the sample variance of Engine_Size as 1726.14. This is a measure of the variation of Engine_Size around the unconditional expectation of 126.91.

In Pandas, we can get the value of the total variance as follows:

unconditional_variance_engine_size = df['Engine_Size'].var()print('(Unconditional) sample variance in Engine_Size='+str(unconditional_variance_engine_size))

We see the following output:

Unconditional variance in Engine_Size=1726.1394527363163

Conditional variance in Engine_Size

The variance of Engine_Size conditioned upon Num_Cylinders is the variance left over in Engine_Size after some of it has been ‘explained’ by the regression of Engine_Size on Num_Cylinders. We can use Equation (2) to calculate it as follows:

Variance of Engine_Size conditional upon Num_Cylinders
Variance of Engine_Size conditional upon Num_Cylinders (Image by Author)

Now let’s look at a slightly more involved example.

Suppose we wish to calculate the variance of Engine_Size conditioned upon Curb_Weight, Vehicle_Volume and Num_Cylinders.

To do so, we use the following procedure:

  1. Construct a regression model in which the response variable is Engine_Size and the regression variables are Curb_Weight, Vehicle_Volume, Num_Cylinders and an intercept.
  2. Train the model on a data set.
  3. Run the trained model on the data set to get the predicted (expected) values of Engine_Size for each combination of Curb_Weight, Vehicle_Volume, Num_Cylinders. These are the set of conditional expectations:
    E(Engine_Size|Curb_Weight, Vehicle_Volume, Num_Cylinders) corresponding to the observed values of Engine_Size.
  4. Plugin the observed values of Engine_Size and the predicted values calculated in step 2 into equation (2) to get the conditional variance.

Let’s calculate it!

#Construct the regression expression. A regression intercept is included by default
olsr_expr = 'Engine_Size ~ Curb_Weight + Vehicle_Volume + Num_Cylinders'
#Carve out the y and X matrices based on the regression expression
y, X = dmatrices(olsr_expr, df, return_type='dataframe')
#Build the OLS linear regression model
olsr_model = sm.OLS(endog=y, exog=X)
#Train the model
olsr_model_results = olsr_model.fit()
#Make the predictions on the training data set. These are the conditional expectations of y
y_pred=olsr_model_results.predict(X)
y_pred=np.array(y_pred)
#Convert y from a Pandas DataFrame into an array
y=np.array(y['Engine_Size'])
#Calculate the conditional variance in Engine_Size using equation (2)
conditional_variance_engine_size = np.sum(np.square(y-y_pred))/(len(y)-1)
print('Conditional variance in Engine_Size='+str(conditional_variance_engine_size))

We get the following output:

Conditional variance in Engine_Size=167.42578329039935

As expected, this variance of 167.43 is considerably less than the total variance in Engine_Size (1726.13).

Relationship of conditional variance to R-squared

R-squared for a linear regression model is the fraction of the total variance in the dependent variable that the explanatory variables are able to ‘explain’.

Definition of R-squared of a linear regression model
Definition of R-squared of a linear regression model (Image by Author)

We now know that the variance in y that X was not able to explain is the conditional variance Var(y|X). And the total variance in y is simply the unconditional variance Var(y). Hence R-squared can be expressed in terms of conditional and unconditional variance as follows:

R-squared expressed in terms of conditional and unconditional variance in y
R-squared expressed in terms of conditional and unconditional variance in y (Image by Author)

Let’s calculate R-squared for the linear regression model that we had constructed earlier. Recollect that the dependent variable y was Engine_Size while the explanatory variables X were Curb_Weight, Vehicle_Volume and Num_Cylinders.

The total variance in y was found to be 1726.1394527363163.

The conditional variance in y, i.e. variance in y conditioned upon Curb_Weight, Vehicle_Volume and Num_Cylinders was found to be 167.42578329039935.

Using equation (4), R-squared of this linear model is:

R-squared = 1–167.43/1726.14 = 0.903

This value matches perfectly with the value reported by statsmodels:

Model training summary of the linear regression model
Model training summary of the linear regression model (Image by Author)

Conditional covariance

Recollect that covariance between two random variables X and Z is a measure of how correlated the variations in X and Z are with each other. Its formula is as follows:

Formula for sample covariance between X and Z
Formula for sample covariance between X and Z (Image by Author)

In this formula, E(X) and E(Z) are the unconditional means (a.k.a. unconditional expectations) of X and Z.

The covariance of X and Z, conditional upon some random variable(s) W is a measure of how correlated are the variations in X and Z around the conditional expectations of X on W, and Z on W respectively.

Formula for sample conditional covariance between X and Z
Formula for sample conditional covariance between X and Z (Image by Author)

E(X|W) and E(Z|W) are the conditional expectations of X and Z on W. Hence (x_i — E(X|W)) is the variation in X after some of it has been explained by W. Ditto for (z_i — E(Z|W)). The index i is implicit in the two conditional expectations, i.e. for each row i in the data set, we use E(X=x_i|W=w_i) and E(Z=z_i|W=w_i).

Thus, the conditional covariance is a measure of how correlated are the variations in X and Z after some of the respective variances have been explained by the presence of W.

As with the procedure for calculating conditional variance, we can estimate the conditional expectations E(X|W) and E(Z|W) by regressing X on W, and Z on W. The respective regression model’s predictions on the training data set are the corresponding conditional expectations E(X|W) and E(Z|W) that we are seeking.

Illustration

We’ll calculate the covariance between Engine_Size and Curb_Weight, conditional upon Vehicle_Volume.

First, we’ll baseline the variance by calculating the unconditional (total) covariance between Engine_Size and Curb_Weight. This can be easily done using equation (5) as follows:

Formula for total covariance between Engine_Size and Curb_Weight
Formula for total covariance between Engine_Size and Curb_Weight (Image by Author)

Using Pandas, we can calculate this covariance as follows:

covariance = df['Curb_Weight'].cov(df['Engine_Size'])

We see the following output:

Covariance between Curb_Weight and Engine_Size=18248.28333333333

Let’s also view the scatter plot of mean-centered Engine_Size and mean-centered Curb_Weight to get a visual feel for this covariance:

#Plot mean-centered Curb_Weight versus Engine_Size
fig = plt.figure()
fig.suptitle('Mean centered Curb_Weight versus Engine_Size')plt.xlabel('Mean centered Engine_Size')plt.ylabel('Mean centered Curb_Weight')plt.scatter(df['Engine_Size']-df['Engine_Size'].mean(), df['Curb_Weight']-df['Curb_Weight'].mean())plt.show()

We see the following plot:

Scatter plot of mean-centered (demeaned) Curb Weight versus demeaned Engine Size of vehicles
Scatter plot of mean-centered (demeaned) Curb Weight versus demeaned Engine Size of vehicles (Image by Author)

One thing we immediately notice in this plot is that there appears to be a wide variation in curb weights for vehicles with similar engine size:

Scatter plot of variation in Curb Weight against the mean-centered Engine Size of vehicles showing wide variation in curb weights amongst vehicles with similar engine sizes
Scatter plot of variation in Curb Weight against the mean-centered Engine Size of vehicles showing wide variation in curb weights amongst vehicles with similar engine sizes (Image by Author)

There are other factors involved that could explain some of this variance in Curb Weight within a particular Engine Size.

Let’s look at Vehicle Volume as one such factor. Specifically, let’s calculate the covariance between Curb_Weight and Engine_Size conditional upon Vehicle Volume, i.e. after netting out the effect of Vehicle Volume.

Formula for covariance between Engine_Size and Curb_Weight conditional upon Vehicle_Volume
Formula for covariance between Engine_Size and Curb_Weight conditional upon Vehicle_Volume (Image by Author)

In the above formula, the two conditional expectations marked in green can be obtained by regressing Engine_Size on Vehicle_Volume and Curb_Weight on Vehicle_Volume. As before, the index i is implicit in the two expectations.

Using Pandas and statsmodels, let’s calculate this conditional covariance as follows. In the below piece of code, X=Engine_Size, Z=Curb_Weight and W=Vehicle_Volume.

#Carve out the X and W matrices. An intercept is automatically added to W.
X, W = dmatrices('Engine_Size ~ Vehicle_Volume', df, return_type='dataframe')
#Regress X on W
olsr_model_XW = sm.OLS(endog=X, exog=W)
olsr_model_XW_results = olsr_model_XW.fit()
#Get the conditional expectations E(X|W)
X_pred=olsr_model_XW_results.predict(W)
X_pred=np.array(X_pred)
X=np.array(df['Engine_Size'])#Carve out the Z and W matrices
Z, W = dmatrices('Curb_Weight ~ Vehicle_Volume', df, return_type='dataframe')
#Regress Z on W
olsr_model_ZW = sm.OLS(endog=Z, exog=W)
olsr_model_ZW_results = olsr_model_ZW.fit()
#Get the conditional expectations E(Z|W)
Z_pred=olsr_model_ZW_results.predict(W)
Z_pred=np.array(Z_pred)
Z=np.array(df['Curb_Weight'])#Construct the delta matrices
Z_delta=Z-Z_pred
X_delta=X-X_pred

#Calculate the conditional covariance
conditional_variance = np.sum(Z_delta*X_delta)/(len(Z)-1)
print('Conditional Covariance between Curb_Weight and Engine_Size='+str(conditional_variance))

We see the following output:

Conditional Covariance between Curb_Weight and Engine_Size=7789.498082862661

If we compare this value of 7789.5 with the total covariance of 18248.28 calculated earlier, we see that the covariance between Engine_Size and Curb_Weight net of the effect of Vehicle_Volume is indeed much smaller than without the effect of Vehicle_Volume.

Here is the complete source code used in the article:

References, Citations and Copyrights

Data set

The Automobile Data Set citation: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. Download link

If you liked this article, please follow me at Sachin Date to receive tips, how-tos and programming advice on topics devoted to regression, time series analysis, and forecasting.

--

--