
Xavier Glorot’s initialization is one of the most widely used methods for initializing weight matrices in neural networks. While in practice, it is straightforward to utilize in your Deep Learning setup, reflecting upon the mathematical reasoning behind this standard initialization technique can prove most beneficial. Additionally, a theoretical understanding of this method is often asked in machine learning interviews, and knowing this derivation offers a great opportunity to demonstrate the depth of your knowledge.
Based on the paper "Understanding the difficulty of training deep feedforward neural networks" ⁽¹ ⁾ by Xavier Glorot et al. (2010), we provide a detailed mathematical derivation in three steps: forward pass equation, backward pass equation, and derivation of weight matrix distributions. Xavier Glorot initialization is most often used with tanh activation function, which fulfills all assumptions required for the proof.
Notation
- A neural network with weight matrices Wⁱ and bias vectors bⁱ consists of a set of two successive transformations: sⁱ = zⁱ Wⁱ + bⁱ and zⁱ ⁺ ¹ = f (sⁱ)
- A layer has nⁱ units, hence zⁱ∈ ℝⁿ⁽ⁱ⁾, Wⁱ ∈ ℝⁿ⁽ⁱ⁾˙ⁿ⁽ⁱ ⁺¹⁾, and bⁱ∈ ℝⁿ⁽ⁱ ⁺¹⁾
- zⁱ Wⁱ+bⁱ has dimension (1 × nⁱ) × (nⁱ × nⁱ ⁺¹) + 1 × nⁱ ⁺¹ = 1 × nⁱ ⁺¹
- f is an element-wise function, hence it does not affect the shape of a vector, and zⁱ ⁺¹ = f (zⁱ Wⁱ + bⁱ) ∈ ℝⁿ⁽ⁱ ⁺¹⁾
-
For a neural network of depth n: z⁰ is the input layer and zⁿ is the output layer
- L is the loss function of the neural network
Assumptions
-
Assumption 1: We assume that the activation function used for a specific layer is odd, with unit derivative in 0: _f ‘(0) =_ 1. Recall that an odd function is defined as f(-x) = -f(x). A popular activation function to use with Glorot initialization is tanh, hence, we need to verify that this function satisfies the above assumption: tanh'(x) = 1-tanh²(x) ⇒ tanh'(0) = 1–0² = 1
-
Assumption 2: We assume that all inputs and all layers at initialization are iid, i.e. independent, identically distributed, and thus, so are the weights and gradients.
-
Assumption 3: We assume that inputs are normalized with zero means, and weights and biases are initialized following a distribution centered in zero, _i.e. 𝔼[z⁰] = 𝔼[Wⁱ] = 𝔼[bⁱ] = 0. Fro_m this and the linearity of f in zero follows that both zⁱ and sⁱ _hav_e a null expectation at initialization.
Motivation
The objective of this proof is to find weight matrix distribution by determining Var[W] given two constraints:
-
∀i, Var[zⁱ] = Var[zⁱ ⁺¹], i.e. the forward signal is flowing with constant variance
-
∀i, Var[∂L/∂sⁱ]=Var[∂L/∂sⁱ ⁺¹], i.e. the backward signal is flowing with constant variance
With the goal of preventing exploding and vanishing gradients in neural networks, the above two equations help us guarantee that, at initialization, the variance of both layers and gradients is constant throughout the network, i.e. the signal gain is exactly one. On the contrary, if the gain is above one, it will likely lead to exploding gradients and divergence in optimization while a signal gain below one will likely lead to vanishing gradients and stop learning.
Math Proof: Xavier Glorot Initialization
I. Forward Pass
We are looking for Wⁱ such that the variance of each subsequent layer z is equal, i.e. Var[zⁱ] = Var[zⁱ ⁺¹].
We know that zⁱ ⁺¹ = f (zⁱ Wⁱ + bⁱ).
To simplify the upcoming calculation, we first apply the variance operator elementwise at index k on both sides of the equation. Please note that each output neuron depends on all the incoming neurons from the input layer. Hence, when taking the element k of zⁱ ⁺¹, the entire vector zⁱ is used in the computation. This is why z is indexed at k only __ on the left side of the equation.
We now analyze in detail the steps behind this derivation:

- In the first step, following Assumption 1 stated earlier, given that f has a unit derivative in 0 and is odd, we can approximate f(x) ≃ x around 0. Then, zⁱ Wⁱ + bⁱ is assumed to be around 0 at initialization since Wⁱ and bⁱ are sampled from distributions centered in 0, an_d z_⁰, the input vector to the neural network, is assumed to be normalized due to input normalization. Hence, each subsequent laye_r z_ⁱ will be 0 in expectation.
-
In the second step, applying the additive property of variance under the independence of variables, i.e. Var[X+Y] = Var[X] + Var[Y] with X ⟂ Y, and the variance property of a constant c, i.e. Var[c] = 0, we know that Var[X+c] = Var[X]. Furthermore, for clarity, we write the vector and matrix multiplication as a sum over nⁱ elements given that zⁱ∈ ℝⁿ⁽ⁱ⁾.

- In the third step, we use the assumption of independence z ⟂ W between input vector z and weight matrix W, which results from the fact that all variables are uncorrelated at initialization. Under independence, the variance of a sum is the sum of the variances.
-
In the fourth step, analogously to the rule on variance sum, the variance of an independent product equals the product of their variances plus two cross-terms involving expectation and variance, _i.e. Var[XY] = Var[X]Var[Y] + E[X]_²_Var[Y] + E[Y]_²Var[X].

- In the fifth step, the equation nicely simplifies, because the first two terms are eliminated given that both z and W have a zero mean. This follows from the normalization of inputs assumption and sampling of W from a distribution centered in zero.
- In the sixth step, we conclude the final form by noting that each term of the sum is the same since z and W are independent and identically distributed.
In summary, having broken down each step, here is the complete derivation one more time for review:

Finally, this brings us to the conclusion of the forward pass proof that the variance of the weights Wⁱ of a layer is the inverse of the number of inputs nⁱ.

Interesting fact: the above proof is also the demonstration of the LeCun initialization, originally introduced in "Efficient Backprop" ⁽²⁾ by LeCun et al. (1998).
II. Backward Pass
We are looking for Wⁱ such that Var[∂L/∂sⁱ] = Var[∂L/∂sⁱ ⁺¹].
Here, sⁱ ⁺¹ = f(sⁱ) W ⁱ ⁺¹ + bⁱ ⁺¹, with sⁱ∈ ℝⁿ⁽ⁱ ⁺¹⁾.
Before applying the variance operator, let us first calculate the derivative.
By chain rule:

In matrix form:

Please take note of the dimensions of each matrix and vector on both sides of this equation: 1 × nⁱ ⁺ ¹ = (1 × nⁱ ⁺ ²) × (nⁱ ⁺ ² × nⁱ ⁺ ¹) × (nⁱ ⁺ ¹ × nⁱ ⁺ ¹ ).
In the next step, in order to simplify the calculation, we rewrite the same equation using elementwise notation. The gradient now becomes a derivative and the matrix of weights is truncated to its k-th column:

Now we can apply variance on both sides.

- First, let us estimate the derivative f’. To find this, we need to look at it backward, realizing that the expectation of sⁱ is 0 at initialization. This then simplifies the estimation because we know that f is assumed to be linear around 0 with f'(0) = 1. Hence, 𝔼[f'(sⁱ)] = 𝔼[f'(0)] = 1.
- Second, knowing that the variance of the sum of independent __ variables is equal to the sum of the variance, we can apply this rule to the variance on the right-hand side.

- Third, applying the rule for the variance of a product of independent variables, _i.e. Var[XY] = Var[X]Var[Y] + E[X]_²_Var[Y] + E[Y]_²Var[X], the equation simplifies to a sum of the product of individual expectations and variances.

- Fourth, as both terms with expectations are equal to zero, the only remaining portion is the sum of variances.
- In the fifth step, we conclude the final form by noting that each term of the sum is the same since the partial derivative of L with regards to sⁱ ⁺ ¹ and W are independent and identically distributed.
In summary, here are all the steps one more time for your review:

Finally, we obtain the following result:

III. Weight Distribution
Following the above demonstration for the forward and backward pass, we concluded two results:

The paper’s authors propose to average the variance for the final result. The main justification comes from the fact that when neural networks have subsequent layers of identical width, which is often the case, the two equalities are satisfied, i.e. nⁱ = nⁱ ⁺ ¹ implies the averaged result satisfies the two previous equations.

In practice, now that the variance of the distribution is known, the weights are initialized with either normal distribution _N(0, 𝜎²) or uniform distribution U(-a, a)._ _A_s mentioned above in Assumption 3, it is fundamental that the distribution chosen is centered in 0.
- For Normal distribution N(0, 𝜎²)
If _X ~ N(0, 𝜎²), t_hen Var[X] = 𝜎², thus the variance and standard deviation can we written as:

We can therefore conclude that Wⁱ follows a normal distribution with coefficients:

As a reminder, nⁱ is the number of inputs of the layer and nⁱ ⁺ ¹ is the number of outputs of the layer.
- For Uniform distribution U(-a, a)
If X ~ U(-a, a), then using the below formula of a variance for a random variable following a uniform distribution, we can find the bound a:

Finally, we can conclude that Wⁱ follows a uniform distribution with coefficients:

Conclusion
In this article, we provided a detailed walkthrough of the individual steps to find the distribution of weight matrices according to the Xavier Glorot initialization.
Given an odd activation function with a unit derivative in 0, like tanh, we can follow this methodology to guarantee an optimal forward and backward propagation of the signal at initialization, i.e. keeping the variance constant throughout both forward and backward pass.
Citations (1) Understanding the difficulty of training deep feedforward neural networks, Glorot et al. (2010) (2) Efficient Backprop, LeCun et al. (1998)
Source: All the above equations and images are my own.