Finding the Gradient of a Vector Function

Part 3 of Step by Step: The Math Behind Neural Networks

Chi-Feng Wang
Towards Data Science

--

Title image: Source

In Part 1, we have been given a problem: to calculate the gradient of this loss function:

Image 1: Loss function

To find the gradient, we have to find the derivative the function. In Part 2, we learned to how calculate the partial derivative of function with respect to each variable. However, most of the variables in this loss function are vectors. Being able to find the partial derivative of vector variables is especially important as neural network deals with large quantities of data. Vector and matrix operations are a simple way to represent the operations with so much data. How, exactly, can you find the gradient of a vector function?

Gradient of a Scalar Function

Say that we have a function, f(x,y) = 3x²y. Our partial derivatives are:

Image 2: Partial derivatives

If we organize these partials into a horizontal vector, we get the gradient of f(x,y), or ∇ f(x,y):

Image 3: Gradient of f(x,y)

6yx is the change in f(x,y) with respect to a change in x, while 3x² is the change in f(x,y) with respect to a change in y.

What happens when we have two functions? Let’s add another function, g(x,y) = 2x+y⁸. The partial derivatives are:

Image 4: Partials for g(x,y)

So the gradient of g(x,y) is:

Image 5: Gradient of g(x,y)

Representing Functions

When we have a multiple functions with multiple parameters, it’s often useful to represent them in a simpler way. We can combine multiple parameters of functions into a single vector argument, x, that looks as follows:

Image 6: Vector x

Therefore, f(x,y,z) will become f(x₁,x₂,x₃) which becomes f(x).

We can also combine multiple functions into a vector, like so:

Image 7: Vector y

Now, y=f(x) where f(x) is a vector of [f₁(x), f₂(x), f₃(x)…fn(x)]

For our previous example with two functions, f(x,y) ⇒ f(x) and g(x,y) ⇒ g(x). Here, vector x = [x₁, x₂], where x₁=x, and x₂=y. To simplify it even further, we can combine our functions: [f(x),g(x)] = [f₁(x), f₂(x)] = f(x) = y.

Image 8: Equations within vector function y

Often, the number of functions and the number of variables will be the same, so that a solution exists for each variable.

Gradient of a Vector Function

Now that we have two functions, how can we find the gradient of both functions? If we organize both of their gradients into a single matrix, we move from vector calculus into matrix calculus. This matrix, and organization of the gradients of multiple functions with multiple variables, is known as the Jacobian matrix.

Image 9: The Jacobian

There are multiple ways of representing the Jacobian. This layout, where we stack the gradients vertically, is known as the numerator layout, but other papers will use the denominator layout, which simply flips it diagonally:

Image 10: Denominator layout of the Jacobian

Gradient of the Identity Function

Let’s take the identity function, y = f(x) = x, where fi(x) = xi, and find its gradient:

Image 11: Identity function

Just as we created our previous Jacobian, we can find the gradients of each scalar function and stack them up vertically to create the Jacobian of the identity function:

Image 12: Jacobian of the identity function

Since it’s an identity function, f₁(x) = x₁, f₂(x) = x₂, and so on. Therefore,

Image 13: Jacobian of the identity function

The partial derivative of a function with respect to a variable that’s not in the function is zero. For example, the partial derivative of 2x² with respect to y is 0. In other words,

Image 14: The partial derivative of a function with respect to a variable that’s not in the function is zero

Therefore, everything not on the diagonal of the Jacobian becomes zero. Meanwhile, the partial derivative of any variable with respect to itself is 1. For example, the partial derivative of x with respect to x is 1. Therefore, the Jacobian becomes:

Image 15: Jacobian of the identity function

Gradient of Element-Wise Vector Function Combinations

Element-wise binary operators are operations (such as addition w+x or w>x which returns a vector of ones and zeros) that applies an operator consecutively, from the first item of both vectors to get the first item of output, then the second item of both vectors to get the second item of output…and so forth.

This paper represents element-wise binary operations with this notation:

Image 16: Element-wise binary operation with f(x) and g(x)

Here, the ◯ means any element-wise operator (such as +), and not a composition of functions.

So how do you find the gradient of an element-wise operation of two vectors?

Since we have two sets of functions, we need two Jacobians, one representing the gradient with respect to x and one with respect to w:

Image 17: Jacobian with respect to w and x

Most arithmetic operations we’ll need are simple ones, so f(w) is often simply the vector w. In other words, fi(wi) = wi. For example, the operation w+x fits this category as it can be represented as f(w)+g(x) where fi(wi) + gi(xi) = wi +xi.

Under this condition, every element in the two Jacobians simplifies to:

Image 18: Elements in the Jacobian

On the diagonal, i=j, so a value exists for the partial derivative. Off the diagonal, however, i≠j, so the partial derivatives become zero:

Image 19: Diagonal jacobian

We can represent this more succinctly as:

Image 20: The Jacobian with respect to w and x

Let’s try to find the gradient of the function w+x. We know that everything off the diagonal is 0. The values of the partials on the diagonal with respect to w and x are:

Image 21: Partials with respect to w and x

So both Jacobians have a diagonal of 1. This looks familiar…it’s the identity matrix!

Let’s try it with multiplication: w*x. The values of the partials on the diagonal with respect to w and x are:

Image 22: Partials with respect to w and x

Therefore, the gradient with respect to w of w*x is diag(x), while the gradient with respect to x of w*x is diag(w).

Applying the same steps for subtraction and division, we can sum it all up:

Image 23: Gradients of common element-wise binary operations

Gradient of Vector Sums

One of the most common operations in deep learning is the summation operation. How can we find the gradient of the function y=sum(x)?

y=sum(x) can also be represented as:

Image 24: y=sum(x)

Therefore, the gradient can be represented as:

Image 25: Gradient of y=sum(x)

And since the partial derivative of a function with respect to a variable that’s not in the function is zero, it can be further simplified as:

Image 26: Gradient of y=sum(x)

Note that the result is a horizontal vector.

What about the gradient of y=sum(xz)? The only difference is that we multiply every partial with a constant, z:

Image 27: Gradient of y=sum(xz) with respect to x

While that is the derivative with respect to x, the derivative with respect to the scalar z is simply a number:

Image 28: Gradient of y=sum(xz) with respect to z

Gradient of Chain Rule Vector Function Combinations

In Part 2, we learned about the multivariable chain rules. However, that only works for scalars. Let’s see how we can integrate that into vector calculations!

Let us take a vector function, y = f(x), and find it’s gradient. Let us define the function as:

Image 29: y = f(x)

Both f₁(x) and f₂(x) are composite functions. Let us introduce intermediate variables for f₁(x) and f₂(x), and rewrite our function:

Image 30: y = f(g(x))

Now, we can use our multivariable chain rule to compute the derivative of vector y. Simply compute the derivative of f₁(x) and f₂(x), and place them one above another:

Image 31: Gradient of y = f(g(x))

Voila! We have our gradient. However, we’ve come to our solution with scalar rules, merely grouping the numbers together into a vector. Is there a way to represent the multivariable chain rule for vectors?

Right now, our gradient is computed with:

Image 32: Gradient of y = f(g(x))

Notice that the first term of the gradients of both f₁(x) and f₂(x) include the partial of g₁ over x, and the second term of the gradients of both f₁(x) and f₂(x) include the partial of g₂ over x. This is just like matrix multiplication! We can therefore represent it as:

Image 33: Vector representation of the gradient of y = f(g(x))

Let’s test our this new representation of the vector chain rule:

Image 34: Vector chain rule

We get the same answer as the scalar approach! If instead of a single parameter x we have a vector parameter x, we just have to alter our rule a little to get the complete vector chain rule:

Image 35: Vector chain rule

In other words:

Image 36: Vector chain rule

In our example above, f is purely a function of g; that is, fi is a function of gi but not gj (each function f matches with exactly 1 function g). In this case, everything off the diagonal becomes zero, and:

Image 37: Special case of vector chain rule

Now we have all the pieces we find the gradient of the neural network we started our series with:

Image 38: Cost function

Check out Part 4 to find out how to compute its derivative!

If you haven’t already, read Parts 1 and 2:

Read Part 4 for the grand finale!

Download the original paper here.

If you like this article, don’t forget to leave some claps! Do leave a comment below if you have any questions or suggestions :)

--

--