Finding the Gradient of a Vector Function
Part 3 of Step by Step: The Math Behind Neural Networks
In Part 1, we have been given a problem: to calculate the gradient of this loss function:
To find the gradient, we have to find the derivative the function. In Part 2, we learned to how calculate the partial derivative of function with respect to each variable. However, most of the variables in this loss function are vectors. Being able to find the partial derivative of vector variables is especially important as neural network deals with large quantities of data. Vector and matrix operations are a simple way to represent the operations with so much data. How, exactly, can you find the gradient of a vector function?
Gradient of a Scalar Function
Say that we have a function, f(x,y) = 3x²y. Our partial derivatives are:
If we organize these partials into a horizontal vector, we get the gradient of f(x,y), or ∇ f(x,y):
6yx is the change in f(x,y) with respect to a change in x, while 3x² is the change in f(x,y) with respect to a change in y.
What happens when we have two functions? Let’s add another function, g(x,y) = 2x+y⁸. The partial derivatives are:
So the gradient of g(x,y) is:
Representing Functions
When we have a multiple functions with multiple parameters, it’s often useful to represent them in a simpler way. We can combine multiple parameters of functions into a single vector argument, x, that looks as follows:
Therefore, f(x,y,z) will become f(x₁,x₂,x₃) which becomes f(x).
We can also combine multiple functions into a vector, like so:
Now, y=f(x) where f(x) is a vector of [f₁(x), f₂(x), f₃(x)…fn(x)]
For our previous example with two functions, f(x,y) ⇒ f(x) and g(x,y) ⇒ g(x). Here, vector x = [x₁, x₂], where x₁=x, and x₂=y. To simplify it even further, we can combine our functions: [f(x),g(x)] = [f₁(x), f₂(x)] = f(x) = y.
Often, the number of functions and the number of variables will be the same, so that a solution exists for each variable.
Gradient of a Vector Function
Now that we have two functions, how can we find the gradient of both functions? If we organize both of their gradients into a single matrix, we move from vector calculus into matrix calculus. This matrix, and organization of the gradients of multiple functions with multiple variables, is known as the Jacobian matrix.
There are multiple ways of representing the Jacobian. This layout, where we stack the gradients vertically, is known as the numerator layout, but other papers will use the denominator layout, which simply flips it diagonally:
Gradient of the Identity Function
Let’s take the identity function, y = f(x) = x, where fi(x) = xi, and find its gradient:
Just as we created our previous Jacobian, we can find the gradients of each scalar function and stack them up vertically to create the Jacobian of the identity function:
Since it’s an identity function, f₁(x) = x₁, f₂(x) = x₂, and so on. Therefore,
The partial derivative of a function with respect to a variable that’s not in the function is zero. For example, the partial derivative of 2x² with respect to y is 0. In other words,
Therefore, everything not on the diagonal of the Jacobian becomes zero. Meanwhile, the partial derivative of any variable with respect to itself is 1. For example, the partial derivative of x with respect to x is 1. Therefore, the Jacobian becomes:
Gradient of Element-Wise Vector Function Combinations
Element-wise binary operators are operations (such as addition w+x or w>x which returns a vector of ones and zeros) that applies an operator consecutively, from the first item of both vectors to get the first item of output, then the second item of both vectors to get the second item of output…and so forth.
This paper represents element-wise binary operations with this notation:
Here, the ◯ means any element-wise operator (such as +), and not a composition of functions.
So how do you find the gradient of an element-wise operation of two vectors?
Since we have two sets of functions, we need two Jacobians, one representing the gradient with respect to x and one with respect to w:
Most arithmetic operations we’ll need are simple ones, so f(w) is often simply the vector w. In other words, fi(wi) = wi. For example, the operation w+x fits this category as it can be represented as f(w)+g(x) where fi(wi) + gi(xi) = wi +xi.
Under this condition, every element in the two Jacobians simplifies to:
On the diagonal, i=j, so a value exists for the partial derivative. Off the diagonal, however, i≠j, so the partial derivatives become zero:
We can represent this more succinctly as:
Let’s try to find the gradient of the function w+x. We know that everything off the diagonal is 0. The values of the partials on the diagonal with respect to w and x are:
So both Jacobians have a diagonal of 1. This looks familiar…it’s the identity matrix!
Let’s try it with multiplication: w*x. The values of the partials on the diagonal with respect to w and x are:
Therefore, the gradient with respect to w of w*x is diag(x), while the gradient with respect to x of w*x is diag(w).
Applying the same steps for subtraction and division, we can sum it all up:
Gradient of Vector Sums
One of the most common operations in deep learning is the summation operation. How can we find the gradient of the function y=sum(x)?
y=sum(x) can also be represented as:
Therefore, the gradient can be represented as:
And since the partial derivative of a function with respect to a variable that’s not in the function is zero, it can be further simplified as:
Note that the result is a horizontal vector.
What about the gradient of y=sum(xz)? The only difference is that we multiply every partial with a constant, z:
While that is the derivative with respect to x, the derivative with respect to the scalar z is simply a number:
Gradient of Chain Rule Vector Function Combinations
In Part 2, we learned about the multivariable chain rules. However, that only works for scalars. Let’s see how we can integrate that into vector calculations!
Let us take a vector function, y = f(x), and find it’s gradient. Let us define the function as:
Both f₁(x) and f₂(x) are composite functions. Let us introduce intermediate variables for f₁(x) and f₂(x), and rewrite our function:
Now, we can use our multivariable chain rule to compute the derivative of vector y. Simply compute the derivative of f₁(x) and f₂(x), and place them one above another:
Voila! We have our gradient. However, we’ve come to our solution with scalar rules, merely grouping the numbers together into a vector. Is there a way to represent the multivariable chain rule for vectors?
Right now, our gradient is computed with:
Notice that the first term of the gradients of both f₁(x) and f₂(x) include the partial of g₁ over x, and the second term of the gradients of both f₁(x) and f₂(x) include the partial of g₂ over x. This is just like matrix multiplication! We can therefore represent it as:
Let’s test our this new representation of the vector chain rule:
We get the same answer as the scalar approach! If instead of a single parameter x we have a vector parameter x, we just have to alter our rule a little to get the complete vector chain rule:
In other words:
In our example above, f is purely a function of g; that is, fi is a function of gi but not gj (each function f matches with exactly 1 function g). In this case, everything off the diagonal becomes zero, and:
Now we have all the pieces we find the gradient of the neural network we started our series with:
Check out Part 4 to find out how to compute its derivative!