There is an overall skepticism in the job market with regard to machine learning engineers and their deep understanding of mathematics. The fact is, all machine learning algorithms are essentially mathematical frameworks – support-vector machines formulated as a dual optimization problem, principal component analysis as spectral decomposition filtering, or neural networks as a composition of successive non-linear functions – and only a thorough mathematical understanding will allow you to truly grasp them.
Various Python libraries facilitate the usage of advanced algorithms to simple steps, e.g. Scikit-learn library with KNN, K-means, decision trees, etc., or Keras, that lets you build neural network architectures without necessarily understanding the details behind CNNs or RNNs. However, becoming a good machine learning engineer requires much more than that, and interviews for such positions often include questions on, for example, the implementation of KNN or decision trees from scratch or deriving the matrix closed-form solution of linear regression or softmax back-propagation equations.
In this article, we will review some fundamental concepts of calculus – such as derivatives for uni- and multi-dimensional functions, including gradient, Jacobian and Hessian – to get you started with your interview preparation and, simultaneously, help you build a good foundation to successfully dive deeper into the exploration of mathematics behind machine learning, especially for neural networks.
These concepts will be demonstrated with 5 examples of derivatives that you should absolutely have in your pocket for interviews:
- Derivative of a Composed Exponential Function – **** f(x)= eˣ ²
- Derivative of a Variable Base and Variable Exponent Function – f(x)= xˣ
- Gradient of Multi-Dimensional Input Function – f(x,y,z) = 2ˣʸ+zcos(x)
- Jacobian of a Multi-Dimensional Function – f(x,y) = [2x², x √y]
- Hessian of a Multi-Dimensional Input Function – f(x,y) = x ²y³
Derivative 1: Composed Exponential Function
The exponential function is a very foundational, common, and useful example. It is a strictly positive function, i.e. eˣ > 0 in ℝ, and an important property to remember is that e⁰ = 1. In addition, you should remember that the exponential is the inverse of the logarithmic function. It is also one of the easiest functions to derivate because its derivative is simply the exponential itself, i.e. (eˣ)’ = eˣ. The derivative becomes tricker when the exponential is combined with another function. In such cases, we use the chain rule formula, which states that the derivative of f(g(x)) is equal to f'(g(x))⋅g'(x), i.e.:
Applying chain rule, we can compute the derivative of f(x)= eˣ ². We first find the derivative of g(x)=x², i.e. g(x)’=2x. We also know that (eˣ)’=eˣ. Multiplying these __ two intermediate results, we obtain
This is a very simple example that might seem trivial at first, but it is often asked by interviewers as a warmup question. If you haven’t seen derivatives for a while, make sure that you can promptly react to such simple problems, because while this won’t give you the job, failing on such a fundamental question could definitely cost you the job!
Derivative 2. Function with Variable Base and Variable Exponent
This function is a classic in interviews, especially in the financial/quant industry, where math skills are tested in even greater depth than in tech companies for machine learning positions. It sometimes brings the interviewees out of their comfort zone, but really, the hardest part of this question is to be able to start correctly.
The most important thing to realize when approaching a function in such exponential form is, first, the inverse relationship between exponential and logarithm, and, second, the fact, that every exponential function can be rewritten as a natural exponential function in the form of
Before we get to our f(x) = xˣ example, let us demonstrate this property with a simpler function f(x) = 2ˣ. We first use the above equation to rewrite 2ˣ as exp(xln(2)) and subsequently apply chain rule to derivate the composition.
Going back to the original function f(x)=xˣ, once you rewrite the function as f(x)=exp(x ln x), the derivative becomes relatively straightforward to compute, with the only potentially difficult part being the chain rule step.
Note that here we used the product rule (uv)’=u’v+uv’ for the exponent xln(x).
This function is generally asked without any information on the function’s domain. If your interviewer doesn’t specify the domain by default, he might be testing your mathematical acuity. Here is where the question gets deceiving. Without being specific about the domain, it seems that xˣ is defined for both positive and negative values. However, for negative x, e.g.(-0.9)^(-0.9), the result is a complex number, concretely -1.05–0.34i. A potential way out would be to define the domain of the function as ℤ⁻ ∪ ℝ⁺ (see here for further discussion), but this would still not be differentiable for negative values. Therefore, in order to properly define the derivative of xˣ, we need to restrict the domain to only strictly positive values. We exclude 0 because for a derivative to be defined in 0, we need the limit derivative from the left (limit in 0 for negative values) to be equal to the limit derivative from the right (limit in 0 for positive values) – a condition that is broken in this case. Since the left limit
is undefined, the function is not differentiable in 0, and thus the function’s domain is restricted to only positive values.
Before we move on to the next section, I leave you with a slightly more advanced version of this function to test your understanding: _f(x) = xˣ_². If you understood the logic and steps behind the first example, adding the extra exponent shouldn’t cause any difficulties and you should conclude the following result:
Derivative 3: Gradient of a Multi-Dimensional Input Function
So far, the functions discussed in the first and second derivative sections are functions mapping from ℝ to ℝ, i.e. the domain as well as the range of the function are real numbers. But machine learning is essentially vectorial and the functions are multi-dimensional. A good example of such multidimensionality is a neural network layer of input size m and output size k, i.e. f(x) = g(Wᵀx + b), which is an element-wise composition of a linear mapping Wᵀx (with weight matrix W and input vector x) and a non-linear mapping g (activation function). In the general case, this can also be viewed as a mapping from ℝᵐ to ℝᵏ.
In the specific case of k=1, the derivative is called gradient. Let us now compute the derivative of the following three-dimensional function mapping ℝ³ to __ ℝ:
You can think of f as a function mapping a vector of size 3 to a vector of size 1.
The derivative of a multi-dimensional input function is called a gradient and is denoted by the symbol nabla (inverted delta): ∇. A gradient of a function g that maps ℝⁿ to ℝ is a set of n partial derivatives of g where each partial derivative is a function of n variables. Thus, if g is a mapping from ℝⁿ to ℝ, its gradient ∇g is a mapping from ℝⁿ to ℝⁿ.
To find the gradient of our function f(x,y,z) = 2ˣʸ + zcos(x), we construct a vector of partial derivatives ∂f/∂x, ∂f/∂y and ∂f/∂z, and obtain the following result:
Note that this is an example similar to the previous section and we use the equivalence 2ˣʸ=exp(xy ln(2)).
In conclusion, for a multi-dimensional function that maps ℝ³ to _ ℝ, the derivative is a gradient ∇ f,_ which maps ℝ³ to ℝ³.
In a general form of mappings ℝᵐ to ℝᵏ where k > 1, the derivative of a multi-dimensional function that maps ℝᵐ to ℝᵏ is a Jacobian matrix (instead of a gradient vector). Let us investigate this in the next section.
Derivative 4. Jacobian of a Multi-Dimensional Input and Output Function
We know from the previous section that the derivative of a function mapping ℝᵐ to ℝ is a gradient mapping ℝᵐ to ℝᵐ. But what about the case where also the output domain is multi-dimensional, i.e. a mapping from ℝᵐ to ℝᵏ for k>1?
In such case, the derivative is called Jacobian matrix. We can view gradient simply as a special case of Jacobian with dimension m x 1 with m equal to the number of variables. The Jacobian J(g) of a function g mapping ℝᵐ to ℝᵏ is a mapping of ℝᵐ to ℝᵏ*ᵐ. This means that the output domain has a dimension of k x m, i.e. is a matrix of shape k x m. In other words, each row i of J(g) represents the gradient ∇ gᵢ of each sub-function gᵢ of g.
Let us derive the above defined function f(x, y) = [2x², x √y] mapping ℝ² to ℝ², thus both input and output domains are multidimensional. In this particular case, since the square root function is not defined for negative values, we need to restrict the domain of y to ℝ⁺. The first row of our output Jacobian will be the derivative of function 1, i.e.∇ 2x², and the second row the derivative of function 2_, i.e. ∇ x √_y.
In Deep Learning, an example where the Jacobian is of special interest is in the explainability field (see, for example, Sensitivity based Neural Networks Explanations) that aims to understand the behavior of neural networks and analyses sensitivity of the output layer of neural networks with regard to the inputs. The Jacobian helps to investigate the impact of variation in the input space on the output. This can analogously be applied to understand the concepts of intermediate layers in neural networks.
In summary, remember that while gradient is a derivative of a scalar with regard to a vector, Jacobian is a derivative of a vector with regard to another vector.
Derivative 5. Hessian of a Multi-Dimensional Input Function
So far, our discussion has only been focused on first-order derivatives, but in neural networks we often talk about higher-order derivatives of multi-dimensional functions. A specific case is the second derivative, also called the Hessian matrix, and denoted H(f) or _∇ ² (nabla squared) ._The Hessian of a function g mapping ℝⁿ to ℝ is a mapping H(g) from ℝⁿ to ℝⁿ*ⁿ.
Let us analyze how we went from ℝ to ℝⁿ*ⁿ on the output domain. The first derivative, i.e. gradient ∇g, is a mapping from ℝⁿ to ℝⁿ and its derivative is a Jacobian. Thus, the derivation of each sub-function ∇gᵢ results in a mapping of ℝⁿ to ℝⁿ, with n such functions. You can think of this as if deriving each element of the gradient vector expanded into a vector, becoming thus a vector of vectors, i.e. a matrix.
To compute the Hessian, we need to calculate so-called _cross-derivatives, t_hat is, derivate first with respect to x and then with respect to y, or vice-versa. One might ask if the order in which we take the cross derivatives matters; in other words, if the Hessian matrix is symmetric or not. In cases where the function f is 𝒞², i_.e. t_wice continuously differentiable, Schwarz theorem states that the cross derivatives are equal and thus the Hessian matrix is symmetric. Some discontinuous, yet differentiable functions, do not satisfy the equality of cross-derivatives.
Constructing the Hessian of a function is equal to finding second-order partial derivatives of a scalar-valued function. For the specific example f(x,y) = x²y³, the computation yields the following result:
You can see that the cross-derivatives 6xy² are in fact equal. We first derived with regard to x and obtained 2xy³, then again with regard to y, obtaining 6xy². The diagonal elements are simply fᵢ" for each _mono-_dimensional sub-function of either x or y.
An extension would be to discuss the case of a second order derivatives for multi-dimensional functions mapping ℝᵐ to ℝᵏ, which can intuitively be seen as a second-order Jacobian. This is a mapping from ℝᵐ to ℝᵏᵐᵐ, i.e. a 3D tensor. Similarly to the Hessian, in order to find the gradient of the Jacobian (differentiate a second time), we differentiate each element of the k x m matrix and obtain a matrix of vectors, i.e. a tensor. While it is rather unlikely that you would be asked to do such computation manually, it is important to be aware of higher-order derivatives for multidimensional functions.
Conclusion
In this article, we reviewed important calculus fundamentals behind machine learning. We demonstrated them with examples of uni- and multidimensional functions, discussing gradient, Jacobian, and Hessian. This review is a thorough walk-through of possible Interview concepts and an overview of calculus-related knowledge behind machine learning.