
Thoughts and Theory
In a convolutional network, the layers near to the input are used to extract spatial features. This behavior is inspired by what happens in human visual system when we are called to recognize an object. The first information that our brain decodes is the shape, the color, the presence of textures, the orientation of the light and the edges. We extract general information from the world which, as we process it, will allow us to obtain more and more abstract information to recognize the object.
In this article, we will focus on edge detection or rather the calculus of the image first derivative, taking a look at the differences between the continuous and discrete worlds. Finally, we will analyze the convolution process through two derivation operators and highlighting their advantages and disadvantages.
1. Notions of continuous differential calculus
Mathematically, the derivative expresses the rate of local variability of a function with respect to a direction of development. About that let us consider a signal f : ℝ→ ℝ with only one direction of development x, and let xi be a point in its domain. The information we want to obtain is whether the signal f, at the working point xi (local variability), undergoes a variation (increases or decreases) or remains constant. The idea could be to study the around of the working point or better: (i) evaluate f in xi (ii) evaluate f in xi plus an infinitesimal quantity epsilon (iii) calculate a difference between the two. This is what happens in the calculation of the first derivative which we can formalize with Formula (1).

A function f is derivable at a working point xi if the limit of the incremental ratio of the function exists and is finite, as the epsilon increment of the independent variable tends to zero.
Multidimensional functions assume particular importance for the following discussion. Let f : ℝ^n → ℝ be a scalar field defined on n directions of development, the calculation of the derivative is done by considering the partial first derivatives, or rather the derivatives with respect to each of the n directions. The latter constitute the gradient ∇(f) : ℝ^n → ℝ³, a vector field that associates to each n-dimensional point of the scalar field a vector. The gradient provides three important pieces of information for each point. The first, the modulus, expresses the amount of variation of the function f around the working point, the second is the versus of growth of the function at the calculation point, and the third is the direction, orthogonal to the contour lines of the scalar field. A graphical representation can be found in Fig.1 in which the scalar field f(x,y) = _x_²+_y_² in ℝ² is considered.
2. Edge detection
Let I[x,y] be our image in two directions of development, x (width) and y (height) respectively. We define an edge as a region of I[x,y] in which there is a change of color intensity Fig.2.

Edge detection aims to highlight this variation by calculating the gradient of the image. As we know, the gradient is made up of partial first derivatives. Their formalization, as presented in section 1, is valid in the continuous world. An image, on the other hand, is a discrete multidimensional signal.
2.1 Discrete partial derivative
The feature of discrete multidimensionality involves an approximation of the continuous partial first derivative by a finite difference, where the epsilon increment does not tend to cancel (ϵ → 0) but takes on a finite value. In the case of our discrete signal I[x,y] the value of the increment is equal to one pixel. The smallest quantity for which, given the pixel [xm,yn], we can move around it to assess the local rate of variability is precisely the pixel. Formalizing, we can distinguish three types of finite differences: (a)Forward (b)Backward (c)Central.
One of the most well-known operations, when we talk about images, is convolution. Given a matrix K (Kernel), the convolution makes K slide by a certain stride along with the height and width of the image I, performing a weighted sum between the values of the Kernel and the overlapping region of the image. This technique, through an appropriate Kernel transformation, is what we use to apply finite differences on the images by calculating the partial first derivative in the two directions of development. A summary and formalization of what has just been said is presented in Tab.1. In the field of Computer Vision and in particular for edge detection, Central or also called symmetric differentiation is used.
![Tab.1: Formalization of the three types of finite differences, Forward, Backward and Central in the development directions x and y and the two-dimensional working point [xm,yn]. To each is associated the corresponding representation through Kernel of dimension 1x3 for the development direction x and 3x1 for the development direction y. (Source: Image by me)](https://towardsdatascience.com/wp-content/uploads/2021/06/1XRHNrILWvKug8ICzwyK5xQ.png)
2.2 Discrete gradient
The convolution produces a new image, called a features map (F), having the same dimensions in terms of height and width as I (stride = 1 and padding = ‘same’) and in which specific features of I are emphasized. As an example, if we calculate the convolution between I[x,y] and Kx (Kernel of derivation with respect to x), the result will be a new image Fx in which the vertical edges will have a non-zero intensity value to the detriment of the horizontal edges. In particular, we can say that:
the intensity value of the pixel Fx[xm,yn] corresponds to the value of the partial first derivative at the point I[xm,yn].
The same analysis can be carried out by considering Ky (derivation kernel with respect to y).
![Fig.3: Calculation of the modulus and direction of the gradient using the image I[x,y] as a discrete signal. (Source: Image by me)](https://towardsdatascience.com/wp-content/uploads/2021/06/1mzXSAyRX8s8Qm7gitQuY9g.png)
Once the values of the partial derivatives have been obtained, we can calculate the gradient G. The latter will associate to each pixel I[xm,yn] the information on the modulus, which will indicate the quantity or magnitude of variation of the image around [xm,yn] and on the direction, which will express the direction of growth of the color intensity around the pixel of interest. Fig.3 shows a geometric representation of what has been said. Since we are in the discrete world, a more correct representation can be found in Fig.4.
![Fig.4: (a) shows the image I[x,y] with two edges in the respective directions of development. (b) shows Fx = IKx, or rather, the derivative of (a) with respect to x. We note how the vertical edges are highlighted with respect to the horizontal ones. In (c), Fy = IKy, or rather, the derivative with respect to y is represented. The behavior, in this case, is opposite to what we saw in (b). Figures (d), (e), (f) report information about the gradient. In particular, for (d) we have the Pitagora sum of the derivatives Fx and Fy. The result is the magnitude or modulus |G[x,y]|. In (e) and (f) the direction of growth of the gradient is represented. As we can see, (e) shows the expected direction of the gradient for each pixel on the edge. The latter is directed to indicate the growth of the color intensity around the pixel of interest, in this case from a color intensity of 0 to 255. We can represent this information through an image (f). The angle θ, which indicates the orientation of the gradient, considers a reference system with the pixel in the top left corner as the intersection point of the Cartesian axes and the y-axis in the opposite direction to the standard one. (Source: Image by me)](https://towardsdatascience.com/wp-content/uploads/2021/06/1UyRzbCBlY689fof-0DaFYQ.png)
For completeness of discussion, the code used to obtain the examples in Fig.5 is presented below. As can be seen, both Kx and Ky are premultiplied by the value -1. More details can be found in Appendix 2.
3. Noise
An image may present random variations in color intensity. This phenomenon is called digital noise and occurs especially in low light situations, at high ISO values or in other cases, can be introduced by the sensor itself (ex: CMOS). For Edge Detection, noise can cancel the presence of the edge.
In this regard, let us consider our image I[x,y] with the addition of noise (in this case salt-and-pepper noise) and plot the intensity histogram (Tab.2). We notice that compared to the case with no noise, the curve is much more undulatory, presenting several peaks called false edges. The derivative will highlight each of them, losing the information on the true edges.
![Tab.2: Comparison of derivatives calculated on the same image I[x,y] with and without the presence of salt-and-pepper noise (Source: Image by me)](https://towardsdatascience.com/wp-content/uploads/2021/06/181rOs4W6g7lLD_u5l6LX2Q.png)
One way to reduce false edges is to blur the image. The type and amount of blurring depends on the intensity of the noise. Some image blurring techniques combined with the Kx and Ky derivative Kernels are presented below.
3.1 Blurring with moving average: Prewitt operator

Prewitt’s operator, as we can see from Fig. 5, consists of several symmetrical derivative Kernels of size 1×3 stacked on top of each other. In particular, it can be decomposed through a matrix product between a three-pixel moving average Kernel and a derivative Kernel. In the literature, we can find this operator premultiplied by 1/9 instead of 1/3. In this case, we are going to perform the average considering all the pixels of the 3×3 derivative Kernel, thus decreasing the amount of noise. The same considerations can be made for the derivative kernel along the y direction.
3.2 Blurring with a Gaussian filter: Sobel operator

The Sobel operator is obtained by calculating the derivative of the Gaussian filter. In particular, it can be decomposed through the matrix product between the discrete Gaussian filter and the derivative Kernel. An example of the Sobel operator along x of size 3×3 is presented in Fig.6. The same considerations can be made for the derivative Kernel along the y-direction.
4. Problems due to derived kernels
Suppose we have a step edge in our image. Applying a derivative Kernel, in particular Central, the edge is represented by a minimum of two pixels. The number of pixels increases if we have a ramp or a root edge. The same phenomenon can be caused not only by the nature of the edge itself but also and above all by the blurring of the image, so using the Sobel and Prewitt operators.
Again, as mentioned in the previous section, the presence of noise can lead to the loss of edge information and the identification of false edges. Blurring can reduce this phenomenon but not definitively eliminate it. An example is presented in Fig.7, where false edges are still present after a blurring process.

These effects can be reduced by using Canny, an edge detector that applies non-maximum suppression and an edge linking process to the result obtained by the Sobel operator, reducing the thickness of the edges and the presence of false positives respectively.
Appendix
- Compute |G[xm,yn]|
As anticipated in paragraph 2.2, the calculation of the gradient modulus is carried out considering the Pitagora sum between the values of the derivatives with respect to the x and y development directions. The latter could be expensive in computational terms if we think that it has to be applied on 25/30 fps (frames per second) unless subsampling. To lighten the computational load, the modulus of the gradient can be approximated using Formula (2).

2. Convolution
In the continuous world, the convolution between two signals f(t) and g(t) results in (i) expressing both functions in a supporting variable (e.g. τ), (ii) assuming that it is g(τ) that is sliding along the new direction, flip the function g with respect to its y-axis (g(-τ)) (iii) adding a variable t to allow sliding (g(t-τ)). (iv) at this point the integral of f(τ) g(t-τ) in dτ [3][4] is calculated.
The convolve() method of the scipy.ndimage library implements this. So, for the Kernels Kx and Ky to be calculated correctly we need to premultiply them by -1. At this point, applying the convolve() method, and in particular point (ii), we have the correct filter values (Tab.3)

Not inverting the derivative Kernels and performing the convolution will result in misreading the information for both the derivative value and the direction of the gradient. The modulus of the gradient is not affected as it is formed by positive sums. An example is shown in Fig.8. As we can see, for the left edge where we have a transition from an intensity value of 255 to 0 we expect a negative derivative, but we get a growth value. The same considerations can be made for the right edge.

References
[2] Lecture 5: Edge Detection – Stanford University