From Circle to ML via Batman: Part II

More Art! And deriving beauties in ML.

Jasdeep Grover

Published in

Towards Data Science

10 min readAug 20, 2020

“M”, ”L” and the Batman symbol are results of single Inequations each. View Graph

Background

Wait! Isn’t the above equation different from what we found last time? Yup, very different but still looks exactly the same or maybe a bit better. Just in case you are wondering what I am talking about, please refer Part I of this series. Many ideas stated here are derived in that article and explained very clearly. The main ones being: why circles become like squares, how we can look at it as an intersection of trenches and how we can build our own graphs and equations by intersecting those trenches.

The Challenge

The trenches we previously used for making graphs(on left) caused extra regions in our Batman(on right)

Working with trenches isn’t a very joyful experience, is it? Honestly? They have two walls and we generally use only one. The other one just lingers around and sometimes creates unwanted regions. Remember, we had to trim our Batman for the same reason. Next, we were able to perform only intersection and negation(flipping the sign of the power) of trenches but their union is still a bit challenging yet very useful. With great power comes the great computational difficulty. It is not common to raise our inputs to very high powers and is not very efficient computationally either. Thus, the scope of what we derived and obtained earlier was a bit limited by our designing skills.

The universal set of mathematics is infinite and we would never be able to express it as the union of our finite ideas. So let’s start finding solutions to our challenges!

From Trenches to Boundaries

Remember, whenever the powers become too large to control, logarithms and exponents come to the rescue. The fundamental reason for having trenches was that when large even powers were used then two walls would form. One at y-f(x)=1(which we generally use) and the other at y-f(x)=-1(which we generally discard). Thus we have to make some changes to get only one wall per trench(that makes it just a wall). We can do this pretty easily. Just replace x^(2n) with e^(nx). The main reason why everything worked was that for absolute input values greater than 1 we would have function increasing above 1 very fast and for ones less than 1 we would have values near zero. In the case of e^(nx) for positive x, we have output values going above 1 very fast and for negative x, they are near zero. The first challenge is solved! Exponents are commonly used and have fast implementations. That’s a nice property to have, ain’t it?

A wall is always better than a trench, especially when no one has to pay for it (at least not computationally in our case).

Complement, Intersection and Union

Once we have this right tool, we can understand what all great things we can do with it. The boundary we just defined has near-zero values for negative x and tends to infinity for positive x. Let’s understand what is a set then. A set here is essentially an inequality like y≤x or x²+y²-1≤0. The boundary of these sets is what we want to represent. Thus we can multiply with large n and exponentiate both sides to get our boundaries. Thus y-x≤0 will look like e^(50(y-x))≤1, which is akin to (y-x)⁵⁰≤1 (one of our previous trench boundaries). The same logic we saw earlier applies to both cases.

Let’s look at the complement of our sets which we can obtain by simply changing the sign of the power as we did last time.

Next, let’s look at our favourite intersection. It’s the same as what we derived earlier as there is nothing different in the logic of the sum of small numbers being less than 1. This can be seen as follows:

And finally our newcomer, union. Let’s derive it. By De-Morgan’s law, we know that AUB = (A’∩B’)’. Thus that means the inverse of the sum of the inverse of the sets. Ahhh! It’s just like how you evaluate the resistance of parallel resistors (1/Rt=1/R1+1/R2). Or, for those who are familiar, it is the Harmonic Mean instead of the sum. Let’s see:

There is also a very important observation. The property of value tending to zero for points inside the set and to infinite for points outside the set is also applicable to the results of the above set operations. Thus these set operations are also repeatable without doing exponentiation again. This means we can compute more complex operations like AU(B∩(CUD)) by applying the above set operations one by one.

The union of a variety of ideas in algebra and set theory has guided us through the narrow intersection of mathematics and creative art. Summing up all our high powered ideas, I can say that the difference from our goal will soon tend to zero. With one final activity, which I would like to exponentiate on, we would be all set to fit the application in Machine Learning.

Let’s Make Something!

Let’s make the following puzzle piece:

The above equation can be obtained by first creating a square, then taking union with 2 circles (the ones protruding) followed by the intersection with the complement of other 2 circles (the ones removed). Let the 4 edges of the square be represented as A, B, C and D. Each edge is a single line. The circles undergoing union be represented as E and F. And circles being removed be represented as G and H. Thus the above figure is: ((A ∩ B ∩ C ∩ D) U E U F) ∩ G’ ∩ H’.

Let’s represent the sets:

A: e^(50(x-1))
B: e^(-50(x+1)) minus sign to change the direction of the set (to make it face inside the figure)
C: e^(50(y-1))
D: e^(-50(y+1)) minus sign to change the direction of the set
E: e^(50((x+1)²+y²-0.25))
F: e^(50((x)²+(y-1)²-0.25))
G: e^(50((x-1)²+(y)²-0.25))
H: e^(50((x)²+(y+1)²-0.25))

Performing (A ∩ B ∩ C ∩ D) gives: (e^(50(x-1))+e^(-50(x+1))+e^(50(y-1))+e^(-50(y+1)))

This is followed by taking union with the circles thus giving:

((e^(50(x-1))+e^(-50(x+1))+e^(50(y-1))+e^(-50(y+1)))^-1+e^(-50((x+1)²+y²-0.25))+e^(-50((x)²+(y-1)²-0.25)))^-1

Finally performing intersection with the complement of other 2 circles gives the desired equation shown in the figure above.

All the above operations are also compatible with what we had learnt previously thus the square can take back its true form: x⁵⁰+y⁵⁰=1. Giving an alternate equation for the same figure as shown below. Note the first term.

((x⁵⁰+y⁵⁰)^-1+e^(-50((x+1)²+y²-0.25))+e^(-50((x)²+(y-1)²-0.25)))^-1+e^(-50((x-1)²+(y)²-0.25))+e^(-50((x)²+(y+1)²-0.25))≤1

Real Applications

Remember, if we can create good looking graphs and their equations then we can also create some pretty interesting functions commonly used in various fields or even make some for ourselves. My background is in Machine Learning therefore I am acquainted with some of these functions used in my field.

Deriving Log-sum-exp and Softmax

We all have learnt so much about the max function which takes in multiple numbers and spits out the largest one. In many applications in machine learning, we want the max operation to not only be as close as possible to the actual largest number but to also have some relation with the numbers which are not the largest. The closer these numbers are to the largest, the more they should contribute to the result. This is important because it allows gradients to propagate through non-maximum terms as well during backpropagation or other training algorithms. Not going deep into ML, we can say we need an approximation of the max operation which accounts for all the terms. Such approximation of fixed(hard) decision functions like max, if-else, sorting etc. is called softening.

What is max essentially? It is the union of the individual functions and the largest one is represented at the output. Note that the largest function in the union operation automatically has smaller ones under it. For 1 dimensional example, max(y=1,y=2,y=3) is y=3. We can also write this as the boundary of Union(y≤1,y≤2,y≤3). The union is y≤3 so the boundary is y=3. Let’s visualize this for some more realistic functions:

The dark blue curve represents the max of the 4 input lines. As this is the same as performing max on every output for a given input x, it is called pointwise max.

Let the input functions be f(x), g(x) and h(x). We can represent them as y-f(x)≤0, y-g(x)≤0 and y-h(x)≤0. Lets perform the union of these borders to generate our approximation for max function:

(e^(-n(y-f(x))+e^(-n(y-g(x))+e^(-n(y-h(x)))^-1≤1

Taking the points on the border, we get: (e^(-n(y-f(x))+e^(-n(y-g(x))+e^(-n(y-h(x)))^-1=1

Rearranging the result to make it a function of x gives:

y = ln(e^(nf(x))+e^(ng(x))+e^(nh(x)))/n

This is essentially the log-sum-exp form. 1/n is generally referred to as the temperature and is denoted as T and is generally taken to be 1. For multiple terms we get: y=ln(sum(e^xᵢ, for all i)). Thus deriving its name log-sum-exp (LSE).

What about all that softening and ML stuff said earlier? Note that the log-sum-exp is always greater than the actual max. This difference accounts for the terms which are second largest, third-largest, and so on …. The output is a mixture of all terms as seen in the above equation but larger terms contribute much more than the smaller ones as we have approximated the max operation. Remember!! larger was the value of n, closer was the circle to a square. Similarly, larger n means bigger terms contribute exponentially more than smaller ones thus making the approximation more accurate. Notice the curved differentiable corners as well, this is another benefit of our approximation very helpful in ML. By the way, what’s the derivative of log-sum-exp? It’s our common friend Softmax(Yes, the classification layer one). I hope now you see how this famous function also derived its name. And yes, I know this is not a Wikipedia article, so I will move on.

Graphs for different values of n. The approximation improves with an increase in n; View Graph

There is no dearth of applications of the log-sum-exp equation. Many of its properties are discussed on its Wikipedia page. There are also papers using it as an architecture directly like ones using it as universal convex approximator and as universal function approximator. There are many many more applications of this function, hence it has inbuilt efficient implementations in nearly all ML libraries from Numpy to Pytorch and Tensorflow. The upcoming applications are like special cases of this one.

Deriving Soft-Plus Activation function

The soft-plus activation is a type of non-linearity used in neural networks. One common non-linearity is ReLU which takes the form max(0,x) and is very commonly used. Under many conditions as described in this paper, we need to approximate the ReLU function. Not diving deep into activation function and ML details, we can handle this challenge by approximating the ReLU function with the methods we just learnt.

Thus we have to approximate max(0,x). Why not just refer our derivation in log-sum-exp. The two needed components are y≤0 and y-x≤0. This will thus give the union as (e^(-ny)+e^(-n(y-x)))^-1≤1. This will thus give us the equation: y = ln(1+e^(nx))/n. When n is 1, this is referred to as the softplus activation function. Beyond its original paper, this function is also used in other activation functions like swish, mish and soft++.

Approximation improves as n increases; View Graph

We can even go beyond and create our own variant of softplus and call it leaky-softplus. Essentially an approximation of leaky-ReLU ( max(0.05x, x) ) with the same procedure. The function takes the following form: y = ln(e^(0.05nx)+e^(nx))/n. And according to the common ritual, we make n=1. The result is shown below. Testing and experimentation are left to the reader. ;-)

Our leaky soft-plus for different n; View Graph

Deriving log-cosh loss

In many regression tasks, a loss function called absolute loss which is essentially the average absolute value of the errors is used. This loss is non-differentiable at zero and also doesn’t have a second derivative for training algorithms using higher-order derivatives. Log-cosh handles these problems very well by approximating it thus looking like mean squared error near zero and like absolute loss away from it. More can be learnt in this article. Thus, we have to approximate |x| which is essentially max(x,-x). We can use the same old trick and get: y=ln(e^nx+e^-nx)/n. We haven’t reached our goal yet. We can now add and subtract ln(2)/n giving us: ln((e^nx+e^-nx)/2)/n+ln(2)/n. The next step is to make n=1 and ignore the constant as it doesn’t affect the training procedure. This gives us: ln((e^x+e^-x)/2) which is ln(cosh(x)). This is our log-cosh loss function.

Note the similarity with the parabola near 0. That’s why we did the ln(2) adjustment. View Graph

Conclusion

Simple ideas coming out of sheer curiosity can have a very wide range of applications. There are no ceilings for curiosity and what it can give us. Hope these articles have provided you with new tools and perspectives which you can apply in fields as diverse as science, engineering and art. For curious minds, I would like to leave a good resource to improve your understanding of this idea: Differentiable Set Operations for Algebraic Expressions.