
Intro
In recent years, Swish has supplanted Relu in several high performing image classification models (e.g. Efficient-Net). However, it has not shown clear favor across all Machine Learning tasks. A very similar activation function, Gaussian Error Linear Units (GELU), is used instead in OpenAI’s GPT. Interestingly, when using the fast approximation for GELU, both non-linearities fall under the umbrella of sigmoid linear units. Their only difference is that GELU’s input is scaled by a constant factor. In this article, I will explore both functions and shed light on why one may work better than another in practice.
What are Swish and GELU?
For the rest of the article, I will be using Swish with a fixed weight of 1 as recommended in their paper, but the later conclusions are easily extended to other values.

With the two activations side-by-side it is easy to see how similar they look. The feature that distinguishes this class of activations the most is their continuous non-monotonic bump. But what is the significance of this bump? It actually enables folding on the input coordinate space and differently sized bumps intuitively correspond to different sized folds.
So we see what the bump does, but why is it useful?
This folding operation can actually be used in a variety of ways. The most striking is its ability to disentangle classes or more generally to simplify the incoming vector space (as visualized later).
![source: [1] open-access archive](https://towardsdatascience.com/wp-content/uploads/2021/03/1jF5gEtkHY8djBHiJg1X2RA.jpeg)
We can also tell that models are taking advantage of this operation from observations in the Swish paper. Here, they note that after training, a large proportion of pre-activations fall in the regime of this bump. However, their analysis of this behavior ends at stating that this shows the non-monotonic bump is an important aspect of Swish. This behavior is easily explained when you think of the folding operation that the bump enables.
Learning with Swish & GELU
With the above intuition, we can now consider these operations during training and what they can learn. I put forth that GELU and Swish-1 networks have the same representational capacity. Intuitively, any solution on one network class can be converted to a solution for the other by simply scaling the weights by an approximate factor of 1.702:
The above transformations will result in the same decision boundary and the same loss. In this way, I can say that a GELU network has a similar loss landscape to its Swish-1 counterpart and differs only in spread (i.e. Swish-1’s loss landscape is an elongated/stretched version of GELU’s). In this case, their corresponding peaks and dips would be of the same magnitude as well.
Note: A formal analysis would be needed to solidify this proposal
Differences between GELU and Silu
Now that I established some connections between the two, let’s see how they behave on a toy dataset. To explore the convergence behavior of GELU and Swish-1 networks, I trained a 4 layer MLP with 2 hidden units in each layer on a 2-class circle dataset repeatedly for different numbers of epochs. All networks were trained using a learning rate of 0.001 on SGD with momentum with a batch size of 20. There were 200 samples from each class. The reported statistics are pulled from the min loss reached in each trial for a given training length. There were 50 trials for 3000, 6000, 10000, and 20000 epochs.




Through these results, I can see that GELU will converge faster than Swish-1 on average on this toy dataset; but if given enough time, they will converge similarly. The two sample Komogorov-Smirnov test is used here to check how close the two models are to fitting the same distribution. This gives an idea of how significantly their performance varies.
Why
Now, why would one class of models converge faster than the other if they have the same representational capacity? In this case, it’s most likely due to the input data and initialization schema. Following the Kaiming uniform initialization of PyTorch Linear layers, the network weights are bound by ~ -0.70710 and +0.70710 and biases by the same. On this toy problem, it can then be said that GELU will likely have a shorter distance to travel to optimal solutions than Swish-1.
However, while short and sweet, this explanation does not take into account the shape of the loss landscapes. As mentioned previously, they vary in spread. Then, at the origin I can say that the two loss landscapes are locally similar. But, for a given initialization, the two will not be at corresponding locations and therefore it’s not guaranteed that the landscapes are locally similar. Then, it is also not guaranteed that they will follow the same relative paths to the same relative minima. In that case, the speed and quality of convergence depends on a wider scope of factors that make up the overall loss landscape and how it is navigated.
While a bit underwhelming of a conclusion, I hope the above insights will lead to promising discussion and further research into better understanding the core components of our networks and how they work together.
Summary
Through the course of this article, I gave an introduction to two types of Sigmoid Linear Units and tried to explain their behavior. I demonstrated how their striking non-monotonic region corresponds to folds along the coordinate space. While the two functions have the same representational capacity, I gave an example of how their convergence behavior can differ depending on the data and initialization. Further work examining loss landscapes is required to make wider-reaching claims here and continues to be an interesting direction of ongoing research.
Remarks
While this article put Swish-1 in the cross-hairs, credit for the Sigmoid-weighted Linear Unit should go to Elfwing et al. Links to the original papers are listed below.
All visualizations were produced programmatically via a customized version of the open-source software manim.
[1] (Swish) https://arxiv.org/pdf/1710.05941
[2] (SiL) https://arxiv.org/abs/1702.03118
[3] (GELU) https://arxiv.org/abs/1606.08415