Exploring the Softmax Function

Developing Intuition With the Wolfram Language

Published in

Towards Data Science

4 min readSep 30, 2020

In machine learning, classification problems are often solved with neural networks which give probabilities for each class or type it is trained to recognize. A typical example is image classification where the input to a neural network is an image and the output is a list of possible things that image represents with probabilities.

The Wolfram Language (WL) comes with a large library of pre-trained neural networks including ones that solve classification problems. For example, the built-in system function ImageIdentify uses a pre-trained network that can recognize over 4,000 objects in images.

Image by the author using a photo by James Sutton on Unsplash

Side note: Because of the unique typesetting capabilities of the Wolfram notebook interface (such as mixing code with images), all code is shown with screen captures. A notebook with full code is included at the end of this story.

You can use the underlying neural network directly to access the probabilities for each of the 4,000+ possible objects. Clearly “domestic cat” wins hands down in this case with a probability of almost 1. Other types of cat follow with lower probabilities. The result for “shower curtain” is probably because of the background of the image. Summing up all the 4,000+ probabilities gives the number 1.0.

When you examine the neural network in detail and look at the layers it consists of, you will notice that the final layer is something called SoftmaxLayer. This layer is very commonly used in neural networks to assign a list of probabilities to a list of objects.

The SoftmaxLayer uses the softmax function which takes a list of numbers as input and gives a normalized list of numbers as output. More specifically, each element of the input list is exponentiated and divided or normalized by the sum of all exponentiated elements.

It is clear from the function definition that the sum of the output elements is always 1. The reason for this is that each element in the output is a fraction where the denominator is the sum of all numerators. What is less clear is how an arbitrary input list relates to an output list, because the softmax function is nonlinear.

To help with this and gain intuition, I wrote a WL function to understand softmax function inputs and outputs. It simply creates two bar charts, one charting the input list, and one charting the output list.

understand[list_List] := Row[{
 BarChart[list], 
 Style[" \[Rule] ", 32],
 BarChart[SoftmaxLayer[][list]]
}]

Let’s start with a very simple input of three zeros. In this case, the output has three equal elements as well, and because they have to add up to 1 they are all 0.333…

And this is true for any list where all elements are the same. For example, a four-element list of 7s will yield a result where all elements are 0.25:

Things get more interesting when the input elements are not all equal. Let’s start with a list of linearly increasing elements. The output is a scaled-down version of the exponential function.

Similarly, a list of linearly decreasing elements yields a decreasing exponential function:

A downward opening parabola yields an output “curve” that looks like a normal distribution (it could be exactly that?).

An upward opening parabola gives a much more extreme output, with the endpoint values dominating.

Finally, and mostly for fun, periodic functions maintain their periodicity in some rescaled form:

Exploring this and more in a notebook is very educational. Understanding how the softmax function works helps to understand how neural networks compute their final classification probability assignments. If you want to experiment more yourself, download this notebook from the Wolfram Cloud. If you’re completely new to WL, I recommend reading my recent post titled “Learning Wolfram: From Zero to Hero”.

Exploring the Softmax Function

Developing Intuition With the Wolfram Language

Written by Arnoud Buzing