The Softmax Function, Simplified

How a regression formula improves accuracy of deep learning models

Published in

Towards Data Science

5 min readNov 26, 2018

tl;dr This is my first post of the series dedicated to topics in Machine/Deep Learning and Natural Language Processing. The post discusses Softmax Regression, where we compute the exponential of the input vector in order to normalize the data set into a probabilistic distribution with values that sums to one. Good for multi-dimensional classification instead of binary classification.

Prologue

I have taken a profound interest in Machine Learning (ML), specifically exploring the field of Natural Language Processing (NLP). Going through research papers and algorithms, the thing that arouses my curiosity over this field; AI in general, is how there always is room for improvement in optimizing experimental results, of how a 100 lines of code can empower industries to automate tasks that otherwise take long, tedious hours to complete. In cases where we have to learn a data-set, then to output meaningful information that is cognizant or aligned with real world expectations and doing that iteratively to achieve more accuracy through incremental changes, to me is absolutely fascinating.

An insight about my academic journey would be peculiar to most as majority of my courses were related to product engineering and data analytics. However, I wanted to add some extra flavor to this academic experience. In order to do that, I decided to enroll into a Natural Language Processing course offered by the school’s CS department.

The course did an excellent job in exposing the vast topics there are to learn from and with so much advancements happening in this field, I was super motivated to start writing about the concepts relevant to ML, NLP and Deep Learning. However, there was a challenge associated with it. The concepts studied in and out of class were already mathematically intensive, with chock full of confusing notations and derivations. I wanted to simplify them in such a way that anyone enthusiastic about certain topics could easily understand them and maybe even apply them in research, career, startups or hackathons.

So within the first few lectures, the professor took us through material that was relevant to Language Modeling; that is, techniques used by intelligent systems to accurately predict a word(s) or sentence(s) in any given text. The content made sense till the professor introduced a particular term called the Softmax Regression. I was overwhelmed by the notation and symbols that came across as French (or more like Latin?) to me. With a few hours of self study and a trusty cup of warm Hazelnut Coffee, the concept made sense soon after.

To explain it in a more simplified and progressive manner, we will start out with the definition first, then understand the symbols involved and finally see a coded implementation of the function itself.

Definition

The Softmax regression is a form of logistic regression that normalizes an input value into a vector of values that follows a probability distribution whose total sums up to 1. The output values are between the range [0,1] which is nice because we are able to avoid binary classification and accommodate as many classes or dimensions in our neural network model. This is why softmax is sometimes referred to as a multinomial logistic regression.

As an aside, another name for Softmax Regression is Maximum Entropy (MaxEnt) Classifier.

The function is usually used to compute losses that can be expected when training a data set. Known use-cases of softmax regression are in discriminative models such as Cross-Entropy and Noise Contrastive Estimation. These are only two among various techniques that attempt to optimize the current training set to increase the likelihood of predicting the correct word or sentence. (We will touch upon the aforementioned techniques in the next few posts, so stay tuned for those.)

If you look at it from the outset, the definition may sound off as trivial but in the realm of Machine Learning and NLP, this regression function has been useful as a baseline comparator. Researchers who design new solutions have to carry out experimentation keeping the softmax results as a reference. However, it should be noted that softmax is not ideally used as an activation function like Sigmoid or ReLU (Rectified Linear Units) but rather between layers which may be multiple or just a single one.

Notation

The classifier function involves some high-level notation which we are going to dive into next. The picture below illustrates how a Softmax function looks like. Let's try to understand it piece by piece.

A mathematical representation of the Softmax Regression function

Given a net input parameter in the form of a one-hot encoded matrix θ, our objective is to predict if the trained set of features x; each with its own set of weights, are a class of j. A one-hot matrix consists of binary values with the number 1 representing an element in the iᵗʰ position of the column while the rest are 0s (a relatively large matrix runs the risk of sparsity, which we will talk about in the next post).
In the formula we compute the exponential of the input parameter and the sum of exponential parameters of all existing values in the inputs. Our output for the Softmax function is the ratio of the exponential of the parameter and the sum of exponential parameter.
θ, on a high level is the sum of the score of each occurring element in the vector. In a generalized form we say that θ is the transpose of the weights matrix w, multiplied by the feature matrix x.
The term w₀x₀ is the bias that needs to be added on every iteration.

Code Implementation

Implementing the code is extraordinarily easy and the fun part is that it’s only one line considering we have the necessary Python helper functions at our disposal. In the Github gist below I have used both Numpy and Tensorflow to write the Softmax function as explained in the previous section. The two mentioned libraries are popularly used to carry out mathematical and neural network related operations.

Conclusion

We shall see further in the next post how we utilize the softmax function when we compute it as part of a log-loss neural network to minimize prediction errors within word embeddings. We will take text samples which will allow us to understand more deeper concepts about NLP and its practical usage in the real world.

So stay tuned. :)

Works Cited

Difference Between Softmax Function and Sigmoid Function

While learning the logistic regression concepts, the primary confusion will be on the functions used for calculating…

dataaspirant.com

https://sebastianraschka.com/faq/docs/softmax_regression.html

Spread and share knowledge. This post is the first among many in this series dedicated to understanding core concepts about NLP. If this article piqued your interest, give a few claps as it always motivates me to write out more informative content.Also, do follow my profile for more tech related articles. — Hamza