The world’s leading publication for data science, AI, and ML professionals.

Breaking Linearity With ReLU

Explaining how and why the ReLU activation function is non-linear

Photo by Alina Grubnyak on Unsplash
Photo by Alina Grubnyak on Unsplash

Introduction

_Neural networks and deep learning_ are assumably one of the most popular reasons people transition into Data Science. However, this excitement can lead to overlooking the core concepts that make neural networks tick. In this post, I want to go over probably the most key feature of neural networks, which I think most practitioners should be aware of to fully understand what is happening under the hood.

Why We Need Activation Functions?

_Activation functions are ubiquitous in data science and Machine Learning. They typically refer to the transformation that’s applied to the linear input of a neuron in a neural network:_

Equation by author in LaTeX.
Equation by author in LaTeX.

Where f is the activation function, y is the output, b is the bias, _and w_i and x_i are the weights_ and their corresponding feature values.

But, why do we need activation functions?

The simple answer is that they allow us to model complex patterns and they do this by making the neural network _non-linear. If there are no non-linear activation functions in the network, the whole model just becomes a linear regression_ model!

Non-linear is a change to the input that is not proportional to the change in the corresponding output.

For example, consider a feed-forward two-layer neural network with two neurons in the middle layer (ignoring the bias terms):

Equation by author in LaTeX.
Equation by author in LaTeX.

We have managed to condense our 2-layer network into a single-layer network! The final equation, in the above derivation, is just simply a linear regression model with features _x_1 and x_2 and their corresponding coefficients._

So our ‘deep neural network’ would collapse to a single layer and become the good old linear regression model! This is not good as the neural network won’t be able to model or fit complex functions to the data.

Neural networks can compute any function due to something called the Universal Approximation Theorem. Check out this post here if you want to learn more!

The formal mathematical definition for a linear function is:

Equation by author in LaTeX.
Equation by author in LaTeX.

And this is a very simple example:

Equation by author in LaTeX.
Equation by author in LaTeX.

So the function f(x) = 10x is linear!

Note if we added a bias term to the above equation it’s no longer be a linear function but rather an affine function. See this statexchange thread discussing why this is the case.

ReLU

The rectified linear unit (ReLU) is the most popular activation function as it is computationally efficient and removes the issues with the _vanishing gradient problem_.

Mathematically the function reads:

Equation by author in LaTeX.
Equation by author in LaTeX.

We can visualise it graphically in Python:

Plot generated by author in Python.
Plot generated by author in Python.

Why Is ReLU Non-Linear?

The ReLU function may appear to be linear, due to the two straight lines. In fact, it is actually _piece-wise linear_. However, it is precisely these two different straight lines that make it non-linear.

We can show it is non-linear by carrying out the same example as above but with the ReLU function:

Equation by author in LaTeX.
Equation by author in LaTeX.

Lets break it down:

Equation by author in LaTeX.
Equation by author in LaTeX.

Therefore, ReLU is non-linear!

I have linked a good article here that showcases how you can create any function using ReLU.

Summary and Further Thoughts

Non-linearity is essential in neural networks as it allows the algorithm to deduce complex patterns in the data. Non-linearity is accomplished by activation functions and the most famous one is the ReLU for computational efficiency and improving known issues when training neural networks. The ReLU function is piece-wise linear which is what causes it to be non-linear as we showed mathematically above.

The full code can be found on my GitHub here:

Medium-Articles/relu.py at main · egorhowell/Medium-Articles

Another Thing!

I have a free newsletter, Dishing the Data, where I share weekly tips for becoming a better Data Scientist. There is no "fluff" or "clickbait," just pure actionable insights from a practicing Data Scientist.

Dishing The Data | Egor Howell | Substack

Connect With Me!


Related Articles