If you have read my previous articles, you’ll know what’s coming next. In this part of the internet, we take complex-sounding concepts and make them fun and nbd by illustrating them. And if you haven’t read my previous articles, I highly recommend you start with my series of articles covering the basics of machine learning because you’ll find that a lot of the material covered there is relevant here.
Today, we’re going to tackle the big boy – an introduction to Neural Networks, a kind of Machine Learning model. This is just the first article in a whole series I plan on doing on Deep Learning. It will focus on how a simple artificial neural network learns and provide you with a deep (ha, pun) understanding of how a neural network is constructed, neuron by neuron, which is super essential as we’ll continue to build upon this knowledge. While we will dive into the mathematical details, there’s no need to worry because we will break down and illustrate each step. By the end of this article, you’ll realize that it’s waaaaay simpler than it sounds.
But before we explore that, you might be wondering: Why do we need neural networks? With so many machine learning algorithms available, why choose neural networks? The answers to this question are plentiful and extensively discussed, so we won’t delve too deeply into it. But it’s worth noting that neural networks are incredibly powerful. They can identify complex patterns in data that classical algorithms may struggle with, tackle highly complex machine learning problems (such as natural language processing and image recognition), and diminish the need for extensive feature engineering and manual efforts.
But all that said, neural network problems pretty much boil down to 2 main categories – Classification, predicting a discrete label for a given input (ex: is this a picture of a cat or a dog? is this movie review positive or negative?) or Regression, predicting a continuous value for a given input (ex: weather prediction – what will the temperature be tomorrow?).
Today we’ll focus on a regression problem. Consider a simple scenario: we recently moved to a new city and are currently searching for a new home. However, we notice that the prices of houses in the area vary significantly.
Since we are unfamiliar with the city, our only source of information is what we can find online. We come across a house that interests us but are unsure whether it is priced fairly.
So we decided to build a neural network for predicting the price of a house based on certain features – its size (in feet²), location (1=urban, 2=suburban, 3=rural), age, and the number of bedrooms. Our goal is to use these features to predict the house price.
The first thing we do is collect data about houses in the neighborhood and what price they sold for.

Next, we want to train a neural network. Training involves feeding this dataset into the model, which learns the patterns in the data.
Terminology segue: Since we’re using the dataset above to train the model, it is called the training data. Usually our training data will contain 1000s if not 100000s of rows, but we’ll keep it simple for now.
As a result, the model becomes capable of predicting the price of a new house based on the available data.
But before getting into the model building and training, let’s understand why it is called a neural network.
Background
A neural network enables computers to process data in a manner inspired by the human brain. It utilizes interconnected neurons arranged in layers, resembling the structure of the human brain.
This is a biological neuron.

It receives inputs, processes the received inputs or data (this processing is nothing short of magical), and generates an output.
Just like the human brain, which processes data by receiving inputs and generating outputs, the neural network operates similarly.

The blue lines here represent the inputs to the neuron. In the context of pricing a house, these inputs can be considered as the different feature variables, while the output will be the predicted house price.

Each input is associated with a constant term called a weight. So let’s add them to our artificial neuron.

The purpose of these weights is to indicate the importance of an input. A higher weight value means that the input is considered more important. So if the weight of age is higher than that of location, it means that the age of the house is given more importance than the location of the house.
Now, just like some magic happens in the biological neuron, this is what that magic looks like in the artificial neuron.

When we zoom in, we see that this magic is essentially 2 mathematical steps.

Magic, Part 1: Summation

The first part is a summation. Here, we multiply each input by its corresponding weight and then sum them together.

You may have also noticed a little b at the top. This is called the bias term and it is a constant value. We add this value to the weighted sum to complete the summation.

Mathematically:

where the features are represented by xᵢ and n = number of features
Magic, Part 2: Activation Function
Here the above summation is inputted through something called the activation function.

Think of activation functions as the translators of raw data into meaningful insights. They take the summation from the previous step and transform it into an output that’s useful for our specific task.
Let’s start with the binary step function. It’s straightforward: if your input (let’s call it x) is equal to or greater than 0, the function spits out a 1; otherwise, it gives you a 0. This is super handy when you need a clear-cut decision, like a yes or no. For example, based on the inputs will this house sell?

Then there’s the linear function, which tells it like it is. It simply returns whatever value it receives. So, if our summation is 5, the output is also 5.

Moving on to the sigmoid function, a real game-changer. It elegantly squishes any input value to fit within a 0 to 1 range. Why is this awesome? Because it’s perfect for probability-based questions. For example, what’s the likelihood of a house selling given certain conditions?

Then there’s the hyperbolic tangent function, or tanh for short. It’s similar to the sigmoid but with a twist: it outputs values ranging from -1 to 1. So, larger positive inputs hover near 1, while larger negative ones approach -1.

And, drumroll please, we have the rectifier function, also known as the ReLU (Rectified Linear Unit). This one’s a star in the neural network world. It’s simple but effective: if the input is positive, it keeps it; if negative, it turns it to zero. This functionality makes it incredibly useful in numerous scenarios.

We also have another one called Leaky ReLU (Leaky Rectified Linear Unit) which is a clever twist on the regular ReLU. While ReLU sets all negative inputs to zero, Leaky ReLU allows a small, non-zero, constant output for negative inputs. Imagine it as a slightly open faucet, letting a tiny trickle of water (or in our case, data) through, even when it’s mostly turned off.

The last one we’ll discuss, which has become more popular recently, is the Swish function.

There‘s a whole universe of other activation functions out there, each with unique characteristics. But these are some of the most popular and versatile ones. (read more about them here)
The cool thing about activation functions is that they can be tailored to our specific problem. For instance, if we’re predicting something continuous, like the price of a house (a regression problem), the rectifier function is a great pick. It only gives positive outputs, aligning well with the fact that house prices aren’t negative. But if we’re estimating probabilities, like the chances of a house selling, the sigmoid function is our go-to, with its neat 0 to 1 range mirroring probability values.
Let’s go ahead and choose the activation function to be a rectifier function in our neuron because that seems to make the most sense for our problem.

And this folks is considered a neural network model (!), albeit the simplest form of one. It just consists of 1 neuron, but a great place to start nonetheless.
The next thing we need to figure out is what the values of the weights and bias terms should be. We know they are constant terms, but what should their values be?
Remember, we discussed training the neural network earlier? All that means is determining the optimal values for our weights and bias terms. We’ll get into specifics of exactly how this training happens later.
For now, let’s assume that we trained our neural network and obtained the optimal values. So, let’s replace the terms with these optimal values.

And this is what we call a trained neural network that is ready to be put into action. Essentially, what this means is that we have utilized the available data to create the most effective model using one neuron from the training. Now, we can make predictions about house prices by inputting the relevant features of the house whose value we are trying to determine.
Let’s try predicting the price of the first house in our training dataset.


When we pass in the inputs, Part 1 of the magical data processing is the summation…

…and Part 2 is passing this summation value through the rectifier function:

Essentially our model takes the features of the first house as input and predicts a price of $1,036,000 based on those features. In other words, it’s saying, "Given these house features, I predict the price of the house to be $1,036,000."

But when we compare it to the actual house price, 1.75M it’s not that great of a prediction, unfortunately. We’re $714,000 off. Yikes.
If we input the remaining houses into this simple model, we will obtain the following predicted prices:

And as we can see, the predicted prices are all quite inaccurate. This indicates that our model is not very effective, which is understandable considering its lack of sophistication. It consists of only one neuron. Just like the human brain, it’s only when neurons collaborate that they make more impactful decisions and process data with greater sophistication.
Let’s take a step back and consider if there is a more intuitive way to solve this problem. Perhaps there is a way to enhance our predictions by considering the interactions between different features. Maybe the combination of two features is more significant than the individual features alone?
For example, the combination of bedrooms and size could be valuable. It’s possible that a smaller house with many rooms might feel cramped, making it less appealing to buyers and resulting in a lower price. Similarly, the combination of age and location could be important. In urban areas, newer houses tend to be more expensive, while in rural areas, buyers may prefer the charm of older houses, which can increase their value. It is also possible that older houses in rural areas are more renovated. Moreover, the combination of location, size, and bedrooms can be interesting. In suburban and rural areas, having more bedrooms in smaller houses may not be favorable. However, in urban areas, where people prefer proximity to the city for work while still having enough space for their families, they may be willing to pay more for smaller houses as long as they have sufficient bedrooms.
The possibilities are endless, and it’s challenging to consider all the different combinations. Fortunately, this is where we leverage the power of multiple neurons. Similar to how biological neurons collaborate to make better decisions, artificial neurons also work together to achieve the same goal.
Let’s make our simple neural network more powerful by adding two more neurons to it. This will create a cobweb-like structure:

In this case, all the inputs are being fed into each of the 3 neurons. Since we have inputs going into 3 neurons and we know each input is associated with a weight, there will be a total of 12 (= 4 * 3) different weights. To keep them separate, let’s introduce some notation.
The weights are represented by _wij, where i is the neuron number and j is the input that goes into it. So for instance, this highlighted weight…

…is labeled w₁₂ because it’s the 2nd input to the 1st neuron. And this highlighted input…

…is labeled w₃₄ because it’s the 4th input to the 3rd neuron. Similarly, this is all the weights labeled:

These weights can take on any value, which is determined during the training process.
Let’s say our training process of the neural network determined that only the bedroom and size features are relevant for neuron 1, while the other 2 features are not considered, then the weights for location and age going into the first neuron will be 0. Similarly, let’s say only the bedrooms, size and location are important for the second neuron, and age is ignored, so the weight for age going into the 2nd neuron is 0. Meanwhile, the third neuron only considers location and age as important features, while bedrooms and size are given a weight of 0.
The resulting neural network will look something like this:

Similarly, the training process will also produce optimal bias values. So let’s go ahead and add them here too (let’s also remove the inputs with weights = 0 just to make the diagram more readable):

You probably notice something odd: we have 3 outputs here. However, we only want one output, which is the predicted price. Therefore, we need to find a way to combine the outputs from the 3 neurons into one. To do this, let’s add another neuron in the front.

The structure remains the same as the previous ones, but instead of our 4 features being fed into the neuron as inputs, the outputs from the previous neurons are now used as inputs for the new neuron.
Terminology segue: Each layer is numbered, with the input layer typically being labeled as 0. The final layer is referred to as the output layer, while any layer situated between the input and output layer is considered a hidden layer.

And remember, every input is accompanied by a corresponding weight. Therefore, even these inputs to the new neurons will have weights, which can be estimated during the training process as well. The new bias will also be determined during the training process. As a result, the new neural network (assuming it’s fully trained) will have the following optimal values:

Now let’s move on to the activation functions. For this case, we’ll set all of them to be equal to the rectifier function. Generally, we have the flexibility to choose different activation functions based on the problem we are trying to solve. However, since the rectifier function is commonly used, let’s just go with that now.

NOTE: Usually, the same layer will have the same activation function.
Okay, finally the fun part. We trained our neural network with all the optimal bias and weight values. Now it’s time to take this baby for a spin and see how well it does in predicting house prices.
Let’s pass the features of our first house through this neural network again.

We’ll clarify the process by highlighting the activated inputs and neurons at each step.



And finally, using the outputs from the hidden layer and passing them through the output layer:

And that’s how we use this neural network to get outputs! This process of passing in inputs to get an output is called forward propagation.
We’ll repeat the same process for the rest of the houses:

Let’s compare these new predicted prices to the old predicted prices made by the neural network with just one neuron.

From just eyeballing it, it appears that the new predictions are performing better than the old ones. But what if we want to find a single number that quantifies how off our predictions are from the actual value?
This is where a cost function comes into play. A cost function tells us how off we are from our prediction. Depending on the type of prediction, we can use different cost functions. But for this problem, we’ll use one called the Mean Square Error (MSE). The MSE allows us to a) measure the deviation of our predictions from the actual price and b) compare predictions made by different models.
It calculates the average of the squares of the difference between the predicted house price and the actual house price. Mathematically:

Terminology segue: It is common notation to refer to the actual price as "y" and the predicted price as "y hat" (denoted that way because the little notation on the top of the "y" looks like a hat)
The objective is to minimize the MSE. The closer MSE is to 0, the better our model is at predicting prices.
So, using this formula, we can calculate the MSE of the old one-neuron model as:

Ugh…that’s a super gnarly number. This just confirms that our first model was pretty bad (cough horrendous cough).
Similarly, the MSE of the new, more complex model:

Still pretty bad but at least a little better than the previous MSE.
But we can consider creating a better model.
One approach is to add more neurons to the existing layer to improve the prediction power. Like this:

Or we could add an entirely new hidden layer:

Alternatively, we can place different activation functions at different layers:

As you can see, the possibilities are endless. We can adjust the complexity of our neural network to meet our specific needs. These different possibilities are called neural network architectures. We can customize the number of layers, the neurons at each layer, and the activation functions to fit the data and problem we’re trying to solve, making it as simple or complex as needed.
Now that we understand how a neural network works, the next article (up now woohoo) will focus on understanding how it learns the optimal bias and weight values aka the training process!
Deep Learning Illustrated, Part 2: How Does a Neural Network Learn?
As always, feel free to connect with me on LinkedIn for any comments/questions!
Unless specified, all images are by the author.