Machine Learning 101 — Artificial Neural Networks

This post is intended to explain the Artificial Neural Networks in general and calculations especially.

ayşe bilge gündüz
Towards Data Science

--

https://pixabay.com/images/id-1343877/

Artificial Neural Networks

The first thing you’ll learn about Artificial Neural Networks(ANN) is that it comes from the idea of modeling the brain. So, the terms we use in ANN is closely related to Neural Networks with slight changes.

  • Neuron is called as neuron in AI too,
  • But dendrite is called as input,
  • Axon is called as output,
  • Finally, the synapse is called weight

In the beginning, learning this amount of jargon is quite enough.

Single Layer Perceptron

Single Layer Perceptron is the primitive version of ANNs. ANNs are inspired by the functionality of biological neural circuits[1], so single layer perceptron is the representation of a neural circuit, naturally. This neural circuit is a neuron and has inputs with weights, accordingly and goes out as a calculated output.

Fig.1 A representation of a Single Layer Perceptron with two inputs and one output
Fig. 2 The general equation of single layer perceptron

In single layer perceptron x represents inputs, w represents weights and θ represents the threshold value.

Fig.3 The calculation according to Fig.1 representation

I tried to explain the equation of single layer perceptron as basic as possible. It’s going to be more complicated after this point. The output value of y is structured according to the function type. There are many types in terms of function but at the beginning 3 of them are more than enough.

1. Step Function

According to the value of an input, the output will be either 0 or 1.

2. Sign Function

3. Sigmoid Function

In this approach, the perceptron act like a linear regression model. The sigmoid function normalizes the values into the scale of [0,1] so that the negative values are scaled around 0 and the positive values are scaled around 1. Thus you don’t need to deal with negative values or very large numbers which may cause miscalculation.

Training in Single Layer Perceptron

  1. Inputs and related output values are given to the model. D:<x,y>
  2. All weight values are defined in a certain way, let’s say random in a range of [-0.5, 0.5] at this point.
  3. An output is calculated according to the values and error is calculated to make a comparison. The learning rate (0<𝜇≤1) defines our step size whether it will be big or small. Generally, it is defined as 1 at the beginning and we need to take a bigger step in the calculation to reach the optimum when we get closer our steps will be smaller.
Fig.4 The equations to train a single layer perceptron

4. Error is corrected according to the threshold by changing the weight values.

5. Go on to the next epoch from step 3.

Example 1: Let’s map the OR operation to a single layer perceptron by using a step function.

Here are the possible inputs and outputs in below table:

accepted default values : 𝜃 = 0.02 ; 𝜇=0.1 ; w1=0.3 ; w2=-0.1

1st Epoch:

y = x1*w1 + x2*w2-𝜃

y1 = 0.3*0 + (-0.1)*0–0.02 = -0.02 -0.02 <0 → y1 = 0

y2 = 0.3*0 + (-0.1)*1–0.02 = -0.12 -0.12<0 → y2 = 0

y3 = 0.3*1 + (-0.1)*0–0.02 = 0.28 0.28>0 → y3 = 1

y4 = 0.3*1 + (-0.1)*1–0.02 = 0.18 0.18>0 → y4 = 1

The calculated outputs and expected outputs are not same for y3 and y4. So we need to calculate the learning rate and new weights accordingly.

Δw11 = 0.1 * (0–0) *0 = 0

Δw21 = 0.1 * (0–0)*0 = 0

Δw13 = 0 ; Δw23 = 0 ; Δw14 =0 ; Δw24 = 0

Δw13 = 0.1 * (1–0)*0 = 0 ; Δw23 = 0.1*(1–0)*1 = 0.1

The latest weight values are calculated as;

w13’ = w13 + Δw13

w13’ = 0.3 + 0 = 0.3

w23’ = w23 + Δw23

w23’ =(-0.1) + 0.1 = 0

2nd Epoch:

The outputs are calculated again;

y1 = 0.3*0 + 0*0 -0.02 = -0.02 <0 → y1 = 0

y2 = 0.3*0 + 0*1 -0.02 = -0.02 <0 → y2 = 0

y3 = 0.3*1 + 0*0 -0.02 = 0.28 >0 → y3 = 1

y4 = 0.3*1 + 0*1 -0.02 = 0.28 > 0 → y4 = 1

Δw11 = 0 ; Δw21 = 0 ; Δw13 = 0 ; Δw23 = 0 ; Δw14 = 0 ; Δw24 = 0

Δw13 = 0.1 * (1–0)*0 = 0 → w13’ = 0 + 0.3 = 0.3

Δw23 = 0.1 *(1–0)*1 = 0.1 → w24’ = 0.1 + 0 = 0.1

Still couldn’t reach to the right weight value. So we still need tuning.

3rd Epoch:

y1 = 0.3*0 + 0.1*0 -0.02 = -0.02 <0 → y1 = 0

y2 = 0.3*0 + 0.1*1 -0.02 = 0.08 >0 → y2 = 1

y3 = 0.3*1 + 0.1*0 -0.02 = 0.28 >0 → y3 =1

y4 = 0.3*1 + 0.1*1 -0.02 = -0.38 > 0 → y4 =1

Δw11 = 0 ; Δw21 = 0 ; Δw12 = 0 ; Δw22 = 0 ;

Δw13 = 0 ; Δw23 = 0 ; Δw14 = 0 ; Δw24 = 0

It seems like we found the right weight values. So this is how a single layer perceptron works. It is very essential to understand the model, since more complex models are based on single layer perceptron.

Multi Layer Perceptron

It was enough to solve OR operation with single layer perceptron but when it comes to XOR operation things have started to change since single layer perceptron was not enough to solve the problem by itself. So the idea of multi layer perceptron(MLP) has just come up. A multi layer perceptron consists of 3 Layers, as it can be seen in Fig.5: an Input Layer, a Hidden Layer, and an Output Layer.

Fig.5 An illustration of Multi Layer Perceptron

Also, MLP is a good start for deep learning[2]. It is known as Deep Artificial Neural Network, but not as much deep as Convolutional Neural Network of course.

Back Propagation

In MLP every calculation belongs to a layer is accepted as an output for the next layer. Backpropagation seems very complicated but it is not. While calculating the error, weights are rearranged by backward calculations.

In the backward pass, using backpropagation and the chain rule of calculus, partial derivatives of the error function w.r.t. the various weights and biases are back-propagated through the MLP. That act of differentiation gives us a gradient, or a landscape of error, along which the parameters may be adjusted as they move the MLP one step closer to the error minimum. This can be done with any gradient-based optimisation algorithm such as stochastic gradient descent. The network keeps playing that game of tennis until the error can go no lower. This state is known as convergence [2].

Fig. 6 E represents the Euclidean distance between y and d and y_t represents the error of the sample t

Example 2: Let’s map the XOR operation with multi layer perceptron by using a step function.

Here are the possible inputs and outputs in below table:

When you check the Results Column you’ll realize that the equations are both of y = x2*w2 and y=x1*w1 greater than 0 but it is not possible since the x1 and x2 are the inputs either one of them are supposed to equal 0 for XOR operations. Besides, y=x1*w1 + x2*w2 smaller than 0 according to the equation in Results Column, but that is not possible as well. So, in this case Single Layer Perceptron cannot be a solution in terms of equations and Multi Layer Perceptron is recommended to find the solution instead.

w13 = 0.5

w23 = 0.4

w35 = -1.2

w14 = 0.9

w24 = 1 w45 = 1.1

𝜃3 = 0.8 𝜃4 = -0.1 𝜃5 = 0.3

𝜇 = 0.1

Let’s accept above values for beginning. In XOR operation when x1=x2=0, y will be equal to y=0. Since the values are continuos, we can use sigmoid function.

y3 = sigmoid(x1*w13+x2*w23-𝜃3)

=1/(1+e-(0.5*1 + 1*0.4 -1*0.8)) = 0.525

y4 =sigmoid(x1*w14+x2*w24-𝜃4)= 0.881

y5 = sigmoid(y3*w35+y4*w45-𝜃3)

= 1/(1+e-(0.525*(-1.2) +0.88*1.1–0.3)) = 0.51 ->real output

Error = Ɛ = (0–0.51) = -0.51

Output Layer:

𝜍5 = y5*(1-y5)* Ɛ = 0.5097*(1–0.5097)*(-0.5097) ≌ -0.13

Δw35 = 𝜍5*y3*𝜇 = -0.12 * 0.525*0.1 = -0.0068 ;

Δw45 = 𝜍5*y4*𝜇 = -0.12 * 0.88*0.1 = -0.012

w35' = -0.0068+(-0,12) =-1,2068

w45' = (-0,12)+1.1 = 1.088

Hidden Layer:

𝜍3 = y3*(1-y3)* 𝜍5*w35 => Δw23=𝜇*x2*𝜍3 ; Δw13=𝜇*x1*𝜍3

𝜍3 = 0.53*(1–0.53)*-0.13*(-1.2) = 0.0389

Δw23 = 0.039*1*0.1 =0.0039 ; Δw13 = 0.039*1*0,1 = 0.0039

w23' = 0.0039+0.4=0.4039 ; w13'=0.0039+0.5=0.5039

𝜍4 = y4*(1-y4)* 𝜍5*w45 => Δw24=𝜇*x2*𝜍4 ; Δw14=𝜇*x1*𝜍4

𝜍4 = 0.0881*(1–0.0881)*(-0.13)*1.1 = -0.015

Δw24=-0.015*1*0.1=-0.0015 ; Δw14=-0.015*1*0.1=-0.0015

w24'=-0.0015+1=0.985; w14'=-0.0015+0.9=0.8985

New updated weights:

w14=-0.0015; w24=0.8985; w13=0.5039; w23=0.4039; w45=1.088; w35=1.2068

P.S: We are calculating every weights until we found the right results with them. How many neuron should be in hidden layer and how many hidden layer shoul be in the network is a question should be answered by the architect of this network. Max. number of hidden layer is 3 for the classical artificial neural network.

Weight Update

1.Incremental Mode

Calculate the new weights after every training

2.Batch Mode

Momentum: The term connects the change between t and (t-1) moments. It is used for sensitivity, mostly.

Conclusion

Artificial Neural Networks(ANN) is a good start for Deep Learning. In this post, I intended to explain the types of ANN and the base calculation to adjust weight values in the core of them.

References:

  1. Tensorflow Single Layer Perceptron, https://www.tutorialspoint.com/tensorflow/tensorflow_single_layer_perceptron.htm
  2. Multi Layer Perceptron, https://pathmind.com/wiki/multilayer-perceptron

--

--

MLE @ Quantcast, PhD student in Computer Science, Fan of Data Science and Cyber Security