The world’s leading publication for data science, AI, and ML professionals.

The Basic Building Block of Neural Networks

Exploring the Dense layer from Keras, down till the source code

The Dense layer (a regular fully-connected layer) is probably the most widely used and well-known Neural Networks layer. It is the basic building block of many Neural Networks architectures.

Understanding the Dense Layer gives a solid base for further exploring other types of layers and more complicated network architectures. Let’s dive into the Dense layer deep down till the code implementing it.

In this article, I’m using Keras (https://keras.io/) for exploring layer implementation and source code, but in general, most types of layers are quite generic and the main principles don’t depend that much on the actual library implementing them.

Dense layer overview

Let’s start by taking a look at a visual representation of such layer:

Dense layer representation (created by the author)
Dense layer representation (created by the author)

In this example, the Dense layer has 3 inputs, 2 units (and outputs) and a bias. Let’s take a look at each of these.

Layer inputs are represented here by x1, x2, x3. This is where data comes in – these can be either input feature values or the output from the previous layer. Technically these can be any numerical values, but in most of the cases, input values will be normalized to an interval of [-1, 1]. Normalization can be done manually or using special layers (e.g. BatchNormalization layer in Keras).

Big circles represent the units. This is where input values are converted to outputs. Outputs are represented by y1 and y2 in the visualization. The number of outputs always matches the number of units. The Dense layer is fully-connected, meaning that it connects every input to every output. This naturally means that every input value affects (or at least can affect – if the corresponding weight value is not zero) each output value.

Conversion from inputs to outputs is defined by the activation function. This function is applied to the input values x1, x2, x3 to obtain output value y. The result of this function is influenced by weights – these are represented as w11, w21, w31 in the visualization. Weights give models the ability to learn – basically what a NN model learns are the values of weights in all layers.

There is one more thing in the visualization – red number "1" with connections to all units, representing the "bias". The bias specifies some external influence to the output value, which is not covered by features provided in the input.

Keras logo (from keras.io)
Keras logo (from keras.io)

Dense layer in Keras

Now let’s try to tie the Dense layer implementation in Keras with our visualization. In Keras the Dense layer is defined as follows:

tf.keras.layers.Dense(
    units,
    activation=None,
    use_bias=True,
    kernel_initializer="glorot_uniform",
    bias_initializer="zeros",
    kernel_regularizer=None,
    bias_regularizer=None,
    activity_regularizer=None,
    kernel_constraint=None,
    bias_constraint=None,
    **kwargs
)

From the mathematical point of view, the Dense layer implements matrix multiplication between the feature values and weights.

outputs = mat_mul(inputs, weights)

I won’t explain the math behind the matrix multiplication here, I’ll just provide one example, how the first output base value would look:

*y1 = x1 w11 + x2 w21 + x3 w31**

Note, that in reality, this is not the final output value, as it can be further changed by the activation function and/or the bias vector, which we’ll take a look at later in this article.

Now, let’s got through the parameters of the Keras Dense layer to see, what is their meaning and how they impact the layer.

Units

The first (and the only mandatory) argument is "units".

  • units: Positive integer, dimensionality of the output space.

"Units" specifies the number of units (and the number of outputs) of the layer. In our example visualization units=2. Note, that Keras Dense layer doesn’t have an argument for specifying the number of inputs. In Keras, the number of inputs is defined by the number of outputs of the previous NN layer.

Practically, the number of units defines "the width" of the Neural Network. This is an important parameter to tune when creating an architecture of NN for some particular task. Many units in dense layers may lead to overfitting, especially if the network is also deep (has many layers). Too few units, on the other hand, can lead to the limited ability to learn the pattern. Typically, the more units are used, the more regularization should be applied.

Activation function

Argument "activation" is optional and defines the activation function.

  • activation: Activation function to use. If you don’t specify anything, no activation is applied (ie. "linear" activation: a(x) = x).

If we look at the source code of the Keras library, the activation function is applied to outputs right before returning them.

if activation is not None:
    outputs = activation(outputs)

If the activation function is not specified (the default value is None), outputs are returned as-is. In practice, typically you will need some activation function to make the NN predict reasonable values. There are many possible activation functions, with different characteristics. An overview of these goes beyond the scope of this article.

Bias

As we saw from the layer definition, bias is used by default in Keras, but optionally can be turned off.

  • use_bias: Boolean, whether the layer uses a bias vector.

In our visualization, bias is represented in red color. There are several ways of visualizing it – I like to think about it as a feature with constant value "1". Note, that weights are still calculated and optimized for all units for bias vector, but the impact of bias vector on outputs is not affected by the real feature values.

To understand better, why bias is needed at all, let’s consider some input sample with all the feature values being "0". Without the bias vector, all the outputs will be always "0" as well – as whatever the weight value will be, it will give zero when multiplied by the "0" feature value. Therefore without bias, the layer won’t be able to respond to the "0" feature value correctly.

In practice, most of the time you would not need to remove the bias vector.

If we look at the code in Keras implementation regarding usage of use_bias parameter for Dense layer, we find the following (some code parts are skipped for better readability):

if use_bias:
    bias = self.add_weight(
        'bias',
        shape=[self.units,],
        ...
        trainable=True)
else:
    bias = None
if bias is not None:
    outputs = nn_ops.bias_add(outputs, bias)

So if we use bias, then bias weights are added to the output values. Note, that this happens before the activation function is applied, so the whole formula in pseudo-code would look as follows:

outputs = activation(dot(input, kernel) + bias)

Initializers

There are two initializer arguments for the Dense layer – kernel (weights) and bias initializers.

  • kernel_initializer: Initializer for the kernel weights matrix.
  • bias_initializer: Initializer for the bias vector.

Simply speaking, these define the initial values for all the weights – both, for the features and for the bias vector.

The model will change the layer weights in the process of training. However, if we choose the initial weights reasonably, the model can converge to the optimal values faster.

By default, feature weights will be initialized randomly with uniform distribution. The bias weights will be initialized with zeros.

Regularizers

There are 3 regularizer arguments for the Dense layer:

  • kernel_regularizer: Regularizer function applied to the kernel weights matrix.
  • bias_regularizer: Regularizer function applied to the bias vector.
  • activity_regularizer: Regularizer function applied to the output of the layer (its "activation").

The primary task for regularization is to control the values— to keep them as low as possible. Which values are being controlled, depends on the regularizer type.

Kernel regularizer reduces the mean value of weights. Why is it important to avoid huge weight values? Simply speaking – the bigger weights are used, the bigger the freedom model has in choosing different values for them. That increases model variance, hence the risk of overfitting grows.

If we talk about how technically regularizers work – they add an additional loss, which depends on the mean value of weights. The bigger weights, the bigger loss – that in turn will force the model to decrease weights.

Similarly, the bias regularizer controls the weights of bias. And the activity regularizer controls the output values.

Constraints

There are 2 arguments for the Dense layer defining constraint functions:

  • kernel_constraint: Constraint function applied to the kernel weights matrix.
  • bias_constraint: Constraint function applied to the bias vector.

These functions are similar to regularizers, as they control the weights, too. The difference is in the way, how exactly the control is implemented. If regularizers introduced penalty and additional loss for big weights, constraints limit the values of weights directly, e.g. by defining the maximum value.

Photo by Robynne Hu on Unsplash
Photo by Robynne Hu on Unsplash

Conclusions

If you are just starting to learn about Neural Networks, understanding the Dense layer is a very good place to start.

Of course, it is very important to understand the general idea and math behind Neural Networks but to successfully design effective architectures, you will need a lot of practice and a trial-error approach. So why not start right now with some own experiments with the Keras Dense layer?

Thanks for reading!


Related Articles