Jump with deep learning

A deep learning model that captures jump discontinuity and an example of application in PDE

Published in

Towards Data Science

6 min readFeb 4, 2022

Jump discontinuities in a function are formed when all the one-sided limits exist and are finite, but are or not equal and you will usually encounter them in piece-wise functions for which different parts of domains are defined by different continuous functions. As is said by the universal approximation theorem, in the mathematical theory of artificial neural networks, a feed-forward network with a single hidden layer containing a finite number of neurons can approximate continuous functions on compact subsets in a finite-dimensional real space, under mild assumptions on the activation function.” However, learning functions with jump discontinuities is hard for a deep neural network as the error around the point where the value jumps are usually large.

In this article, I will be showing how to build a specific block that can capture the jump discontinuity and use the network to treat a partial differential equation (PDE) problem.

Neural networks failed to jump well

The classic network

Let us start by looking at a simple piece-wise function with only one jump discontinuity: the function equals sin(x) when x is positive and sin(x) otherwise, as is shown at the beginning of this article.

We first build a simple, classic neural network that contains three dense layers such that the two hidden layers have both 32 units. The output layer has a simple enough linear activation function. The figure below illustrates the architecture of the model.

After 20 training epochs, we observe the model learns much worse next to when the function approaches the jump discontinuity and the learned function does not even form a jump.

Image by author: original function vs learned function

The network with Heaviside activation

The result is not surprising: the activation functions of all layers are continuous and they regularize the original curve. Therefore, a natural idea is to set the activation function of the output layer as the Heaviside function.

The Heaviside function, also say the unit step function, usually denoted by H or θ, is a step function whose zero for negative arguments and one for positive arguments. We can define the Heaviside function with Keras in the following way. Note that since the derivative of the Heaviside function is the Dirac distribution which is not defined everywhere, we simply set the gradient to be 0.

@tf.custom_gradientdef heaviside(x):
    ones = tf.ones(tf.shape(x), dtype=x.dtype.base_dtype)
    zeros = tf.zeros(tf.shape(x), dtype=x.dtype.base_dtype)
    def grad(dy):
        return dy*0 
    return keras.backend.switch(x > 0.5, ones, zeros), grad

We now build a model with the same structure as before and the only change is that the activation function of the output layer is set to be the Heaviside function just we defined. After 20 training epochs, we observe the model indeed captures the jump discontinuity but has no ability at all to predict the smooth function part.

Image by author: original function vs learned function with Heaviside activation

The reason for such behavior comes from the fact that the minimizer of the loss function is searched by calculating the gradients while the derivative of a Heaviside function is a Dirac distribution whose value cannot be defined everywhere on the real axis. As the gradient-based method fails to reach the desired minimum, the learned function stays piece-wisely on little plateaus.

Building the “good jump model”

To obtain the minimum with a nice gradient and capture the jump discontinuity at the same time, we now try to construct a “well jump model” that takes advantage of both the classic method and the one with a Heaviside activation. The idea is to build a layer that outputs a linear combination of a linear function and a Heaviside function as the following equation: output=w_s*s(input)+w_h*H(input)+b, where S and H are the linear function and the Heaviside function, w_s, w_h, b are trainable parameters.

import kerasclass Linear(keras.layers.Layer):
      def __init__(self, units=32, input_dim=32):
          super(Linear, self).__init__()
          self.w = self.add_weight(shape=(input_dim, units),         initializer="random_normal", trainable=True)
          self.b = self.add_weight(shape=(units,), initializer="zeros", trainable=True)      def call(self, inputs):
          return tf.matmul(inputs, self.w) + self.b
class JumpBlock(keras.layers.Layer):
      def __init__(self):
          super(JumpBlock, self).__init__()
          self.linear0 = Linear(1,1)
          self.linear1 = Linear(1,1)
          self.linear2 = Linear(1,1)
          self.linear2.b.trainbale=False
     
      def call(self,inputs):
          x=self.linear0(inputs)
          h= self.linear1(heaviside(x))
          s=self.linear2(x)
          return h+s

We add the jump block as the output layer to the network. The illustration below shows the architecture of the “good jump model”:

We end this section with the good result of “well jump model” that succeeded in capturing both the smooth part ad the jump discontinuity of the give function.

Image by author: the result of the “well jump model”

For more details of the code, please check the notebook.

Application in hyperbolic PDEs

I would now like to apply the “good jump model” to a hyperbolic partial differential problem. To simply the story, let us consider a simple enough but not trial equation: the inviscid Burgers equation with a simple form: u_t+u*u_x=0. We are always interested in the initial problem: for a given function, find a solution that satisfies the equation and equals the function initially. In general, even a smooth enough initial condition can form a solution with shock waves, treated as jump discontinuities.

Here, let us consider a time series as a solution of Burgers equation obtained by the Glimm scheme from time 0 to time 1 discretized into 500 timesteps and we would like to use the time series to forecast the solution when time passes 1. For that purpose, we use a network that combines the temporal convolutional network (TCN) and our “good jump model”. The idea is to let TCN capture the temporal evolution of the solution and let the “good jump model” force the formation of the shock waves.

The input of the model is the solution within 20 timesteps and the output is the solution at the next time step. A short remark here is that the loss function we use is the mean square error because the Burgers equation is usually satisfied in a weak sense, i.e. not everywhere and we thus have to use an L2 norm. We trained the model for 100 epochs and obtain the validation loss as e-4.

from tcn import TCN
from tensorflow.keras.layers import Densefrom tensorflow.keras.models import Sequentialtcn_layer = TCN(input_shape=(time_step,space_size))m=Sequential()
m.add(tcn_layer)
m.add(Dense(32))
m.add(Dense(32))
m.add(Dense(space_size))
m.add(JumpBlock())
m.compile(optimizer='adam', loss='mse')x_train=x[:400]
y_train=y[:400]
x_val=x[400:]
y_val=y[400:]m.fit(x_train, y_train, epochs=100, validation_data=(x_val,y_val))

Voila, there is the solution of the Burgers equation predicted by the model at the time=1.2.

Next steps

In this article, we build a jump block that combines the linear function and the Heaviside function to learn functions with jump discontinuities. We then used the block in a Burgers equation to predict solutions for a larger time and get more or less satisfying results. However, for the prediction of the solution, only the data as the numerical solution was used while the structure of the equation itself was ignored. An idea is to combine such block with the logic of physical informed neural network (PINN), a deep learning method to solve the PDEs that often fails to form shock waves alone.