Building Lipschitz constrained networks with DEEL-LIP

spectral normalization made easy

Published in

Towards Data Science

8 min readJul 15, 2020

What are Lipschitz functions?

Named after Rudolf Lipschitz, a function is said to be k-Lipschitz, when its first derivatives are bounded by some constant k. The minimum value for such k is called the Lipschitz constant of the function. Such a function has interesting properties, among those:

This means that if x1 and x2 are close to each others, then, their predictions f(x1) and f(x2) will be close too. As neural networks approximate functions, we will focus on 1-Lipschitz networks.

Why does Lipschitz constant of networks matter

These kinds of networks find many uses in modern machine learning: they have been proven to be robust to adversarial attacks [1].

Source: Explaining and Harnessing Adversarial Examples, Goodfellow et al, ICLR 2015.

An adversarial attack is the act of finding an x2 (the “gibbon” image) close to our original x1 (the “panda” image), such that their predictions f(x1) and f(x2) differs. The main problem is that the added noise to build such adversarial example is very small. Controlling the Lipschitz constant of a network has a great impact on adversarial robustness: the lower the Lipschitz constant is, the more noise is needed in order to build an adversarial example. Feel free to see this gentle introduction to adversarial robustness to learn more about this topic.

But this is not the only use of Lipschitz networks: these are also used in Wasserstein distance estimation. The Wasserstein metric allows to measure the distance between two distributions. Computing this metric requires optimizing 1-Lipschitz networks. This is notably used in Wasserstein-GAN [2].

Different levels of Lipschitz constraints

The previous equation tells us that the gradient of our function is at most equal to 1. But enforcing this constraint doesn’t necessarily mean that the gradient will effectively reach 1 at some point.

Depending on the use case, we might require different levels of Lipschitz constraints:

“soft” 1-Lipschitz constraint: we force the gradient function to be close to one and equal to one on average. In this case, the gradient can be greater than one at some specific points of the input domain. Using such constraints during training does not guarantee that the final networks will be 1-Lipschitz but allow to regularizing training.
“hard” 1-Lipschitz constraint: forcing the gradient to be lower or equal to one at every point of the input domain. The gradient can be lower than one at some points. This is used when dealing with adversarial robustness.
“Gradient equal to one almost everywhere”: forcing the gradient to be equal to 1 almost everywhere. This particular class of networks is not suitable for traditional ML. However it is necessary when working with Wasserstein distance estimation [3].

We will focus on the last two levels of constraints.

How do we apply these constraints on Neural Networks

Computing the Lipschitz constant of a neural network is known to be an NP-hard problem. However, several methods have been proposed to enforce such “hard” constraints during training. Most of these methods rely on constraining the Lipschitz constant of each layer, and the following composition property can then be used:

For linear layers, such as Dense layers, there are numerous ways of achieving this property, like weight clipping or normalization.

We will now explore existing methods to this for Dense layers. Recall the definition of a Dense layer as a linear operator composed with a non linear activation function:

We will first focus on the linear kernel K. We will later see how to include activations functions and how to extend these methods to other types of layers.

Weight clipping

This is the naive method to enforce the Lipschitz constraint at the level of each linear layer. For instance, a dense layer with m inputs and n outputs, is 1-Lipschitz if all the weights are clipped within the following interval:

This is enough to guarantee that the Lipschitz constant of the layer is lower than 1. However, in practice, the gradient for training data points is usually much lower than 1. This is because the only way to have a gradient equal to 1 at some point is to set all weights to their clipping values, which is very restrictive. In other terms, this means that networks trained under such constraints have a Lipschitz constant smaller than 1, but many 1-Lipschitz networks do not satisfy this constraint.

Can we then find a constraint without such drawbacks?

Spectral Normalization

This method tackles the limitations of weight clipping by using spectral normalization. Mathematically this type of normalization relies on the singular value decomposition:

Since the center matrix is diagonal, and as U and V don’t affect the gradient of K, We can then say that the gradient of K is bounded by:

If we divide all weights by the greatest sigma value, the gradient is then bounded by one:

Such normalization provides very strong guarantees:

The gradient cannot be greater than 1.
There exists a set of points in the input domain such that the gradient will be equal to 1.

Moreover the highest singular value can be computed quickly by using the power iteration method [4].

It also provides a formal way to express the “gradient equals to 1 almost everywhere” constraint:

This constraint can also be obtained in practice by using the Björck orthonormalization algorithm [5]

What about activation functions ?

To obtain a 1-Lipschitz neural network, activation functions used must also be 1-Lipschitz. Most of these are already 1-Lipschitz: ReLU ELU, sigmoid, tanh, logSigmoid… Some need to be properly parameterized, such as leakyRelu, PReLU… finally some are not 1-Lipschitz at all.

What about other layers?

Applying the composition property on layers requires that each layer respect the 1-Lipschitz constraint. We showed examples of normalization for dense layers, but is this applicable to any layer type ?

Convolution layers: one might think that normalizing the kernel of a convolution layer (by clipping or spectral normalization) is enough but there is a catch: convolution layers use padding, stride, dilation... All these operations have an effect on the norm of the output, thus changing the layer’s Lipschitz constant. In order to catch this phenomenon, a corrective factor can be computed accounting these parameters [6].

Pooling layers: these layers can be seen as a special case of convolutions, so we can also apply a corrective factor.

Batch Norm: As it performs rescaling, this layer is not constrained. Besides, learning under 1-lipschitz constraint mitigate the need of such layers. Using it would only be useful to correct the bias induced after each layer which can also be done by setting the use_bias parameter of each layer.

Dropout: It allows a kind of regularization, switching to zero part of layer outputs. However, at inference a scaling factor is applied to compensate the dropout factor, which breaks the Lipschitz property.

k-Lipschitz neural nets made easy with DEEL-LIP

DEEL-LIP is a library built upon Tensorflow that extends the usual Keras elements such as layers, initializers or activations, allowing the user to build easily 1-Lipschitz networks. Provided layers use spectral normalizaion, and/or Björck orthonormalization. When required, the adequate corrective factor is computed and applied.

How to use it?

Firstly the library can be installed with pip

Now the following code demonstrates how to build and compile a neural networks with DEEL-LIP. It is quite similar to standard Keras code.

example of NN construction with DEEL-LIP and Keras

DEEL-LIP has been developed to be simple to use for Keras users:

DEEL-LIP follows the same package structure as Tensorflow/Keras.
All elements (layers, activations, initializers…) are compatible with standard Keras elements.
When a layer overrides a standard Keras element, it implements the same interface, with the same parameters. The only difference lies in the parameter that controls the Lipschitz constant of the layer.

What’s important when dealing with Lipschitz layers?

As the Lipschitz constant of the network is the product of the constants of each layer, so every single layer must respect the Lipschitz constraint. Adding a single layer that does not satisfy the Lipschitz constraint is enough to break the Lipschitz constraint of the whole network.

In order to make things more convenient, any layer imported from DEEL-LIP is safe to use: if you can import it, you can use it. This is also true for the parameters of layers: wrong settings (i.e. breaking the Lipschitz constraint) on a layer will raise an error.

A complete list of available layers can be found in the documentation.

What is the overhead during backpropagation/inference?

During backpropagation, the normalization of the weights adds an overhead at each iteration. However, the spectral normalization algorithm is added directly into the graph during backpropagation. This is called differentiable constraint, which yields much more efficient optimization steps.

During inference, no overhead is added at all: the normalization is mandatory only during backpropagation. DEEL-LIP provides an export feature that can be used to create a standard Keras model from a deep-lip one. Each Dense or Conv2D layers is then converted to a standard Keras layer. By exporting models, it is possible to use networks trained using DEEL-LIP for inference without even requiring the installation of DEEL-LIP.

Conclusions

Lipschitz constrained networks are neural networks with bounded derivatives. They have many applications ranging from adversarial robustness to Wasserstein distance estimation. There are various ways to enforce such constraints. The spectral normalization and the Björck orthonormalization are among the most efficient methods, the DEEL-LIP library provide an easy way to train and use these constrained layers.

Thanks for reading!

References

[1] Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, Nicolas Usunier. Parseval Networks: Improving Robustness to Adversarial Examples. arXiv:1704.08847 [stat.ML]
[2] Martin Arjovsky, Soumith Chintala, Léon Bottou. Wasserstein GAN. arXiv:1701.07875 [stat.ML]
[3] Cédric Villani. Optimal Transport: Old and New. Grundlehren der mathematischen Wissenschaften. Springer Berlin Heidelberg, 2008.
[4] Wikipedia: power iteration method
[5] Cem Anil, James Lucas, Roger Grosse. Sorting out Lipschitz function approximation. arXiv:1811.05381 [cs.LG]
[6] Mathieu Serrurier, Franck Mamalet, Alberto González-Sanz, Thibaut Boissin, Jean-Michel Loubes, Eustasio del Barrio. Achieving robustness in classification using optimal transport with hinge regularization. arXiv:2006.06520 [cs.LG]