Courage to Learn ML: Demystifying L1 & L2 Regularization (part 3)

Why L0.5, L3, and L4 Regularizations Are Uncommon

Amy Ma
Towards Data Science

--

Photo by Kelvin Han on Unsplash

Welcome back to the third installment of ‘Courage to Learn ML: Demystifying L1 & L2 Regularization’ Previously, we delved into the purpose of regularization and decoded L1 and L2 methods through the lens of Lagrange Multipliers.

Continuing our journey, our mentor-learner duo will further explore L1 and L2 regularization using Lagrange Multipliers.

In this article, we’ll tackle some intriguing questions that might have crossed your mind. If you’re puzzled about these topics, you’re in the right place:

  • What’s the reason behind not having an L0.5 regularization?
  • Why do we care about whether a problem is a non-convex problem given most deep learning problems are non-convex?
  • Why are norms like L3 and L4 not commonly used?
  • Can L1 and L2 regularization be combined? And what are the advantages and disadvantages of doing this?

I have a question based on our last discussion, I checked that for Lp norm, the value of p can be any number larger than 0. Why don’t use p between 0 and 1? What’s the reason behind not having an L0.5 regularization?

I’m glad you brought up this question. To get straight to the point, we typically avoid p values less than 1 because they lead to non-convex optimization problems. Let me illustrate this with an image showing the shape of Lp norms for different p values. Take a close look at when p=0.5; you’ll notice that the shape is decidedly non-convex.

The shape of Lp norms for different p value. Source: https://lh5.googleusercontent.com/EoX3sngY7YnzCGY9CyMX0tEaNuKD3_ZiF4Fp3HQqbyqPtXks2TAbpTj5e4tiDv-U9PT0MAarRrPv6ClJ06C0HXQZKHeK40ZpVgRKke8-Ac0TAqdI7vWFdCXjK4taR40bdSdhGkWB
The shape of Lp norms for different p value. Source: https://lh5.googleusercontent.com/EoX3sngY7YnzCGY9CyMX0tEaNuKD3_ZiF4Fp3HQqbyqPtXks2TAbpTj5e4tiDv-U9PT0MAarRrPv6ClJ06C0HXQZKHeK40ZpVgRKke8-Ac0TAqdI7vWFdCXjK4taR40bdSdhGkWB

This becomes even clearer when we look at a 3D representation, assuming we’re optimizing three weights. In this case, it’s evident that the problem isn’t convex, with numerous local minima appearing along the boundaries.

Source: https://ekamperi.github.io/images/lp_norms_3d.png

The reason why we typically avoid non-convex problems in machine learning is their complexity. With a convex problem, you’re guaranteed a global minimum — this makes it generally easier to solve. On the other hand, non-convex problems often come with multiple local minima and can be computationally intensive and unpredictable. It’s exactly these kinds of challenges we aim to sidestep in ML.

When we use techniques like Lagrange multipliers to optimize a function with certain constraints, it’s crucial that these constraints are convex functions. This ensures that adding them to the original problem doesn’t alter its fundamental properties, making it more difficult to solve. This aspect is critical; otherwise, adding constraints could add more difficulties to the original problem.

Why do we care about whether a problem or a constraint is a non-convex problem here? Aren’t most deep learning problems non-convex?

You questions touches an interesting aspect of deep learning. While it’s not that we prefer non-convex problems, it’s more accurate to say that we often encounter and have to deal with them in the field of deep learning. Here’s why:

  1. Nature of Deep Learning Models leads to a non-convex loss surface: Most deep learning models, particularly neural networks with hidden layers, inherently have non-convex loss functions. This is due to the complex, non-linear transformations that occur within these models. The combination of these non-linearities and the high dimensionality of the parameter space typically results in a loss surface that is non-convex.
  2. Local Minima are no longer a problem in deep learning: In high-dimensional spaces, which are typical in deep learning, local minima are not as problematic as they might be in lower-dimensional spaces. Research suggests that many of the local minima in deep learning are close in value to the global minimum. Moreover, saddle points — points where the gradient is zero but are neither maxima nor minima — are more common in such spaces and are a bigger challenge.
  3. Advanced optimization techniques exist that are more effective in dealing with non-convex spaces. Advanced optimization techniques, such as stochastic gradient descent (SGD) and its variants, have been particularly effective in finding good solutions in these non-convex spaces. While these solutions might not be global minima, they often are good enough to achieve high performance on practical tasks.

Even though deep learning models are non-convex, they excel at capturing complex patterns and relationships in large datasets. Additionally, research into non-convex functions is continually progressing, enhancing our understanding. Looking ahead, there’s potential for us to handle non-convex problems more efficiently, with fewer concerns.

Why don’t we consider using higher norms, like L3 and L4, for regularization?

Recall the image we discussed earlier showing the shapes of Lp norms for various values of p. As p increases, the Lp norm’s shape evolves. For example, at p = 3, it resembles a square with rounded corners, and as p nears infinity, it forms a perfect square.

The shape of Lp norms for different p value. Source: https://lh5.googleusercontent.com/EoX3sngY7YnzCGY9CyMX0tEaNuKD3_ZiF4Fp3HQqbyqPtXks2TAbpTj5e4tiDv-U9PT0MAarRrPv6ClJ06C0HXQZKHeK40ZpVgRKke8-Ac0TAqdI7vWFdCXjK4taR40bdSdhGkWB
The shape of Lp norms for different p value. Source: https://lh5.googleusercontent.com/EoX3sngY7YnzCGY9CyMX0tEaNuKD3_ZiF4Fp3HQqbyqPtXks2TAbpTj5e4tiDv-U9PT0MAarRrPv6ClJ06C0HXQZKHeK40ZpVgRKke8-Ac0TAqdI7vWFdCXjK4taR40bdSdhGkWB

In our optimization problem’s context, consider higher norms like L3 or L4. Similar to L2 regularization, where the loss function and constraint contours intersect at rounded edges, these higher norms would encourage weights to approximate zero, just like L2 regularization. (If this part isn’t clear, feel free to revisit Part 2 for a more detailed explanation.) Based on this statement, we can talk about the two crucial reasons why L3 and L4 norms aren’t commonly used:

  1. L3 and L4 norms demonstrate similar effects as L2, without offering significant new advantages (make weights close to 0). L1 regularization, in contrast, zeroes out weights and introduces sparsity, useful for feature selection.
  2. Computational complexity is another vital aspect. Regularization affects the optimization process’s complexity. L3 and L4 norms are computationally heavier than L2, making them less feasible for most machine learning applications.

To sum up, while L3 and L4 norms could be used in theory, they don’t provide unique benefits over L1 or L2 regularization, and their computational inefficiency makes them less practical choice.

Is it possible to combine L1 and L2 regularization?

Yes, it is indeed possible to combine L1 and L2 regularization, a technique often referred to as Elastic Net regularization. This approach blends the properties of both L1 (lasso) and L2 (ridge) regularization together and can be useful while challenging.

Elastic Net regularization is a linear combination of the L1 and L2 regularization terms. It adds both the L1 and L2 norm to the loss function. So it has two parameters to be tuned, lambda1 and lambda2

Elastic Net regularization. Source: https://wikimedia.org/api/rest_v1/media/math/render/svg/a66c7bfcf201d515eb71dd0aed5c8553ce990b6e
Elastic Net regularization. Source: https://wikimedia.org/api/rest_v1/media/math/render/svg/a66c7bfcf201d515eb71dd0aed5c8553ce990b6e

What is the benefit of using Elastic Net regularization? If so, why don’t we use it more often?

By combining both regularization techniques, Elastic Net can improve the generalization capability of the model, reducing the risk of overfitting more effectively than using either L1 or L2 alone.

Let’s break it down its advantages:

  1. Elastic Net provides more stability than L1. L1 regularization can lead to sparse models, which is useful for feature selection. But it can also be unstable in certain situations. For example, L1 regularization can select features arbitrarily among highly correlated variables (while make others’ coefficients become 0). While Elastic Net can distribute the weights more evenly among those variables.
  2. L2 can be more stable than L1 regularization, but it doesn’t encourage sparsity. Elastic Net aims to balance these two aspects, potentially leading to more robust models.

However, Elastic Net regularization introduces an extra hyperparameter that demands meticulous tuning. Achieving the right balance between L1 and L2 regularization and optimal model performance involves increased computational effort. This added complexity is why it’s not frequently used.

In our next session, we’ll explore L1 and L2 regularization from an entirely new angle, delving into the realm of Bayesian prior beliefs to deepen our understanding. Let’s pause here for now — looking forward to our next discussion!

Other posts in this series:

  • If you liked the article, you can find me on LinkedIn.

--

--

👩‍💻 Data Scientist & Machine Learning Enthusiast 🚀 | 🐾 Cat Lover & New Mom 🍼 | Video Game Adventurer 🎮 | Here to make ML fun and accessible! 🌟