Last week we introduced constrained optimisation and left you with no hint how this might be helpful to tackle the Support Vector Machine problem numerically. This time we will put everything into context by formulating Support Vector Machine as QP problem!
In this article, we introduce Karuch-Kuhn-Tucket (KKT) conditions, formulate the Support Vector Machine problem as a convex optimisation problem and introduce several transformations to make it more convenient to solve numerically!
Let’s jump straight into it.
What are these KKT conditions and why do we care?
To put it plainly, KKT conditions are the first derivate tests that allow us to check whether we have indeed reached the optimal solution. We recall that assuming the objective function, inequality and equality constraints are differentiable, our Lagrangian has the following form

where x* is the optimal solution. Thus we have the following KKT conditions

Do all these symbols make sense for you? If not, let me know I’ll provide more explanations!
Here is good news! 😃 If we assume that the objective is convex, these conditions become necessary and sufficient. In other words, we are guaranteed that x, lambda and nu* are primal and dual optimal with zero duality gap. Remember what it means?
So now, we are fully set! We have our Lagrangian and KKT conditions, how should we formulate SVM as a constrained optimisation problem?
Support vector machines as Quadratic Programming problem
We recall that Support Vector Machine essentially finds an optimal hyperplane, where "support vectors" are the ones that position of hyperplane and the width of margins.¹
The idea of separating hyperplane
If we have vector x in a _p-_dimensional space, a hyperplane or affine set L is defined by the equation

![Image by Author. Inspired by [1]](https://towardsdatascience.com/wp-content/uploads/2021/10/1-Amy_BnbY4AQ7sYIDBpVKQ.png)
That’s the picture you should have in your mind (Sorry for the messy drawing). Big idea! 🙂 These betas are the ones that control the orientation of the line – thus, we want to find the optimal separation by minimising our betas!
We now list some important properties:
- For any two points x_1 and x_2 lying in L, we have

Thus we can construct vector normal B* to the surface L. As we are in 2D here, we have a line instead of the surface. We recall that vector normal is a vector that is perpendicular to the surface at a given point.
- For any given point _x0 in L, we have

- The signed distance of any point of x to L is given by. I note that we have a derivative in the denominator in the final expression because taking the first-order derivative of the objective, we will be left with beta’s only.

Step 1 – Formulate an optimisation problem.
The objective is to find the hyperplane that separates the data points by maximising the distance to the closest point from either class. If M is a signed distance, **** then we have the following optimisation problem

The following inequality condition ensures that all the points are at least a signed distance M from the decision boundary defined by betas.
Step 2 – Think again! Reformulate for convenience.
Now follow carefully, there are three important adjustments we should do to the aforementioned optimisation formulation to account for misclassification and to make it slightly easier to solve numerically. We introduce the following changes:
- Arbitrarily set

This allows us to rewrite the initial problem in such a way that in the future we will be minimising the squared norm, which is done purely for convenience.
- The real-world problem is usually non-separable as it always includes some observations that overlap in the future space. This is called soft margin. We thus need to allow some points to be on the wrong side of the margin. Mathematically, we introduce slack variables _(_letter xi), the sum of which should not exceed the value C specified

This constraint is formulated in such a way to express the proportional amount by which the prediction _f(xi) is on the wrong side of its margin.
We now have the following optimisation problem. We note that 1/2 is introduced before the squared norm is done purely for convenience – recall that once we take the first-order derivative, the power of 2 will go down so 1/2 will cancel it out. Furthermore, we rewrite the bounding constraint as "regularisation" term, which is essentially the same thing as we are penalising for the misclassification mistakes!

Step 3 – Write Lagrangian (we are almost there).
The aforementioned gives us the constrained optimisation problem, now we put everything together to get Lagrangian!

So our new goal is to minimise the vector of beta, _beta0, alphas and the vector of xi. Can you see where all these terms come from? If not, let me know and I will break it down.
What else are we missing here? Right! We still need to apply our first-derivative tests (KKT conditions) to see whether the optimal solution we are attempted to find is truly optimal. Let’s write these down now.
We first introduce Stationarity Conditions. This is the same as if we were trying to determine a stationary point where the derivative of the function is zero. As we are optimising with respect to the vector of beta, _beta0, alphas and the vector of xi. We need to ensure that all three pass these stationary tests! Thus, we write

From the primal form of the problem, we have so-called Primal Feasibility conditions. Feasibility simply means that our optimal point lies in the possible region that satisfies the constraints!

Going further, we have Multipliers conditions! Well, since these two are directly coming from KKT conditions. We have one multiplier for equality constraints and another for inequality constraints. Hence, we write

Last but not least, we have Complementary Slackness conditions. These constraints are introduced since we have zero duality gap and thus they guarantee that the values of primal and dual are the same! You can read more about how these conditions are obtained here.

Step 4 – Lagrangian looks a bit scary. Let’s write the dual form!
Our Lagrangian looks very scary and we also have 9(!) constraints to account for. We already know that we have a convex problem, thus we can find the global optimum even if we approach the problem by writing the dual form.
Let’s try reformulating our primal problem into dual form. We use the first three Stationarity conditions and plug them into the Lagrangian. We write

To summarise, we have the following problem

This is much simpler than the Lagrangian we have seen before. It is also much more flexible as we have introduced slack variables that allow for misclassifications!
Look at this expression carefully. What does it remind you of? Exactly, it has the form of a quadratic program!

This means that now we can apply QP methods to solve this problem numerically 🚀
The nicest thing – here comes a kernel function!
One of the nicest things about the dual form is that the data (our x) comes in the form of a dot product. This means that we can apply any kind of transformations to our data beforehand! This is very handy as hardly any real datasets can be separated linearly!
References:
[1] T. et. all. Hastie, "Springer Series in Statistics The Elements of Statistical Learning," Math. Intell., vol. 27, no. 2, pp. 83–85, 2009, [Online]. Available: http://www.springerlink.com/index/D7X7KX6772HQ2135.pdf.