Administrative/Meta:

**New room!**Siebel 1214.Office hours

**cancelled**on Monday, 9/26. Instead, we voted to have office hours Saturday, 9/24, during 1-3pm. We will also add extra office hours next weekend.No class next wednesday.

Schedule change: SVMs, since continues with duality theme, and the introduction of risk minimization.

A very thorough reference for SVMs is Steinwart and Christmann (2008).

I have an obsession with the minus signs in Fenchel duality, in particular with convex ERM (they’ve changed in this lecture, versus the last one…).

This lecture serves two purposes: giving a concrete instance of the abstract risk minimization setup from last time, and introducing some key theory of the SVM (support vector machine).

**Remark.** It’s worth asking: now that everyone talks about neural nets all day, why bring up SVMs?

We have few insights into non-linear function classes, SVMs are one.

There’s still lots of interesting research going on; for instance see “kernel mean embedding” literature.

Let’s motivate the form of the SVM optimization problem geometrically; a later lecture will justify our abstract convex risk minimization setup more systematically (lecture “consistency of convex risk minimization”). In particular, this derivation is **just a heuristic**, but still illuminative.

Let data \((x_i)_{i=1}^d\) with \(x_i \in {\mathbb{R}}^d\), and labels \((y_i)_{i=1}^n\) with \(y_i \in\{-1,+1\}\) be given. Within the goal of linear classification (finding \(w\in{\mathbb{R}}^d\) and predicting \(x \mapsto {\mathbf{1}}[ w^\top x \geq 0 ]\)), a reasonable idea is to be correct with some *margin*, meaning we seek a feasible point to the following problem: \[
\text{find } w\in{\mathbb{R}}^d \centerdot \|w\|_2 = 1, \forall i \centerdot y_i{\left\langle w, x_i \right \rangle} \geq 1.
\] Geometrically, all points \(x_i\) have distance at least \(1\) from the hyperplane \(\{x\in{\mathbb{R}}^d : {\left\langle w, x \right \rangle} = 0\}\), and the side they fall on depends on \(y_i\). *(Picture drawn in class.)*

It may happen that a hyperplane separating the points exists, but only by some positive distance less than one. An easier objective is \[ \min\left\{ \|w\|_2^2 : w\in{\mathbb{R}}^d, \forall i\centerdot y_i{\left\langle w, x_i \right \rangle} \geq 1\right\}. \] An optimal value \(\bar w\) will necessarily have examples at distance \(\|\bar w\|_2\) from \(\{x\in{\mathbb{R}}^d : {\left\langle w, x \right \rangle} = 0\}\).

It’s still possible the constraint can fail, due to inputs that are not (strictly) linearly separable (e.g., the “xor” from lecture 2). Thus for every example \(i\), introduce variable \(\epsilon_i\) for some slack on that example: \[ \min\left\{ \frac{\lambda}{2} \|w\|_2^2 + \sum_i \epsilon_i : w\in{\mathbb{R}}^d, \epsilon \in {\mathbb{R}}_+^n, \forall i\centerdot y_i{\left\langle w, x_i \right \rangle} \geq 1 - \epsilon_i\right\}. \] The parameter \(\lambda \geq 0\) lets us trade off between the competing goals of having \(\|w\|_2\) small and having \(\sum_i \epsilon_i\) small:

- \(\lambda\) large means we have more points on the wrong side of the hyperplane, but also more that are correct and far from the hyperplane.
- \(\lambda\) small means we have more points on the right side of the hyperplane, but little guarantee that many are far from the hyperplane.

As a final simplification, the optimal \(\bar \epsilon\) satisfies \(\bar \epsilon_i := \max\{0, 1 - y_i {\left\langle \bar w, x_i \right \rangle}\}\). Thus we end up with the regularized ERM problem \[ \min_{w\in{\mathbb{R}}^d} \frac \lambda 2 \|w\|_2^2 + \sum_{i=1}^n \max\{0, 1 - y_i {\left\langle w, x_i \right \rangle}\}. \]

**Remarks.**

The (non-convex!) problem \[ \min\left\{ \sum_{i=1}^n \max\{0, 1 - y_i {\left\langle w, x_i \right \rangle}\} : w\in{\mathbb{R}}^d, \|w\|_2 = 1\right\} \] has a pleasing interpretation: the objective value, for any \(w\) with \(\|w\|_2=1\), is the distance we must move all the points to be distance at least \(1\) from (and on the correct side of) hyperplane \(\{ x\in{\mathbb{R}}^d : {\left\langle w, x \right \rangle} = 0\}\).

Another convention is to introduce a parameter \(b\geq 0\), and learn a predictor \(x\mapsto 2\cdot {\mathbf{1}}[{\left\langle w, x \right \rangle} \geq b] - 1\). This has all sorts of weird consequences on the learning problem, particularly with the kernel, also \(b\) is not regularized, etc.

Let’s return to the optimization setup from last time. Define **hinge loss** \(\ell\) (and per-coordinate variant \(\ell_i\)) as \[
\ell(z) := \max\{0, 1+z\}
\qquad
\ell_i(v) := \ell(v_i).
\] Collect all examples \(((x_i,y_i))_{i=1}^n\) as the first \(n\) rows of matrix \(A\in {\mathbb{R}}^{m\times d}\), meaning \(A_{ij} = y_i (x_i)_j\). For now, leave the rows \(i \in \{n+1,\ldots, m\}\) undefined; the case \(m>n\) will be useful shortly.

With this notation, the SVM primal and dual are as follows.

**Theorem** (Baby Representer Theorem)**.** Suppose \(\lambda > 0\). Then \[
\min\left\{
\sum_{i=1}^n \ell_i(-Aw) + \frac \lambda 2 \|w\|_2^2 : w\in{\mathbb{R}}^d
\right\}
=
\max\left\{
-\sum_{i=1}^n s_i - \frac 1 {2\lambda} \|A^\top s\|_2^2 : s \in [0,1]^n \times \{0\}^{m-n}
\right\}.
\] Primal-dual optimal pairs \((\bar w, \bar s)\) always exist. \(\bar s\) is optimal iff it has the following form: \[
\bar s \in \begin{cases}
\{0\} & i > n,
\\
\{0\} & i \leq n, (A\bar w)_i > 1,
\\
[0,1] & i \leq n, (A\bar w)_i = 1,
\\
\{1\} & i \leq n, (A\bar w)_i < 1.
\end{cases}
\] Lastly, \(\bar w\) is unique, and has the form \(\bar w = A^\top \bar s / \lambda\).

**Remarks.**

**Support vectors.**There may be more than one dual optimum, but all of them are zero on all coordinates except those for which \(i \leq n\) and \((A\bar w)_i \leq 1\). Expanding the definition of \(A\), then \(\bar s_i > 0\) implies \(y_i {\left\langle \bar w, x_i \right \rangle} < 1\), which is consistent with the original geometric development. Such vectors are called**support vectors**.These support vectors are consistent with our earlier geometric picture. We were finding a vector \(w\) so that \(y_i {\left\langle w, x_i \right \rangle} \geq 1\) for all \((x_i,y_i)\). In particular, this means that for any pair with \(y_j {\left\langle \bar w, x_j \right \rangle} > 1\) (where \(\bar w\) is optimal), we can wiggle \(x_j\) a little and it doesn’t affect the optimal choice for \(\bar w\). Similarly, \(\bar s\) is completely determined by those eamples which have \((A\bar w)_i \leq 1\).

**Kernel trick.**Define**gram matrix**\(G := A A^\top\); by definition of \(A\), \(G_{ij} = y_i y_j {\left\langle x_i, x_j \right \rangle}\). Note that \(\|A^\top s\|_2^2 = s^\top G s\), meaning the dual can be written without \(A\). So if we solve the problem in the dual, it seems we never need the full matrices \(A\) or \(G\), and can just do pairwise computations on the fly.We still seem to need a copy of \(A\) at prediction time: our prediction on an example \(i\) is \((A\bar w)_i\). This is the purpose of having \(m > n\): let’s pack unlabeled examples we care about at indices \(i>n\), meaining \(A_{ij} = (x_i)_j\) where \(i > n\) and we don’t know a corresponding \(y_i\). But \(\bar w = A^\top \bar s / \lambda\), thus \((A\bar w)_i = (A A^T\top \bar s / \lambda)_i = (G\bar s / \lambda)_i\), so once again at evaluation time we just need \(G\), or rather pairwise evaluations.

Together, this gives the

**kernel trick**: by optimizing SVMs in the dual and then predicting with \(A\bar w\), we can avoid explicit vector representations of examples and work only with inner products \({\left\langle x_i, x_j \right \rangle}\), which we can define as we wish.**This will be discussed properly in the next lecture,**the presentation I gave threw many people off.**Dual optimization.**For a long time, optimizing SVMs was done in the dual, namely via**dual coordinate ascent**, something we’ll discuss in a few lectures. There are all sorts of ridiculous tricks here.*[ Todo to future matus: tell us some tricks please. ]***Uniqueness in the dual.**If \(G\) is invertible (namely if it is symmetric positive definite, as it is symmetric positive semi-definite by construction), then the dual is strongly convex and also has a unique optimum.On the other hand, it may seem weird that \(\bar w\) is unique but satisfies \(\bar w = A^\top \bar s / \lambda\), where \(\bar s\) is

*any*dual optimum. The resolution is related to the preceding point on uniqueness in the dual: the degree of freedom in \(\bar s\) is washed out by \(A\), meaning \(A^\top \bar s\) is still unique.**Representer theorem.**This theorem is usually presented as follows. If we make the matrix \(A\) bigger and bigger (while holding the training sample fixed), we can few it as a mapping into an infinite dimensional space of functions; namely \(Aw\) is now a function, and \((Aw)_i\) is its value on point \(i\), where we have infinitely many options for \(i\). As we’ll see shortly, the theorem still goes through, and the essential bit is that we can still write \((A \bar w) = (G \bar s / \lambda)_i\), meaning we can**represent**the optimum by just the nonzero entries (and where they appear) of \(\bar s\).

**Note:** some more material was presented after this, but the presentation wasn’t great; see the next lecture for a cleaned-up version.

*[ matus notes to future self: can prove infdim attainment via double-duality. should include something about primal-dual error gaps for approximate optima. ]*

Steinwart, Ingo, and Andreas Christmann. 2008. *Support Vector Machines*. 1st ed. Springer.