Administrative/Meta:

SVM basics

This lecture serves two purposes: giving a concrete instance of the abstract risk minimization setup from last time, and introducing some key theory of the SVM (support vector machine).

Remark. It’s worth asking: now that everyone talks about neural nets all day, why bring up SVMs?

Let’s motivate the form of the SVM optimization problem geometrically; a later lecture will justify our abstract convex risk minimization setup more systematically (lecture “consistency of convex risk minimization”). In particular, this derivation is just a heuristic, but still illuminative.

Let data \((x_i)_{i=1}^d\) with \(x_i \in {\mathbb{R}}^d\), and labels \((y_i)_{i=1}^n\) with \(y_i \in\{-1,+1\}\) be given. Within the goal of linear classification (finding \(w\in{\mathbb{R}}^d\) and predicting \(x \mapsto {\mathbf{1}}[ w^\top x \geq 0 ]\)), a reasonable idea is to be correct with some margin, meaning we seek a feasible point to the following problem: \[ \text{find } w\in{\mathbb{R}}^d \centerdot \|w\|_2 = 1, \forall i \centerdot y_i{\left\langle w, x_i \right \rangle} \geq 1. \] Geometrically, all points \(x_i\) have distance at least \(1\) from the hyperplane \(\{x\in{\mathbb{R}}^d : {\left\langle w, x \right \rangle} = 0\}\), and the side they fall on depends on \(y_i\). (Picture drawn in class.)

It may happen that a hyperplane separating the points exists, but only by some positive distance less than one. An easier objective is \[ \min\left\{ \|w\|_2^2 : w\in{\mathbb{R}}^d, \forall i\centerdot y_i{\left\langle w, x_i \right \rangle} \geq 1\right\}. \] An optimal value \(\bar w\) will necessarily have examples at distance \(\|\bar w\|_2\) from \(\{x\in{\mathbb{R}}^d : {\left\langle w, x \right \rangle} = 0\}\).

It’s still possible the constraint can fail, due to inputs that are not (strictly) linearly separable (e.g., the “xor” from lecture 2). Thus for every example \(i\), introduce variable \(\epsilon_i\) for some slack on that example: \[ \min\left\{ \frac{\lambda}{2} \|w\|_2^2 + \sum_i \epsilon_i : w\in{\mathbb{R}}^d, \epsilon \in {\mathbb{R}}_+^n, \forall i\centerdot y_i{\left\langle w, x_i \right \rangle} \geq 1 - \epsilon_i\right\}. \] The parameter \(\lambda \geq 0\) lets us trade off between the competing goals of having \(\|w\|_2\) small and having \(\sum_i \epsilon_i\) small:

As a final simplification, the optimal \(\bar \epsilon\) satisfies \(\bar \epsilon_i := \max\{0, 1 - y_i {\left\langle \bar w, x_i \right \rangle}\}\). Thus we end up with the regularized ERM problem \[ \min_{w\in{\mathbb{R}}^d} \frac \lambda 2 \|w\|_2^2 + \sum_{i=1}^n \max\{0, 1 - y_i {\left\langle w, x_i \right \rangle}\}. \]

Remarks.

SVM Duality

Let’s return to the optimization setup from last time. Define hinge loss \(\ell\) (and per-coordinate variant \(\ell_i\)) as \[ \ell(z) := \max\{0, 1+z\} \qquad \ell_i(v) := \ell(v_i). \] Collect all examples \(((x_i,y_i))_{i=1}^n\) as the first \(n\) rows of matrix \(A\in {\mathbb{R}}^{m\times d}\), meaning \(A_{ij} = y_i (x_i)_j\). For now, leave the rows \(i \in \{n+1,\ldots, m\}\) undefined; the case \(m>n\) will be useful shortly.

With this notation, the SVM primal and dual are as follows.

Theorem (Baby Representer Theorem). Suppose \(\lambda > 0\). Then \[ \min\left\{ \sum_{i=1}^n \ell_i(-Aw) + \frac \lambda 2 \|w\|_2^2 : w\in{\mathbb{R}}^d \right\} = \max\left\{ -\sum_{i=1}^n s_i - \frac 1 {2\lambda} \|A^\top s\|_2^2 : s \in [0,1]^n \times \{0\}^{m-n} \right\}. \] Primal-dual optimal pairs \((\bar w, \bar s)\) always exist. \(\bar s\) is optimal iff it has the following form: \[ \bar s \in \begin{cases} \{0\} & i > n, \\ \{0\} & i \leq n, (A\bar w)_i > 1, \\ [0,1] & i \leq n, (A\bar w)_i = 1, \\ \{1\} & i \leq n, (A\bar w)_i < 1. \end{cases} \] Lastly, \(\bar w\) is unique, and has the form \(\bar w = A^\top \bar s / \lambda\).

Remarks.

Note: some more material was presented after this, but the presentation wasn’t great; see the next lecture for a cleaned-up version.

[ matus notes to future self: can prove infdim attainment via double-duality. should include something about primal-dual error gaps for approximate optima. ]

References

Steinwart, Ingo, and Andreas Christmann. 2008. Support Vector Machines. 1st ed. Springer.