Administrative/Meta:

Convex ERM

Let’s take a step back for a moment and remember why we are minimizing convex functions. Our running example is empirical risk minimization, namely the minimization over some function class \({\mathcal{F}}\) of \[ {\widehat{{\mathcal{R}}}}(f) := \frac 1 n \sum_{i=1}^n \ell(-y_i f(x_i)), \label{eq:1} \] where \(x_i \in {\mathbb{R}}^d\) and \(y_i \in \{-1,+1\}\). This procedure is selecting a predictor \({\mathcal{F}}\ni f : {\mathbb{R}}^d \to {\mathbb{R}}\); for instance, we’ve extensively discussed the case of linear predictors \({\mathcal{F}}:= \{ x \mapsto {\left\langle w, x \right \rangle} : w\in{\mathbb{R}}^d\}\), and we’ll allow the notation \[ {\widehat{{\mathcal{R}}}}(w) := {\widehat{{\mathcal{R}}}}(x\mapsto {\left\langle w, x \right \rangle}) = \frac 1 n \sum_{i=1}^n \ell(-y_i {\left\langle w, x_i \right \rangle}). \] (Recall that “linear” can be quite flexible by working with \(x_i\) in some richer space, for instance in boosting and SVMs.)

To convert this into a linear classifier, one needs a map to \(\{-1,+1\}\), for instance \(x\mapsto {\text{sgn}}({\left\langle w, x \right \rangle})\).

Key question: how well does this convex procedure solve the original problem, namely the minimization of the classification error \[ {\widehat{{\mathcal{R}}_{\text{z}}}}(f) := \frac 1 n \sum_{i=1}^n {\mathbf{1}}[ {\text{sgn}}(f(x_i)) \neq y_i]\ ? \label{eq:2} \]

This and the subsequent lecture will give one negative and one positive result to this end.

Example losses.

Remark.

A negative scenario

Theorem (See Ben-David et al. (2012) for a similar result). Let \(\epsilon \in (0,1)\) and scalar \(r > 0\) be given. There exists a set of \(n = {\mathcal{O}}(1/\epsilon)\) labeled points \(((x_i,y_i))_{i=1}^n\) satisfying the following conditions.

Proof. [ picture showing the location and probability mass of the two distinct data point. ]

Choose any integer \(n\) with \({\epsilon}/2 \leq 1/n \leq {\epsilon}\), meaning \(n = {\mathcal{O}}(1/{\epsilon})\). Place \(n-1\) points at \(+1\) with label \(+1\), and \(1\) point at \(-c\) where \(c:= n / r\), again with label \(+1\); any predictor \(\hat w > 0\) achieves error \(1/n \leq {\epsilon}\).

The convex empirical risk of \(w\in {\mathbb{R}}^d\) is \[ {\widehat{{\mathcal{R}}}}(w) := \frac 1 n \left(\ell(c w) + \sum_{i=1}^{n-1} \ell(-w)\right) = \frac 1 n \ell(cw) + \frac {n-1}{n} \ell(-w). \] [ The proof of existence of minimizers is omitted; for instance, the conditions on \(\ell\) imply bounded level sets. ] First note that \(\min\partial {\widehat{{\mathcal{R}}}}(0) > 0\): \[ \begin{aligned} \min (\partial {\widehat{{\mathcal{R}}}})(0) &= \min\left( \frac 1 n (\partial \ell(cw))(0) + \frac {n-1}{n} (\partial \ell(-w))(0) \right) \\ &= \min\left( \frac c n (\partial \ell)(0) - \frac {n-1}{n} (\partial \ell)(0) \right) \\ &\geq \frac c n \min(\partial \ell)(0) - \frac {n-1}{n} \max(\partial \ell)(0) \\ &\geq (\max (\partial\ell)(0)) \left( \frac {cr} n - \frac {n-1}{n}\right) \\ &> 0. \end{aligned} \]

This in turn implies that every minimizer \(\bar w\) has \(\bar w < 0\).

Together, these two points imply \(\bar w < 0\). This means any optimal choice \(\bar w\) will only correctly classify the lone point at \(-c\), giving \({\widehat{{\mathcal{R}}_{\text{z}}}}(\bar w) = 1 - 1/n \geq 1- {\epsilon}\). \({\qquad\square}\)

A positive scenario

[ In class, about \(20\) minutes were spent explaining how the “negative scenario” can be defeated, which naturally leads to the positive scenario. This material is deferred to the notes for the next lecture. ]

[ Note to future matus: the class liked this topic a lot and there was a lot of time spent on various questions, and based on a vote it was split into two lectures. Not sure what to do next year, though. ]

References

Ben-David, Shai, David Loker, Nathan Srebro, and Karthik Sridharan. 2012. “Minimizing the Misclassification Error Rate Using a Surrogate Convex Loss.” In ICML.

Guruswami, Venkatesan, and Prasad Raghavendra. 2006. “Hardness of Learning Halfspaces with Noise.” In FOCS.

Impagliazzo, Russell. 1995. “Hard-Core Distributions for Somewhat Hard Problems.” In FOCS, 538–45.

Zhang, Tong. 2004a. “Statistical Analysis of Some Multi-Category Large Margin Classification Methods.” JMLR 5: 1225–51.

———. 2004b. “Statistical Behavior and Consistency of Classification Methods Based on Convex Risk Minimization.” The Annals of Statistics 32: 56–85.