Using the tools from the last class, we obtained the following control on a single predictor \(f:{\mathbb{R}}^d\to{\mathbb{R}}\): with probability at least \(1-\delta\) over an i.i.d. draw of \(((x_i,y_i))_{i=1}^n\) from distribution \((X,Y)\), \[ {{\mathcal{R}}_{\text{z}}}({\text{sgn}}(f)) \leq {\widehat{{\mathcal{R}}_{\text{z}}}}({\text{sgn}}(f)) + \sqrt{\frac 1 {2n}\ln\left(\frac 1 \delta\right)}, \] where \[ {{\mathcal{R}}_{\text{z}}}(g) := {\text{Pr}}[ g(X) \neq Y] \qquad\text{and}\qquad {\widehat{{\mathcal{R}}_{\text{z}}}}(g) := \frac 1 n \sum_{i=1}^n {\mathbf{1}}[g(x_i) \neq y_i]. \]

Our goal is to control not a single predictor, but the output of a training algorithm. The next two lectures will establish our basic tools here as follows.

Independence issue / overfitting

Suppose we obtain examples \(((x_i,y_i))_{i=1}^n\), feed them to an algorithm, and obtain a predictor \(\hat f\). What prevents us from applying the earlier proof technique?

Earlier, we defined random variables \(Z_i := {\mathbf{1}}[{\text{sgn}}(\hat f(X_i)) \neq Y_i]\) and \(W_i := Z_i - {\mathbb{E}}(Z_i)\), and then applied Hoeffding’s inequality to \((W_1,\ldots,W_n)\). The issue is that \((W_1,\ldots,W_n)\) must be independent, but \(\hat f\) is a random variable depending on \(((X_i,Y_i))_{i=1}^n\): each \(W_i\) in general depends on every \(((X_i,Y_i))_{i=1}^n\).

Remark. We analyzed stochastic gradient descent in the first lecture, where we showed that the averaged iterate \(\bar w := (n+1)^{-1} \sum_i w_i\) satisfies, with probability at least \(1-\delta\), \[ f(\bar w) - \inf_{\|w\|_2\leq R} \leq {\mathcal{O}}\left(\frac {RL\sqrt{\ln(1/\delta)}}{\sqrt{n}} \right) \] This proof uses the output of an algorithm and seems to sidestep the above independence issues. What gives?

The algorithm invokes Azuma’s ienquality, and not Hoeffding, to the martingale difference sequence \[ Z_i := {\left\langle \hat g_i - {\nabla f}(w_{i-1}, w_{i-1} - \bar w \right \rangle}, \] where stochastic gradient satisfies \({\mathbb{E}}(\hat g_i) = {\nabla f}(w_{i-1})\); conditioned on \(((x_j,y_j)_{j=1}^{i-1}\), \({\nabla f}(w_{i-1})\) and \(\hat g_i\) are independent, so we can apply Azuma’s inequality to \(X_i := \sum_{j\leq i} Z_i\).

As such, this analysis and algorithm barely dodged the earlier independence issue. Recall that we also stated an open problem, that repeating even a single example breaks the sgd analysis. So the independence issue is there as well.

Finite classes and uniform covering numbers

One resolution to the preceding independence issues is to be very careful about the way the algorithm uses random samples, as in the martingale analysis of sgd.

Another way, which is what we’ll use as the basis for the “generalization” part of the course, is to study the behavior of the random variable \[ \sup_{f\in{\mathcal{F}}} {\mathcal{R}}(f) - {\widehat{{\mathcal{R}}}}(f). \] This way, if we run an to pick some \(\hat f\in{\mathcal{F}}\), the preceding blanket statement will also imply a guarantee on \(\hat f\), since we made the guarantee without seeing or depending on the data in any way.


As a first foray into this setting, let’s suppose \(|{\mathcal{F}}|\) is finite.

Theorem. Suppose \(|{\mathcal{F}}|<\infty\), and \(\ell(f(x),y) \in [0,b]\) for every \((x,y,f)\) with probability \(1\). Then with probability at least \(1-\delta\) over the i.i.d. draw of \(((x_i,y_i))_{i=1}^n\), \[ {\mathcal{R}}(f) - {\widehat{{\mathcal{R}}}}(f) \leq b \sqrt{\frac {\ln(|{\mathcal{F}}|) + \ln(1/\delta)}{2n}}. \]

Proof. Set \(\epsilon := b \sqrt{\ln(|{\mathcal{F}}|/\delta)/(2n)}\). By Hoeffding’s inequality, for any \(f\in{\mathcal{F}}\), \[ {\text{Pr}}[ {\mathcal{R}}(f) - {\widehat{{\mathcal{R}}}}(f) \geq \epsilon ] \leq \frac {\delta}{2|{\mathcal{F}}|}. \] By the union bound, \[ {\text{Pr}}[ \sup_{f\in{\mathcal{F}}} {\mathcal{R}}(f) -{\widehat{{\mathcal{R}}}}(f) \geq \epsilon ] \leq \sum_{f\in{\mathcal{F}}} {\text{Pr}}[ {\mathcal{R}}(f) - {\widehat{{\mathcal{R}}}}(f) \geq \epsilon ] \leq \delta. \] \({\qquad\square}\)

Finite classes rarely arise directly; instead, we usually discretize some infimite class. This discretization will make it clear that the having \(\ln(|{\mathcal{F}}|)\) (as opposed to \(\text{poly}(|{\mathcal{F}}|)\)) is nice.

Given a function class \({\mathcal{F}}\), a (possibly infinite) set of inputs \({\mathcal{Z}}\), and a precision level \(\epsilon\), Say that a finite subset \({\mathcal{G}}\subseteq {\mathcal{F}}\) is a primitive cover of \({\mathcal{F}}\) if for every \(f\in{\mathcal{F}}\) there exists \(g_f\in{\mathcal{G}}\) so that \(|g_f(z) - f(z)| \leq \epsilon\) for every \(z\in{\mathcal{Z}}\). The primitive covering number \({\mathcal{N}}(\epsilon, {\mathcal{F}},{\mathcal{Z}})\) is infinite if no primitive covers exist, and otherwise it is the size of the smallest primitive cover.

Theorem. Consider losses of the form \(\ell:{\mathbb{R}}\to{\mathbb{R}}_+\), meaning \({\mathcal{R}}(f) = {\mathbb{E}}(\ell(-Yf(X)))\), and \(\ell\) is \(L\)-lipschitz. Suppose \(Y\in\{-1,+1\}\), and \(X\in S\) for some \(S\) with probability \(1\). Suppose \(|f(x)|\leq b\) for \(X\in S\) and \(\ell(0) \leq Lb\). Then with probability at least \(1-\delta\) over the draw of \(((x_i,y_i))_{i=1}^n\), \[ \sup_{f\in{\mathcal{F}}} {\mathcal{R}}(f) - {\widehat{{\mathcal{R}}}}(f) \leq \inf_{\epsilon \geq 0} 2Lb \left( \epsilon + \sqrt{\frac{\ln|{\mathcal{N}}(b\epsilon,{\mathcal{F}},S)| + \ln(1/\delta)} {2n}} \right). \]


Proof. Fix any \(\epsilon > 0\), suppose \({\mathcal{N}}(b\epsilon, {\mathcal{F}}, S) < \infty\) (the bound is for free otherwise), and fix any minimal cover \({\mathcal{G}}\). For any \(g\in{\mathcal{G}}\), with probability \(1\), every \((X,Y)\) satisfy \[ |\ell(-Yg(X))| \leq |\ell(-Yg(X)) - \ell(0)| + |\ell(0)| \leq L |-Yg(X) - 0| + Lb \leq 2Lb. \] Applying the finite class generalization bound to \({\mathcal{G}}\) gives \[ \sup_{g\in{\mathcal{G}}} {\mathcal{R}}(g) - {\widehat{{\mathcal{R}}}}(f) \leq 2Lb \sqrt{\frac {\ln({\mathcal{N}}(\epsilon,{\mathcal{F}},S)) + \ln(1/\delta)}{2n}}. \] Now fix any \(f\in{\mathcal{F}}\), taking \(g\in{\mathcal{G}}\) to denote the approximating element, \[ |{\widehat{{\mathcal{R}}}}(f) - {\widehat{{\mathcal{R}}}}(g)| \leq \frac 1 n \sum_{i=1}^n \left| \ell(-yf(x)) - \ell(-yg(x)) \right| \leq \frac L n \sum_{i=1}^n \left| (-y) (f(x) - g(x)) \right| \leq Lb \epsilon. \] Similarly, \(|{\mathcal{R}}(f) - {\mathcal{R}}(g)| \leq \epsilon\). Thus \[ \begin{aligned} {\mathcal{R}}(f) -{\widehat{{\mathcal{R}}}}(f) &= ({\mathcal{R}}(f) - {\mathcal{R}}(g)) + ({\mathcal{R}}(g) - {\widehat{{\mathcal{R}}}}(g)) + ({\widehat{{\mathcal{R}}}}(g) - {\widehat{{\mathcal{R}}}}(f)) \\ &\leq 2Lb\epsilon + 2Lb\sqrt{\frac {\ln({\mathcal{N}}(b\epsilon,{\mathcal{F}},S)) + \ln(1/\delta)}{2n}}. \end{aligned} \] Since the bound held for every \(\epsilon \geq 0\), it holds over their infimum. \({\qquad\square}\)

Example. Consider the case of linear prediction, meaning functions of the form \({\mathcal{F}}:=\{x\mapsto {\left\langle w, x \right \rangle} : \|w\|_2\leq W\}\), and suppose moreover that \(\|x\|_2 \leq X\) with probability \(1\). Pick a set \(S\) of weight vectors \(w\in{\mathbb{R}}^d\) so that for every \(\|w\|\leq W\), there exsts \(w' \in S\) with \(\|w'-w\|_2 \leq \epsilon / X\). A standard estimate is that \(|S| \leq {\mathcal{O}}((XW/\epsilon)^d)\). Moreover, for any \(\|x\|_2\leq X\), \[ |{\left\langle w, x \right \rangle} - {\left\langle w', x \right \rangle}| \leq \|x\|_2 \|w-w'\|_2 \leq \epsilon. \] Thus \({\mathcal{N}}(\epsilon,{\mathcal{F}}, \{x\in{\mathbb{R}}^d : \|x\|_2\leq X\}) = {\mathcal{O}}((XW/\epsilon)^d)\), and thus the preceding bound gives (for any \(1\)-Lipschitz loss for simplicity) \[ \begin{aligned} \sup_{f\in{\mathcal{F}}} {\mathcal{R}}(f) - {\widehat{{\mathcal{R}}}}(f) &\leq {\mathcal{O}}\left( \inf_{\epsilon > 0} XW\epsilon + XW\sqrt{\frac {d \ln(XW/\epsilon) + \ln(2/\delta)}{n}} \right) \\ &= {\mathcal{O}}\left( XW \sqrt{\frac {d \ln(nXW) + \ln(2/\delta)}{n}} \right), \end{aligned} \] where the final step simplified via \(\epsilon := 1/\sqrt{n}\). By contrast, the sgd bound lacked the extra log term! We’ll see how to drop this with Rademacher complexity.