As we’ve discussed in the past few lectures, symmetrization / Rademacher complexity give us the following bound.

Theorem. Let functions \({\mathcal{F}}\) be given with \(|f(z)| \leq c\) almost surely for every \(f\in{\mathcal{F}}\). With probability \(\geq 1-\delta\) over an i.i.d. draw \(S := (Z_1,\ldots,Z_n)\), every \(f\in{\mathcal{F}}\) satisfies \[ {\mathbb{E}}f \leq \hat {\mathbb{E}}f + 2{\text{Rad}}({\mathcal{F}}_{|S}) + 3c \sqrt{\frac 2 n \ln\left(\frac 2 \delta\right)}. \]

To make this meaningful to machine learning, we need to replace \({\mathbb{E}}f\) with some form of risk. Today will discuss three choices.

  1. \({\mathcal{R}}_\ell\) where \(\ell\) is Lipschitz. We covered this last time but will recap a little.

  2. \({{\mathcal{R}}_{\text{z}}}(f) := {\text{Pr}}[f(X)\neq Y]\); for this we’ll use finite classes and discuss shatter coefficients and VC dimension.

  3. \({\mathcal{R}}_\gamma(f) = {\mathcal{R}}_{\ell_\gamma}\) where \(\ell_\gamma(z) := \max\{ 0, \min \{z/\gamma + 1, 1\}\}\) will lead to nice bounds for a number of methods, for instance boosting.

\({\mathcal{R}}_\ell\) recap

Last time we pointed out that bounded linear predictors \({\mathcal{F}}:= \{x\mapsto {\left\langle w, x \right \rangle} : \|w\|_2\leq W\}\) applied to bounded input values (\(\|x\|_2\leq X\)) with a nondecreasing \(L\)-lipschitz loss (e.g., logistic loss is \(1\)-Lipschitz) gives with probability \(\geq 1-\delta\) for every \(f\in{\mathcal{F}}\) \[ {\mathcal{R}}_\ell(f) \leq {\widehat{{\mathcal{R}}}}_\ell(f) + LWX/\sqrt{n} + 3(LWX + \ell(0))\sqrt{\frac 2 n \ln\left(\frac 1 \delta\right)}. \]

Note moreover that regularization implies a choice of \(\lambda\); namely, when \(\lambda >0\) and \(\ell \geq 0\), minimization of \[ f(w) := {\widehat{{\mathcal{R}}}}(f) + \lambda \|w\|_2^2/2 \] implies that the ERM optimum \(w_\lambda\) satisfies \[ \lambda \|w_\lambda\|_2^2/2 \leq f(w_\lambda) \leq f(0) = {\mathcal{R}}(0), \] thus we can take \(W := \sqrt {2{\mathcal{R}}(0) / \lambda}\), and the earlier generalization terms with \(W\) become \[ \frac {LWX}{\sqrt{n}} = LX \sqrt{\frac {2{\mathcal{R}}(0)}{\lambda n}}. \] Consequently, statistical learning theory typically recommends \(\lambda \geq 1/n^{1-{\varepsilon}}\) for even \({\varepsilon}\leq 1/2\) so that this bound goes to \(0\) quickly as \(n\) increases.

Remark. This is just a sufficient condition, not a necessary conditions.

\({{\mathcal{R}}_{\text{z}}}\), VC dimension

Turning to \({{\mathcal{R}}_{\text{z}}}\), we obtain a complexity term \[ {\text{Rad}}(\{ (x,y) \mapsto {\mathbf{1}}[{\text{sgn}}(f(x)) \neq y] : f\in {\mathcal{F}}\}_{|S}\}). \] The following definitions and lemma show how we can simplify this.

Now consider the sign patterns that arise from a set of real-valued predictors, meaning \[ \begin{aligned} {\text{sgn}}({\mathcal{F}}) &:= \{ x \mapsto {\text{sgn}}(f(x)) : f\in{\mathcal{F}}\}, \\ {\text{sgn}}(U) &:= \{ ({\text{sgn}}(v_1),\ldots,{\text{sgn}}(v_n)) : v\in U \}. \end{aligned} \] Define the shatter coefficients \({\text{Sh}}\) and VC dimension \({\text{VC}}\) as \[ \begin{aligned} Sh({\mathcal{F}}_{|S}) &:= \left| {\text{sgn}}({\mathcal{F}}_{|S}) \right|, \\ Sh({\mathcal{F}};n) &:= \max_{S \in {\mathcal{S}}} {\text{Sh}}({\mathcal{F}}_{|S}), \\ {\text{VC}}({\mathcal{F}}) &:= \max\{ i\in {\mathbb{Z}}_+ : {\text{Sh}}({\mathcal{F}};i) = 2^i \}. \end{aligned} \]

The following two lemmas show how to use these concepts in controlling \({{\mathcal{R}}_{\text{z}}}\).

Lemma. \[ {\text{Rad}}(\{ (x,y) \mapsto {\mathbf{1}}[{\text{sgn}}(f(x)) \neq y] : f\in {\mathcal{F}}\}_{|S}\}) \leq \frac 1 2 {\text{Rad}}({\text{sgn}}({\mathcal{F}})_{|S}). \] Proof. For each coordinate \(i\), define a map \(f_i(z) := {\mathbf{1}}[ z = y_i ]\) for \(z\in \{-1,+1\}\), and then extend this to an affine function by interpolation. Each \(f_i\) is \(1/2\)-Lipschitz, thus the Lipschitz composition lemma gives the result. \({\qquad\square}\)

Lemma (Sauer-Shelah, Vapnik-Chervonenkis?, Warren?). Define \(V := {\text{VC}}({\mathcal{F}})\) for convenience. Then \[ {\text{Sh}}({\mathcal{F}};n) \leq \begin{cases} 2^n &\text{when $n \leq V$}, \\ \left(\frac{en}{V}\right)^V &\text{otherwise}, \end{cases} \] and in general \({\text{Sh}}({\mathcal{F}};n) \leq n^V + 1\).

Proof. Omitted; this is taught in lots of machine learning classes… \({\qquad\square}\)

Remark (historical). [ future matus: dig this all up properly. ]

  1. Basically the definition of VC dimension appears in the Warren 1960s paper, who attributes it to an earlier paper by Shapiro. There is evidence Kolmogorov had a form of it decades earlier.

  2. Warren roughly gives the Sauer-Shelah lemma in his 1960s paper.

Putting these pieces together, we get the following.

Theorem (“VC theorem”). With probability \(\geq 1-\delta\), every \(f\in{\mathcal{F}}\) satisfies \[ {{\mathcal{R}}_{\text{z}}}(f) \leq \hat{{\mathcal{R}}_{\text{z}}}(f) + {\text{Rad}}({\text{sgn}}({\mathcal{F}})_{|S}) + 3\sqrt{\frac 2 n \ln\left(\frac 1 \delta\right)} \] where \[ {\text{Rad}}({\text{sgn}}({\mathcal{F}})_{|S}) \leq \sqrt{\frac{8 \ln({\text{Sh}}({\mathcal{F}}_{|S}))}{n}} \qquad\text{and}\qquad \ln({\text{Sh}}({\mathcal{F}}_{|S})) \leq \ln({\text{Sh}}({\mathcal{F}};n)) \leq {\text{VC}}({\mathcal{F}}) \ln(n+1). \]

Remark (on optimization).

  1. As discussed many times, there are trivial cases where minimizing \(\hat{{\mathcal{R}}_{\text{z}}}\) is NP-hard.

  2. Instead, it is common to minimize \({\mathcal{R}}_\ell\). This can be related as follows.

    Lemma. Suppose \(\ell:{\mathbb{R}}\to{\mathbb{R}}_+\) is nondecreasing, and pick any \(a\geq 0\) with \(\ell(-a) > 0\). Then \[ {\mathcal{R}}({\text{sgn}}(f)) \leq {\text{Pr}}[ f(X)Y \leq a ] \leq \frac{{\mathcal{R}}_\ell(f)}{\ell(-a)}. \]

    Proof. By Markov’s inequality, \[ \begin{aligned} {{\mathcal{R}}_{\text{z}}}(f) &= {\text{Pr}}[ {\text{sgn}}(f(X)) \neq Y ] \leq {\text{Pr}}[ Yf(X) \leq 0 ] \leq {\text{Pr}}[ Yf(X) \leq a ] \\ &= {\text{Pr}}[ -Yf(X) \geq -a ] \leq {\text{Pr}}[ \ell(-Yf(X)) \geq \ell(-a) ] \\ &\leq \frac {{\mathbb{E}}(\ell(-Yf(X)))}{\ell(-a)}. \end{aligned} \] \({\qquad\square}\)


Define \[ \begin{aligned} l_\gamma(z) &:= \max\{0, \min\{ 1, 1 + z/\gamma \}\}, \\ {\mathcal{R}}_\gamma(f) &:= {\mathcal{R}}_{\ell_\gamma}(f) = {\mathbb{E}}(\ell_\gamma(-Yf(X))). \end{aligned} \] These losses have a number of nice properties and are useful when analyzing boosting. Let’s first get a few more Rademacher lemmas out of the way (not all of which we’ll need here).


  1. \({\text{Rad}}(U) \geq 0\).

  2. \({\text{Rad}}(cU + \{v\}) \leq |c|{\text{Rad}}(U)\).

  3. \({\text{Rad}}({\text{conv}}(U)) = {\text{Rad}}(U)\).

  4. Let sets of vectors \((U_i)_{i \geq 1}\) be given with \(\sup_{v\in U_i}{\left\langle \sigma, v \right \rangle}_n \geq 0\) for every \(\sigma\in\{-1,+1\}^n\). Then \[ {\text{Rad}}(\bigcup_i U_i) \leq \sum_i {\text{Rad}}(U_i). \] (For instance, it suffices to have \(U_i = -U_i\), or \(0\in U_i\).)


  1. \({\mathbb{E}}_\sigma \sup_{v\in U} {\left\langle \sigma, v \right \rangle}_n \geq \sup_{v\in U} {\mathbb{E}}_\sigma {\left\langle \sigma, v \right \rangle}_n = 0.\)

  2. Use the Lipschitz composition lemma with the \(|c|\)-lipschitz maps \(f_i(z) := c z + v_0\).

  3. As has been used a number of times in the course, optimizing a linear function over a polytope is achieved at a corner: \[ \begin{aligned} {\mathbb{E}}_\sigma \sup_{v\in{\text{conv}}(U)} {\left\langle \sigma, v \right \rangle}_n &= {\mathbb{E}}_\sigma \sup_{k \geq 1} \sup_{v_1,\ldots,v_k \in U} \sup_{a_1,\ldots,a_k \in \Delta_k} {\left\langle \sigma, \sum_j a_jv_j \right \rangle}_n \\ &= {\mathbb{E}}_\sigma \sup_{v \in U} {\left\langle \sigma, v \right \rangle}_n. \end{aligned} \]

  4. Thanks to the condition on each \(U_i\), \[ \begin{aligned} {\mathbb{E}}_\sigma \sup_{v\in \bigcup_i U_i} {\left\langle \sigma, v \right \rangle}_n &= {\mathbb{E}}_\sigma \sup_{i\geq 1} \sup_{v\in U_i} {\left\langle \sigma, v \right \rangle}_n \leq {\mathbb{E}}_\sigma \sum_i\sup_{v\in U_i} {\left\langle \sigma, v \right \rangle}_n = \sum_i {\mathbb{E}}_\sigma \sup_{v\in U_i} {\left\langle \sigma, v \right \rangle}_n. \end{aligned} \] \({\qquad\square}\)

Remark. In the case \({\text{Rad}}(\cup_i U_i)\), some condition on \(U_i\) is needed; otherwise, any countable class \(U\) could be decposed into singletons \(U_i\), and \(0\leq {\text{Rad}}(U) \leq \sum_i {\text{Rad}}(U_i) = 0\), which contradicts the existence of classes with cardinality \(2\) but positive Rademacher complexity.

These tools lead to the following control on \({\mathcal{R}}_\gamma\).

Theorem. Let some base class of functions \({\mathcal{H}}\) be given, and suppose \[ {\mathcal{F}}:= W{\text{conv}}({\mathcal{H}}\cup -{\mathcal{H}}) = \left\{ \sum_{i=1}^k \alpha_i h_i : k \in {\mathbb{Z}}_+,\ \alpha \in {\mathbb{R}}^k, \|\alpha\|_1 \leq W, h_1,\ldots,h_k \in {\mathcal{H}}\right\} \] With probability \(\geq 1-\delta\), every \(f\in{\mathcal{F}}\) satisfies \[ {{\mathcal{R}}_{\text{z}}}(f) \leq {\mathcal{R}}_\gamma(f) \leq \hat{\mathcal{R}}_\gamma(f) + \frac {2W} \gamma {\text{Rad}}({\mathcal{H}}\cup -{\mathcal{H}}) + 3\sqrt{\frac 2 n \ln\left(\frac 1 \delta\right)}. \]

Proof. Automatically, \({{\mathcal{R}}_{\text{z}}}(f) \leq {\mathcal{R}}_\gamma(f)\), and the rest follows from Rademacher rules above. (Note the deviations have scaling \(1\) since they are controlled before unwrapping the Rademacher complexity.) \({\qquad\square}\)

Remark (on optimization, and the meaning of this bound).