• Homework 2 is out.

• Project presentations are on December 8 at noon (location TBA); details forthcoming on webpage.

## Overview

The culmination of Rademacher/symmetrization was the following bound.

Theorem. Let functions $${\mathcal{F}}$$ be given with $$|f(z)| \leq c$$ almost surely for every $$f\in{\mathcal{F}}$$. With probability $$\geq 1-de]ta$$ over an i.i.d. draw $$S := (Z_1,\ldots,Z_n)$$, every $$f\in{\mathcal{F}}$$ satisfies ${\mathbb{E}}f \leq \hat {\mathbb{E}}f + 2{\text{Rad}}({\mathcal{F}}_{|S}) + 3c \sqrt{\frac 2 n \ln(2/\delta)}.$ where \begin{aligned} {\mathbb{E}}f &:= {\mathbb{E}}(f(Z)), \\ \hat{\mathbb{E}}f &:= \frac 1 n \sum_{i=1}^n f(Z_i), \\ {\text{Rad}}(U) &:= {\mathbb{E}}_\sigma \sup_{v\in V} {\left\langle \sigma, v \right \rangle}_n, \\ {\left\langle a, b \right \rangle}_n &:= \frac 1 n \sum_{i=1}^n a_i b_i, \\ {\mathcal{F}}_{|S} &:= \left\{ (f(Z_1),\ldots,f(Z_n)) : f\in{\mathcal{F}}\right\}. \end{aligned}

Goals for today.

1. Bounds on $${\mathcal{R}}_\ell(f) := {\mathbb{E}}(\ell(-Yf(X)))$$ when $$\ell$$ is Lipschitz.

2. A Rademacher bound for finite classes. We’ll use this in the next class to discuss Shatter coefficients and VC dimension.

## $${\mathcal{R}}_\ell$$ for lipschitz $$\ell$$

Before getting the tools to work with Lipschitz losses, let’s see how easily we can control $$l_2$$ bounded linear functions with Rademacher complexity; this will allow us to get bounds for logistic regression soon after.

Lemma. Set $$X:= \sup_{x\in S} \|x\|_2$$. Then ${\text{Rad}}(\{x\mapsto {\left\langle w, x \right \rangle}\}_{|S} : \|w\|_2\leq W\}) \leq W\sqrt{\sum_i \|x_i\|_2^2}/n \leq WX/\sqrt{n}.$ Proof. For any fixed $$\sigma\in\{-1,+1\}^n$$, setting $$x_\sigma := \sum_{i=1}^n \sigma_i x_i/n$$, the equality case of Cauchy-Schwarz grats $\sup_{\|w\|_2\leq W} \frac 1 n sum_{i=1}^n \sigma_i {\left\langle w, x_i \right \rangle} \sup_{\|w\|_2\leq W} {\left\langle w, x_\sigma \right \rangle} =\left\{ \begin{array}{ll} \text{if x_\sigma = 0: }& 0\\ \text{otherwise: }& {\left\langle Wx_\sigma / \|x_\sigma\|_2, x_\sigma \right \rangle} \end{array} \right\} = W\|x_\sigma\|_2.$ Now to handle the expectation, we invoke the only inequality in the whole proof: by Jensen’s inequality, ${\mathbb{E}}_\sigma \|x_\sigma\|_2 = {\mathbb{E}}_\sigma \sqrt{\|x_\sigma\|_2^2} \leq \sqrt{{\mathbb{E}}_\sigma \|x_\sigma\|_2^2}.$ The rest is again equalities: since $${\mathbb{E}}_\sigma(\sigma_i\sigma_j) = {\mathbf{1}}[i = j]$$, ${\mathbb{E}}\|x_\sigma\|_2^2 = \frac 1 {n^2} \sum_{j=1}^d {\mathbb{E}}_\sigma (\sum_{i=1}^n x_{i,j}\sigma_i)^2 = \frac 1 {n^2} \sum_{j=1}^d {\mathbb{E}}_\sigma (\sum_{i=1}^n x_{i,j}^2\sigma_i^2 + \sum_{i\neq l} x_{i,j} x_{l,j} \sigma_i \sigma_l) = \sum_{i=1}^n \|x_i\|_2^2 / n^2$ $${\qquad\square}$$

Remark. This proof is particularly clean for $$\|\cdot\|_2$$, but in the homework we’ll see a clean trick to handle other norms.

Lemma. Let functions $$\vec \ell = (\ell_i)_{i=1}^n$$ be given with each $$\ell_i:{\mathbb{R}}\to{\mathbb{R}}$$ $$L$$-lipschitz, and for any vector $$v\in{\mathbb{R}}^n$$ define the coordinate-wise composition $$\vec \ell \circ v := (\ell_i(v_i))_{i=1}^n$$, and similarly $$\vec \ell \circ U := \{ \vec \ell \circ v : v\in U\}$$. Then ${\text{Rad}}(\vec\ell \circ U) \leq L {\text{Rad}}(U).$

Remark. The proof seems straightforward but it is a little magical. It uses a step akin to symmetrization, but quite different. The difficulty arises since $$|\ell_i(a)-\ell_i(b)| \leq L|a-b|$$ by the definition of Lipschitz, but we need to erase that absolute value.

Proof. We will show that we can replace $$\ell_1$$ with $$L$$, and the proof is complete by recursing on $$\{2,\ldots,n\}$$. The idea is that we need to get two terms that depend on $$\ell_i$$ in order to invoke Lipschitz. Proceeding from the definition of $${\text{Rad}}$$, \begin{aligned} {\text{Rad}}(\vec \ell \circ U) &= {\mathbb{E}}_\sigma \sup_{v\in U} \frac 1 n \sum_i \sigma_i \ell_i(v_i) \\ &= \frac 1 {2n} {\mathbb{E}}_{\sigma_{2:n}} \left( \sup_{v\in U} \ell_1(v_1) + \sum_{i\geq 2} \sigma_i \ell_i(v_i) + \sup_{w\in U} -\ell_1(w_1) + \sum_{i\geq 2} \sigma_i \ell_i(w_i) \right) \\ &\leq \frac 1 {2n} {\mathbb{E}}_{\sigma_{2:n}} \left( \sup_{\substack{v\in U\\w\in U}} L | u_1 - w_1| + \sum_{i\geq 2} \sigma_i ( \ell_i(v_i) + \ell_i(w_i) ) \right) \\ &= \frac 1 {2n} {\mathbb{E}}_{\sigma_{2:n}} \left( \sup_{\substack{v\in U\\w\in U\\v_1\geq w_1}} L | v_1 - w_1| + \sum_{i\geq 2} \sigma_i ( \ell_i(v_i) + \ell_i(w_i) ) \right) \\ &= \frac 1 {2n} {\mathbb{E}}_{\sigma_{2:n}} \left( \sup_{\substack{v\in U\\w\in U\\v_1\geq w_1}} L ( v_1 - w_1) + \sum_{i\geq 2} \sigma_i ( \ell_i(v_i) + \ell_i(w_i) ) \right) \\ &= \frac 1 {n} {\mathbb{E}}_{\sigma} \sup_{v\in U} \left(L v_1 + \sum_{i\geq 2} \sigma_i \ell_i(v_i) \right) \end{aligned} The same technique is now applied for $$i\in \{2,\ldots,n\}$$. $${\qquad\square}$$

This gives the following useful bound.

Theorem. Let functions $${\mathcal{F}}$$ and loss $$\ell$$ be given. Suppose $$\ell$$ is $$L$$-Lipschitz and $$|\ell(-yf(x))|\leq c$$ and $$|y|\leq 1$$ almost surely. Then with probability at least $$1-\delta$$ over $$S$$, every $$f\in{\mathcal{F}}$$ satisfies ${\mathcal{R}}_\ell(f) \leq {\widehat{{\mathcal{R}}}}_\ell(f) + 2L{\text{Rad}}({\mathcal{F}}_{|S}) + 3c \sqrt{\frac 2 n \ln\left(\frac 1\delta\right)}.$ Proof. The proof follows from the previous Rademacher rules by noting $$\ell_i(z) := \ell(-y_iz)$$ is $$L$$-Lipschitz, just like $$\ell$$. $${\qquad\square}$$

Remark (logistic regression). Combining these pieces, with $$\|x\|_2\leq X$$ and functions $${\mathcal{F}}:= \{x\mapsto {\left\langle w, x \right \rangle} : \|w\|_2\leq W\}$$ and nondecreasing $$L$$-lipschitz loss (e.g., logistic loss $$z\mapsto \ln(1+\exp(z))$$ is $$1$$-lipschitz) we get, with probability $$\geq 1-\delta$$, that each $$f\in{\mathcal{F}}$$ satisfies ${\mathcal{R}}_\ell(f) \leq {\widehat{{\mathcal{R}}}}_\ell(f) + 2LWX/\sqrt{n} + 3(LWX + \ell(0)) \sqrt{\frac 2 n \ln\left(\frac 1\delta\right)}.$ We had roughly this bound with SGD, but via a very different analysis!

## Finite classes

The main Rademacher tool here is as follows. [ In class we also discussed Shatter coefficients and VC dimension, but these will be in the next lecture notes. ]

Theorem (Massart finite lemma). ${\text{Rad}}(U) \leq \frac {\max_{v\in U} \|v\|_2\sqrt{2\ln(|U|)}}{n}.$

While this has a fancy name, it’s a consequence of the following lemma from homework.

Lemma. If $$(X_1,\ldots,X_n)$$ are $$c^2$$-subgaussian (but not necessarily independent or identical), then ${\mathbb{E}}\max_i X_i \leq c \sqrt{2 \ln(n)}.$

Proof. As in the homework, this follows by noting $$\max_i X_i \leq \inf_{t>0}t^{-1} \ln \sum_i \exp(tX_i)$$, using the definition of $$c^2$$-subgaussian, and optimizing $$t$$. (The homework problem had $$2n$$ not $$n$$, but it controlled for $$\max_i |X_i|$$.) $${\qquad\square}$$.

Next we need to see how subgaussianity transfer from a random variables to sums of them.

Lemma. If $$(Z_1,\ldots,Z_n)$$ are $$c_i^2$$-subgaussian and independent, then $$\sum_i Z_i/n$$ is $$\sum_ic_i^2/n^2$$-subgaussian.

Proof. Set $$\bar Z := \sum_i Z_i / n$$. For any $$t\in {\mathbb{R}}$$, ${\mathbb{E}}(\exp(t Z_i)) = \prod_i {\mathbb{E}}(\exp(tZ_i/n)) \leq \prod_i \exp(c^2t^2/(2n^2)) = \exp(\sum_ic_i^2t^2/2n^2).$

These lemmas suffice to prove the Rademacher bound.

Proof (of Massart finite lemma). For each $$v\in V$$, define random variable $$X_v := {\left\langle \sigma, v \right \rangle}_n$$; crucially, the distribution of $$X_v$$ is determined by the distribution of $$\sigma$$. Moreover, $$\sigma_iv_i$$ is $$v_i^2$$-subgaussian by the Hoeffding lemma, meaning $$X_v$$ is $$\|v\|_2^2/n^2$$-subgaussian by the preceding lemma, which together with the lemma on maxima of subgaussian distributions gives the bound. $${\qquad\square}$$