Generalization?

Let’s summarize where we are in the course. To put everything together, let’s define a few quantities. \[ \begin{aligned} {\mathcal{R}}(f) &:= {\mathbb{E}}(\ell(f(X), Y)), \\ {\widehat{{\mathcal{R}}}}(f) &:= \frac 1 n \sum_{i=1}^n \ell(f(x_i), y_i)), \\ \bar f &:= "{\text{argmin}}_{f\in {\mathcal{F}}}" {\mathcal{R}}(f), \\ \bar f_n &:= "{\text{argmin}}_{f\in {\mathcal{F}}}" {\widehat{{\mathcal{R}}}}(f), \\ \hat f &\quad\text{output of optimization algorithm}, \\ \bar g &:= "{\text{argmin}}_{g\text{ measurable}}" {\mathcal{R}}(g). \end{aligned} \] (As usual, \({\text{argmin}}\) has technical issues we are avoiding, hence the quotes.)

The goal in a machine learning problem is to make an algorithm that outputs \(\hat f\) so that \({\mathcal{R}}(\hat f) - {\mathcal{R}}(\bar g)\) is small. We can decompose this error into the following pieces: \[ \begin{aligned} {\mathcal{R}}(\hat f) - {\mathcal{R}}(\bar g) &= {\mathcal{R}}(\hat f) - {\widehat{{\mathcal{R}}}}(\hat f) &(\triangle) \\ &\quad+ {\widehat{{\mathcal{R}}}}(\hat f) - {\widehat{{\mathcal{R}}}}(\bar f_n) &(\square) \\ &\quad+ {\widehat{{\mathcal{R}}}}(\bar f_n) - {\widehat{{\mathcal{R}}}}(\bar f) &(\spadesuit) \\ &\quad+ {\widehat{{\mathcal{R}}}}(\bar f) - {\mathcal{R}}(\bar f) &(\triangle) \\ &\quad+ {\widehat{{\mathcal{R}}}}(\bar f) - {\mathcal{R}}(\bar f) &(\diamond). \end{aligned} \] These terms can be controlled as follows. (This course was designed around this decomposition!)

So the rest of this course is part 3, the statistical problem above (comparing \({\widehat{{\mathcal{R}}}}\) and \({\mathcal{R}}\)), and then we’ll leave a few lectures for some advanced/miscellaneous topics.

Chernoff’s bounding method

We will work towards bounds of roughly the following form: with probability at least \(1-\delta\) over the draw of \(n\) i.i.d. samples \((X_1,\ldots,X_n)\) with \({\mathbb{E}}(X_1) = 0\), \[ \frac 1 n \sum_{i=1}^n X_i - {\mathbb{E}}(X_1) \leq {\mathcal{O}}\left( \text{scaling}\cdot \sqrt{\frac {\ln(1/\delta)}{n}} \right). \] The key thing to note is that the confidence parameter \(\delta\) appears as \(\ln(1/n)\); said another way, the number of samples scales linearly in the number of bits of desired precision. For some of our uses, this type of scaling will be essential.

Remark. All of our bounds will be of the preceding form: an upper bound on \(n^{-1} X_i\), where \({\mathbb{E}}(X_1)=0\). To convert this into a lower bound, we can swap \(X_i\) with \(-X_i\), and to work with random variables which are not zero mean, we can use \(X_i - {\mathbb{E}}(X_i)\). There are, however, relevant bounds (which we will not cover) where the right hand side depends on whether it is a lower or upper bound; see for instance “multiplicative Chernoff bounds”. [ future matus: cite please. ] (Another choice, followed by some authors (for instance Maxim Raginsky’s book), is to work with \(|X_i - {\mathbb{E}}(X_i)|\) and union bound over both directions of the inequality, gettin \(\ln(2/\delta)\) rather than \(\ln(1/\delta)\).)

The first step along this path is the Markov inequality.

Theorem (Markov’s inequality). For any random variable \(X\) and any \(a>0\), \[ {\text{Pr}}[ |X| \geq a ] \leq \frac {{\mathbb{E}}|X|}{a}. \]

Proof. It suffices to note \({\mathbf{1}}[ |X|\geq a ] \leq |X|/a\) and apply \({\mathbb{E}}(\cdot)\) to both sides. \({\qquad\square}\)

Corollary. For any nondecreasing \(f :{\mathbb{R}}\to {\mathbb{R}}_+\) and any \(a\in{\mathbb{R}}\) with \(f(a) > 0\), \[ {\text{Pr}}[ X \geq a ] \leq \frac {{\mathbb{E}}(f(X)) }{ f(a) }. \]

Proof. Since \(f\) is nondecreasing, \({\text{Pr}}[ X \geq a ] \leq {\text{Pr}}[ f(X) \geq f(a) ]\), and the bound follows by Markov’s inequality. (If \(f\) were strictly increasing then the first inequality would be an equality.) \({\qquad\square}\)

Example.

So far, it has seemed that higher moments help; at least, with controls on enough moments, we get the \(\ln(1/\delta)\) that we desired. Unfortunately, the number of moments we need depends on the desired confidence \(\delta\), which is awkward.

Along these lines, recall the Taylor expansion \[ e^{x} = \sum_{i=0}^\infty \frac {x^i}{i!}. \] In the first homework, we discussed the moment generating function \({\mathbb{E}}(\exp(tX))\). Based on the above Taylor expansion, this function should be sensitive to all the moments, though in a strange way: moment \(p\) is rescaled by \(p!\). Let’s see what happens if we plug this into the earlier corollary which combined Markov with a nondecreasing function: for any \(\epsilon > 0\), \[ {\text{Pr}}[ X \geq \epsilon ] = \inf_{t\geq 0} {\text{Pr}}[ X \geq \epsilon ] \leq \inf_{t\geq 0} \frac {{\mathbb{E}}(\exp(tX))}{\exp(t\epsilon)}. \] If instead we have the random variable \(n^{-1} \sum_i X_i\), then we get \[ {\text{Pr}}[ n^{-1} \sum_i X_i \geq \epsilon ] \leq \inf_{t\geq 0} \frac {{\mathbb{E}}(\prod_i \exp(tX_i/n))}{\exp(t\epsilon)}, \] where so far we have neither assumed i.i.d., nor have we assumed \(X_i\) has mean zero!

Recall (from the homework) that a random variable \(Z\) is \(c\)-subgaussian if \({\mathbb{E}}(\exp(tZ)) \leq \exp(t^2 c / 2)\) for every \(t\in {\mathbb{R}}\). We can use this to complete the preceding proof.

Theorem. Suppose \((X_1,\ldots,X_n)\) are independent, and \(X_i\) is \(c\)-subgaussian. Then \[ {\text{Pr}}[ n^{-1} \sum_i X_i \geq \epsilon ] \leq \exp\left(- \frac {n\epsilon^2}{2c} \right). \] In particular, with probability at least \({\mathbf{1}}-\delta\), \[ \frac 1 n \sum_i X_i < \sqrt{\frac c {2n} \ln(1/\delta)}. \]

Proof. Continuing from the earlier derivation, using independence and the definition of \(c\)-subgaussian \[ {\text{Pr}}[ n^{-1} \sum_i X_i \geq \epsilon ] \leq \inf_{t\geq 0} \frac {{\mathbb{E}}(\prod_i \exp(tX_i/n))}{\exp(t\epsilon)} = \inf_{t\geq 0} \frac {\prod_i {\mathbb{E}}(\exp(tX_i/n))}{\exp(t\epsilon)} \leq \inf_{t\geq 0} \exp\left( \frac {ct^2}{2n^2} - t\epsilon \right). \] The quantity with the \(\exp(\cdot)\) is a convex quadratic minimized at \(t:= \epsilon n^2 / c\). Plugging this in gives the first bound, and the second bound follows from setting the right hand side to \(\delta\) and solving for \(\epsilon\). \({\qquad\square}\)

Example.

Before we close this topic, there is a final useful inequality for us, which goes beyond the pure i.i.d. setting.

Theorem (Azuma’s inequality). Let martingale sequence \((X_0,\ldots,X_n)\) be given, meaning \({\mathbb{E}}(X_i - X_{i-1} | X_{i-1},\ldots,X_1) = 0\). Suppose further that \(|X_i - X_{i-1}| \leq c_i\) with probability \(1\) (for some \(c_i\)). Then \[ {\text{Pr}}\left[ X_n - X_0 \geq \epsilon \right] \leq \exp\left( - \frac {\epsilon^2} {\sum_i c_i^2} \right). \]

Remark.

[ future matus: maybe only use martingale difference sequences? ]

Proof (of Azuma). For convenience, define \(Z_i := X_i - X_{i-1}\), whereby \(X_n - X_0 = \sum_i Z_i\). Stepping through the proof of Hoeffding’s inequality, independence was not used to obtain the inequality \[ {\text{Pr}}[ X_n - X_0 \geq \epsilon] = {\text{Pr}}[ \sum_i Z_i \geq \epsilon ] \leq \inf_{t\geq 0} \frac {{\mathbb{E}}(\prod_{i=1}^n \exp(t(Z_i)))}{\exp(t\epsilon)}. \] By properties of conditional expectation and martingales, we can manipulate this expression to conditionally apply the Hoeffding lemma: \[ \begin{aligned} {\mathbb{E}}(\prod_i \exp(t(Z_i))) &= {\mathbb{E}}\left({\mathbb{E}}\left(\prod_{i=1}^n \exp(t(Z_i))| X_{n-1}, \ldots, X_0\right)\right) \\ &= {\mathbb{E}}\left({\mathbb{E}}\left(\exp(t(Z_n))| X_{n-1}, \ldots, X_0\right)\right) {\mathbb{E}}\left({\mathbb{E}}\left(\prod_{i=1}^{n-1} \exp(t(Z_i))| X_{n-1}, \ldots, X_0\right)\right) \\ &= \exp(t^2 / (8c_i^2)) {\mathbb{E}}\left({\mathbb{E}}\left(\prod_{i=1}^{n-1} \exp(t(Z_i))| X_{n-1}, \ldots, X_0\right)\right). \end{aligned} \] Proceeding with this derivation recursively, we obtain \[ {\text{Pr}}[ X_n - X_0 \geq \epsilon] \leq \inf_{t\geq 0} \frac {\exp(-\sum_i \epsilon^2/c_i^2)}{\exp(t\epsilon)}. \] and finish as in the subgaussian case. \({\qquad\square}\)

References