Syllabus overview

(First part of lecture is interactive with class, going over course webpage, namely the syllabus except the schedule.)

Philosophy behind course

Machine learning theory overview

Machine learning is: using data to tune algorithms.

Example (linear regression). Suppose our goal is to predict \(y\in {\mathbb{R}}\) given \(x\in{\mathbb{R}}^d\).

The main part of the schedule is roughly split into these three issues. Note that we've ignored the question ``why use least squares error''!

Schedule overview

(Schedule then discussed, should make some more sense now.)

A quick proof

Note: a tremendous amount of detail was omitted in lecture; this material will only appear on homeworks in a few weeks, once it has been presented properly in the ``optimization section''. In other words, this material is optional!

This proof will continue the above example: the statistical learning model, using linear regression. This means we get an IID sample \(((x_i,y_i))_{i=1}^t\), and must predict well on future examples from the same distribution. We'll sketch a derivation that should look natural, and at the end state the theorem it gives.

Regarding the three questions above: ``representation'' will be discussed in the next class, but we must contend here with optimization and generalization. We will cheat:

The algorithm will be given momentarily, and makes use of the following objects:

The stochastic gradient iteration has \(w_0\in B\) arbitrary, and thereafter \[ \begin{aligned} w_{i+1} &:= \Pi_B( w_i - \eta \hat g_{i+1} ) \\ &= \Pi_B( w_i - \eta (\nabla {\mathcal{R}}(w_i) + \varepsilon_{i+1})), \end{aligned} \] where

Remark. Many of these assumptions are fine when \(B\) is compact and \(X,Y\) also lie in compact sets (almost surely). But those compactness assumptions themselves are often false in practice.

To analyze this, imagine that \(u\) is very good and unique; we'd like \(\|w_{i} - u\|\) to become small. Indeed (without any assumptions on \(u\) except \(u\in B\)): \[ \begin{aligned} \| w_{i+1} - u \|^2 &= \| \Pi_B( w_i - \eta \hat g_{i+1} ) - u \|^2 \\ &\leq \| w_i - \eta \hat g_{i+1} - u \|^2 &\text{(see below)} \\ &= \| w_i - u \|^2 + 2 \eta {\left\langle \hat g_{i+1}, u - w_i \right \rangle} + \eta^2 \|\hat g_{i+1}\|^2. \\ &= \| w_i - u \|^2 + 2 \eta {\left\langle (\nabla_{w} {\mathcal{R}})(w_i), u - w_i \right \rangle} + 2 \eta {\left\langle \varepsilon_{i+1}, u - w_i \right \rangle} + \eta^2 G^2. \\ &\leq \| w_i - u \|^2 + 2 \eta ({\mathcal{R}}(u) - {\mathcal{R}}(w_i)) + 2 \eta {\left\langle \varepsilon_{i+1}, u - w_i \right \rangle} + \eta^2 G^2. \end{aligned} \] where the orthogonal projection was removed essentially due to pythagorean theorem (we'll do this in detail in a homework), and the last step used the first-order convexity inequality. Rearranging and summing over \(i \in \{0,\ldots,t-1\}\), and then applying Jensen's inequality, \[ \begin{aligned} \|w_0 - u \|^2 + 2 \eta \sum_{i=0}^{t-1} {\left\langle \varepsilon_{i+1}, u - w_i \right \rangle} + t\eta^2 G^2 &\geq 2\eta \sum_{i=0}^{t-1} \left( {\mathcal{R}}(w_i) - {\mathcal{R}}(u) \right) \\ &= 2\eta t \sum_{i=0}^{t-1} \frac 1 t \left( {\mathcal{R}}(w_i) - {\mathcal{R}}(u) \right) \\ &\geq 2\eta t \left( {\mathcal{R}}(\bar w_{t-1}) - {\mathcal{R}}(u) \right) \end{aligned} \] where \(\bar w_{t-1} = \sum_{i=0}^{t-1} w_i / t\) is the averaged iterate, which sometimes appears in practice (though not commonly). Noting \(w_0 \in B\) implies \(\|u-w_0\| \leq 2R\) and dividing by \(2\eta t\) gives \[ {\mathcal{R}}(\bar w_{t-1}) - {\mathcal{R}}(u) \leq \frac {4R^2}{\eta t} + \frac {1}{t} \sum_{i=0}^{t-1} {\left\langle \varepsilon_{i+1}, u - w_i \right \rangle} + \eta G^2. \notag \] To simplify further, consider the middle term of the right hand side. Since \({\mathbb{E}}(\varepsilon_{i+1}) = 0\), it seems there should be a way to make the entire expression small. But the two sides of the inner product are not independent, we can't just wrap everything in an expectation. The trick is to note that the left hand side is independent of the right when given the right, and that this conditional expectation of \(\varepsilon_{i+1}\) is also zero: \[ \begin{aligned} {\mathbb{E}}({\left\langle \varepsilon_{i+1}, u - w_i \right \rangle}) &= {\mathbb{E}}({\mathbb{E}}({\left\langle \varepsilon_{i+1}, u - w_i \right \rangle}\ |\ ((x_1,y_1),\ldots,(x_i,y_i)))) \\ &= {\mathbb{E}}({\left\langle {\mathbb{E}}(\varepsilon_{i+1}\ |\ ((x_1,y_1),\ldots,(x_i,y_i)) , u - w_i \right \rangle} )) = 0. \end{aligned} \] This gives \[ {\mathbb{E}}( {\mathcal{R}}(\bar w_{t-1}) - {\mathcal{R}}(u) ) \leq \frac {4R^2}{\eta t} + \eta G^2. \notag \] The last step is to optimize for \(\eta\); \(1/\sqrt{t}\) is fine, but the best choice (minimizing the bound) is \(2R/(G\sqrt{t})\), which gives the following.

Theorem. Suppose \(w_0 = 0\), and otherwise \(w_i\) is produced by the sgd iteration with \(\eta = 2R/(G\sqrt{t})\). Then \[ {\mathbb{E}}( {\mathcal{R}}(\bar w_{t-1}) - {\mathcal{R}}(u) ) \leq \frac {4RG}{\sqrt{t}}, \notag \] where the expectation is over the training set \(((x_i,y_i))_{i=1}^t\).

Remarks.

Open problems. Here are a few suggested by the lecture.

References

Bartlett, Peter L., Michael I. Jordan, and Jon D. McAuliffe. 2006. “Convexity, Classification, and Risk Bounds.” Journal of the American Statistical Association 101 (473): 138–56.

Zhang, Tong. 2004. “Statistical Behavior and Consistency of Classification Methods Based on Convex Risk Minimization.” The Annals of Statistics 32: 56–85.