Administrative/Meta:

AdaBoost setup

Let’s now look at the AdaBoost setup in more detail.

The AdaBoost algorithm is then \(\|\cdot\|_1\) steepest descent, expanded with the present notation as follows.

  1. Set \(w_0 := 0\); this is a convention which allows the invariant that \(t\) iterations of boosting give \(w_t\) with at most \(t\) nonzero entries. This is desirable since the predictor \(x \mapsto \sum_{j=1}^d w_j h_j(x)\) becomes expensive to compute as \(w\) becomes more dense.

  2. For \(s = 1,\ldots, t\):

    1. Choose \[ v_s := {\text{argmax}}{\left\{ { {\left\langle \nabla ({\mathcal{L}}\circ (-A))(w), v \right \rangle} : \|v\|_1 \leq 1 } \right\}} = {\text{argmax}}{\left\{ { {\left\langle v, A^\top\nabla{\mathcal{L}}(-Aw) \right \rangle} : \|v\|_1 \leq 1 } \right\}}. \] This can be simplified further by noting that the maximum may be taken, without loss of generality, to be a coordinate \(j \in [d]\) and a sign \(r\). Namely, \(v_s:= r\cdot {\textbf{e}}_j\), where \[ \begin{aligned} j &:= {\text{argmax}}_j \left| \left( A^\top \nabla{\mathcal{L}}(-Aw) \right)_j \right| \\ &= {\text{argmax}}_j \left| \sum_{i=1}^n y_i \ell'(-(Aw)_i) h_j(x_i) \right|, \\ r &:= {\text{sgn}}\left( (A^\top \nabla {\mathcal{L}}(-Aw))_j \right). \end{aligned} \] Note that this optimization for \(j\) is a re-weighted ERM problem: the vector \(\nabla{\mathcal{L}}(-Aw)\) provides the re-weighting. This \({\text{argmax}}\) searches over \({\mathcal{H}}\) for a good choice \(h_j\in{\mathcal{H}}\); namely, each round of AdaBoost selects another good hypothesis from \({\mathcal{H}}\).

    2. Update \(w_i := w_{i-1} - \eta_i v_i\). Previously, the step size \(\eta_i\) was set to \(\|\nabla ({\mathcal{L}}\circ (-A))\|_\infty / \beta\); here the choice will be somewhat more complicated, and deferred to the statement of the convergence rate.

Remarks.

Convergence under separability

This section will prove a convergence rate of AdaBoost under the separability assumption, and still using the earlier steepest descent tools. The rate will be identical to the classical rate Freund and Schapire (1997), and the step size is identical, meaning the iterate sequences are the same.

Before proceeding, there is another way to write the separability assumption, which will be more convenient mathematically. Say the weak learning assumption (WLA) is satisfied for matrix \(A\) with constant \(\gamma > 0\) if \[ \inf_{\phi\in {\mathbb{R}}_+^n\setminus \{0\}} \frac {\|A^\top \phi\|_\infty}{\|\phi\|_1} \geq \gamma. \]

Remark. Perhaps WLA looks strange, but note the following properties.

The convergence theorem is as follows. The step size is funny and will be described after the statement.

Theorem. Suppose WLA holds, and \((w_i)_{i=0}^t\) are generated by AdaBoost with step size \(\eta_i := \|A^\top \nabla{\mathcal{L}}(-Aw_{i-1})\|_\infty / \|\nabla {\mathcal{L}}(-Aw_{i-1})\|_1 \geq \gamma\). Then \[ {\mathcal{L}}(-Aw_t) \leq \exp\left(- t\gamma^2 / 2\right). \]

Remarks.

Let’s now try to prove the theorem, proceeding from an upper bound resulting from smoothness as usual. Or, rather, almost as usual: let’s slip in one detail, namely let’s suppose the function is \(\beta_i\) smooth in iteration \(i\), and that the step size is \(\eta_i := \|A^\top {\mathcal{L}}(-Aw_{i-1})\|_\infty / \beta_i\). (By “smooth in iteration \(i\)”, we mean that that \({\mathcal{L}}\circ (-A)\) is \(\beta_i\) smooth over the sublevel set \(\{w\in{\mathbb{R}}^d : {\mathcal{L}}(-Aw) \leq {\mathcal{L}}(-(Aw_{i-1}))\}\); these are the only points we care about in this iteration.) This, combined with WLA, gives \[ \begin{aligned} {\mathcal{L}}(-Aw_i) &\leq {\mathcal{L}}(-Aw_{i-1}) - \frac {\|A^\top \nabla {\mathcal{L}}(-Aw)\|_\infty^2}{2\beta_i} \\ &\leq {\mathcal{L}}(-Aw_{i-1}) - \frac {\gamma^2 \|\nabla {\mathcal{L}}(-Aw)\|_1^2}{2\beta_i} \end{aligned} \]

It appears we are stuck, but momentarily we will prove the following (which will return us to our earlier step size).

Lemma. For any \(w,w'\) with \(\max\{{\mathcal{L}}(-Aw),{\mathcal{L}}(-Aw')\} \leq {\mathcal{L}}(-Aw_{i-1})\), \[ \left\| \nabla({\mathcal{L}}\circ (-A))(w) - \nabla({\mathcal{L}}\circ (-A))(w') \right\|_\infty \leq {\mathcal{L}}(-Aw_{i-1}) \|w-w'\|_1. \]

In particular, this lemma says we can set \(\beta_i := {\mathcal{L}}(-Aw_{i-1})\) (which recovers the step size from the theorem statement). Plugging this in and noting \(\|\nabla{\mathcal{L}}(-Aw_{i-1})\|_1 = {\mathcal{L}}(-Aw_{i-1})\), \[ \begin{aligned} {\mathcal{L}}(-Aw_i) &\leq {\mathcal{L}}(-Aw_{i-1}) - \frac {\gamma^2 \|{\mathcal{L}}(-Aw_{i-1})\|_1^2}{2{\mathcal{L}}(-Aw_{i-1})} \\ &\leq {\mathcal{L}}(-Aw_{i-1}) - \frac {\gamma^2 {\mathcal{L}}(-Aw_{i-1})}{2} \end{aligned} \]

All that remains is to prove the lemma.

Proof (of preceding lemma). [future matus: explain the steps] \[ \begin{aligned} \left\| \nabla({\mathcal{L}}\circ (-A))(w) - \nabla({\mathcal{L}}\circ (-A))(w') \right\|_\infty &= \max{\left\{ { {\left\langle Av, \nabla{\mathcal{L}}(-Aw) - \nabla{\mathcal{L}}(-Aw') \right \rangle} : \|v\|_1\leq 1 } \right\}} \\ &\leq \frac 1 n \sum_{i=1}^n \left| \ell'(-(Aw)_i) - \ell'(-(Aw')_i) \right| \\ &= \frac 1 n \sum_{i=1}^n \left| \int_0^1 \ell''(-(Aw')_i + r((-Aw)_i - (-Aw')_i))(Aw - Aw')_i dr \right| \\ &\leq \left( \int_0^1 \frac 1 n \sum_{i=1}^n \ell''(-(Aw')_i + r((-Aw)_i - (-Aw')_i))(Aw - Aw')_i dr \right) \|Aw - Aw'\|_\infty \\ &\leq \left( \int_0^1 \frac 1 n \sum_{i=1}^n \ell(-(Aw')_i + r((-Aw)_i - (-Aw')_i))(Aw - Aw')_i dr \right) \|w - w'\|_1 \\ &\leq {\mathcal{L}}(-Aw_i) \|w - w'\|_1. \end{aligned} \]

\({\qquad\square}\).

``Consistency’’ of convex ERM

The end of lecture gave a taste of the results in the next lecture. In the next lecture we’ll close this segment on convex optimization by seeing how convex optimization helps us with our original problem, namely that of classification. We’ll show the following two facts.

References

Freund, Yoav, and Robert E. Schapire. 1997. “A Decision-Theoretic Generalization of on-Line Learning and an Application to Boosting.” J. Comput. Syst. Sci. 55 (1): 119–39.

Telgarsky, Matus. 2012. “A Primal-Dual Convergence Analysis of Boosting.” JMLR 13 (3): 561–606.