Intro: steepest descent method

The following method is sometimes called steepest descent, although terminology is not standard. We’ll analyze this method for (unconstrained!) minimization of smooth, differentiable functions.

  1. Let \(w_0 \in {\mathbb{R}}^d\) be given.

  2. For \(i = 1,\ldots, t\):

    1. Choose any \(v_i := {\text{argmax}}{\left\{ { {\left\langle {\nabla f}(w_{i-1}), v \right \rangle} : \|v\|\leq 1 } \right\}}\).

    2. Set \(w_i := w_{i-1} - \eta_i v_i\).

The choice of \(v_i\) is similar to the choice of \(v_i\) from the Frank-Wolfe method (albeit \(\max\) in place of \(\min\)) a \(1\)-ball for some norm. However, the second step is different, moving to \(w_{i-1} - \eta_i v_i\) rather than \(w_{i-1} + \eta_i (v_i - w_i)\). In particular, this method is not attempting to maintain a convex constraint.


Dual norms

In Fenchel duality we had a primal problem/space, and a dual problem/space; the primal was for parameters, and the dual was for gradients (roughly speaking). It’s thus natural for the dual to also have norms whenever the primal does: given a norm \(\|\cdot\|\), define the dual norm \(\|\cdot\|_*\) as \[ \|v\|_* := \sup{\left\{ { {\left\langle u, v \right \rangle} : u\in {\mathbb{R}}^d, \|u\|\leq 1} \right\}}. \] This definition makes more sense with a bit of functional analysis or topology. For now, note that the definition gives a sort of generalized Hölder’s for free: given any \(w,s\), \[ \|s\|_* \geq {\left\langle w/\|w\|, s \right \rangle} = {\left\langle w, s \right \rangle} / \|w\| \qquad\Longrightarrow\qquad {\left\langle w, s \right \rangle} \leq \|w\| \|s\|_*. \] In homework, we almost proved Hölder’s inequality; the proof is only completed with the “optional” problem which gives the equality case and the exact form of dual norms for conjugate exponents.

Rule of thumb / sanity check. When doing derivations, parameters and other elements of the primal space should be measured with \(\|\cdot\|\), whereas gradients and other dual elements should be measured with \(\|\cdot\|_*\).

Remark. From the definition of dual norms, we immediately get a propery about steepest descent. The choice of \(v_i\) is the maximizing element in the definition of the dual norm (where we always have \(\max\) rather than just \(\sup\) since we have only finitely many dimensions), thus \[ {\left\langle {\nabla f}(w_{i-1}), v_i \right \rangle} = \|{\nabla f}(w_{i-1})\|_*. \]

A key property is how norms and conjugates interact.

Proposition. If \(f(w) = \|\cdot\|\), then \[ f^*(s) = \begin{cases} 0 &\text{when $\|s\|_* \leq 1$}, \\ \infty &\text{otherwise}. \end{cases} \] (Since \(f\) is closed and convex, then \(f^{**} = f\).)

Remark. In the homework, we saw that \((\|\cdot\|_p^p/p)^* = \|\cdot\|_q^q/q\) for conjugate exponents \(p,q\in [1,\infty]\). The preceding proposition tells us what happens if we don’t mess with the powers. [ note to future matus. in class I discussed the example of lasso; can compare this to ridge regression. I should have presented OLS, lasso, and ridge clearly in an early lecture. Oh well. ]

Proof. Let \(s\in{\mathbb{R}}^d\) be given, and consider two cases.


Now that we have a notion of norms and duals, we can correspondingly define smoothness with sensitivity to both norms. From there we will be able to analyze steepest descent.

A differentiable function \(f : {\mathbb{R}}^d \to {\mathbb{R}}\) is \(\beta\)-smooth with respect to \(\|\cdot\|\) (or simply \((\|\cdot\|, \beta)\)-smooth when \[ \|{\nabla f}(w) - {\nabla f}(w')\|_* \leq \beta \|w-w'\|. \] Following the same derivation as in the last set of notes, this implies \[ f(w') \leq f(w) + {\left\langle {\nabla f}(w), w'-w \right \rangle} + \frac {\beta}{2} \|w'-w\|^2 \] (which was the original definition of smoothness we gave).

Remark. Recall the sanity check above: as expected, the definition of smoothness has dual norms on gradients, primal norms on parameters.

Examples. Let’s see how smoothness arises in two common machine learning problems. Whenever faced with a prediction problem (and a loss has not yet been specified), it’s good to check how how these two examples work out.

Smoothness and gradient descent

We’ll finish the lecture by putting these pieces together to show an easy convergence result for steepest descent.

Let’s see how far we can get just with smoothness. Let’s use the step size \(\eta_i := \|{\nabla f}(w_i)_{i-1}\|_*/\beta\) (which can be found by minimizing \(\eta_i\) in the following); by smoothness, \[ \begin{aligned} f(w_{i+1}) &\leq f(w_i) + {\left\langle {\nabla f}(w_i), -\eta_{i+1}v_{i+1} \right \rangle} + \frac \beta 2 \|\eta_{i+1} v_{i+1}\|^2 \\ &= f(w_i) - \frac 1 \beta \|{\nabla f}(w_i)\|_* + \frac {\beta \eta_{i+1}^2}{2} \\ &= f(w_i) - \frac {\|{\nabla f}(w_i)\|_*^2}{2\beta}. \end{aligned} \] This is good news: “intuitively”, since first order conditions say \({\nabla f}(w) = 0\) iff \(w\) is optimal, we can hope for \(\|{\nabla f}(w)\|_*\) to be large when we are far from optimal, and the above expression says that error reduces quickly in that cases.

We will use this expression to produce convergence rates in the next lecture, but for now we get the following.

Proposition. If \(\inf_w f(w) > -\infty\), then \({\nabla f}(w_i) \to 0\).

Remark. No convexity assumption is made!

Proof. Rearranging the earlier bound gives \[ \|{\nabla f}(w_i)\|_*^2 \leq 2\beta( f(w_i) - f(w_{i+1}) ) \] Applying \(\sum_{i=0}^t\) gives \[ \sum_{i=0}^t \|{\nabla f}(w_i)\|_*^2 \leq 2\beta \sum_{i=0}^t (f(w_i) - f(w_{i+1})) = 2\beta( f(w_0) - f(w_{t+1}) ) \leq 2 \beta( f(w_0) - \inf_w f(w) ), \] and applying \(\lim_{t\to\infty}\) to both sides gives \[ \sum_{i=0}^\infty \|{\nabla f}(w_i)\|_*^2 \leq 2\beta (f(w_i) - \inf_w f(w)) < \infty. \] The result follows by properties of norms and series.