Projected gradient descent

To read this you should be familiar with the notions of subgradient and first-order optimality condition (both presented in Basics of Convex Analysis).

1. Introduction and FOC

We consider here the general minimization problem over a convex set \(C\) written in its unconstrained form:

\[ \begin{eqnarray} \min_{x}\quad f(x)+i_{C}(x). \end{eqnarray} \]

where \(f\) is continuously differentiable on \(C\). Following the FOC, \(x^{\sharp}\) is a minimizer if \(0\in \nabla f(x^{\sharp})+\partial i_{C}(x^{\sharp})\). To obtain the subdifferential of \(i_{C}\) at a point \(x\in C\), note that if \(y\in\partial i_{C}(x)\) then, by definition,

\[ \begin{eqnarray} i_{C}(z) &\ge & i_{C}(x) + \langle z-x,y \rangle, \quad \forall z.\nonumber \end{eqnarray} \]

Now, for any \(z\notin C\), the above inequality is trivially verified. We can thus restrict to \(z\in C\) and since \(x\in C\), we are left with

\[ \begin{eqnarray} 0 &\ge & \langle z-x,y \rangle, \quad \forall z\in C.\nonumber \end{eqnarray} \]

This inequality defines the subdifferential of interest but also happens to be the definition of the normal cone to the convex set \(C\) denoted \(N_{C}(x)\). Summarizing, we have:

\[ \begin{eqnarray} \partial i_{C}(x) &=& \{y \,|\, \langle z-x,y\rangle \le 0, \,\forall z\in C\} \,\,=:\,\, N_{C}(x). \end{eqnarray} \]

We can thus re-write the FOC for the problem as follows:

\[ \begin{eqnarray} x^{\sharp} \,\in\arg\min_{x\in C} \, f(x) &\Longleftrightarrow & 0 \,\in\, \nabla f(x^{\sharp})+N_{C}(x^{\sharp}) \label{FOC basic} \end{eqnarray} \]

\[ \begin{eqnarray} \text{if}\,\, u \,\in\, N_{C}(x) \,\,\text{then}\,\, \forall\alpha\ge0,\,\, \alpha u \,\in\, N_{C}(x) \label{normal cone multiplication}. \end{eqnarray} \]

2. Normal cone and Euclidean projection

The Euclidean projection onto a convex set \(C\) that we shall denote \(\pi_{C}\) is defined as follows for any \(z\in X\):

\[ \begin{eqnarray} \pi_{C}(z) &=& \arg \min_{x\in C}\quad{1\over 2} \|x-z\|^{2}_{2}. \end{eqnarray} \]

Note that in the unconstrained problem, \(C=X\) and hence \(\pi_{C}\equiv \mathbb I \) the identity operator. Note also that the factor \(1/2\) is here only for aesthetics when computing the gradient.

The subgradient of the objective function is trivial to compute (\(\partial f(x)=(x-z)\)); writing \(x^{\sharp}\) the minimizer (with thus \(x^{\sharp}=\pi_{C}(z)\)), the FOC \(\eqref{FOC basic}\) in this particular problem simply reduces to

\[ \begin{eqnarray} 0 \,\in\, (x^{\sharp}-z) + N_{C}(x^{\sharp}) \quad\!\!\text{or equivalently}\quad\!\! z \in x^{\sharp}+N_{C}(x^{\sharp})\label{reduced FOC}. \end{eqnarray} \]

Using operator notations, the right-hand side can be written \(z\in (\mathbb I+N_{C})(x^{\sharp})\) where \(\mathbb I\) denotes the identity operator. Replacing \(x^{\sharp}\) by \(\pi_{C}(z)\) and re-arranging terms then gives

\[ \begin{eqnarray} \pi_{C}(z) &=& (\mathbb I + N_{C})^{-1}(z).\nonumber \end{eqnarray} \]

We have thus shown the classical relationship:

\[ \begin{eqnarray} \pi_{C} &\equiv& (\mathbb I+N_{C})^{-1}. \end{eqnarray} \]

\[ \begin{eqnarray} (x+\alpha u) - x &\in & N_{C}(x)\nonumber \end{eqnarray} \]

then, letting \(z:=(x+\alpha u)\), we have precisely \(z\in x+N_{C}(x)\) which is precisely the form in \(\eqref{reduced FOC}\) which is equivalent to \(x=\pi_{C}(z)\) and thus

\[ \begin{eqnarray} x &=& \pi_{C}(x + \alpha u), \quad (\alpha \ge 0, u\in N_{C}(x)).\nonumber \end{eqnarray} \]

The last thing to observe is that the FOC \(\eqref{FOC basic}\) indicates that \(-\nabla f(x^{\sharp})\in N_{C}(x^{\sharp})\). We have thus found ourselves a useful fixed-point form for the minimizer.

For any \(\alpha \ge 0\), a minimizer \(x^{\sharp}\) of the problem verifies the following fixed-point equation:

\[ \begin{eqnarray} x^{\sharp} &=& \pi_{C}(x^{\sharp}-\alpha \nabla f(x^{\sharp})). \end{eqnarray} \]

3. The algorithm

So, as we had said in the introduction of this blog, a big part of convex optimization is based on working out clever fixed-point equations verified by the minimizers and to then apply a fixed-point iteration. In this case it leads to the following algorithm.

The projected gradient descent (PGD) generates iteratively the sequence \(\{x_k\}\) via

\[ \begin{eqnarray} x_{k+1} &=& \pi_{C}(x_k-\alpha_{k} \nabla f(x_{k})) \end{eqnarray} \]

where \(\alpha_{k}>0\) is the step-size (ignoring the trivial case \(\alpha_k=0\)).

Roughly speaking, assuming that the \(\alpha_k\) are picked sensibly and under some regularity conditions on the problem, the method enjoys a rate \((f(x_k)-f(x^\sharp))=\mathcal O(k^{-1/2})\). This convergence with the square root of the number of iterations is actually a standard result for gradient-descent type methods.

4. Steepest descent

Beyond the fixed-point equation, it is well known that following the direction of the gradient is also a good idea since it is the local direction of maximum decrease of the function. Indeed, let us consider the unconstrained case and a small step from \(x\) to \(x-\delta\). A Taylor expansion of \(f\) around \(x\) gives

\[ \begin{eqnarray} f(x)-f(x-\delta ) &=& \langle \delta,\nabla f(x)\rangle + o(\|\delta\|) \end{eqnarray} \]

and for a sufficiently small \(\delta\), this difference is maximized when \(d=\nabla f(x)\).