# Projected gradient descent

To read this you should be familiar with the notions of subgradient and first-order optimality condition (both presented in Basics of Convex Analysis).

## 1. Introduction and FOC

We consider here the general minimization problem over a convex set $$C$$ written in its unconstrained form:

$\begin{eqnarray} \min_{x}\quad f(x)+i_{C}(x). \end{eqnarray}$

where $$f$$ is continuously differentiable on $$C$$. Following the FOC, $$x^{\sharp}$$ is a minimizer if $$0\in \nabla f(x^{\sharp})+\partial i_{C}(x^{\sharp})$$. To obtain the subdifferential of $$i_{C}$$ at a point $$x\in C$$, note that if $$y\in\partial i_{C}(x)$$ then, by definition,

$\begin{eqnarray} i_{C}(z) &\ge & i_{C}(x) + \langle z-x,y \rangle, \quad \forall z.\nonumber \end{eqnarray}$

Now, for any $$z\notin C$$, the above inequality is trivially verified. We can thus restrict to $$z\in C$$ and since $$x\in C$$, we are left with

$\begin{eqnarray} 0 &\ge & \langle z-x,y \rangle, \quad \forall z\in C.\nonumber \end{eqnarray}$

This inequality defines the subdifferential of interest but also happens to be the definition of the normal cone to the convex set $$C$$ denoted $$N_{C}(x)$$. Summarizing, we have:

$\begin{eqnarray} \partial i_{C}(x) &=& \{y \,|\, \langle z-x,y\rangle \le 0, \,\forall z\in C\} \,\,=:\,\, N_{C}(x). \end{eqnarray}$

We can thus re-write the FOC for the problem as follows:

$\begin{eqnarray} x^{\sharp} \,\in\arg\min_{x\in C} \, f(x) &\Longleftrightarrow & 0 \,\in\, \nabla f(x^{\sharp})+N_{C}(x^{\sharp}) \label{FOC basic} \end{eqnarray}$

$\begin{eqnarray} \text{if}\,\, u \,\in\, N_{C}(x) \,\,\text{then}\,\, \forall\alpha\ge0,\,\, \alpha u \,\in\, N_{C}(x) \label{normal cone multiplication}. \end{eqnarray}$

## 2. Normal cone and Euclidean projection

The Euclidean projection onto a convex set $$C$$ that we shall denote $$\pi_{C}$$ is defined as follows for any $$z\in X$$:

$\begin{eqnarray} \pi_{C}(z) &=& \arg \min_{x\in C}\quad{1\over 2} \|x-z\|^{2}_{2}. \end{eqnarray}$

Note that in the unconstrained problem, $$C=X$$ and hence $$\pi_{C}\equiv \mathbb I$$ the identity operator. Note also that the factor $$1/2$$ is here only for aesthetics when computing the gradient.

The subgradient of the objective function is trivial to compute ($$\partial f(x)=(x-z)$$); writing $$x^{\sharp}$$ the minimizer (with thus $$x^{\sharp}=\pi_{C}(z)$$), the FOC $$\eqref{FOC basic}$$ in this particular problem simply reduces to

$\begin{eqnarray} 0 \,\in\, (x^{\sharp}-z) + N_{C}(x^{\sharp}) \quad\!\!\text{or equivalently}\quad\!\! z \in x^{\sharp}+N_{C}(x^{\sharp})\label{reduced FOC}. \end{eqnarray}$

Using operator notations, the right-hand side can be written $$z\in (\mathbb I+N_{C})(x^{\sharp})$$ where $$\mathbb I$$ denotes the identity operator. Replacing $$x^{\sharp}$$ by $$\pi_{C}(z)$$ and re-arranging terms then gives

$\begin{eqnarray} \pi_{C}(z) &=& (\mathbb I + N_{C})^{-1}(z).\nonumber \end{eqnarray}$

We have thus shown the classical relationship:

$\begin{eqnarray} \pi_{C} &\equiv& (\mathbb I+N_{C})^{-1}. \end{eqnarray}$

$\begin{eqnarray} (x+\alpha u) - x &\in & N_{C}(x)\nonumber \end{eqnarray}$

then, letting $$z:=(x+\alpha u)$$, we have precisely $$z\in x+N_{C}(x)$$ which is precisely the form in $$\eqref{reduced FOC}$$ which is equivalent to $$x=\pi_{C}(z)$$ and thus

$\begin{eqnarray} x &=& \pi_{C}(x + \alpha u), \quad (\alpha \ge 0, u\in N_{C}(x)).\nonumber \end{eqnarray}$

The last thing to observe is that the FOC $$\eqref{FOC basic}$$ indicates that $$-\nabla f(x^{\sharp})\in N_{C}(x^{\sharp})$$. We have thus found ourselves a useful fixed-point form for the minimizer.

For any $$\alpha \ge 0$$, a minimizer $$x^{\sharp}$$ of the problem verifies the following fixed-point equation:

$\begin{eqnarray} x^{\sharp} &=& \pi_{C}(x^{\sharp}-\alpha \nabla f(x^{\sharp})). \end{eqnarray}$

## 3. The algorithm

So, as we had said in the introduction of this blog, a big part of convex optimization is based on working out clever fixed-point equations verified by the minimizers and to then apply a fixed-point iteration. In this case it leads to the following algorithm.

The projected gradient descent (PGD) generates iteratively the sequence $$\{x_k\}$$ via

$\begin{eqnarray} x_{k+1} &=& \pi_{C}(x_k-\alpha_{k} \nabla f(x_{k})) \end{eqnarray}$

where $$\alpha_{k}>0$$ is the step-size (ignoring the trivial case $$\alpha_k=0$$).

Roughly speaking, assuming that the $$\alpha_k$$ are picked sensibly and under some regularity conditions on the problem, the method enjoys a rate $$(f(x_k)-f(x^\sharp))=\mathcal O(k^{-1/2})$$. This convergence with the square root of the number of iterations is actually a standard result for gradient-descent type methods.

## 4. Steepest descent

Beyond the fixed-point equation, it is well known that following the direction of the gradient is also a good idea since it is the local direction of maximum decrease of the function. Indeed, let us consider the unconstrained case and a small step from $$x$$ to $$x-\delta$$. A Taylor expansion of $$f$$ around $$x$$ gives

$\begin{eqnarray} f(x)-f(x-\delta ) &=& \langle \delta,\nabla f(x)\rangle + o(\|\delta\|) \end{eqnarray}$

and for a sufficiently small $$\delta$$, this difference is maximized when $$d=\nabla f(x)$$.