The Mirror Descent Algorithm

To read this you should be familiar with

Ideally you will also be familiar with the material from the ‘‘more convex analysis’’ page, it is not necessary but you will then have to take a few facts for granted.

1   From Euclidean to arbitrary projections

In the notes on the projected gradient descent we showed that a way to solve the original constrained minimization problem is to follow the iteration

\[ \begin{eqnarray} x_{k+1} &=& \pi_{C}(x_{k}-\alpha_{k}\nabla f(x_{k})). \end{eqnarray} \]

Where \(\pi_{C}\) denotes the Euclidean projection. Using its definition, this can be re-written as

\[ \begin{eqnarray} x_{k+1} &=& \arg\min_{x\in C}\quad \|x_{k}-\alpha_{k}\nabla f(x_{k}) - x\|^{2}_{2}\nonumber\\ &=& \arg\min_{x\in C}\quad \| (x-x_{k})+\alpha_{k}\nabla f(x_{k})\|^{2}_{2}. \end{eqnarray} \]

Now, since the \(\ell^{2}\)-norm is induced by the inner-product, we can use the decomposition

\[ \begin{eqnarray} \| (x-x_{k}) +\alpha_{k}\nabla f(x_{k}) \|_{2}^{2} &=& \| x-x_{k}\|^{2}_{2} + 2\alpha_{k}\langle x,\nabla f(x_{k})\rangle + M_{k},\nonumber \end{eqnarray} \]

where \(M_{k}\) does not depend on \(x\) (and can thus be ignored). Rearranging terms, we are left with the equivalent iteration:

\[ \begin{eqnarray} x_{k+1} &=& \arg\min_{x\in C} \quad \left\{ \langle x,\nabla f(x_{k})\rangle + {1\over \alpha_{k}}{\|x-x_{k}\|^{2}_{2}\over 2} \right\}. \end{eqnarray} \]

 

Illustration of the PGD: the local direction of steepest descent at \(x_{k}\) (given by the gradient) is illustrated by the blue arrows, the dashed circles represent the level-sets of the \(\ell^{2}\)-norm. There is a tradeoff between following the direction of the steepest-descent and moving away more from the current point \(x_{k}\) given the boundary of the set \(C\) which is balanced by the step size \(\alpha_{k}\).

Given this equivalent formulation of the PGD and the illustration above, one could then suggest keeping the exact same structure but replacing the \(\ell^{2}\)-distance by another distance-like function.

The generalized projected gradient descent (GPGD) is defined for any positive-definite function \(d:X\times X\to \mathbb R^+\) by the following iteration:

\[ \begin{eqnarray} x_{k+1} &=& \arg\min_{x\in C} \quad \left\{ \langle x,\nabla f(x_{k})\rangle + {1\over \alpha_{k}}d(x,x_{k}) \right\}\quad \text{with}\quad \alpha_{k}>0. \end{eqnarray} \]

We shall call positive-definite functions divergences. Recall that such a function verifies \(d(u,v)>0\) for all \(u\neq v\) and \(d(u,u)=0\).
In ‘‘descent algorithms (1)’’ we will consider a more formal approach leading to this GPGD.
One reason why one might want to consider another distance-like function is that doing so can better reflect the geometry of the set \(C\) which can make each step easier to compute (for example if the distance grows to infinity as \(x\) approaches the boundary of \(C\), one can then ignore \(C\) altogether). In ‘‘descent algorithms (2)’’ we will look further into this choice of function.
For a given divergence, there now remains to compute the corresponding iteration steps. A popular class of divergences for which the iteration steps have a more explicit form are the Bregman divergences.

2   Bregman divergences and MDA

Recall that for a strictly convex function \(\psi\) we can define a positive definite function \(B_{\psi}:X\times X\to \mathbb R^{+}\) with

\[ \begin{eqnarray} B_{\psi}(x,y) &:=& \psi(x)-\psi(y)-\langle x-y, \psi’(y)\rangle, \quad \text{with}\quad \psi’(y)\,\in\,\partial \psi(y),\nonumber \end{eqnarray} \]

called the Bregman divergence (associated with \(\psi\)).
Let us now consider a (\(\mu\)-)strongly convex function \(\varphi\), continuously differentiable, the associated Bregman divergence \(B_{\varphi}\) and use it in the generalized PGD. One then considers the iteration

\[ \begin{eqnarray} x_{k+1} &\in& \arg\min_{x\in C} \quad \left\{ \langle x,\nabla f(x_{k})\rangle + {1\over \alpha_{k}}B_{\varphi}(x,x_{k}) \right\}\quad \text{with}\quad \alpha_{k}>0.\nonumber \end{eqnarray} \]

The FOC for the definition of \(x_{k+1}\) reads:

\[ \begin{eqnarray} 0&\in& \alpha_{k}\nabla f(x_{k}) + \nabla \varphi(x_{k+1}) - \nabla\varphi(x_{k}) + N_{C}(x_{k+1}),\nonumber \end{eqnarray} \]

which, after rearranging terms, reads

\[ \begin{eqnarray} x_{k+1} &\in& (\nabla \varphi + N_{C})^{-1}(\nabla \varphi(x_{k})-\alpha_{k}\nabla f(x_{k})).\nonumber \end{eqnarray} \]

To complete this, observe that \((\nabla\varphi+N_{C})=\partial (\varphi+i_{C})\) so that letting \(\phi:=\varphi+i_{C}\), and using that for \(h\in\Gamma_{0}(X)\), \((\partial h)^{-1}=\partial h^{\star}\) (cf. ‘‘more convex analysis’’), we are finally left with the MDA.

The Mirror Descent Algorithm (MDA) is the iteration

\[ \begin{eqnarray} x_{k+1} &=& \nabla \phi^{\star} (\nabla \varphi(x_{k})-\alpha_{k}\nabla f(x_{k})), \end{eqnarray} \]

where \(\phi^{\star}(y):=\max_{z\in C}\,[\langle z,y\rangle - \varphi (z)]\).

  • letting \(z^{+}:=\arg\min_{z\in C} [\varphi(z)-\langle z,y\rangle]\), we have

\[ \phi^{\star}(y) \,\,=\,\, \langle z^{+},y\rangle - \varphi(z^{+})\quad\text{and hence}\quad \nabla\phi^{\star}(y)\,\,=\,\, z^{+}. \]

  • \(\phi\) is also strongly convex so that the subdifferential of its convex conjugate is single-valued everywhere which is why we can write \(\nabla\phi^{\star}\).

  • as in the development of the PGD, the whole development of the MDA can be done without the assumption of differentiability of \(f\) which leads to the iteration \(x_{k+1}=\nabla\phi^{\star}(\nabla\varphi(x_{k})-\alpha_{k} f’(x_{k}))\) with \(f’(x_{k})\in \partial f(x_{k})\).

3   Some references

  • Beck, Teboulle, Mirror descent and nonlinear projected subgradient descent, 2003.

  • Nemirovski, Tutorial: Mirror Descent Algorithms for Large-Scale Deterministic and Stochastic Convex Optimization, 2012.