## Basics of Convex AnalysisWe introduce here briefly a few important building blocks of convex optimization: the notion of **subgradient**(aka. generalized gradient) and subdifferential, the**first-order optimality condition****strict**and**strong convexity**,**positive-definite functions**and a hint at**Bregman divergences**,the **convex conjugate**.
## 1. Subgradient and First-order Optimality ConditionWe say that \(y\in X\) is a \[ \begin{eqnarray} f(z) &\ge & f(x) + \langle z-x, y \rangle, \qquad \forall z\in X.\label{subgradient inequality} \end{eqnarray} \] The inequality above just indicates that the graph of the function \(f\) is supported by the hyperplane defined by the right-hand side. A subgradient is thus the ‘‘slope’’ of one such
Now, by definition of an optimal point \(x^{\sharp}\) for the unconstrained problem, we must have \(f(z)\ge f(x^{\sharp})\) for all \(z\in X\). This can be written as \[ \begin{eqnarray} f(z) &\ge& f(x^{\sharp}) + \langle z-x,0 \rangle, \qquad \forall z \in X,\nonumber \end{eqnarray} \] and hence \(0\) must be a subgradient of \(f\) at \(x^\sharp\). This is the \[ \begin{eqnarray} x^\sharp \,\in\, \arg\min_x \, f(x) &\Longleftrightarrow& 0\,\in\, \partial f(x^\sharp). \end{eqnarray} \] If we take the subdifferential as an operator then, intuitively, looking for a minimizer amounts to ‘‘inverting’’ the subdifferential and evaluating it at \(0\), i.e.: \(x^\sharp = (\partial f)^{-1}(0)\). We shall come back to this in more details but this idea of inverting an operator involving the subdifferential to find the minimizer is very important. Before moving on, it is useful to note (and not too hard to convince oneself) that the following inclusion holds for the subdifferential of a sum: \[ \begin{eqnarray} \sum_i \partial f_i &\subseteq& \partial \sum_i f_i. \end{eqnarray} \] For most problems of interest, it holds as an equality but note that if \(0\in \sum_i \partial f_i(x^\dagger)\) then the above inclusion implies that \(0\in\partial \sum_i f_i(x^\dagger)\) which is sufficient to show that \(x^\dagger\) is a minimizer (which is what we are mostly interested in). ## 2. Strict and Strong convexityThe function \(\psi\) is said to be \[ \begin{eqnarray} \psi(z) &>& \psi(x) + \langle z-x , y\rangle, \quad\forall z\in X,\, z\neq x \end{eqnarray} \] where \(y\in\partial \psi(x)\). Recalling that a function \(d: X\times X\) is said to be \[ \begin{eqnarray} B_{\psi}(z,x) &:=& \psi(z)-\psi(x) - \langle z-x,y \rangle, \quad \text{with}\quad y\in\partial \psi(x). \label{Bregman A} \end{eqnarray} \] Equation \(\eqref{Bregman A}\) actually defines The function \(\varphi\) is said to be \(\mu\)- \[ \begin{eqnarray} B_{\varphi}(x,y) &\ge& {\mu\over 2} {\|x-y\|^{2}_{2}}, \quad\forall x,y\in X. \end{eqnarray} \] The factor \(1/2\) might seem irrelevant but makes other developments look nicer. For now, observe that if we take the derivative of the right-hand side, then we are left with \((x-y)\) without a spurious factor \(2\)…
## 3. Legendre-Fenchel convex conjugateLet us consider again the definition of the subgradient of a convex function \(f\) at a point \(x\): \[ \begin{eqnarray} \partial f(x) &=& \{y\,|\, f(z)\,\ge\, f(x)+\langle z-x,y\rangle, \,\forall z\}.\nonumber \end{eqnarray} \] Note that we can then rearrange terms in the condition as follows: \[ \begin{eqnarray} \partial f(x) &=& \{y\,|\, \langle z,y\rangle - f(z)\,\le\, \langle x,y\rangle - f(x), \,\forall z\},\nonumber \end{eqnarray} \] but since the condition must hold for all \(z\), it must equivalently hold for all \(z\) that maximizes the lower bound. Note that the maximum of that lower bound tightens the inequality exactly (just take \(z=x\)). We can thus write the subgradient as \[ \begin{eqnarray} \partial f(x) &=& \{y \,|\, \max_{z} \,\, [\langle z,y\rangle -f(z)] \,=\, \langle x,y\rangle -f(x)\}.\nonumber \end{eqnarray} \] This justifies the definition of the \[ \begin{eqnarray} f^\star(y) &:=& \max_z \,\,[\langle z,y \rangle - f(z)]. \end{eqnarray} \] where \(\overline{\mathbb R}=\mathbb R \cup \{\pm\infty\}\) is the extended real line. The subgradient can then be expressed in terms of the convex conjugate as: \[ \begin{eqnarray} \partial f(x) &=& \{ y \,|\, f^{\star}(y) \,=\, \langle x,y\rangle - f(x)\}. \end{eqnarray} \] A useful property of the convex conjugate for We can consider a simple (and yet quite useful) example for the convex conjugate: if we define \(\psi(x):=\|x\|^2/2\), its convex conjugate is then \[ \begin{eqnarray} \psi^\star(y) &=& \max_z\,\, \langle z,y\rangle - \frac12\langle z,z\rangle, \end{eqnarray} \] and the FOC leads immediately to \(\psi^\star(y)=\psi(y)\). The definition of the convex conjugate also directly implies \[ \begin{eqnarray} f(x) + f^\star(y) &\ge & \langle x,y\rangle, \quad \forall x,y\in X. \end{eqnarray} \] ## 4. A couple of commentsWe use \(\max\) and \(\min\) everywhere but it would be more correct to use \(\sup\) and \(\inf\). In practice this point is usually irrelevant. For more about all these technical details (and much more), standard phrase: please refer to Rockafellar's book :-). |