Chapter 5 Gaussian Graphical Models

that \(X_V\) has a multivariate Gaussian distribution with parameters \(\mu\) and \(\Sigma\) if the joint density is \[\begin{align*} f(x_V) &= \frac{1}{(2\pi)^{p/2} |\Sigma|^{1/2}} \exp\left\{ - \frac{1}{2} (x_V - \mu)^T \Sigma^{-1} (x_V - \mu) \right\}, && x_V \in \mathbb{R}^p. \end{align*}\]

Proposition 5.1 Let \(X_V \sim N_p(\mu, \Sigma)\), and let \(A\) be a \(q \times p\) matrix of full rank \(q\). Then \[ A X_V \sim N_q(A \mu, A\Sigma A^T). \] In particular, for any \(U \subseteq V\) we have \(X_U \sim N_{q}(\mu_{U}, \Sigma_{UU})\).

Proof (Proof sketch (you should fill in the gaps)). For \(q = p\) this just follows from applying the transformation \(Z = A X_V\) to the density of \(X_V\).
If \(q < p\) then since \(\Sigma\) is positive definite we can write \(\Sigma = L L^T\) for a non-singular lower triangular matrix \(L\); then construct a non-singular \(p \times p\) matrix \[\begin{align*} \tilde A = \left(\begin{matrix} A \\ B \end{matrix}\right) \end{align*}\] whose first \(q\) rows are \(A\), and such that \(\tilde{A} L\) has its first \(q\) rows orthogonal to its last \(p-q\) rows. Then \[\begin{align*} \tilde{A} \Sigma \tilde{A}^T = \left(\begin{matrix} A \Sigma A^T & 0\\ 0 & B \Sigma B^T \end{matrix}\right) \end{align*}\] and the first \(q\) components have the desired marginal distribution.

For simplicity of notation, we will assume throughout that \(\mu = 0\). Note that the dependence structure is entirely determined by \(\Sigma\), and \(\mu\) is an orthogonal parameter to \(\Sigma\).

5.1 Gaussian Graphical Models

We only consider cases in which \(\Sigma\) is positive definite, so all our density functions are strictly positive. Hence, by the Hammersley-Clifford Theorem, the pairwise and global Markov properties, and the factorization criterion all lead to the same conditional independence restrictions. If any of these hold, we will say that \(\Sigma\) [‘is Markov with respect to’ a graph, without ambiguity.

Recall that \(X_A X_B\) if and only if \(\Sigma_{AB} = 0\), and note that a corollary of this is that \(X \perp\hspace{-3.2mm}\perp Y\) and \(X \perp\hspace{-3.2mm}\perp Z\) does imply \(X \perp\hspace{-3.2mm}\perp Y, Z\) for jointly Gaussian random variables.

Theorem 5.1 Let \(X_V \sim N_p(\mu, \Sigma)\) for positive definite \(\Sigma\), with \(K = \Sigma^{-1}\).
Then the distribution of \(X_V\) is Markov with respect to \({\cal G}\) if and only if \(k_{ab} = 0\) whenever \(a \not\sim b\) in \({\cal G}\).

Proof. This follows immediately from Proposition 3.1.

We introduce some notation for convenience. If \(M\) is a matrix whose rows and columns are indexed by \(A \subseteq V\), we write \(\{M\}_{A,A}\) to indicate the matrix indexed by \(V\) (i.e. it has \(|V|\) rows and columns) whose \(A,A\)-entries are \(M\) and with zeroes elsewhere.

For example, if \(|V| = 3\) then \[\begin{align*} M &= \left(\begin{matrix} a & b\\ b & c\\ \end{matrix}\right) & \{M\}_{12,12} &= \left(\begin{matrix} a & b & 0\\ b & c & 0\\ 0 & 0 & 0\\ \end{matrix}\right), \end{align*}\] where \(12\) is used as an abbreviation for \(\{1,2\}\) in the subscript.

Lemma 5.1 Let \({\cal G}\) be a graph with decomposition \((A,S,B)\), and \(X_V \sim N_p(0, \Sigma)\). Then \(p(x_V)\) is Markov with respect to \({\cal G}\) if and only if \[\begin{align*} \Sigma^{-1} = \left\{({\Sigma}_{A\cup S, A\cup S})^{-1} \right\}_{A\cup S, A\cup S} + \left\{({\Sigma}_{B\cup S, B\cup S})^{-1} \right\}_{B\cup S, B\cup S} - \left\{({\Sigma}_{S, S})^{-1}\right\}_{S, S}, \end{align*}\] and \(\Sigma_{A\cup S, A\cup S}\) and \(\Sigma_{B\cup S, B\cup S}\) are Markov with respect to \({\cal G}_{A \cup S}\) and \({\cal G}_{B \cup S}\) respectively.

Proof. We know from Lemma 4.1 that \[\begin{align*} p(x_V) \cdot p(x_S) = p(x_A, x_S) \cdot p(x_B, x_S). \end{align*}\] where \(p(x_A, x_S)\) and \(p(x_B, x_S)\) are Markov with respect to \({\cal G}_{A \cup S}\) and \({\cal G}_{B \cup S}\) respectively.
Since margins of multivariate Gaussians are also multivariate Gaussian, we can insert the appropriate density for each term, take logs and rearrange to see that: \[\begin{align*} x_V^T \Sigma^{-1} x_V + x_S^T (\Sigma_{SS})^{-1} x_S = x_{A\cup S}^T (\Sigma_{A\cup S,A\cup S})^{-1} x_{A\cup S} + x_{B\cup S}^T (\Sigma_{B\cup S,B\cup S})^{-1} x_{B\cup S} + \text{const.} \end{align*}\] which is a quadratic polynomial in the variables \(x_v\). By, comparing coefficients for each term we obtain that \[\begin{align*} {\Sigma}^{-1} = \left\{({\Sigma}_{A\cup S, A\cup S})^{-1} \right\}_{A\cup S, A\cup S} + \left\{({\Sigma}_{B\cup S, B\cup S})^{-1} \right\}_{B\cup S, B\cup S} - \left\{({\Sigma}_{S, S})^{-1}\right\}_{S, S}. \end{align*}\] This gives the result.

Applying the previous result to a decomposable graph repeatedly we see that \(X_V\) is Markov with respect to \({\cal G}\) if and only if \[\begin{align*} \Sigma^{-1} = \sum_{i=1}^k \left\{(\Sigma_{C_i, C_i})^{-1} \right\}_{C_i, C_i} - \sum_{i=2}^k \left\{(\Sigma_{S_i, S_i})^{-1} \right\}_{S_i, S_i}. \end{align*}\]

5.2 Maximum Likelihood Estimation

Let \(X_V^{(1)}, \ldots, X_V^{(n)}\) be i.i.d. \(N_p(0, \Sigma)\); then from Section 3 the sufficient statistic for \(\Sigma\) is the sample covariance matrix: \[\begin{align*} W\equiv \frac{1}{n} \sum_{i=1}^n X_V^{(i)} X_V^{(i)T}. \end{align*}\] In addition, \(\hat\Sigma = W\) is also the MLE for \(\Sigma\) under the unrestricted model (i.e. when all edges are present in the graph). Let \(\hat\Sigma^{\cal G}\) denote the MLE for \(\Sigma\) under the restriction that the distribution satisfies the Markov property for \({\cal G}\), and \(\hat{K}^{\cal G}\) its inverse.

Recall that if \(i \not\sim j\) then \(k_{ij} = 0\), so the sufficient statistics for a graph \({\cal G}\) reduce to the entries in \(W\) that correspond to edges in the graph.
The MLE involves picking \(\hat{K}\) such that: \[\begin{align*} \hat{k}_{ij} &= 0 &\text{whenever } &i \not\sim j\\ \hat{\sigma}_{ij} &= W_{ij} &&i \sim j; \end{align*}\] (here \(\hat\sigma_{ij}\) is the \((i,j)\) entry of the inverse of \(\hat{K}\)).

For a decomposable graph \({\cal G}\) with cliques \(C_1, \ldots, C_k\) this means that the MLE can be written in the form \[\begin{align*} \left(\hat{\Sigma}^{{\cal G}}\right)^{-1} = \sum_{i=1}^k \left\{(W_{C_i, C_i})^{-1} \right\}_{C_i, C_i} - \sum_{i=2}^k \left\{(W_{S_i, S_i})^{-1} \right\}_{S_i, S_i}. \end{align*}\] This matches the sufficient statistics so that \(\Sigma_{C_i,C_i} = W_{C_i,C_i}\) for each \(i\).

5.3 Data Examples

Example 5.1

Whittaker (1990) analyses data on five maths test results administered to 88 students, in analysis, algebra, vectors, mechanics and statistics. The empirical concentration matrix (i.e. \(S^{-1}\)) is given by the following table (entries multiplied by \(10^3\))
mechanics vectors algebra analysis statistics
mechanics 5.24 -2.44 -2.74 0.01 -0.14
vectors -2.44 10.43 -4.71 -0.79 -0.17
algebra -2.74 -4.71 26.95 -7.05 -4.70
analysis 0.01 -0.79 -7.05 9.88 -2.02
statistics -0.14 -0.17 -4.70 -2.02 6.45

Notice that some of the entries in the concentration matrix are quite small, suggesting that conditional independence holds. Indeed, fitting the graphical model in Figure 5.1 gives an excellent fit (see Examples Sheet 2). The model suggests that ability in analysis and statistics is independent of that in mechanics and vector calculus, conditional on one’s fundamental skill at algebra.

A graph for the maths test data.

Figure 5.1: A graph for the maths test data.

Bibliographic Notes

Gaussian graphical models have their origin in Dempster’s (1972) paper on ‘covariance selection’, and he essentially gave the result of Lemma 5.1.

References

Dempster, A. P. (1972). Covariance selection. Biometrics, 28(1), 157–175.