Chapter 3 Exponential Families and Contingency Tables
For much of the rest of the course we will be dealing with collections of random variables \(X_V \equiv (X_v : v \in V)\), indexed by a set \(V= \{1,\ldots, p\}\). Each \(X_v\) takes values in the set \({\cal X}_v\). For a subset of the variables \(A \subseteq V\), we write \(X_A\) to denote \((X_v : v \in A)\).
3.1 Exponential Families
Let \(p(\cdot; \theta)\) be a collection of probability
densities over \({\cal X}\) indexed by \(\theta \in \Theta\). We say
that \(p\) is an exponential family if it can be
written as
\[\begin{align*}
p(x; \theta) = \exp\left\{ \sum_i \theta_i \phi_i(x) - A(\theta) - C(x) \right\}.
\end{align*}\]
If \(\Theta\) is a non-empty open set then the family is said to be
regular.
The functions \(\phi_i\) are the sufficient statistics,
and the components \(\theta_i\) are called the canonical parameters (or natural parameters). We can
replace the sum with an inner product of vectors \(\theta = (\theta_i)\)
and \(\phi = (\phi_i(x))\):
\[\begin{align*}
p(x; \theta) = \exp\left\{ \langle \theta, \phi(x) \rangle - A(\theta) - C(x)\right\}.
\end{align*}\]
The function \(A(\theta)\) is the cumulant function,
and must be chosen so that the distribution normalizes, i.e.
\[\begin{align*}
A(\theta) = \log \int \exp\left\{ \langle\theta, \phi(x) \rangle - C(x) \right\} \, dx.
\end{align*}\]
\(Z(\theta) \equiv e^{A(\theta)}\) is also called the partition function.
Lemma 3.1 We have \[\begin{align*} \nabla A(\theta) &= \mathbb{E}_{\theta} \phi(X), & \nabla\nabla^T A(\theta) &= \operatorname{Cov}_{\theta} \phi(X). \end{align*}\] Consequently \(A(\theta)\) (and \(- \log p(x; \theta)\)) are convex in \(\theta\). In addition, the map \(\mu(\theta) : \theta \mapsto \nabla A(\theta)\) is bijective, and called the mean function.
Proof. For the first part,
\[\begin{align*}
e^{A(\theta)} \frac{\partial}{\partial \theta_i} A(\theta) &= \frac{\partial}{\partial \theta_i} e^{A(\theta)}\\
&= \frac{\partial}{\partial \theta_i} \int \exp\left\{ \langle \theta, \phi(x) \rangle -C(x)\right\} \, dx\\
&= \int \frac{\partial}{\partial \theta_i} \exp\left\{ \langle \theta, \phi(x) \rangle -C(x)\right\} \, dx\\
&= \int \phi_i(x) \exp\left\{ \langle \theta, \phi(x) \rangle -C(x)\right\} \, dx\\
&= e^{A(\theta)} \int \phi_i(x) \exp\left\{ \langle \theta, \phi(x) \rangle - A(\theta) -C(x)\right\} \, dx\\
&= e^{A(\theta)} \mathbb{E}_{\theta} \phi_i(X).
\end{align*}\]
The result for the Hessian follows similarly. The convexity of \(-\log p(x; \theta) = A(\theta) - \langle \theta, \phi(x) \rangle\)
is now immediate from the fact that its Hessian is a non-negative definite matrix.
That \(\mu(\theta)\) is bijective requires strict convexity; i.e. that
the Hessian is positive definite. This follows from a slight
extension to the above (see the book by Wainwright and Jordan, Proposition 3.1).
The property of convexity plays an important role in the computational advantages of exponential families. Convex functions are easy to work with for the purposes of optimization: in particular, they do not contain multiple local minima.
Example 3.1 Let \(X \sim \operatorname{Poisson}(\lambda)\). We have \[\begin{align*} p_\lambda(x) &= e^{-\lambda} \frac{\lambda^x}{x!} = \frac{1}{x!} \exp\left\{x \log \lambda - \lambda\right\}. \end{align*}\] Clearly the canonical parameter is \(\theta = \log\lambda\), so we can rewrite as \[\begin{align*} p_\theta(x) &= \frac{1}{x!} \exp\left\{\theta x - e^{\theta}\right\}, \end{align*}\] giving \(A(\theta) = e^\theta\) (which is convex, as expected). Note that \(A'(\theta) = A''(\theta) = e^\theta = \lambda\), which is indeed the mean and variance of a Poisson distribution.
3.2 Empirical Moment Matching
To find the maximum likelihood estimate in an exponential family, we maximize the log-likelihood (ignoring \(C\), since it is constant in \(\theta\)) \[\begin{align*} l(\theta; X^{(1)},\ldots, X^{(n)}) &= \left\langle \sum_{i=1}^n \phi(X^{(i)}), \theta \right\rangle - n A(\theta)\\ n^{-1} l(\theta; X^{(1)},\ldots, X^{(n)}) &= \langle \overline{\phi(X)}, \theta \rangle - A(\theta) \end{align*}\] where \(\overline{\phi(X)} = n^{-1} \sum_i \phi(X^{(i)})\) is the sample mean of the sufficient statistics. To maximize this, we can differentiate and set to zero, obtaining \[\begin{align*} \overline{\phi(X)} - \nabla A(\theta) = 0, \end{align*}\] so in other words when we choose \(\theta\) so that \(\mathbb{E}_\theta \phi(X) = \overline{\phi(X)}\): the mean of the sufficient statistics matches the empirical mean from the data.
Note also that if we differentiate just with respect to \(\theta_i\), we obtain the same result for each sufficient statistic separately; hence if we update the parameters to match the moment \(\overline{\phi_i(X)} = \mathbb{E}_{\theta} \phi_i(X)\), then we increase the log-likelihood. If we iterate this over \(i\), we will converge to the global maximum likelihood estimate, because the log-likelihood is a (strictly) concave and differentiable function.
3.3 Multivariate Gaussian Distribution
Let \(X_V = (X_1,\ldots,X_p)^T \in \mathbb{R}^p\) be a random vector.
Let \(\mu \in \mathbb{R}^p\) and \(\Sigma \in \mathbb{R}^{p\times p}\) be
a positive definite symmetric matrix. We say that \(X_V\) has a
multivariate Gaussian distribution with parameters \(\mu\) and
\(\Sigma\) if the joint density is
\[\begin{align*}
f(x_V) &= \frac{1}{(2\pi)^{p/2} |\Sigma|^{1/2}} \exp\left\{ - \frac{1}{2} (x_V - \mu)^T \Sigma^{-1} (x_V - \mu) \right\}, && x_V \in \mathbb{R}^p.
\end{align*}\]
This is also called the multivariate normal distribution.
The concentration matrix is \(K \equiv \Sigma^{-1}\).
We can rewrite this as \[\begin{align*} f(x_V) &= \frac{1}{(2\pi)^{p/2}} \exp\left\{ - \frac{1}{2} x_V^T K x_V + \mu^T K x_V - \frac{1}{2} \mu^T K \mu + \frac{1}{2} \log|K| \right\}, && x_V \in \mathbb{R}^p. \end{align*}\] Noting that \(x_V^T K x_V = \sum_{i,j} k_{ij} x_i x_j\) we see that this is an exponential family with canonical parameters \(\eta \equiv K \mu\) and \(K\), and corresponding sufficient statistics \(\phi(x_V) = (x_V, -\frac{1}{2} x_V x_V^T)\).
We then obtain that \(2 A(\theta) = 2 A(K, \eta) = \eta^T K^{-1} \eta - \log|K|\), which by differentiating2, gives \[\begin{align*} \nabla_\eta A(\theta) &= K^{-1} \eta = \mu\\ 2\nabla_K A(\theta) &= - K^{-1} - K^{-1} \eta \eta^T K^{-1} = -\Sigma - \mu \mu^T, \end{align*}\] and these are indeed the expectations of the sufficient statistics. By the previous observation that maximum likelihood estimation corresponds to moment matching for exponential families, this means that the MLEs are \(\hat\mu = \overline{X}_V\) and \(\hat\Sigma = \overline{X_VX_V'} - \overline{X}_V \overline{X}_V^T\).
Proposition 3.1 Let \(X_V\) have a multivariate Gaussian distribution with concentration matrix \(K = \Sigma^{-1}\). Then \(X_i \mathbin{\perp\hspace{-3.2mm}\perp}X_j \mid X_{V \setminus \{i,j\}}\) if and only if \(k_{ij} = 0\), where \(k_{ij}\) is the corresponding entry in the concentration matrix.
Proof. The log-density is
\[\begin{align*}
\log f(x_V) &= - \frac{1}{2} (x_V - \mu)^T K (x_V - \mu) + \text{const}
\end{align*}\]
where the constant term does not depend upon \(x_V\).
It is clear that the only term involving both \(x_i\) and
\(x_j\) is \(- k_{ij} (x_i-\mu_i) (x_j-\mu_j)\). Hence, \(k_{ij} = 0\)
if and only if the log-density has separate terms for each of
\(x_i\) and \(x_j\).
We will return to the multivariate Gaussian distribution in Chapter 5.
3.4 Contingency Tables
In this section we will assume that our variables \(X_v\) are discrete with a finite set of levels \({\cal X}_v \equiv \{1,\ldots,d_v\}\). Though we use integers as labels, they can represent something completely arbitrary and unordered such as religion, social preference, or a car model.
Given a vector of these categories \(X_V^{(i)} = (X_1^{(i)}, \ldots, X_p^{(i)})\)
sampled over individuals \(i=1,\ldots,n\), it is
helpful to cross-tabulate their responses. Define:
\[\begin{align*}
n(x_V) \equiv \sum_{i=1}^n \mathbb{I}\{X^{(i)}_1 = x_1, \ldots, X^{(i)}_p = x_p\},
\end{align*}\]
i.e. the number of individuals who have the response
pattern \(x_V\).
These counts are the sufficient statistics
for a multinomial model, whose log-likelihood is
\[\begin{align*}
l(p; \boldsymbol{n}) &= \sum_{x_V} n(x_V) \log p(x_V), && p(x_V) \geq 0, \quad \sum_{x_V} p(x_V) = 1.
\end{align*}\]
Letting \(0_V\) mean the vector of zeros, we can rewrite this as
\[\begin{align*}
l(p; \boldsymbol{n}) &= \sum_{x_V\neq 0_V} n(x_V) \log p(x_V)/p(0_V) + n \log p(0_V).
\end{align*}\]
We immediately obtain that the multinomial distribution
is an exponential family with sufficient statistics
given by the counts \(n(x_V)\), and canonical parameters
given by the ratios of log-probabilities. The cumulant function is \(-\log p(0_V)\),
but it should be written as a function of the canonical parameters; you can
check that this gives
\[\begin{align*}
-\log p(0_V) = \log \left(1 + \sum_{x_V \neq 0_V} e^{\theta(x_V)}\right)
\end{align*}\]
for \(\theta(x_V) = \log p(x_V)/p(0_V)\), which is convex.
Note that canonical parameters
are only unique up to linear transformations; in particular, we could have used
any cell as the reference value instead of \(0_V\). We will use an
alternative parameterization below.
Each possibility \(x_V\) is called a
cell of the table.
Given a subset of the responses \(A \subseteq V\) we may be
interested in the marginal table:
\[\begin{align*}
n(x_A) \equiv \sum_{x_B} n(x_A, x_B),
\end{align*}\]
where \(B = V \setminus A\).
Example 3.2 Consider the death penalty data again:
|
|
The marginal table has
White | Black | |
---|---|---|
Yes | 53 | 15 |
No | 430 | 176 |
So if we let \(D\) and \(R\) respectively denote indicators of whether the death penalty was imposed and the defendant’s race then, for example, \(n(x_D = \text{Yes}, x_R = \text{Black}) = 15\).
3.5 Computation
As noted in the introduction, even a moderately sized contingency table will cause statistical problems in practice due to the curse of dimensionality. If we have \(k\) binary variables, then the contingency table will have \(2^k\) cells. Even for \(k=10\) we will have over a thousand possibilities, and for \(k=50\) there are too many to cells to store in a computer’s memory.
Conditional independence can help, however; suppose that
\(X_A \mathbin{\perp\hspace{-3.2mm}\perp}X_B \mid X_S\) for some \(A\cup B\cup S = V\), so
that we have
\[\begin{align*}
p(x_V) = p(x_S) \cdot p(x_A \mid x_S) \cdot p(x_B \mid x_S).
\end{align*}\]
Now we can store each of these factors in computer
memory separately,
which means \(2^s + 2^{a+s} + 2^{b+s} = 2^s(1 + 2^a + 2^b)\) cells instead of \(2^{s+a+b}\).
This is a considerable saving if \(s\) is small and the minimum of \(a\) and \(b\) is
not too small. With respect to calculations, if we want to find \(P(X_v=1)\) and
\(v \in A\), then we need only sum over the \(2^{s+a}\) entries in
\(p(x_S) \cdot p(x_A \mid x_S)\)
rather than the \(2^{a+b+s}\) entries in \(p(x_V)\).
Of course, if there are other conditional independences present then one might imagine that further computational savings become possible: indeed this is correct, and is one of the main ideas behind graphical models.
3.6 Log-linear models
The log-linear parameters for \(p(x_V) > 0\) are defined by the relation \[\begin{align*} \log p(x_V) &= \sum_{A \subseteq V} \lambda_A(x_A)\\ &= \lambda_\emptyset + \lambda_1(x_1) + \cdots + \lambda_{V}(x_V), \end{align*}\] and the identifiability constraint \(\lambda_A(x_A) = 0\) whenever \(x_a = 1\) for some \(a \in A\). (Other identifiability constraints can also be used.)
In the case of binary variables (that is, each variable
takes only two states, \(d_v = 2\), \({\cal X}_v = \{1,2\}\)), there is only one
possibly non-zero level for each log-linear parameter \(\lambda_A(x_A)\),
which is when \(x_A = (2,\ldots,2)\).
In this case we will
simply write \(\lambda_A = \lambda_A(2,\ldots,2)\).
We will proceed under this assumption
from now on.
Example 3.3 Consider a \(2 \times 2\) table with probabilities \(\pi_{ij} = P(X=i, Y=j)\). The log-linear parametrization has \[\begin{align*} \log \pi_{11} &= \lambda_\emptyset & \log \pi_{21} &= \lambda_\emptyset + \lambda_X\\ \log \pi_{12} &= \lambda_\emptyset + \lambda_Y& \log \pi_{22} &= \lambda_\emptyset + \lambda_X + \lambda_Y + \lambda_{XY}. \end{align*}\] From this we can deduce that \[\begin{align*} \lambda_{XY} &= \log \frac{\pi_{11} \pi_{22}}{\pi_{21} \pi_{12}}. \end{align*}\] The quantity \(\exp\lambda_{XY}\) is called the odds ratio between \(X\) and \(Y\), and is a fundamental quantity in statistical inference.
Multinomial models can be fitted as Poisson GLMs using the following fact: :::{.proposition} Let \(X_i \sim \operatorname{Poisson}(\mu_i)\) independently, and let \(N = \sum_{i=1}^k X_i\). Then, \[\begin{align*} N &\sim \operatorname{Poisson}\left({\textstyle \sum}_i \mu_i\right)\\ (X_1, \ldots, X_k)^T \mid N=n &\sim \operatorname{Multinom}(n, (\pi_1,\ldots,\pi_k)^T), \end{align*}\] where \(\pi_i = \mu_i/\sum_j \mu_j\). :::
3.7 Conditional Independence
Log-linear parameters provide a convenient way of expressing conditional independence constraints, since factorization of a density is equivalent to an additive separation of the log-density.
Theorem 3.1 Let \(p > 0\) be a discrete distribution on \(X_V\) with associated log-linear parameters \(\lambda_C\), \(C\subseteq V\). The conditional independence \(X_a \mathbin{\perp\hspace{-3.2mm}\perp}X_b \mid X_{V \setminus \{a,b\}}\) holds if and only if \(\lambda_C = 0\) for all \(\{a,b\} \subseteq C \subseteq V\).
Proof. See examples sheet.
If there is a conditional independence, then the log-linear parameters can be calculated by just looking at the distribution of each ‘piece’ of the conditional independence separately. For example, suppose that \(X_A \mathbin{\perp\hspace{-3.2mm}\perp}X_B \mid X_C\), where \(A \cup B \cup C = V\). Then by Theorem \(\ref{thm:ciequiv}\), we have \[\begin{align*} p(x_C) \cdot p(x_A, x_B, x_C) = p(x_A, x_C) \cdot p(x_B, x_C), \end{align*}\] and hence \[\begin{align*} \log p(x_A, x_B, x_C) = \log p(x_A, x_C) + \log p(x_B, x_C) - \log p(x_C). \end{align*}\] Then applying the log-linear expansions to each term, we get \[\begin{align*} \sum_{W \subseteq V} \lambda_W(x_W) = \sum_{W \subseteq A\cup C} \lambda^{AC}_W(x_W) + \sum_{W \subseteq B\cup C} \lambda^{BC}_W(x_W) - \sum_{W \subseteq C} \lambda^C_W(x_W). \end{align*}\] Start with a vector \(x_V\) whose entries are all 1 apart from some \(w \in A\) with \(x_w=2\). Then every term above is zero, apart from \(\lambda_w(x_w) = \lambda^{AC}_w(x_w)\). Similarly we obtain that \(\lambda_w(x_w) = \lambda^{AC}_w(x_w) + \lambda^{BC}_w(x_w) - \lambda^{C}_w(x_w)\) for \(w \in C\). Proceeding in a straightforward way, using these incremental operations, we can deduce that \[\begin{align*} \lambda_W(x_W) &= \lambda_W^{AC}(x_W) && \text{for any } W \subseteq A\cup C \text{ with } W \cap A \neq \emptyset\\ \lambda_W(x_W) &= \lambda_W^{BC}(x_W) && \text{for any } W \subseteq B\cup C \text{ with } W \cap B \neq \emptyset\\ \lambda_W(x_W) &= \lambda_W^{AC}(x_W) + \lambda_W^{BC}(x_W) - \lambda_W^{C}(x_W) && \text{for any } W \subseteq C. \end{align*}\] So under this conditional independence, the log-linear parameters for \(p(x_V)\) are easily obtainable from those for \(p(x_A, x_C)\), \(p(x_B, x_C)\) and \(p(x_C)\).
Example 3.4 Let us now try applying this to our death penalty dataset using R. The file is available on the class website.
## DeathPen Defendant Victim freq
## 1 Yes White White 53
## 2 No White White 414
## 3 Yes Black White 11
## 4 No Black White 37
## 5 Yes White Black 0
## 6 No White Black 16
## 7 Yes Black Black 4
## 8 No Black Black 139
We can fit log-linear models using the command with a Poisson response. This gives the model \(\text{DeathPen} \mathbin{\perp\hspace{-3.2mm}\perp}\text{Defendant} \mid \text{Victim}\).
mod1 <- glm(freq ~ DeathPen*Victim + Defendant*Victim,
family=poisson, data=df)
summary(mod1)$coefficients
The output (edited for brevity) is:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.0610 0.1258 32.283 < 2e-16 ***
DeathPenNo 1.9526 0.1336 14.618 < 2e-16 ***
VictimBlack -4.9711 0.5675 -8.760 < 2e-16 ***
DefendantBlack -2.2751 0.1516 -15.010 < 2e-16 ***
DeathPenNo:VictimBlack 1.7045 0.5237 3.255 0.00114 **
VictimBlack:DefendantBlack 4.4654 0.3041 14.685 < 2e-16 ***
We can verify that the coefficient of Victim-Defendant is the same as the marginal log odds-ratio between those two variables by fitting a model that ignores whether or not the death penalty was administered:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.45318 0.04627 117.84 <2e-16 ***
DefendantBlack -2.27513 0.15157 -15.01 <2e-16 ***
VictimBlack -3.37374 0.25423 -13.27 <2e-16 ***
DefendantBlack:VictimBlack 4.46538 0.30407 14.69 <2e-16 ***
Note that the parameter estimates relating to the Defendant’s race (and their standard errors) are the same as in the larger model.
It is perhaps easier just to recover the predicted counts under the model:
## 1 2 3 4 5 6 7 8
## 58.035 408.965 5.965 42.035 0.403 15.597 3.597 139.403
Compare these to the actual counts: a goodness of fit test can be performed by using Pearson’s \(\chi^2\) test or (almost equivalently) by looking at the residual deviance of the model.
Here I use matrix calculus, see for example, The Matrix Cookbook, available here (though note this is not examinable!)↩︎