Chapter 3 Exponential Families and Contingency Tables

For much of the rest of the course we will be dealing with collections of random variables \(X_V \equiv (X_v : v \in V)\), indexed by a set \(V= \{1,\ldots, p\}\). Each \(X_v\) takes values in the set \({\cal X}_v\). For a subset of the variables \(A \subseteq V\), we write \(X_A\) to denote \((X_v : v \in A)\).

3.1 Exponential Families

Let \(p(\cdot; \theta)\) be a collection of probability densities over \({\cal X}\) indexed by \(\theta \in \Theta\). We say that \(p\) is an exponential family if it can be written as \[\begin{align*} p(x; \theta) = \exp\left\{ \sum_i \theta_i \phi_i(x) - A(\theta) - C(x) \right\}. \end{align*}\] If \(\Theta\) is a non-empty open set then the family is said to be regular.
The functions \(\phi_i\) are the sufficient statistics, and the components \(\theta_i\) are called the canonical parameters (or natural parameters). We can replace the sum with an inner product of vectors \(\theta = (\theta_i)\) and \(\phi = (\phi_i(x))\): \[\begin{align*} p(x; \theta) = \exp\left\{ \langle \theta, \phi(x) \rangle - A(\theta) - C(x)\right\}. \end{align*}\]

The function \(A(\theta)\) is the cumulant function, and must be chosen so that the distribution normalizes, i.e.
\[\begin{align*} A(\theta) = \log \int \exp\left\{ \langle\theta, \phi(x) \rangle - C(x) \right\} \, dx. \end{align*}\] \(Z(\theta) \equiv e^{A(\theta)}\) is also called the partition function.

Lemma 3.1 We have \[\begin{align*} \nabla A(\theta) &= \mathbb{E}_{\theta} \phi(X), & \nabla\nabla^T A(\theta) &= \operatorname{Cov}_{\theta} \phi(X). \end{align*}\] Consequently \(A(\theta)\) (and \(- \log p(x; \theta)\)) are convex in \(\theta\). In addition, the map \(\mu(\theta) : \theta \mapsto \nabla A(\theta)\) is bijective, and called the mean function.

Proof. For the first part, \[\begin{align*} e^{A(\theta)} \frac{\partial}{\partial \theta_i} A(\theta) &= \frac{\partial}{\partial \theta_i} e^{A(\theta)}\\ &= \frac{\partial}{\partial \theta_i} \int \exp\left\{ \langle \theta, \phi(x) \rangle -C(x)\right\} \, dx\\ &= \int \frac{\partial}{\partial \theta_i} \exp\left\{ \langle \theta, \phi(x) \rangle -C(x)\right\} \, dx\\ &= \int \phi_i(x) \exp\left\{ \langle \theta, \phi(x) \rangle -C(x)\right\} \, dx\\ &= e^{A(\theta)} \int \phi_i(x) \exp\left\{ \langle \theta, \phi(x) \rangle - A(\theta) -C(x)\right\} \, dx\\ &= e^{A(\theta)} \mathbb{E}_{\theta} \phi_i(X). \end{align*}\] The result for the Hessian follows similarly. The convexity of \(-\log p(x; \theta) = A(\theta) - \langle \theta, \phi(x) \rangle\) is now immediate from the fact that its Hessian is a non-negative definite matrix.
That \(\mu(\theta)\) is bijective requires strict convexity; i.e. that the Hessian is positive definite. This follows from a slight extension to the above (see the book by Wainwright and Jordan, Proposition 3.1).

The property of convexity plays an important role in the computational advantages of exponential families. Convex functions are easy to work with for the purposes of optimization: in particular, they do not contain multiple local minima.

Example 3.1 Let \(X \sim \operatorname{Poisson}(\lambda)\). We have \[\begin{align*} p_\lambda(x) &= e^{-\lambda} \frac{\lambda^x}{x!} = \frac{1}{x!} \exp\left\{x \log \lambda - \lambda\right\}. \end{align*}\] Clearly the canonical parameter is \(\theta = \log\lambda\), so we can rewrite as \[\begin{align*} p_\theta(x) &= \frac{1}{x!} \exp\left\{\theta x - e^{\theta}\right\}, \end{align*}\] giving \(A(\theta) = e^\theta\) (which is convex, as expected). Note that \(A'(\theta) = A''(\theta) = e^\theta = \lambda\), which is indeed the mean and variance of a Poisson distribution.

3.2 Empirical Moment Matching

To find the maximum likelihood estimate in an exponential family, we maximize the log-likelihood (ignoring \(C\), since it is constant in \(\theta\)) \[\begin{align*} l(\theta; X^{(1)},\ldots, X^{(n)}) &= \left\langle \sum_{i=1}^n \phi(X^{(i)}), \theta \right\rangle - n A(\theta)\\ n^{-1} l(\theta; X^{(1)},\ldots, X^{(n)}) &= \langle \overline{\phi(X)}, \theta \rangle - A(\theta) \end{align*}\] where \(\overline{\phi(X)} = n^{-1} \sum_i \phi(X^{(i)})\) is the sample mean of the sufficient statistics. To maximize this, we can differentiate and set to zero, obtaining \[\begin{align*} \overline{\phi(X)} - \nabla A(\theta) = 0, \end{align*}\] so in other words when we choose \(\theta\) so that \(\mathbb{E}_\theta \phi(X) = \overline{\phi(X)}\): the mean of the sufficient statistics matches the empirical mean from the data.

Note also that if we differentiate just with respect to \(\theta_i\), we obtain the same result for each sufficient statistic separately; hence if we update the parameters to match the moment \(\overline{\phi_i(X)} = \mathbb{E}_{\theta} \phi_i(X)\), then we increase the log-likelihood. If we iterate this over \(i\), we will converge to the global maximum likelihood estimate, because the log-likelihood is a (strictly) concave and differentiable function.

3.3 Multivariate Gaussian Distribution

Let \(X_V = (X_1,\ldots,X_p)^T \in \mathbb{R}^p\) be a random vector.
Let \(\mu \in \mathbb{R}^p\) and \(\Sigma \in \mathbb{R}^{p\times p}\) be a positive definite symmetric matrix. We say that \(X_V\) has a multivariate Gaussian distribution with parameters \(\mu\) and \(\Sigma\) if the joint density is \[\begin{align*} f(x_V) &= \frac{1}{(2\pi)^{p/2} |\Sigma|^{1/2}} \exp\left\{ - \frac{1}{2} (x_V - \mu)^T \Sigma^{-1} (x_V - \mu) \right\}, && x_V \in \mathbb{R}^p. \end{align*}\] This is also called the multivariate normal distribution. The concentration matrix is \(K \equiv \Sigma^{-1}\).

We can rewrite this as \[\begin{align*} f(x_V) &= \frac{1}{(2\pi)^{p/2}} \exp\left\{ - \frac{1}{2} x_V^T K x_V + \mu^T K x_V - \frac{1}{2} \mu^T K \mu + \frac{1}{2} \log|K| \right\}, && x_V \in \mathbb{R}^p. \end{align*}\] Noting that \(x_V^T K x_V = \sum_{i,j} k_{ij} x_i x_j\) we see that this is an exponential family with canonical parameters \(\eta \equiv K \mu\) and \(K\), and corresponding sufficient statistics \(\phi(x_V) = (x_V, -\frac{1}{2} x_V x_V^T)\).

We then obtain that \(2 A(\theta) = 2 A(K, \eta) = \eta^T K^{-1} \eta - \log|K|\), which by differentiating², gives \[\begin{align*} \nabla_\eta A(\theta) &= K^{-1} \eta = \mu\\ 2\nabla_K A(\theta) &= - K^{-1} - K^{-1} \eta \eta^T K^{-1} = -\Sigma - \mu \mu^T, \end{align*}\] and these are indeed the expectations of the sufficient statistics. By the previous observation that maximum likelihood estimation corresponds to moment matching for exponential families, this means that the MLEs are \(\hat\mu = \overline{X}_V\) and \(\hat\Sigma = \overline{X_VX_V'} - \overline{X}_V \overline{X}_V^T\).

Proposition 3.1 Let \(X_V\) have a multivariate Gaussian distribution with concentration matrix \(K = \Sigma^{-1}\). Then \(X_i \mathbin{\perp\hspace{-3.2mm}\perp}X_j \mid X_{V \setminus \{i,j\}}\) if and only if \(k_{ij} = 0\), where \(k_{ij}\) is the corresponding entry in the concentration matrix.

Proof. The log-density is \[\begin{align*} \log f(x_V) &= - \frac{1}{2} (x_V - \mu)^T K (x_V - \mu) + \text{const} \end{align*}\] where the constant term does not depend upon \(x_V\).
It is clear that the only term involving both \(x_i\) and \(x_j\) is \(- k_{ij} (x_i-\mu_i) (x_j-\mu_j)\). Hence, \(k_{ij} = 0\) if and only if the log-density has separate terms for each of \(x_i\) and \(x_j\).

We will return to the multivariate Gaussian distribution in Chapter 5.

3.4 Contingency Tables

In this section we will assume that our variables \(X_v\) are discrete with a finite set of levels \({\cal X}_v \equiv \{1,\ldots,d_v\}\). Though we use integers as labels, they can represent something completely arbitrary and unordered such as religion, social preference, or a car model.

Given a vector of these categories \(X_V^{(i)} = (X_1^{(i)}, \ldots, X_p^{(i)})\) sampled over individuals \(i=1,\ldots,n\), it is helpful to cross-tabulate their responses. Define: \[\begin{align*} n(x_V) \equiv \sum_{i=1}^n \mathbb{I}\{X^{(i)}_1 = x_1, \ldots, X^{(i)}_p = x_p\}, \end{align*}\] i.e. the number of individuals who have the response pattern \(x_V\).
These counts are the sufficient statistics for a multinomial model, whose log-likelihood is \[\begin{align*} l(p; \boldsymbol{n}) &= \sum_{x_V} n(x_V) \log p(x_V), && p(x_V) \geq 0, \quad \sum_{x_V} p(x_V) = 1. \end{align*}\] Letting \(1_V\) mean the vector of zeros, we can rewrite this as \[\begin{align*} l(p; \boldsymbol{n}) &= \sum_{x_V\neq 1_V} n(x_V) \log p(x_V)/p(1_V) + n \log p(1_V). \end{align*}\] We immediately obtain that the multinomial distribution is an exponential family with sufficient statistics given by the counts \(n(x_V)\), and canonical parameters given by the ratios of log-probabilities. The cumulant function is \(-\log p(1_V)\), but it should be written as a function of the canonical parameters; you can check that this gives \[\begin{align*} -\log p(1_V) = \log \left(1 + \sum_{x_V \neq 1_V} e^{\theta(x_V)}\right) \end{align*}\] for \(\theta(x_V) = \log p(x_V)/p(1_V)\), which is convex. Note that canonical parameters are only unique up to linear transformations; in particular, we could have used any cell as the reference value instead of \(1_V\). We will use an alternative parameterization below.

Each possibility \(x_V\) is called a cell of the table.
Given a subset of the responses \(A \subseteq V\) we may be interested in the marginal table: \[\begin{align*} n(x_A) \equiv \sum_{x_B} n(x_A, x_B), \end{align*}\] where \(B = V \setminus A\).

Example 3.2 Consider the death penalty data again:

Table 3.1: Counts of death penalties in Florida for crimes involving white victims (left) and black victims (right).

	White	Black
Yes	53	11
No	414	37

	White	Black
Yes	0	4
No	16	139

The marginal table has

Table 3.2: Marginal version of tables above.
	White	Black
Yes	53	15
No	430	176

So if we let \(D\) and \(R\) respectively denote indicators of whether the death penalty was imposed and the defendant’s race then, for example, \(n(x_D = \text{Yes}, x_R = \text{Black}) = 15\).

3.5 Computation

As noted in the introduction, even a moderately sized contingency table will cause statistical problems in practice due to the curse of dimensionality. If we have \(k\) binary variables, then the contingency table will have \(2^k\) cells. Even for \(k=10\) we will have over a thousand possibilities, and for \(k=50\) there are too many to cells to store in a computer’s memory.

Conditional independence can help, however; suppose that \(X_A \mathbin{\perp\hspace{-3.2mm}\perp}X_B \mid X_S\) for some \(A\cup B\cup S = V\), so that we have \[\begin{align*} p(x_V) = p(x_S) \cdot p(x_A \mid x_S) \cdot p(x_B \mid x_S). \end{align*}\] Now we can store each of these factors in computer memory separately, which means \(2^s + 2^{a+s} + 2^{b+s} = 2^s(1 + 2^a + 2^b)\) cells instead of \(2^{s+a+b}\).
This is a considerable saving if \(s\) is small and the minimum of \(a\) and \(b\) is not too small. With respect to calculations, if we want to find \(P(X_v=1)\) and \(v \in A\), then we need only sum over the \(2^{s+a}\) entries in \(p(x_S) \cdot p(x_A \mid x_S)\) rather than the \(2^{a+b+s}\) entries in \(p(x_V)\).

Of course, if there are other conditional independences present then one might imagine that further computational savings become possible: indeed this is correct, and is one of the main ideas behind graphical models.

3.6 Log-linear models

The log-linear parameters for \(p(x_V) > 0\) are defined by the relation \[\begin{align*} \log p(x_V) &= \sum_{A \subseteq V} \lambda_A(x_A)\\ &= \lambda_\emptyset + \lambda_1(x_1) + \cdots + \lambda_{V}(x_V), \end{align*}\] and the identifiability constraint \(\lambda_A(x_A) = 0\) whenever \(x_a = 1\) for some \(a \in A\). (Other identifiability constraints can also be used.)

In the case of binary variables (that is, each variable takes only two states, \(d_v = 2\), \({\cal X}_v = \{1,2\}\)), there is only one possibly non-zero level for each log-linear parameter \(\lambda_A(x_A)\), which is when \(x_A = (2,\ldots,2)\).
In this case we will simply write \(\lambda_A = \lambda_A(2,\ldots,2)\). We will proceed under this assumption from now on.

Example 3.3 Consider a \(2 \times 2\) table with probabilities \(\pi_{ij} = P(X=i, Y=j)\). The log-linear parametrization has \[\begin{align*} \log \pi_{11} &= \lambda_\emptyset & \log \pi_{21} &= \lambda_\emptyset + \lambda_X\\ \log \pi_{12} &= \lambda_\emptyset + \lambda_Y& \log \pi_{22} &= \lambda_\emptyset + \lambda_X + \lambda_Y + \lambda_{XY}. \end{align*}\] From this we can deduce that \[\begin{align*} \lambda_{XY} &= \log \frac{\pi_{11} \pi_{22}}{\pi_{21} \pi_{12}}. \end{align*}\] The quantity \(\exp\lambda_{XY}\) is called the odds ratio between \(X\) and \(Y\), and is a fundamental quantity in statistical inference.

Multinomial models can be fitted as Poisson GLMs using the following fact: :::{.proposition} Let \(X_i \sim \operatorname{Poisson}(\mu_i)\) independently, and let \(N = \sum_{i=1}^k X_i\). Then, \[\begin{align*} N &\sim \operatorname{Poisson}\left({\textstyle \sum}_i \mu_i\right)\\ (X_1, \ldots, X_k)^T \mid N=n &\sim \operatorname{Multinom}(n, (\pi_1,\ldots,\pi_k)^T), \end{align*}\] where \(\pi_i = \mu_i/\sum_j \mu_j\). :::

3.7 Conditional Independence

Log-linear parameters provide a convenient way of expressing conditional independence constraints, since factorization of a density is equivalent to an additive separation of the log-density.

Theorem 3.1 Let \(p > 0\) be a discrete distribution on \(X_V\) with associated log-linear parameters \(\lambda_C\), \(C\subseteq V\). The conditional independence \(X_a \mathbin{\perp\hspace{-3.2mm}\perp}X_b \mid X_{V \setminus \{a,b\}}\) holds if and only if \(\lambda_C = 0\) for all \(\{a,b\} \subseteq C \subseteq V\).

Proof. See examples sheet.

If there is a conditional independence, then the log-linear parameters can be calculated by just looking at the distribution of each ‘piece’ of the conditional independence separately. For example, suppose that \(X_A \mathbin{\perp\hspace{-3.2mm}\perp}X_B \mid X_C\), where \(A \cup B \cup C = V\). Then by Theorem \(\ref{thm:ciequiv}\), we have \[\begin{align*} p(x_C) \cdot p(x_A, x_B, x_C) = p(x_A, x_C) \cdot p(x_B, x_C), \end{align*}\] and hence \[\begin{align*} \log p(x_A, x_B, x_C) = \log p(x_A, x_C) + \log p(x_B, x_C) - \log p(x_C). \end{align*}\] Then applying the log-linear expansions to each term, we get \[\begin{align*} \sum_{W \subseteq V} \lambda_W(x_W) = \sum_{W \subseteq A\cup C} \lambda^{AC}_W(x_W) + \sum_{W \subseteq B\cup C} \lambda^{BC}_W(x_W) - \sum_{W \subseteq C} \lambda^C_W(x_W). \end{align*}\] Start with a vector \(x_V\) whose entries are all 1 apart from some \(w \in A\) with \(x_w=2\). Then every term above is zero, apart from \(\lambda_w(x_w) = \lambda^{AC}_w(x_w)\). Similarly we obtain that \(\lambda_w(x_w) = \lambda^{AC}_w(x_w) + \lambda^{BC}_w(x_w) - \lambda^{C}_w(x_w)\) for \(w \in C\). Proceeding in a straightforward way, using these incremental operations, we can deduce that \[\begin{align*} \lambda_W(x_W) &= \lambda_W^{AC}(x_W) && \text{for any } W \subseteq A\cup C \text{ with } W \cap A \neq \emptyset\\ \lambda_W(x_W) &= \lambda_W^{BC}(x_W) && \text{for any } W \subseteq B\cup C \text{ with } W \cap B \neq \emptyset\\ \lambda_W(x_W) &= \lambda_W^{AC}(x_W) + \lambda_W^{BC}(x_W) - \lambda_W^{C}(x_W) && \text{for any } W \subseteq C. \end{align*}\] So under this conditional independence, the log-linear parameters for \(p(x_V)\) are easily obtainable from those for \(p(x_A, x_C)\), \(p(x_B, x_C)\) and \(p(x_C)\).

Example 3.4 Let us now try applying this to our death penalty dataset using R. The file is available on the class website.

df <- read.table("../Datasets/deathpen.txt", header=TRUE)
df

##   DeathPen Defendant Victim freq
## 1      Yes     White  White   53
## 2       No     White  White  414
## 3      Yes     Black  White   11
## 4       No     Black  White   37
## 5      Yes     White  Black    0
## 6       No     White  Black   16
## 7      Yes     Black  Black    4
## 8       No     Black  Black  139

We can fit log-linear models using the command with a Poisson response. This gives the model \(\text{DeathPen} \mathbin{\perp\hspace{-3.2mm}\perp}\text{Defendant} \mid \text{Victim}\).

mod1 <- glm(freq ~ DeathPen*Victim + Defendant*Victim, 
               family=poisson, data=df)
summary(mod1)$coefficients

The output (edited for brevity) is:

Coefficients:
                           Estimate Std. Error z value Pr(>|z|)    
(Intercept)                  4.0610     0.1258  32.283  < 2e-16 ***
DeathPenNo                   1.9526     0.1336  14.618  < 2e-16 ***
VictimBlack                 -4.9711     0.5675  -8.760  < 2e-16 ***
DefendantBlack              -2.2751     0.1516 -15.010  < 2e-16 ***
DeathPenNo:VictimBlack       1.7045     0.5237   3.255  0.00114 ** 
VictimBlack:DefendantBlack   4.4654     0.3041  14.685  < 2e-16 ***

We can verify that the coefficient of Victim-Defendant is the same as the marginal log odds-ratio between those two variables by fitting a model that ignores whether or not the death penalty was administered:

mod2 <- glm(freq ~ Defendant*Victim, 
               family=poisson, data=df)
summary(mod2)$coefficients

                           Estimate Std. Error z value Pr(>|z|)    
(Intercept)                 5.45318    0.04627  117.84   <2e-16 ***
DefendantBlack             -2.27513    0.15157  -15.01   <2e-16 ***
VictimBlack                -3.37374    0.25423  -13.27   <2e-16 ***
DefendantBlack:VictimBlack  4.46538    0.30407   14.69   <2e-16 ***

Note that the parameter estimates relating to the Defendant’s race (and their standard errors) are the same as in the larger model.

It is perhaps easier just to recover the predicted counts under the model:

count1 <- predict(mod1, type="response")
count1

##       1       2       3       4       5       6       7       8 
##  58.035 408.965   5.965  42.035   0.403  15.597   3.597 139.403

Compare these to the actual counts: a goodness of fit test can be performed by using Pearson’s \(\chi^2\) test or (almost equivalently) by looking at the residual deviance of the model.

Here I use matrix calculus, see for example, The Matrix Cookbook, available here (though note this is not examinable!)↩︎