Chapter 2 Conditional Independence
The primary tool we will use to provide statistical and computationally feasible models is conditional independence. This ensures that distributions factorize into smaller pieces that can be evaluated separately and quickly.
2.1 Independence
Recall that two discrete variables \(X\) and \(Y\) are independent if \[\begin{align*} P(X=x, Y=y) &= P(X=x) \cdot P(Y=y) && \forall x \in {\cal X}, y \in {\cal Y}. \end{align*}\] Note that this is equivalent to \[\begin{align*} P(X=x \mid Y=y) &= P(X=x) && \text{whenever }P(Y=y) > 0, \forall x \in {\cal X}. \end{align*}\] In other words, knowing the value of \(Y\) gives us no information about the distribution of \(X\); we say that \(Y\) is irrelevant for \(X\). Similarly, two variables with joint density \(f_{XY}\) are independent if \[\begin{align*} f_{XY}(x, y) &= f_X(x) \cdot f_Y(y) && \forall x \in {\cal X}, y \in {\cal Y}. \end{align*}\] The qualification that these expressions hold for all \((x, y) \in {\cal X}\times {\cal Y}\), a product space, is very important1, and sometimes forgotten.
Example 2.1 Suppose that \(X, W\) are independent Exponential\((\lambda)\) random variables. Define \(Y = X + W\). Then the joint density of \(X\) and \(Y\) is \[\begin{align*} f_{XY}(x, y) &= \left\{ \begin{array}{ll} \lambda^2 e^{- \lambda y} & \text{if $y > x > 0$}, \\ 0 & \text{otherwise} \end{array} \right.. \end{align*}\] Note that the expression within the valid range for \(x,y\) factorizes, so when performing the usual change of variables one may mistakenly conclude that \(X\) and \(Y\) are independent.
2.2 Conditional Independence
Given random variables \(X,Y\) we denote the joint density \(p(x,y)\), and call \[\begin{align*} p(y) &= \int_{\cal X}p(x,y) \, dx. \end{align*}\] the marginal density (of \(Y\)). The conditional density of \(X\) given \(Y\) is defined as any function \(p(x \,|\, y)\) such that \[\begin{align*} p(x, y) &= p(y) \cdot p(x \,|\, y). \end{align*}\] Note that if \(p(y) > 0\) then the solution is unique and given by the familiar expression \[\begin{align*} p(x \,|\, y) &= \frac{p(x, y)}{p(y)}. \end{align*}\]
Definition 2.1 Let \(X,Y\) be random variables defined on a product space \({\cal X}\times {\cal Y}\); let \(Z\) be a third random variable, and let the joint density be \(p(x,y,z)\). We say that \(X\) and \(Y\) are conditionally independent given \(Z\) if \[\begin{align*} % p(z) \cdot p(x, y, z) &= p(x, z) \cdot p(y,z), &\forall x \in \X, y \in \Y, z \in \Z. p(x \,|\, y, z) &= p(x \,|\, z), &\forall x \in {\cal X}, y \in {\cal Y}, z \in {\cal Z}\text{ such that } p(y,z) > 0. \end{align*}\] When this holds we write \(X \mathbin{\perp\hspace{-3.2mm}\perp}Y \mid Z \, [p]\), possibly omitting the \(p\) for brevity.
In other words, once \(Z=z\) is known, the value of \(Y\) provides no additional information that would allow us to predict or model \(X\). If \(Z\) is degenerate—that is, there is some \(z\) such that \(P(Z=z) = 1\), then the definition above is the same as saying that \(X\) and \(Y\) are independent. This is called marginal independence, and denoted \(X \mathbin{\perp\hspace{-3.2mm}\perp}Y\).
Example 2.2 Let \(X_1, \ldots, X_k\) be a Markov chain. Then \(X_k\) is independent of \(X_1, \ldots, X_{k-2}\) conditional upon \(X_{k-1}\): \[\begin{align*} P(X_k = x \mid X_{k-1}=x_{k-1}, \ldots, X_1 = x_1) = P(X_k = x \mid X_{k-1}=x_{k-1}) \end{align*}\] for all \(x, x_{k-1}, \ldots, x_1\). That is, \(X_k \mathbin{\perp\hspace{-3.2mm}\perp}X_1, \ldots, X_{k-2} \mid X_{k-1}\). This is known as the Markov property, or memoryless property.
Although the definition of conditional independence appears to be asymmetric in \(X\) and \(Y\), in fact it is not: if \(X\) gives no additional information about \(Y\) then the reverse is also true, as the following theorem shows.
Theorem 2.1 Let \(X,Y,Z\) be random variables on a Cartesian product space.
The following are equivalent.
\(p(x \,|\, y, z) = p(x \,|\, z)\) for all \(x,y,z\) such that \(p(y,z) > 0\);
\(p(x, y \,|\, z) = p(x \,|\, z) \cdot p(y \,|\, z)\) for all \(x,y,z\) such that \(p(z) > 0\);
\(p(x, y, z) = p(y, z) \cdot p(x \,|\, z)\) for all \(x,y,z\);
\(p(z) \cdot p(x,y,z) = p(x,z) \cdot p(y,z)\) for all \(x,y,z\);
\(p(x, y, z) = f(x, z) \cdot g(y, z)\) for some functions \(f,g\) and all \(x,y,z\).
Remark. Note that above we use expressions such as ‘for all \(x,y,z\) such that \(p(y,z) > 0\)’; this should be formally interpreted as the measure theoretic notion ‘\(\sigma(Y,Z)\)-almost everywhere’. In this course measure theoretic considerations are suppressed, so writing imprecise (though morally correct) statements such as ‘\(p(y,z) > 0 \implies p(z) > 0\)’ is fine. In the case of discrete distributions, the two formulations are equivalent.
Proof. Note that \(p(y,z) > 0\) implies \(p(z) > 0\), so (i) \(\implies\) (ii) follows from multiplying by \(p(y \,|\, z)\), and (ii) \(\implies\) (iii) by multiplying by \(p(z)\). (iii) \(\implies\) (i) directly.
The equivalence of (iii) and (iv) is also clear (note that if \(p(z)=0\) then both sides of (iii) are 0), and (iii) implies (v). It remains to prove that (v) implies the others. Suppose that (v) holds. Then \[\begin{align*} p(y,z) = \int p(x,y,z) \, dx = g(y,z) \int f(x,z) \, dx = g(y,z) \cdot \widetilde{f}(z). \end{align*}\] If \(\tilde{f}(z) > 0\) (which happens whenever \(p(z) > 0\)) we have \[\begin{align*} p(x,y,z) = \frac{f(x,z)}{\tilde{f}(z)} p(y,z). \end{align*}\] But by definition \(f(x,z)/\tilde{f}(z)\) is \(p(x \,|\, y,z)\), and it does not depend upon \(y\), so we obtain (iii).
Conditional independence is a complicated and often unintuitive notion, as the next example illustrates.
Example 2.3 Below is a famous data set that records the races of the victim and defendants in various murder cases in Florida between 1976 and 1987, and whether or not the death penalty was imposed upon the killer. The data are presented as counts, though we can turn this into an empirical probability distribution by dividing by the total, 674.
|
|
If we add those two \(2 \times 2\) tables together, we obtain the marginal table:
White | Black | |
---|---|---|
Yes | 53 | 15 |
No | 430 | 176 |
Here we see that the chance of receiving a death sentence is approximately independent of the defendant’s race. \(P(\text{Death} \mid \text{White}) = 53/(53+430) = 0.11\), \(P(\text{Death} \mid \text{Black}) = 15/(15+176) = 0.08\). (One could fiddle the numbers to obtain exact independence.)
However, restricting only to cases where the victim is white we see that black defendants have nearly a 1/3 chance of receiving the death penalty, compared to about 1/8 for whites. And for black victims the story is the same, a handful of blacks were were sentenced to death while no white defendants were. (In fact we will see in Chapter 3.4 that this conditional dependence is not statistically significant either, but for the purposes of this discussion this doesn’t matter: we could multiply all the numbers by 10 and get a data set in which the correlations are significant. For more on this data set, take a look at Example 2.3.2 in the book Categorical Data Analysis by Agresti).
The previous example teaches us the valuable lesson
that marginal independence does not imply conditional
independence (nor vice versa).
More generally, conditioning on additional things
may result in dependence being induced. However,
there are properties that relate conditional
independences, the most important of which are given
in the next theorem.
Theorem 2.2 Conditional independence satisfies the following properties, sometimes called the graphoid axioms.
\(X \mathbin{\perp\hspace{-3.2mm}\perp}Y \mid Z\) \(\implies\) \(Y \mathbin{\perp\hspace{-3.2mm}\perp}X \mid Z\);
\(X \mathbin{\perp\hspace{-3.2mm}\perp}Y, W \mid Z\) \(\implies\) \(X \mathbin{\perp\hspace{-3.2mm}\perp}Y \mid Z\);
\(X \mathbin{\perp\hspace{-3.2mm}\perp}Y, W \mid Z\) \(\implies\) \(X \mathbin{\perp\hspace{-3.2mm}\perp}W \mid Y, Z\);
\(X \mathbin{\perp\hspace{-3.2mm}\perp}W \mid Y, Z\) and \(X \mathbin{\perp\hspace{-3.2mm}\perp}Y \mid Z\) \(\implies\) \(X \mathbin{\perp\hspace{-3.2mm}\perp}Y,W \mid Z\);
if \(p(x,y,z,w) > 0\), then \(X \mathbin{\perp\hspace{-3.2mm}\perp}W \mid Y, Z\) and \(X \mathbin{\perp\hspace{-3.2mm}\perp}Y \mid W, Z\) \(\implies\) \(X \mathbin{\perp\hspace{-3.2mm}\perp}Y,W \mid Z\).
These properties are sometimes referred to respectively as symmetry, decomposition, weak union, contraction and intersection.
Proof.
Symmetry follows from Theorem 2.1
Starting from \(p(x,y,w \,|\, z) = p(x \,|\, z) p(y,w \,|\, z)\) and integrating out \(w\) gives \(p(x,y\,|\, z) = p(x\,|\, z) p(y \,|\, z)\).
3 and 4. See Examples Sheet 1.
- By Theorem 2.1 we have \(p(x,y,w, z) = f(x, y, z) g(y, w, z)\) and \(p(x,y,w, z) = \tilde{f}(x, w, z) \tilde{g}(y, w, z)\). By positivity, taking ratios shows that \[\begin{align*} f(x, y, z) &= \frac{\tilde{f}(x, w, z) \tilde{g}(y, w, z)}{ g(y, w, z)}\\ &= \frac{\tilde{f}(x, w_0, z) \tilde{g}(y, w_0, z)}{ g(y, w_0, z)} \end{align*}\] for any \(w_0\), since the LHS does not depend upon \(w\); now we see that the right hand side is a function of \(x,z\) times a function of \(y,z\), so \[\begin{align*} f(x, y, z) &= a(x,z) \cdot b(y,z). \end{align*}\] Plugging into the first expression gives the result.
Remark. Properties 2–4 can be combined into a single ‘chain rule’: \[\begin{align*} X \mathbin{\perp\hspace{-3.2mm}\perp}W \mid Y, Z &&\text{and} && X \mathbin{\perp\hspace{-3.2mm}\perp}Y \mid Z &&\iff && X \mathbin{\perp\hspace{-3.2mm}\perp}Y,W \mid Z. \end{align*}\]
The fifth property is often extremely useful (as we shall see), but doesn’t generally hold if the distribution is not positive: see the Examples Sheet.
Remark. Since the events \(\{Y=y\}\) and \(\{Y=y, h(Y)=h(y)\}\) are equal for any (measurable) function \(h\), it follows that \[\begin{align*} p(x \mid y,z) = p(x \mid y, h(y), z). \end{align*}\] This can be used to prove that \[\begin{align*} X \mathbin{\perp\hspace{-3.2mm}\perp}Y \mid Z \quad \implies \quad X \mathbin{\perp\hspace{-3.2mm}\perp}h(Y) \mid Z\quad \text{ and } \quad X \mathbin{\perp\hspace{-3.2mm}\perp}Y \mid h(Y), Z, \end{align*}\] both of which are very useful facts.
2.3 Statistical Inference
Conditional independence crops up in various areas of statistics; here is an example that should be familiar.
Example 2.4 Suppose that \(X \sim f_\theta\) for some parameter \(\theta \in \Theta\). We say that \(T \equiv t(X)\) is a conditionally independent for \(\theta\) if the likelihood can be written as \[\begin{align*} L(\theta \mid X=x) = f_\theta(x) = g(t(x), \theta) \cdot h(x). \end{align*}\] Note that under a Bayesian interpretation of \(\theta\), this is equivalent to saying that \(X \mathbin{\perp\hspace{-3.2mm}\perp}\theta \mid T\).
Conditional independence can also give huge computational
advantages for dealing with complex distributions
and large datasets.
Take random variables \(X,Y,Z\) on a product space with joint density
\[\begin{align*}
p_{\theta}(x,y,z) = g_{\eta}(x,y) \cdot h_{\zeta}(y,z), && \forall x,y,z, \theta,
\end{align*}\]
for some functions \(g,h\), where
\(\theta = (\eta,\zeta)\) is a Cartesian product.
Then suppose we wish to find the maximum likelihood estimate of \(\theta\); well this is just \(\hat\theta = (\hat\eta, \hat\zeta)\) where \[\begin{align*} \hat\eta = \arg\max_\eta \prod_{i=1}^n g_{\eta}(x_i,y_i), & &\hat\zeta = \arg\max_\zeta \prod_{i=1}^n h_{\zeta}(y_i,z_i). \end{align*}\] So we can maximize these two pieces separately. Notice in particular that we don’t need all the data in either case!
If in a Bayesian mood, we might impose independent priors \(\pi(\eta, \zeta) = \pi(\eta) \cdot \pi(\zeta)\). Then \[\begin{align*} \pi(\eta, \zeta \mid \boldsymbol{x},\boldsymbol{y},\boldsymbol{z}) &\propto \pi(\eta) \cdot \pi(\zeta)\cdot \prod_i g_{\eta}(x_i,y_i) \cdot h_{\zeta}(y_i,z_i) \\ &= \left\{ \pi(\eta) \prod_i g_{\eta}(x_i,y_i) \right\} \cdot \left\{ \pi(\zeta) \prod_i h_{\zeta}(y_i,z_i) \right\}\\ &\propto \pi(\eta \mid \boldsymbol{x}, \boldsymbol{y}) \cdot \pi(\zeta \mid \boldsymbol{y}, \boldsymbol{z}). \end{align*}\] Applying Theorem 2.1(ii) we see that \(\eta \mathbin{\perp\hspace{-3.2mm}\perp}\zeta \mid \boldsymbol{X}, \boldsymbol{Y}, \boldsymbol{Z}\), and so we can perform inference about this distribution for the two pieces separately (e.g. by running an MCMC procedure or finding the posterior mode).
Indeed, each piece only requires part of the data, and for large problems this can be a tremendous computational saving.
Bibliographic Notes
The first systematic study of conditional independence was made by Dawid (1979), who also related it to ideas such as statistical sufficiency. Contributions were also made by Spohn (1980). At the time independence only applied to variables, but it has since been extended to include deterministic quantities as well; see, for example, Constantinou & Dawid (2017). The Hammersley-Clifford Theorem first appeared in an earlier unpublished manuscript (Clifford & Hammersley, 1971), and was subsequently reproved several times by other authors.
References
Of course, for continuous random variables densities are only defined up to a set of measure zero, so the condition should really read ‘almost everywhere’. We will ignore such measure theoretic niceties in this course.↩︎