Chapter 4 Conditional independence
An important notion in the study of causality is that of conditional independence. This concept arises very naturally in distributions under interventions, and is also crucial in the understanding of graphical notions of causality.
4.1 Independence
Recall that two discrete variables \(X\) and \(Y\) are independent if \[\begin{align*} P(X=x, Y=y) &= P(X=x) \cdot P(Y=y) && \forall x \in \mathcal{X}, y \in \mathcal{Y}. \end{align*}\] Note that this is equivalent to \[\begin{align*} P(X=x \mid Y=y) &= P(X=x) && \text{whenever }P(Y=y) > 0, \forall x \in \mathcal{X}. \end{align*}\] In other words, knowing the value of \(Y\) gives us no information about the distribution of \(X\); we say that \(Y\) is irrelevant for \(X\). Similarly, two variables with joint density \(f_{XY}\) are independent if \[\begin{align*} f_{XY}(x, y) &= f_X(x) \cdot f_Y(y) && \forall x \in \mathcal{X}, y \in \mathcal{Y}. \end{align*}\] The qualification that these expressions hold for all \((x, y) \in \mathcal{X}\times \mathcal{Y}\), a product space, is very important1, and sometimes forgotten.
Example 4.1 Suppose that \(X, W\) are independent Exponential\((\lambda)\) random variables. Define \(Y = X + W\). Then the joint density of \(X\) and \(Y\) is \[\begin{align*} f_{XY}(x, y) &= \left\{ \begin{array}{ll} \lambda^2 e^{- \lambda y} & \text{if $y > x > 0$}, \\ 0 & \text{otherwise} \end{array} \right.. \end{align*}\] Note that the expression within the valid range for \(x,y\) factorizes, so when performing the usual change of variables one may mistakenly conclude that \(X\) and \(Y\) are independent.
4.2 Conditional Independence
Given random variables \(X,Y\) we denote the joint density \(p(x,y)\), and call \[\begin{align*} p(y) &= \int_\mathcal{X}p(x,y) \, dx. \end{align*}\] the marginal density (of \(Y\)). The conditional density of \(X\) given \(Y\) is defined as any function \(p(x \,|\, y)\) such that \[\begin{align*} p(x, y) &= p(y) \cdot p(x \,|\, y). \end{align*}\] Note that if \(p(y) > 0\) then the solution is unique and given by the familiar expression \[\begin{align*} p(x \,|\, y) &= \frac{p(x, y)}{p(y)}. \end{align*}\]
Definition 4.1 Let \(X,Y\) be random variables defined on a product space \(\mathcal{X}\times \mathcal{Y}\); let \(Z\) be a third random variable, and let the joint density be \(p(x,y,z)\). We say that \(X\) and \(Y\) are conditionally independent given \(Z\) if \[\begin{align*} % p(z) \cdot p(x, y, z) &= p(x, z) \cdot p(y,z), &\forall x \in \sX, y \in \sY, z \in \sZ. p(x \,|\, y, z) &= p(x \,|\, z), &\forall x \in \mathcal{X}, y \in \mathcal{Y}, z \in \mathcal{Z}\text{ such that } p(y,z) > 0. \end{align*}\] When this holds we write \(X \mathbin{\perp\hspace{-3.2mm}\perp}Y \mid Z \, [p]\), possibly omitting the \(p\) for brevity.
In other words, once \(Z=z\) is known, the value of \(Y\) provides no additional information that would allow us to predict or model \(X\). If \(Z\) is degenerate—that is, there is some \(z\) such that \(P(Z=z) = 1\)—then the definition above is the same as saying that \(X\) and \(Y\) are independent. This is called marginal independence, and denoted \(X \mathbin{\perp\hspace{-3.2mm}\perp}Y\).
Example 4.2 Let \(X_1, \ldots, X_k\) be a Markov chain. Then \(X_k\) is independent of \(X_1, \ldots, X_{k-2}\) conditional upon \(X_{k-1}\): \[\begin{align*} P(X_k = x \mid X_{k-1}=x_{k-1}, \ldots, X_1 = x_1) = P(X_k = x \mid X_{k-1}=x_{k-1}) \end{align*}\] for all \(x, x_{k-1}, \ldots, x_1\). That is, \(X_k \mathbin{\perp\hspace{-3.2mm}\perp}X_1, \ldots, X_{k-2} \mid X_{k-1}\). This is known as the Markov property, or memoryless property.
Although the definition of conditional independence appears to be asymmetric in \(X\) and \(Y\), in fact it is not: if \(X\) gives no additional information about \(Y\) then the reverse is also true, as the following theorem shows.
Theorem 4.1 Let \(X,Y,Z\) be random variables on a Cartesian product space. The following are equivalent.
\(p(x \,|\, y, z) = p(x \,|\, z)\) for all \(x,y,z\) such that \(p(y,z) > 0\);
\(p(x, y \,|\, z) = p(x \,|\, z) \cdot p(y \,|\, z)\) for all \(x,y,z\) such that \(p(z) > 0\);
\(p(x, y, z) = p(y, z) \cdot p(x \,|\, z)\) for all \(x,y,z\);
\(p(z) \cdot p(x,y,z) = p(x,z) \cdot p(y,z)\) for all \(x,y,z\);
\(p(x, y, z) = f(x, z) \cdot g(y, z)\) for some functions \(f,g\) and all \(x,y,z\).
Remark. Note that above we use expressions such as ‘for all \(x,y,z\) such that \(p(y,z) > 0\)’; this should be formally interpreted as the measure theoretic notion ‘\(\sigma(Y,Z)\)-almost everywhere’. In this course measure theoretic considerations are suppressed, so writing imprecise (though morally correct) statements such as ‘\(p(y,z) > 0 \implies p(z) > 0\)’ is fine. In the case of discrete distributions, the two formulations are equivalent.
Proof. Note that \(p(y,z) > 0\) implies \(p(z) > 0\), so (i) \(\implies\) (ii) follows from multiplying by \(p(y \,|\, z)\), and (ii) \(\implies\) (iii) by multiplying by \(p(z)\). (iii) \(\implies\) (i) directly.
The equivalence of (iii) and (iv) is also clear (note that if \(p(z)=0\) then both sides of (iii) are 0), and (iii) implies (v). It remains to prove that (v) implies the others. Suppose that (v) holds. Then \[\begin{align*} p(y,z) = \int p(x,y,z) \, dx = g(y,z) \int f(x,z) \, dx = g(y,z) \cdot \widetilde{f}(z). \end{align*}\] If \(\tilde{f}(z) > 0\) (which happens whenever \(p(z) > 0\)) we have \[\begin{align*} p(x,y,z) = \frac{f(x,z)}{\tilde{f}(z)} p(y,z). \end{align*}\] But by definition \(f(x,z)/\tilde{f}(z)\) is \(p(x \,|\, y,z)\), and it does not depend upon \(y\), so we obtain (iii).
Conditional independence is a complicated and often unintuitive notion, as the next example illustrates.
Example 4.3 Below is a famous data set that records the races of the victim and defendants in various murder cases in Florida between 1976 and 1987, and whether or not the death penalty was imposed upon the killer. The data are presented as counts, though we can turn this into an empirical probability distribution by dividing by the total, 674.
|
|
If we add those two \(2 \times 2\) tables together, we obtain the marginal table:
White | Black | |
---|---|---|
Yes | 53 | 15 |
No | 430 | 176 |
Here we see that the chance of receiving a death sentence is approximately independent of the defendant’s race. \(P(\text{Death} \mid \text{White}) = 53/(53+430) = 0.11\), \(P(\text{Death} \mid \text{Black}) = 15/(15+176) = 0.08\). (One could fiddle the numbers to obtain exact independence.)
However, restricting only to cases where the victim is white we see that black defendants have nearly a 1/3 chance of receiving the death penalty, compared to about 1/8 for whites. And for black victims the story is the same, a handful of blacks were were sentenced to death while no white defendants were.
The previous example teaches us the valuable lesson that marginal independence does not imply conditional independence (nor vice versa). More generally, conditioning on additional things may result in dependence being induced. However, there are properties that relate conditional independences, the most important of which are given in the next theorem.
Theorem 4.2 Conditional independence satisfies the following properties, sometimes called the graphoid axioms.
\(X \mathbin{\perp\hspace{-3.2mm}\perp}Y \mid Z\) \(\implies\) \(Y \mathbin{\perp\hspace{-3.2mm}\perp}X \mid Z\);
\(X \mathbin{\perp\hspace{-3.2mm}\perp}Y, W \mid Z\) \(\implies\) \(X \mathbin{\perp\hspace{-3.2mm}\perp}Y \mid Z\);
\(X \mathbin{\perp\hspace{-3.2mm}\perp}Y, W \mid Z\) \(\implies\) \(X \mathbin{\perp\hspace{-3.2mm}\perp}W \mid Y, Z\);
\(X \mathbin{\perp\hspace{-3.2mm}\perp}W \mid Y, Z\) and \(X \mathbin{\perp\hspace{-3.2mm}\perp}Y \mid Z\) \(\implies\) \(X \mathbin{\perp\hspace{-3.2mm}\perp}Y,W \mid Z\);
if \(p(x,y,z,w) > 0\), then \(X \mathbin{\perp\hspace{-3.2mm}\perp}W \mid Y, Z\) and \(X \mathbin{\perp\hspace{-3.2mm}\perp}Y \mid W, Z\) \(\implies\) \(X \mathbin{\perp\hspace{-3.2mm}\perp}Y,W \mid Z\).
These properties are sometimes referred to respectively as symmetry, decomposition, weak union, contraction and intersection.
Proof.
Symmetry follows from Theorem 4.1
Starting from \(p(x,y,w \,|\, z) = p(x \,|\, z) p(y,w \,|\, z)\) and integrating out \(w\) gives \(p(x,y\,|\, z) = p(x\,|\, z) p(y \,|\, z)\).
3 and 4. See Examples Sheet 1.
- By Theorem 4.1 we have \(p(x,y,w, z) = f(x, y, z) g(y, w, z)\) and \(p(x,y,w, z) = \tilde{f}(x, w, z) \tilde{g}(y, w, z)\). By positivity, taking ratios shows that \[\begin{align*} f(x, y, z) &= \frac{\tilde{f}(x, w, z) \tilde{g}(y, w, z)}{ g(y, w, z)}\\ &= \frac{\tilde{f}(x, w_0, z) \tilde{g}(y, w_0, z)}{ g(y, w_0, z)} \end{align*}\] for any \(w_0\), since the LHS does not depend upon \(w\); now we see that the right hand side is a function of \(x,z\) times a function of \(y,z\), so \[\begin{align*} f(x, y, z) &= a(x,z) \cdot b(y,z). \end{align*}\] Plugging into the first expression gives the result.
Remark. Properties 2–4 can be combined into a single ‘chain rule’: \[\begin{align*} X \mathbin{\perp\hspace{-3.2mm}\perp}W \mid Y, Z &&\text{and} && X \mathbin{\perp\hspace{-3.2mm}\perp}Y \mid Z &&\iff && X \mathbin{\perp\hspace{-3.2mm}\perp}Y,W \mid Z. \end{align*}\]
The fifth property is often extremely useful (as we shall see), but doesn’t generally hold if the distribution is not positive.
Remark. Since the events \(\{Y=y\}\) and \(\{Y=y, h(Y)=h(y)\}\) are equal for any (measurable) function \(h\), it follows that \[\begin{align*} p(x \mid y,z) = p(x \mid y, h(y), z). \end{align*}\] This can be used to prove that \[\begin{align*} X \mathbin{\perp\hspace{-3.2mm}\perp}Y \mid Z \quad \implies \quad X \mathbin{\perp\hspace{-3.2mm}\perp}h(Y) \mid Z\quad \text{ and } \quad X \mathbin{\perp\hspace{-3.2mm}\perp}Y \mid h(Y), Z, \end{align*}\] both of which are very useful facts.
Bibliographic Notes
The first systematic study of conditional independence was made by A. Philip Dawid (1979),
who also related it to ideas such as statistical sufficiency. Contributions were
also made by Spohn (1980). At the
time independence only applied to stochastic variables, but it has since
been extended to include deterministic quantities as well; see, for example,
Constantinou and Dawid (2017).
References
Of course, for continuous random variables densities are only defined up to a set of measure zero, so the condition should really read ‘almost everywhere’. We will ignore such measure theoretic niceties in this course.↩︎