2  Causality

TODOs:

As you read the research literature and become more familiar with the kinds of questions answered in medicine, biology, the social sciences, public policy, and so on, you will learn that about half of researchers are trying to answer causal questions: questions about how a particular variable causes changes in another variable, or whether it does so at all. The other half of researchers are lying.1

Consequently, we need some formal way of describing causality. What does it mean to say something “causes” something else? And we will need a language to describe our research goals when answering causal questions. When we have numerous variables with complicated relationships, deciding what we are looking for—which variables must be included in a model, what coefficients or tests or predictions will answer the research question, what types of confounding or bias we must worry about—is often harder than conducting the analyses. Having a formal way to describe our goals will give us a formal way to reason about this problem.

There are two dominant frameworks used by statisticians to talk about causality: the potential outcomes framework, which provides a definition of “causality” and a way of reasoning about it mathematically, and causal diagrams, which provide a great way to illustrate causal relationships graphically and to reason about our models. We’ll start with counterfactuals, but we will mainly use causal diagrams in this course, simply because they let us sketch out a particular regression problem and reason about how we might estimate the causal effect we want.

2.1 Counterfactuals

2.1.1 Definitions

Let’s start with some notation. Suppose our dataset consists of \(N\) patients, subjects, or units, indexed with \(i = 1, \dots, N\). We are interested in how a particular treatment causes a particular outcome. For simplicity, we’ll suppose there are only two possible treatments, coded as 0 and 1, where 1 is a treatment and 0 is a control. We let: \[\begin{align*} Z_i &= \text{treatment assignment (0 or 1)} \\ X_i &= \text{vector of covariates} \\ U_i &= \text{vector of unobserved covariates} \\ \end{align*}\] The treatment assignment \(U_i\) might be within our control (in a randomized experiment) or might be out of our control (in an observational study). \(X_i\) contains other observed covariates that may affect the patient’s outcome; while \(U_i\) encapsulates all the unknown features of the patient that may also affect their outcome.

With these defined, we can define potential outcomes. The potential outcomes give the value of the outcome if the patient were to receive a particular treatment: \[\begin{align*} C_i(1) &= \text{potential outcome of $i$ under treatment ($Z_i = 1$)} \\ C_i(0) &= \text{potential outcome of $i$ under control ($Z_i = 0$)} \\ (C_i(0), C_i(1)) &= \text{causal effect} \\ Y_i &= C_i(Z_i) = \text{observed outcome.} \end{align*}\]

We call \(C_i(0)\) and \(C_i(1)\) the potential outcomes because they are exactly that: potential. They are hypothetical variables representing what would happen if a particular person received a particular treatment, holding everything else (including \(X_i\) and \(U_i\)) fixed. In a particular study, a person may only receive one treatment and not both, and so we only observe \(Y_i = C_i(Z_i)\).2 We may observe what happens if person \(i\) receives the treatment, but not what happens if they receive the control, and so we never observe both \(C_i(0)\) and \(C_i(1)\) for one person \(i\). We don’t have the convenience of simply loading up a data frame with the potential outcomes and running an analysis.

Example 2.1 (Fixitol and Solvix) Suppose we’re conducting a study of a new drug Fixitol for treating malaise, compared to the old standard medication Solvix. To know if Fixitol is better than Solvix, we will examine every patient with the Malaise Measurement Meter after they are treated, and it will produce a malaise score indicating the severity of their malaise. Fixitol is better if it reduces malaise scores. We’ll call Fixitol the treatment and Solvix the control.

However, we also know that other factors affect malaise, such as the amount of sunlight a person gets, their age, other medications they might be taking, and even their ketchup consumption. (Ketchup has natural mellowing agents that reduce malaise.) We can measure some of these factors, but others may be difficult to measure, or we may not even know we need to measure them because we don’t fully understand malaise and everything related to it.

In this case:

  • \(Z_i\) indicates if person \(i\) received Fixitol or Solvix to treat their malaise
  • \(X_i\) might be a vector giving their ketchup consumption, age, and other demographic variables
  • \(U_i\) is an unknown and unobserved vector of everything else that might be related to malaise
  • \(C_i(1)\) is the malaise score that person \(i\) would receive if they took Fixitol, while \(C_i(0)\) is the malaise score person \(i\) would receive if they took Solvix.

Potential outcomes let us define causality. A treatment causes a change in outcome for person \(i\) if \(C_i(0) \neq C_i(1)\), holding fixed all other attributes \(X_i\) and \(U_i\) of that person. In Example 2.1, Fixitol causes an improvement relative to Solvix if your malaise after taking Fixitol would be less than your malaise after taking Solvix.

Exercise 2.1 (Causal effect of race) This definition of causality has limits. Consider a classic example: Is it possible to define the causal effect of race on, say, socioeconomic status? Is it possible to change race (\(Z_i\)) while holding fixed all variables \(X_i\) and \(U_i\), and if so, does it answer the right question?

2.1.2 Estimating counterfactuals

Because counterfactuals are essentially unobservable (unless you have a time machine, or a particularly niche experimental setup in which you can give multiple treatments to each subject without them interacting with each other), we can’t actually estimate both \(C_i(0)\) and \(C_i(1)\) from data for each \(i\). But we might be interested in averages: can we estimate \(\E[C_i(0)]\) and \(\E[C_i(1)]\)?

That depends. There are many ways to estimate counterfactuals and causal effects, and you can take entire courses on sophisticated methods that apply in particular situations. But a very simple approach might be to ask: Can we get a random sample from the populations of \(C_i(0)\) and \(C_i(1)\)? If so, we can estimate their expectations with sample means. If they are true random samples, these will be unbiased, and the law of large numbers will be on our side.

How do we obtain a random sample? If we have \(N\) people in our study, take half at random. For that half, obtain \(C_i(1)\)—that is, assign them to take the treatment, and ensure they do. For the other half, obtain \(C_i(0)\), again by assigning them to take the control. Measure the outcome variable.

This is random assignment, the simplest and perhaps most effective way to estimate potential outcomes.

Exercise 2.2 Returning again to Example 2.1, suppose we enroll \(N\) people in our clinical trial. We randomly assign them to take Fixitol or Solvix and measure their malaise afterward. By the above argument, and assuming we measured correctly and people took the medications they were assigned, we now have unbiased estimates of \(C_i(0)\) and \(C_i(1)\) in this population of \(N\) people. Are these necessarily unbiased estimates for the population of all people in the world who have malaise?

Without random assignment, the estimation problem is much more challenging. Consider that \(C_i(0)\) and \(C_i(1)\) may depend on the covariates \(X_i\) and the unobserved covariates \(U_i\). If the treatment assignment \(Z_i\) also depends on these covariates—if people with certain covariates are more or less likely to get the treatment—then we observe \(C_i(1)\) for a biased sample.

Exercise 2.3 (Non-random assignment) Suppose each person has a covariate \(X_i \in \R\). Its population distribution is \(\operatorname{Normal}(0, 4)\). Each person receives either a treatment or a control, and their response is measured. The true causal mechanism is \[\begin{align*} C_i(0) &= X_i \\ C_i(1) &= X_i + 1. \end{align*}\] Thus every person \(i\) would have a 1 unit higher response if they receive the treatment than if they do not. However, there is a wide range of individual outcomes, because each person has a different value of \(X_i\).

Finally, suppose that the treatment \(Z_i\) is assigned based on \(X_i\): \[ Z_i = \begin{cases} 1 & \text{if $X_i > 0$} \\ 0 & \text{if $X_i \leq 0$.} \end{cases} \]

Calculate the following quantities:

  1. \(\E[C_i(0)]\)
  2. \(\E[C_i(1)]\)
  3. \(\E[C_i(0) \mid Z_i = 0]\)
  4. \(\E[C_i(1) \mid Z_i = 1]\)

Comment on what these results mean. Specifically, what would you say is the true causal effect, and what would we estimate if we tried to estimate that causal effect by comparing the average response among the treated against the average response among the controls?

Hint: You may find it helpful to look up the properties of the half-normal distribution.

We can try to avoid this problem in various ways. We could, for instance, try to match using the covariates \(X_i\): compare each treated person to a control person with very similar covariates \(X_i\). This would work if there are no important unobserved covariates \(U_i\). We could also try to model the effects of \(X_i\), as we will see later.

In general the estimation of causal effects without random assignment is a complex subject, and there are many specialized methods for specific scenarios that are outside the scope of this course.

2.1.3 Average treatment effects

Using counterfactuals, we can define useful quantities that we may try to estimate using data. For instance, many studies are interested in the average treatment effect.

Definition 2.1 (Average treatment effect) The average treatment effect for a binary treatment is the difference \(\E[C_i(1) - C_i(0)]\).

Counterfactuals are useful for providing these definitions and for doing mathematical analyses of causal estimators. But for reasoning about models for a specific dataset, causal diagrams are very useful.

2.2 Causal diagrams

We can represent causal relationships between variables as directed acyclic graphs. Initially, these graphs will only represent dependence between random variables, but later we will see how to describe causal claims using the same graphical language.

Example 2.2 (A simple DAG) Suppose we have observable random variables \(X\), \(Y\), and \(Z\), and unobservable random variables \(U_X\), \(U_Y\), and \(U_Z\). We can draw a diagram:

Here each node represents a random variable in the population.

In a causal diagram, the directed arrows imply dependence: Example 2.2 implies that \[\begin{align*} Z &= f_Z(U_Z) \\ Y &= f_Y(U_Y) \\ X &= f_X(Y, Z, U_X), \end{align*}\] and the \(U\) variables are drawn from their own joint distribution. The form of these functions is unspecified by the diagram: they need not be linear, \(f_X\) need not be separable into components by variable, and so on.

We can think of the \(U\) variables as representing the random variation, error, or noise in our system, which is why I called them “unobservable”: in a particular observed dataset, we will have many observations of \(X\), \(Y\), and \(Z\), but not \(U_X\), \(U_Y\), or \(U_Z\). Typically we omit them from the diagrams, because every problem in statistics has noise everywhere, and relationships are very rarely deterministic. When we omit them, we imply they are independent; if two variables in the diagram have sources of error that are dependent, that error source should be explicitly included as a node in the diagram.

Definition 2.2 (Graph terminology) Let’s define graph terminology by using Example 2.2:

  • A node or vertex is connected to other nodes by edges. The nodes \(X\) and \(Z\) are connected by an edge from \(Z \to X\), but there is no edge between \(Z\) and \(Y\).
  • A directed edge has a direction, represented by an arrow. All edges in causal diagrams are directed edges.
  • When an edge goes from one node to another, it starts at the parent node and ends at the child node. This is reflected by the direction of the arrow in the diagram. \(Z\) is a parent of \(X\), and \(X\) is its child.
  • A path between two nodes is a route between those two nodes following edges of the graph. For instance, there is a path \(Z \to X \leftarrow Y\).
  • A directed path is a path that follows the direction of arrows. There is no directed path from \(Y\) to \(Z\), but there is a directed path from \(U_Z\) to \(X\).
  • The descendants of a node are all nodes that can be reached from it by directed paths; the initial node is their ancestor.
  • On a path, a collider is a node with two arrows into it. On the path \(Z \to X \leftarrow Y\), \(X\) is a collider. But on the path \(U_Y \to Y \to X\), \(X\) is not a collider.

2.2.1 Conditional independence

Causal diagrams represent conditional independence relationships. The joint distribution of all variables can be decomposed as \[ \Pr(\text{variables}) = \prod \Pr(\text{child node} \mid \text{parents}). \]

In the example above, we have \[\begin{multline*} \Pr(X, Y, Z, U_X, U_Y, U_Z) = \\ \Pr(X \mid U_X, Y, Z) \Pr(Z \mid U_Z) \Pr(Y \mid U_Y) \Pr(U_X) \Pr(U_Y) \Pr(U_Z) \end{multline*}\] If we are interested in a particular relationship—say, \(\Pr(X \mid U_X, Y, Z)\)—we need not estimate the full joint distribution. We can estimate the conditional distribution of interest alone.

Example 2.3 (Chain of dependence) Consider this causal diagram:

From this diagram, we deduce \[ \Pr(Z = z \mid X = x, Y = y) = \Pr(Z = z \mid Y = y), \] i.e. conditioning on \(Y\) makes \(X\) independent of \(Z\). If we do not condition on \(Y\), we know \(Z\) and \(X\) are likely dependent.

Contrast this with Example 2.2. In that DAG, \[ \Pr(Z = z \mid X = x, Y = y) \neq \Pr(Z = z \mid X = x) \] because \(X\) is not a parent of \(Z\). In fact, conditioning on \(X\) makes \(Z\) dependent on \(Y\) when it was not dependent previously. This means \(X\) is a collider on the path \(Z \to X \leftarrow Y\).

Example 2.4 (Berkson’s Paradox) Berkson (1946) first described colliders before causal diagrams had been invented as a way to describe dependence. Suppose we want to know if diabetes is a cause of cholecystitis, an inflammation of the gallbladder. We examine the records of numerous hospital patients and record whether they have diabetes, cholecystitis, or both. This gives us the following causal diagram:

We want to know if the \(\text{diabetes} \to \text{cholecystitis}\) arrow is present. Notice the indicator of whether the person is hospitalized. By using only records of hospital patients, we are conditioning on this variable. This induces dependence between diabetes and cholecystitis even if there is no causal arrow between them: If you are hospitalized, there is presumably a good reason for it; people without diabetes must have some other reason for being hospitalized, making them more likely than average to have cholecystitis. This results in an apparent negative relationship between diabetes and cholecystitis, as Berkson noticed in real hospital records.

2.2.2 d-separation

Given an arbitrarily complicated causal graph, and if we condition on an arbitrary collection of random variables in that graph, how can we tell if two selected variables would be statistically dependent or independent? The criterion we use is d-separation (d for “directional”). Two d-separated variables are independent; d-connected variables are usually dependent. But d-separation depends on conditioning: if we condition on some variables (i.e. hold their value fixed), that changes which variables are d-separated.

Definition 2.3 (d-separation) Consider two nodes \(X\) and \(Y\) in a causal diagram, and every path \(p\) connecting them. (These need not be directed paths.) Suppose we condition on a set of nodes \(\{Z\}\).

\(X\) and \(Y\) are d-separated (when conditioning on \(\{Z\}\)) if all paths \(p\) are blocked. A path \(p\) is blocked by \(\{Z\}\) if and only if

  1. \(p\) contains a chain \(A \to B \to C\) or fork \(A \leftarrow B \to C\) where \(B \in \{Z\}\), or
  2. \(p\) contains a collider \(A \to B \leftarrow C\) where \(B \notin \{Z\}\), and no descendant of \(B\) is in \(\{Z\}\).

If \(X\) and \(Y\) are not d-separated, they are d-connected.

Why does conditioning on a chain block the path, and hence the association? Consider the simplest possible case: the causal diagram is \(A \to B \to C\) and we are interested in the association between \(A\) and \(C\). This implies the following structural relationship: \[\begin{align*} A &= f_A(U_A) \\ B &= f_B(A, U_B) \\ C &= f_C(B, U_C), \end{align*}\] where each \(U\) is an independent random variable of unobserved error, and the functions are deterministic. Substituting these together, we have that \[ C = f_C(f_B(A, U_B), U_C), \] and so the value of \(C\) depends on the value of \(A\) (in some perhaps complicated way, depending on the form of \(f_C\) and \(f_B\)). But if we condition on \(B = b\), then we have \[ C = f_C(b, U_C), \] and the only source of variation in \(U\) is \(U_C\). There is no longer any relationship to the particular value of \(A\).

Similarly, why does conditioning on a collider open the path and create association. Consider the path \(X \to Y \leftarrow Z\), implying that \[\begin{align*} X &= f_X(U_x)\\ Z &= f_Z(U_z)\\ Y &= f_Y(X, Z, u_Y). \end{align*}\] The random variables \(X\) and \(Z\) are functions of independent noise sources (\(U_X\) and \(U_Z\)), so as written, they are independent. But if we condition on \(Y = y\), then \(f_Y(X, Z, u_Y)\) must be equal to \(y\), and so \(X\) and \(Z\) must have values such that \(f_Y(X, Z, u_Y) = y\). For example, if \[ f_Y(X, Z, U_Y) = 2X - 3Z + u_Y, \] then when we condition on \(Y = y\), \[ 3Z + y = 2X + U_Y, \] and so \(Z\) and \(X\) are linearly correlated.

Example 2.5 (d-separation demo) Consider the following causal diagram (Pearl, Glymour, and Jewell 2016, fig. 2.9):

  • No conditioning. \(W\) and \(Z_2\) are d-connected. The path \(Z_2 \to Z_3 \to X \to W\) is not blocked. But \(Z_1\) and \(Z_2\) are d-separated, because all possible paths between them contain a collider that is not conditioned on (such as \(Z_3\) in the path \(Z_1 \to Z_3 \leftarrow Z_2\)).
  • Condition on \(Z_3\). \(W\) and \(Z_2\) are still not d-separated. The \(Z_2 \to Y \leftarrow W\) path is blocked, because it has a collider (\(Y\)) that is not conditioned on; the path \(Z_2 \to Z_3 \to X \to W\) is a chain blocked by conditioning on \(Z_3\). But the \(Z_2 \to Z_3 \leftarrow Z_1 \to X \to W\) path is not blocked, because \(Z_3\) is a collider on that path. On the other hand, \(Z_1\) and \(Z_2\) are now connected, because the path \(Z_1 \to Z_3 \leftarrow Z_3\) is not blocked, as the colliding node is conditioned on.

So why do we care about d-separation? First, it means causal diagrams have testable implications: From a diagram we can determine which variables are independent, and if we can test for independence using data, we can test the diagram.3 The field of causal discovery attempts to use tests to determine which causal diagrams are compatible with observed data.

Second, it helps us determine which effects can and cannot be estimated from observed data alone. Let’s consider two examples.

Example 2.6 (No confounding) Suppose patients receive a treatment and we measure an outcome. This outcome is also influenced by a prior condition the patients have, but we do not observe this prior condition. The diagram is:

We can estimate \(\Pr(\text{outcome} \mid \text{treatment})\) from data, but we can’t include the prior condition because it is unobserved. By the law of total probability, we are really estimating \[\begin{multline*} \Pr(\text{outcome} \mid \text{treatment}) = \\ \sum_\text{conditions} \Pr(\text{outcome} \mid \text{treatment}, \text{condition}) \Pr(\text{condition} \mid \text{treatment}), \end{multline*}\] and according to the diagram, \(\Pr(\text{condition} \mid \text{treatment}) = \Pr(\text{condition})\). Hence the association in observed data is the average relationship between treatment and outcome, averaging over prior conditions based on their prevalence.

We can thus estimate the association between treatment and outcome using only the observed variables.

Example 2.7 (Confounding) On the other hand, suppose that the prior condition influences the treatment and outcome, but the treatment has no effect whatsoever. For example, maybe doctors tend to treat patients differently based on their prior condition, but haven’t realized the treatments are equivalent:

Again, suppose the prior condition is unobserved. We can again write out the law of total probability, but in this case, \(\Pr(\text{condition} \mid \text{treatment}) \neq \Pr(\text{condition})\), while \(\Pr(\text{outcome} \mid \text{treatment}, \text{condition}) = \Pr(\text{outcome} \mid \text{condition})\) because the treatment is ineffective. Consequently, \[ \Pr(\text{outcome} \mid \text{treatment}) = \sum_\text{conditions} \Pr(\text{outcome} \mid \text{condition}) \Pr(\text{condition} \mid \text{treatment}), \] and so the distribution of outcomes can depend on the treatment, even though the treatment has no causal effect on outcome. Here prior condition is a confounding variable because it has a causal effect on both treatment and outcome, whereas in Example 2.6 it was not a confounding variable because it was only associated with outcome, not treatment. Confounding variables introduce association between variables that is not from a directed path between the two variables.

These examples illustrate that each path between two nodes in a causal diagram—paths that may be blocked or open based on the d-separation criteria—represents a path for association. Each open path between two variables represents a source of association between those variables, and all other variables on that path may influence the association.

If we are interested in the causal relationship between treatment and outcome, we are implicitly interested in only specific paths: the directed paths from the treatment to the outcome. If there are any other open paths, such as from confounding variables or from colliders we have conditioned on, the association between treatment and outcome will be the association from all the paths, not simply the directed causal paths of interest. If we want to estimate the causal effect alone—for example, to know what would happen if we assigned a particular treatment instead of allowing it to be influenced by prior condition—we need some way to isolate the particular causal effect of interest.

Exercise 2.4 (Pearl, Glymour, and Jewell (2016), problem 2.4.1) Suppose we have the causal graph below. All error terms are independent.

  1. Suppose we can only condition on variables in the set \(\{Z_3, W, X, Z_1\}\). What variables must we condition on, if any, to make each pair of variables below conditionally independent?

    • \((X, Y)\)
    • \((Z_1, Z_2)\)
    • \((W, Z_1)\)
  2. List all paths from \(X\) to \(Y\). On each path, indicate which nodes, if any, are colliders.

  3. Suppose we want to measure the causal effect of \(X\) on \(Y\). Which of the paths between them represent causal paths, and which are confounding paths? Identify the smallest set of variables we must condition on to block the confounding paths.

  4. Now suppose we want to predict the value of \(Y\) using measurements of the other variables in the graph. What is the smallest set of variables we could use to estimate \(Y\) which would be as good as using all the variables? In other words, if you want to build a model to predict \(Y\), which variables should you use?

    (Think about what \(Y\) is a function of, given the structure of the graph.)

  5. Suppose we want to predict \(Z_2\) from measurements of \(Z_3\). Would our prediction be better if we also used measurements of \(W\)? Consider using a nonparametric model of the joint distribution, and assume we have large amounts of data, so variance is not a concern. (Think about whether \(Z_2\) is independent of \(W\) when we condition on \(Z_3\).)

2.2.3 The backdoor criterion

The backdoor criterion gives us a precise way to determine when a causal effect of interest has been isolated, or to determine what variables we must observe and condition on in order to isolate a causal effect.

Definition 2.4 (Backdoor criterion) When estimating the causal effect of \(X\) on \(Y\), a set of variables \(\{Z\}\) satisfies the backdoor criterion if no node in \(\{Z\}\) is a descendant of \(X\), and the nodes in \(\{Z\}\) block every path between \(X\) and \(Y\) that contains an arrow into \(X\).

Paths between \(X\) and \(Y\) containing an arrow into \(X\) are called backdoor paths, as they represent confounding or other unwanted association.

The backdoor criterion is a sufficient condition, but not necessary: it is possible to isolate a causal effect in other ways. You can always use d-separation to tell if you have blocked paths other than the paths of interest. But most of the time, when we are simply controlling for confounding variables, we are ultimately satisfying the backdoor criterion. It’s only in unusual or creative cases that we identify a causal effect another way.

Example 2.8 Suppose we are examining the causal effect of a particular treatment on recovery, as in Example 2.6. However, we know there is a confounding path:

Here SES is socioeconomic status. It causally effects treatment (because some people cannot afford more expensive treatments), and also exercise (because personal trainers are expensive and having time for the gym is related to your job); and recovery from this condition is related to exercise.

The path from treatment to SES to exercise to recovery is a backdoor path, because it contains an arrow into treatment.

Suppose we cannot measure SES in this study and hence cannot condition on it. If we can measure exercise levels and condition on that variable, we have satisfied the backdoor criterion and hence blocked the backdoor path. Conditioning on exercise is sufficient to allow us to identify the causal effect of treatment on recovery.

Exercise 2.5 Consider the following causal diagram:

  1. What must you condition on to estimate the causal effect of \(X\) on \(Y\)?
  2. If we have selected on \(W = w\), what variables must we condition on, if any, to estimate the causal effect of \(X\) on \(Y\)?

2.2.4 Graph surgery

The first way to isolate causal effects in the presence of confounders is to perform graph surgery: instead of conditioning on variables, simply change the causal diagram so the backdoor criterion is satisfied. In other words, we intervene to assign the value of a particular node in the graph, instead of allowing it to be affected by other random variables.

Example 2.9 (Random assignment) Consider the situation in Example 2.7. Suppose we recruit a number of patients who need treatment, but instead of allowing doctors to choose their treatment based on their prior condition, we simply flip a coin to assign the treatment. In this sample of patients, prior condition no longer has a causal effect on treatment, and so the causal arrow from condition to treatment disappears. We are left with the situation in Example 2.6.

Random assignment is the most robust way to estimate a causal effect; in medicine, randomized clinical trials are known as the gold standard for providing evidence. Random assignment has the advantage of eliminating all possible backdoor paths, even ones we are unaware of. Alternately, we can think of it as balancing all possible confounding variables: because the treatment is randomly assigned, there should be no systematic differences between the treatment groups apart from the treatment we assign.

However, random assignment is not foolproof:

  1. The causal effect of \(X\) on \(Y\) may not be caused by the reason or mechanism we intended. In a medical trial, for example, if the doctors and nurses in the staff are aware which patients have received which treatment, they may treat the patients differently (intentionally or unintentionally), changing their outcomes. Hence the causal effect of treatment on outcome may involve the behavior of the medical staff, not just the effect of the treatment.
  2. Some random assignment is not so random. We could, for instance, flip a coin and decide that all patients in one hospital will receive one treatment, whereas all patients in another hospital will receive the other treatment. If the hospitals or their patients are systematically different, the effect we measure may be due to the hospital, not the treatment.
  3. Many randomized experiments involve people, and people are often human. They may not take the treatment, they may not complete the tasks as assigned, or they may not follow instructions.

Random assignment also is not always possible. If I want to study the causal effect of, say, childhood trauma on later life achievement, most authorities would not permit me to randomly assign some children to be traumatized. Similarly, the causal effect of many government policies is difficult to evaluate because governments will not offer to randomly apply the policies to some people and not others.

Example 2.10 (Natural experiments) Sometimes, circumstances provide something like random assignment without an experimenter deliberately assigning treatments. This is called a natural experiment. For example, Kirk (2009) was interested in the causes of recidivism—meaning the causes of re-offense among people convicted of crimes, imprisoned, and then released. One hypothesis was that released prisoners may return to the same neighborhood where they had friends and opportunities that got them into crime, and if they instead live in unfamiliar places without those factors that led them to crime, they may be less likely to reoffend.

However, it is difficult to force released prisoners to live in specific neighborhoods, and presumably their decision of where to live is influenced by many potential confounding factors. But in 2005, Hurricane Katrina hit the coast of Louisiana and destroyed many neighborhoods through floods and wind. Afterwards, some people released from prison could return to their old neighborhoods, but some could not, because their neighborhoods were destroyed. Kirk (2009) used data on prisoners released on parole to conclude that “moving away from former geographic areas substantially lowers a parolee’s likelihood of re-incarceration.”

2.2.5 The do-calculus

The notion of graph surgery through assignment also leads us to a useful concept: the “do-calculus”.

Definition 2.5 (do notation) In the do calculus, we write \(\cdo(X = x)\) to reflect \(X\) being assigned to the value \(x\). For example, \[ \Pr(Y \mid \cdo(X = x)) \] is the conditional distribution of \(Y\) when \(X\) is assigned to be \(x\). This is different from \[ \Pr(Y \mid X = x), \] which is the conditional distribution of \(Y\) for individuals who happen to have the value \(X = x\).

We can see the connection to counterfactuals more clearly now. In counterfactual notation, \(Z\) is a treatment assignment and \(C(Z)\) is the counterfactual under that treatment, meaning the potential outcome for each individual if they were assigned that treatment, while \(Y\) is the observed outcome. Here, \(\Pr(Y \mid Z = z)\) is the distribution of observed outcomes among people who get treatment \(z\), while \(\Pr(Y \mid \cdo(Z = z))\) is the distribution of outcomes among people assigned treatment \(z\), and represents those counterfactuals. When we have a binary treatment and binary outcome (such as recovered vs. not recovered), estimating the average treatment effect is the same as estimating \(\Pr(Y = 1 \mid \cdo(Z = 1)) - \Pr(Y = 1 \mid \cdo(Z = 0))\).

Returning to Example 2.9, \(\Pr(\text{outcome} \mid \text{treatment})\) does not represent the causal effect when there is a confounding variable, because \(\Pr(\text{outcome} \mid \text{treatment}) \neq \Pr(\text{outcome} \mid \cdo(\text{treatment}))\). But if we assign the treatment, or otherwise break the confounding path, then \(\Pr(\text{outcome} \mid \text{treatment}) = \Pr(\text{outcome} \mid \cdo(\text{treatment}))\). When we are looking for causal effects, we are looking for ways to estimate the latter.

2.2.6 Adjustment

Beyond random assignment, the alternative way to isolate causal effects is adjustment. We often hear of studies that “adjust for” or “control for” variables, and these studies are attempting to isolate a causal effect by including variables in the conditioning set \(\{Z\}\).

If we identify a set \(\{Z\}\) that satisfies the backdoor criterion, we can adjust for it using the adjustment formula.

Definition 2.6 (Adjustment formula) To estimate the causal effect of \(X\) on \(Y\) in a causal diagram, where the set of variables \(\{Z\}\) satisfies the backdoor criterion, we can calculate \[ \Pr(Y = y \mid \cdo(X = x)) = \sum_z \Pr(Y = y \mid X = x, \{Z\} = z) \Pr(\{Z\} = z), \] where the sum is over all combinations of values of the variables in the conditioning set \(\{Z\}\). For continuous variables, the sum is replaced with an integral.

To calculate or estimate this, we need the values of all the variables in \(\{Z\}\).

Example 2.11 (Adjusting for confounding) In Example 2.6 and Example 2.7, we considered estimating the causal effect of treatment on an outcome when there is a prior condition that could confound the effect. We showed that if the prior condition is a confounder, estimating \(\Pr(\text{outcome} \mid \text{treatment})\) from the data would not estimate the causal effect.

Suppose we can observe the prior condition. Using the adjustment formula, the adjusted effect is \[\begin{multline*} \Pr(\text{outcome} \mid \cdo(\text{treatment})) =\\ \sum_{\text{condition } c} \Pr(\text{outcome} \mid \text{treatment}, \text{condition} = c) \Pr(\text{condition} = c), \end{multline*}\] where the sum is over all the values of the prior condition variable.

Adjustment allows us to estimate causal effects when we cannot randomly assign the treatment. But adjustment’s key weakness is that we must have all the variables in \(\{Z\}\), meaning we must trust that we have included all the variables related to \(X\) and \(Y\) in our causal diagram and we have connected them with the correct set of edges. If there is a confounding variable we are unaware of, or we misunderstand the causal relationship between the variables we are aware of, we may fail to control for all the necessary variables, and our estimate will not be for the correct causal effect.

Just as the backdoor criterion is not the only way to isolate causal effects, adjustment is not the only way to estimate them; there are other approaches that work in different cases. But situations where we can satisfy the backdoor criterion are the most common, and adjustment has a direct connection to the typical regression models we will use, so it’s what we’ll focus on.

Example 2.12 (Kidney stones) Kidney stones are mineral deposits that form in the urinary tract, hardening into small stones. Very small stones can be passed from the body in the urine, but larger stones can get stuck. Anecdotally, kidney stones are extremely painful, often described as worse than childbirth. There are a variety of methods to remove kidney stones, depending on their size and composition.

A 1986 study sought to compare several methods for removing kidney stones (Charig et al. 1986). One method was open surgery, which involves making an incision and physically removing the kidney stone. A newer method was percutaneous nephrolithotomy,4 which uses a much smaller incision to insert a tool to break up the stone so it can be removed in small pieces. The study was observational: it used hospital surgery records that recorded the type of surgery and the outcome. The surgery was defined to be successful if the stones were eliminated or less than 2 mm in size three months after surgery.

Table 2.1: Success rates for two kinds of kidney stone removal procedures.
Treatment Diameter < 2 cm ≥ 2 cm Overall
Open surgery 93% 73% 78%
Percutaneous nephrolithotomy 87% 69% 83%

Table 2.1 shows the results of the study. Overall, percutaneous nephrolithotomy had a higher success rate than the more traditional open surgery. But among patients whose kidney stones were smaller than 2 cm, open surgery was more effective—and among patients whose kidney stones were larger, open surgery was also more effective. How could that happen?

The key is that “the probability of having open surgery or percutaneous nephrolithotomy varied according to the diameter of the stones” (Julious and Mullee 1994). Since percutaneous nephrolithotomy was new, surgeons likely chose to use it for simpler cases with smaller stones; and stone size is strongly related to success rate, since larger stones are more difficult to remove. Percutaneous nephrolithomy was used in easier cases than open surgery. Or, in other words, stone size was a confounding variable and the causal diagram looked like this:

To estimate the causal effect of surgery type on success rate, we need to use the adjustment formula. The paradox exhibited by this data, where an association in the overall dataset reverses when we look within subgroups, is known as Simpson’s paradox.

Exercise 2.6 (Kidney stone adjustment) The raw data from Example 2.12 is shown in Table 2.2. Again, this is an observational study, and the table shows the number of surgeries that were successful or unsuccessful for stones of different sizes.

Table 2.2: Data from Charig et al. (1986) on the success rate of open surgery and percutaneous nephrolithotomy (PN) for kidney stone removal.
Stone size Surgery Successful Unsuccessful
< 2 cm Open 81 6
PN 234 36
≥ 2 cm Open 192 71
PN 55 25

Using the data in the table, first estimate the unadjusted difference in success rates, \[ \Pr(\text{successful} \mid \text{open surgery}) - \Pr(\text{successful} \mid \text{PN}). \] Then use the adjustment formula (Definition 2.6) to estimate \[ \Pr(\text{successful} \mid \cdo(\text{open surgery})) - \Pr(\text{successful} \mid \cdo(\text{PN})), \] which is essentially the average treatment effect. Compare your results and give a sentence interpreting each.

Exercise 2.7 (Conditions for Simpson’s paradox) Simpson’s paradox, as illustrated in Example 2.12, is integrally connected with confounding. If \(X\) is the treatment, \(Y\) is the outcome, and \(Z\) is a confounding variable, and all three are binary, the paradox is that it is possible for the following to be true simultaneously (or their reverse): \[\begin{align*} \Pr(Y=1 \mid Z=1, X=1) &> \Pr(Y=1 \mid Z=1, X=0) & (1)\\ \Pr(Y=1 \mid Z=0, X=1) &> \Pr(Y=1 \mid Z=0, X=0) & (2)\\ \Pr(Y=1 \mid X=1) &< \Pr(Y=1 \mid X=0). & (3) \end{align*}\] Show that this paradox cannot occur if \(X\) and \(Z\) are independent by using the law of total probability. Comment on the connection between this probabilistic result and the causal diagrams: the paradox cannot occur if what is true about the diagram connecting \(X\), \(Y\), and \(Z\)?

2.3 Translating questions to causal diagrams

There is much more detail to causal diagrams than we have discussed here; see Pearl, Glymour, and Jewell (2016) for a more detailed overview. The basics here are nonetheless sufficient for our uses. They will facilitate the process of applied data analysis, which proceeds in steps:

  1. Attempt to understand the data, the variables that have been measured, and how they may be related to each other.
  2. Attempt to understand the question being asked. Often the question is vague: Which treatment is best? What decision should we make? How do we improve outcomes? The statistician must figure out what kinds of statistical results could answer the question.
  3. Determine what models and methods would be appropriate to answer the questions.
  4. Apply those models and methods to the data.
  5. Realize you have misunderstood the problem, the data is unclear, and the models don’t work. Return to step 1.
  6. Finally, attempt to communicate the results back to the scientist, policymaker, or client who had the question.

Causal diagrams can assist us with steps 1 through 3: By putting the situation on paper in a simple, easy-to-understand diagram, we can make sure we understand what’s going on, which specific causal effects are of interest, and what models might estimate those effects. Sometimes we discover that the causal effect cannot be estimated, or that some confounding variables cannot be measured or eliminated; sometimes we must tell the scientist their question cannot be answered. The causal diagram and associated methods can help us along the way.

Exercise 2.8 (COVID testing) During the COVID pandemic, regular testing was recommended as a strategy to limit spread. Many people sought COVID tests before traveling or visiting vulnerable relatives, after large events, or when they felt sick; some employers required employees to get regularly tested; and some state and federal agencies mandated regular testing among specific groups, such as healthcare workers. But access to testing was uneven: the supply was sometimes limited, it was often difficult to figure out where to get tests, and though tests were supposed to be free or covered by insurance, the expense and paperwork may have deterred some people. Other people chose not to test because they felt their personal risk was low, they felt the pandemic was overblown in general, or they otherwise didn’t think tests were necessary.

A government agency is interested in occupational risk of COVID, so they can create workplace programs to encourage people to get tested. Suppose they want to know in which occupations people are more or less likely to get a COVID test, and how much of this effect is due to the occupation—since of course different occupations may involve people of different demographic and socioeconomic groups, and may be conducted in different states with different testing policies and guidelines.

  1. Draw a causal diagram of factors relevant to COVID testing and the agency’s research question. This is open-ended: use your knowledge to think of relevant factors, including ones not listed here, and how they fit into the diagram.
  2. Based on your diagram, what kind of data should the agency collect to answer their research question? What factors must be included in the data?

Exercise 2.9 (Flight delays) The Bureau of Transportation Statistics collects data on every commercial airline flight in the United States, including the length of any flight delays and their causes. Suppose you have access to the following variables for thousands of flights:

  • Departure delay. How late the airplane was when it left the gate to take off
  • Arrival delay. How late the airplane was when it arrived at the destination
  • Time of flight. How long the airplane spent in the air
  • Distance. The distance between the origin and destination of the flight

You have an idea. When an airplane departs late, but it’s a long-distance flight, the pilot might be able to make up for lost time: by going a little bit faster they may be able to catch up and arrive on time. But if it’s a short-distance flight, going a bit faster will only save a minute or two, and won’t have a substantial effect on the arrival delay. So: if two planes both depart late by the same amount, is the one with the longer distance to fly more likely to arrive on time?

  1. Draw a causal diagram for this situation, including the variables listed above. It may also make sense to add additional variables not shown above, if you think they’re relevant, such as potential confounding variables or variables that influence the relationships between those shown above.
  2. Which causal relationship would we need to estimate to test your idea? Which variables must be conditioned on to do so? Point to specific paths in the diagram you want to estimate.
  3. What regression model should you fit to estimate that relationship? Assume the relationships are all linear, and write out a model like \(\text{outcome} = \beta_0 + \beta_1 \text{variable}\)
  4. If you fit a regression model to predict arrival delay with all the covariates listed above, does this satisfy the backdoor criterion for the causal effect of interest? What would the interpretation of the distance coefficient be?

Exercise 2.10 (Petting zoo) Recently, Cranberry Melon University installed an on-campus petting zoo to help students relax. A psychology professor is interested in whether spending time with the ducks at the petting zoo causes students to get better grades.

The professor conducts a survey5 of students and records the following variables:

  • Grade point average. This is the main outcome variable: the student’s GPA a semester after the ducks appeared.

  • Duck time. Time, during that semester, the student spent with the ducks.

  • Aptitude. The student’s SAT score from when they were admitted to Cranberry Melon. The psychologist was worried that SAT scores (and aptitude generally) may affect GPA as well as time spent with the ducks.

  • Dean’s List. Did the student appear on the Dean’s List? Students are put on the Dean’s List based on their GPA and sometimes their extraordinary SAT scores.

  • Socio-economic status. Measured by family income. This may affect aptitude and GPA, and could be confounded with duck time as well.

  • Salary after graduation. Affected by many factors—but employers definitely do not know how much time students spend with ducks, because of federal anti-duck-discrimination law.

  1. Draw a plausible causal diagram incorporating these variables.
  2. Identify the specific causal path the professor wants to estimate by giving the start and end nodes and all nodes on the path (e.g. \(A \to B \to \cdots \to Z\)).
  3. Name a minimum set of variables the professor must include in a model to be able to estimate that causal path. Specify the outcome variable and the variables that must be controlled for. Remember, this is an observational survey, not an experiment; but assume all students are equally likely to respond to the survey.
  4. Identify any variables that the psychologist must not include in the model, if any. Explain (in a sentence or two) why they must not be included.

Exercise 2.11 (SAT scores, Ramsey and Schafer (2013), section 12.1.1) Since 1926, many prospective American college students have taken the SAT before applying to college. In 1982, average scores by state were published by the College Entrance Examination Board (now the College Board), developers of the test. These averages showed large differences by state, from an average score of 790 out of 1600 in South Carolina to an average of 1088 in Iowa. Naturally many people were interested in the source of the differences.

Researchers compiled several possible factors associated with each state’s average score:

  • Takers: The percentage of eligible high school seniors in the state who took the SAT. Not every college required the SAT for admission; in some midwestern states, most colleges used other tests, so the only students taking the SAT were those applying to other colleges out of state.
  • Income: The median income of families of the students who took the SAT.
  • Years: The average number of school years that the students taking the SAT had of courses in social science, humanities, and the natural science.
  • Public: The percentage of the SAT takers who attended public high schools, rather than private schools.
  • Expend: The total state expenditure on high schools, in hundreds of dollars per student.
  • Rank: The median percentile ranking of the test-takers among their high school classes.

Later, we will try to use this data to answer a simple question: Which states have the best performance relative to the amount of money they spend? That is, which get the highest scores per dollar, after accounting for other factors that may influence the scores?

To approach this problem, we should first try to understand the relationships of the variables. Draw a plausible causal diagram of how these factors are related to each other, keeping in mind that the data gives averages per state, and is not at the individual student level. Write a paragraph justifying your choices of arrows and explaining what causal effects you expect.

2.3.1 Causal diagrams in complex settings

In principle, if we know the causal diagram for a particular problem and want to estimate a particular causal relationship, we can use d-separation and rules like the backdoor criterion to determine how to do so. With a method chosen, we merely need enough data to estimate the necessary conditional distributions.

But reality is rarely so kind to us. Perhaps in a precisely controlled physics experiment we may know all the variables and their causal relationships; but if we are evaluating the effect of a government policy, or studying human psychology, or trying to tell if bacon causes cancer, we only have a vague and incomplete notion of the full causal diagram. Drawing a diagram, even woefully incomplete, may be helpful to organize our thinking, but we may be unable to determine how to estimate our desired causal effect.

TODO What if the situation is so complex that we can’t hope to draw the full diagram? Conditions for a partial diagram to be sufficient

2.3.2 Recognizing limitations

TODO Discuss how we should think about cases when we are uncertain about the diagram – what could that do to our conclusions? What about if we don’t know about unknown unknowns? https://arxiv.org/pdf/2208.13871.pdf, https://link.springer.com/article/10.1007/s10654-019-00494-6

2.4 Connecting causal diagrams to regression models

As defined above, causal diagrams give us ways to describe conditional independence relationships between variables. If we drew a diagram, found the conditional independence relationships, and determined which variables to condition on to find an effect of interest, we could use the observed data to estimate the conditional distributions. But estimating full distributions is difficult—we’d have to use some kind of conditional density estimation, which could require huge amounts of data to do well. In typical statistical practice, we’re more used to estimating expectations: building models to estimate \(\E[Y \mid X]\), for instance.

For example, let’s return to a situation like Example 2.7.

If we want to estimate the causal effect of \(X\) on \(Y\) by conditioning on \(Z\), we can use the adjustment formula (Definition 2.6), which says that \[ \Pr(Y = y \mid \cdo(X = x)) = \sum_z \Pr(Y \mid X = x, Z = z) \Pr(Z = z). \] Let’s write this instead in continuous form, for continuous random variables. The sum is replaced with an integral, and instead of \(\Pr\) for probability we use \(f\) for probability density: \[ f(y \mid \cdo(X = x)) = \int_z f(y \mid x, z) f(z) \dif z \] If we instead want the expectation—the expected causal outcome—we have \[\begin{align*} \E[Y \mid \cdo(X = x)] &= \int_y y \int_z f(y \mid x, z) f(z) \dif z \dif y \\ &= \int_z f(z) \int_y y f(y \mid x, z) \dif y \dif z \\ &= \int_z \E[Y \mid X, Z] f(z) \dif z \\ &= \E_Z[\E[Y \mid X, Z]], \end{align*}\] where \(\E_Z\) is the expectation with respect to the marginal distribution of \(Z\).

And this is what leads us to regression. A regression model of \(Y\) as a function of \(X\) and \(Z\) is a model that estimates \(\E[Y \mid X, Z]\). If our regression model is good and we have sufficient data, we can get a good estimate, and that can be used to get a good estimate of \(\E[Y \mid \cdo(X = x)]\). We can interpret this equation as saying that to model the average causal effect of \(X\) on \(Y\), we make a regression model that includes—controls for—\(Z\), and then simply average over \(Z\).

An expectation is merely a summary, of course; it would be better if we knew the full distribution \(f(y \mid \cdo(X = x))\), since then we would know the distribution of expected outcomes, its shape, the probability of specific outcomes, and so on. But that is often impractical, and in any case we often make simplifying assumptions when we construct regression models. For instance, if we assume the model \(y = r(x, z) + e\), where \(e\) is a random error term and \(r\) is a deterministic function, then the conditional distribution of \(Y\) is due to the distribution of \(e\); and if we assume Gaussian errors, as we often do, then estimating the variance of the errors and the expectation of \(Y\) is sufficient to say \(Y\)’s distribution, conditional on \(X\) and \(Z\), is Gaussian. Of course, if our regression assumptions are incorrect—perhaps the distribution of \(Y\) changes dramatically in shape and variance depending on \(X\) and \(Z\), or the errors are extremely non-Gaussian, or our chosen regression model is poor and does not reflect the true relationship—then a causal estimate based on regression will be incorrect as well.


  1. An earlier version of this joke is due to Wasserman (1999) (“There are two types of statisticians: those who do causal inference and those who lie about it.”). This is only partly a joke: You will find many research papers that carefully note they have merely found an “association” between two variables, and haven’t established any causal relationship, but the papers are accompanied by editorials and press releases claiming the new research highlights the cause of some problem and suggests new ways to solve it. These researchers are, in other words, attempting to make causal claims while avoiding the obvious criticism that their data and methods cannot justify causal claims.↩︎

  2. A notable exception is the crossover trial, where participants receive one treatment, then “cross over” to the other group and receive the other treatment.↩︎

  3. However, there are equivalence classes of causal diagrams that have different edges but imply the same independence structure. We can’t distinguish between elements of these equivalence classes using data alone.↩︎

  4. Percutaneous nephrolithotomy is an extremely intimidating name, but in Latin it makes perfect sense: per- (through) cutaneous (skin) nephro- (kidney) lith- (stone) otomy (removal).↩︎

  5. The survey is, of course, named the Quantitative Undertaking for Anatidae Cuddling Knowledge.↩︎