Potential Outcomes - The Framework for Causal Inference

We often talk about it, implicitly or explicitly, but what do we actually mean by causality? It is a concept that is fundamental to the way we think and act, and as such it's deeply embedded into our language. When we say: I was late for work because I stayed up after midnight, in addition to acknowledging the current observed state of things - that we are late for work, we are also stating a belief: Had I gone to sleep before midnight, I would not have been late for work. This statement implies some sort of a choice and agency on our behalf, that we could have somehow done something else and witnessed different consequences. This is why causal inference is so important for decision making of any kind.

Statistics as a field has often distanced itself from causal questions, and even the pioneers of randomized controlled trials like Neyman (1923) and Fisher (1925) were cautious when considering them, especially in observational studies. Thus the famous statistician's mantra: Correlation is not causation! is rightly still alive today, even though it's a fact that some correlations can actually be measurements of causal relationships.
It wasn't until 1974 when Donald Rubin expanded on the initial ideas of Neyman and Fisher and defined the potential outcomes framework for tackling these questions, even in the case of observational studies.

This post is meant as a high level summary of the potential outcomes framework. I do not plan to go too deep into the details here, but I think this is a topic worth talking about, as it is very intuitive and opens doors to answering causal questions that were previously thought to be hopelessly closed. In addition, I have noticed that this topic is not something that data scientists are usually familiar with, even those working heavily on experimentation, so I hope this article has a humble contribution to changing that.

In case you are curious and want to go deeper, I would encourage you to read Rubin's paper¹ (1974), as well as the book for which Rubin is one of the authors: Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction². The book delves much deeper into the topic, and the introduction contains very interesting history of causal analysis from Neyman and Fisher up to modern methods. Warm recommendations for anyone interested.

A Toy Example

Let's play a little with the example we initially started this post with - the causal question embedded in the sentence I was late for work because I stayed up after midnight. Let's imagine we are extremely curious to find out if going to sleep before midnight affects weather we wake up in time for work.

Let's also imagine that we have a very cool app that keeps track of many things that we were doing every day. Among those things are the time when we went to sleep, so we know if we stayed up after midnight, as well as if we were late to work the next day.

To be more formal, we can define:

W = \begin{cases} 1 & \text{if } \textit{went to sleep before 00:00}\\ 0 & \text{otherwise} \end{cases}

which signifies the assignment to either treatment (going to sleep before 00:00), and control (going to sleep after 00:00).

In addition, we can define the outcome:

Y = \begin{cases} 1 & \text{if } \textit{arrived to work on time next day}\\ 0 & \text{otherwise} \end{cases}

Let's also define $X$ to be a vector of additional data that the cool app keeps track of (e.g coffein intake, alcohol intake, amount of exercise, level of stress etc.).

We also have a full history of many days in the past, and we'll represent this history with $(\textbf{W}, \textbf{Y}, \textbf{X})$ , and we can denote the i'th unit as $(W_i,Y_i,X_i)$ .

Missing Data - Fundamental Problem of Causal Inference

From the perspective of a single unit $i$ , one of the ways to quantify the causal effect would be $Y_i(\text{went to sleep before 00:00}) - Y_i(\text{went to sleep after 00:00})$ .

But there is a small problem here. This would require us to have data both in the case when we went to sleep early, and in the case we went to sleep late for the same night of our lives. Think of it as if, at the time when the $W$ is decided, two parallel worlds are created - one where $W=1$ , the other where $W=0$ , but everything else is the same, including the date and time and all the variables $X$ . It is clear that we can always only observe one of these two worlds. This is usually called the fundamental problem of causal inference.

To make this explicit in the notation, the vector of outcomes $\textbf{Y}$ is usually written as $\textbf{Y}_{obs}$ , to signify that these are the observed outcomes. As for $Y_i(\text{went to sleep before 00:00})$ and $Y_i(\text{went to sleep after 00:00})$ , these quantities are the famous potential outcomes that gave the name to the whole framework. To make the notation less verbose, they are usually written as $Y_i(1)$ and $Y_i(0)$ , and in the vector form $\textbf{Y(1)}$ and $\textbf{Y(0)}$ .

We can easily establish the connection between $\textbf{Y}_{obs}$ and the potential outcomes, given our assignment vector $\textbf{W}$ :

\textbf{Y}_{obs} = \textbf{W}\textbf{Y}(1) + (1 - \textbf{W})\textbf{Y}(0)

It can be helpful to see an example of how our data looks like in tabular format. Let's start first with $(\textbf{W}, \textbf{Y}, \textbf{X})$ :

W	$Y_{obs}$	X
1	0	x_1
0	0	x_2
1	0	x_3
1	1	x_4
...	...	...
0	1	x_n

Taking into account the potential outcomes and the assignment, we can represent the same data in another way:

Y(0)	Y(1)	X
?	1	x_1
0	?	x_2
?	0	x_3
?	1	x_4
...	...	...
1	?	x_n

Looking at the data from the perspective of the potential outcomes leads us to a very important insight that we have Donald Rubin to thank for - the fundamental problem of causal inference is actually a missing data problem. So in order to solve it, we have to find a way to impute the missing data as correctly as possible³.

These missing data, i.e. the ? from the table above, can be represented in our notation quite conveniently, analogous to $Y_{obs}$ :

Y_{miss} = (1 - \textbf{W})\textbf{Y}(1) + \textbf{W}\textbf{Y}(0)

It's easy to see that the key component to imputing the $Y_{miss}$ is the vector $W$ , which leads us to the heart of the potential outcomes framework - the notion of the assignment mechanism.

The Assignment Mechanism

The assignment mechanism is what determines how each of the units ended up getting the intervention they recieved. It is a key component of the whole framework because it describes the data generating process - how we picked which treatment to give to each unit, or how the assignment vector $\textbf{W}$ is created.

It is easy to understand why this is important if you imagine we had a big simulator machine, and our assignment mechanism is just a piece of code that runs on it, after which the outcomes $\textbf{Y}$ are generated. This machine could then generate the $Y_{miss}$ for each of the units from our sample. This allows us to easily calculate the individual causal effect for each unit, as well as the average causal effect of the whole sample.

Of course, Having this machine would be equivalent to being able to sample from the "parallel world" where we have applied the alternative intervention on the same units. We have already concluded this is hardly possible. The concept of simulation is still useful, as a simulation can be a good approximation of the world on average, which can allow us to calculate the average causal effect. Note that a simulation doesn't have to be a program, it can also be a simple bayesian network, for example, or a sample from a distribution that approximates the real data on average. Or it can be an analytical solution dependent on the assignment mechanism, assignment vector $\textbf{W}$ and the observed sample $(\textbf{Y}_{obs}, \textbf{X})$ to derive conclusions about the unobserved data $\textbf{Y}_{miss}$ . Thus we are using the word "simulation" here very broadly, as a way to answer What would $\textbf{Y}$ be in case of a different assignment $\textbf{W}$ ?

So the assignment mechanism and the knowledge or assumption we make on it enables us to talk and "simulate" these different assignments, and depending on the type of assignment mechanism we have, the causal analysis will be different.

As an example, let's look at the most famous type of assignment mechanism that has been well adopted in the industry - Randomized Controlled Trials. But unlike the usual banter, lets look at it from the perspective of potential outcomes framework and what kind of assignment mechanism they assume.

Randomized Controlled Trials

In randomized controlled trials, the assignment mechanism is well known, probabillistic and fully controlled by us. Each unit gets assigned the intervention randomly with a certain probability.

Let's first be more specific about what we mean by randomized controlled trials. This specificity concerns the properties of the assignment mechanism, and in classical randomized experiments, the assignment mechanism must be:

Individualistic - The assignment probability of every unit is only dependent on that units potential outcomes and covariates⁴
Probabilistic - The assignment probability is never 0
Unconfounded - The assignment probability is not dependent on the potential outcomes
Controlled by the experimenter

Now, let's see how Neyman originaly used the idea. He was interested in finding an average causal effect given a random sample of size $N$ consisting of two groups - one where the treatment was applied ( $W=1$ ) and the control group ( $W=0$ ).

When we have a finite sample, we can talk about an average causal effect for the finite sample of size $N$ , or we can treat this as a sample from a much bigger super-population. The latter is essentially the expected value of the unit level causal effect for the super-population. This distinction is very important, and Neyman was interested in both, but as we will see, the estimators for both of these cases are identical.

Let's define these causal effects in terms of potential outcomes. We can write the finite sample causal effect as:

\tau_{fs} = \frac {1} {N} \sum_{i=1}^{N} Y_i(1) - Y_i(0)

Because the units are independent, i.e. the assignment mechanism is induvidualistic, we can write this in another way:

\tau_{fs} = \frac {1} {N_t} \sum_{i: W_i=1} Y_i(1) - \frac {1} {N_c} \sum_{i: W_i=0} Y_i(0)

and for the super-population causal effect:

\tau_{sp} = \mathbf{E}_{sp}[Y(1) - Y(0)]

Because $\tau_{fs}$ is a random sample from the super-population, the expected value of $\tau_{fs}$ is equal to $\tau_{sp}$ :

\mathbf {E}_{sp}[\tau_{fs}] = \frac {1} {N} \sum_{i=1}^{N} \mathbf {E}_{sp} [Y_i(1) - Y_i(0)] = \tau_{sp}

This is an interesting observation on it's own, but it's even more relevant because this equality propagates down to what is actually important to us - the estimators of $\tau_{fs}$ and $\tau_{sp}$ . This is because an unbiased estimator of $\tau_{fs}$ is also an unbiased estimator of $\tau_{sp}$ .

Let's look at the estimator that Neyman has proposed:

\hat{\tau} = \bar{Y_t}^{obs} - \bar{Y_c}^{obs}

where:

\bar{Y_t}^{obs} = \frac {1} {N_t} \sum_{i: W_i=1} Y_i^{obs} \\ \bar{Y_c}^{obs} = \frac {1} {N_c} \sum_{i: W_i=0} Y_i^{obs}

To see that this estimator is indeed unbiased, we should go back to the definintion of $Y^{obs}$ that we have mentioned before, and that we'll repeat here for convenience:

\textbf{Y}_{obs} = \textbf{W}\textbf{Y}(1) + (1 - \textbf{W})\textbf{Y}(0)

Given this, we can rewrite the estimator in the following way:

\hat{\tau} = \bar{Y_t}^{obs} - \bar{Y_c}^{obs} \\ \hat{\tau} = \frac {1} {N_t} \sum_{i: W_i=1} Y_i^{obs} - \frac {1} {N_c} \sum_{i: W_i=0} Y_i^{obs} \\ \hat{\tau} = \frac {1} {N} \sum_{i}^{N} \frac {W_i Y_i(1)} {N_t / N} - \frac {(1-W_i) Y_i(0)} {N_c / N}

Now notice that for the finite sample estimator, the only random component here is $W$ - everything else is considered fixed. In other words, we have a sample of N units, and if we would sample again, we would not sample new units, but we would only reassign the same units to different interventions.

So showing that $\hat{\tau}$ is an unbiased estimator for $\tau_{fs}$ amounts to showing that the expected value of the estimator across all possible assignments is equal to $\tau_{fs}$ . This distribution across all possible assignments is called the randomization distribution, and this is where the assignment mechanism plays a key role in the calculations.

To see why, we need to consider the expected value of the probability of assignment to treatment, and in randomized experiments this is:

\mathbf{E_W} [W_i | Y(1), Y(0)] = \frac {N_t} {N}

Similarly, the expected value for the assignment to control is:

\mathbf{E_W} [(1 - W_i) | Y(1), Y(0)] = \frac {N_c} {N}

Since the $W_i$ -s are the only random variables in the equation of $\hat{\tau}$ , the expectation of the whole expression will amount be defined by the expectations of $W_i$ -s:

\mathbf{E_W} [\hat{\tau}] = \frac {1} {N} \sum_{i}^{N} \frac {\mathbf{E_W} [W_i] Y_i(1)} {N_t / N} - \frac {\mathbf{E_W} [(1-W_i)] Y_i(0)} {N_c / N} = \\ = \frac {1} {N} \sum_{i}^{N} \frac {N_t / N * Y_i(1)} {N_t / N} - \frac {N_c / N * Y_i(0)} {N_c / N} = \\ = \frac {1} {N} \sum_{i}^{N} Y_i(1) - Y_i(0) = \tau_{fs}

This proves that the estimator is unbiased, and it follows from the equivalence of $\tau_{fs}$ and $\tau_{sp}$ that this estimator is also unbiased for $\tau_{sp}$ .

In standard A/B testing setups, we would also need to estimate confidence intervals for this estimator, and for this we need an estimator for the variance of $\hat{\tau}$ . We will not derive this estimator here, or prove that it's unbiased, since it's a bit involved, but we'll instead just show the estimator that Neyman has come up with:

\hat{\mathbb{V}} = \frac{\hat{\sigma_t}^2}{N_t} + \frac{\hat{\sigma_c}^2}{N_c}

where the $\hat{\sigma}$ -s are the usual variance estimators for treatment and control respectively.

This is exactly the estimator of variance that is most commonly used in A/B testing setups! The great part is that it's exactly the same for both the finite sample and the superpopulation estimator variance, which is pretty cool.

Other Assignment Mechanisms?

Making different assumptions on the assignment mechanism can lead to different types of analysis and constraints. For example, if we had the same assumptions as in randomized controlled trials, except for the 4th one, which states the assignment mechanism is known and in our control, we would have a broad class of regular assignment mechanisms that has been studied well. These studies assume that available covariates $X$ are enough to explain the assignment mechanism and that within a subpopulations defined by $X$ , the units are randomly assigned.

Another example where the assignment mechanism is in fact controlled by us, but randomizing over units breaks other assumptions, like Uncofoundedness or Individialistic assignment mechanism, is the switchback experiment design, which I wrote about here.

And many more methods are being worked on and invented every day, where the potential outcomes framework plays a key role. This is a topic of ongoing research that is pretty hot and relevant, and perhaps worth articles on their own in the future.

Conclusion

The beauty of potential outcomes is that it opens the door to tackling causal questions with a simple change in notation and by identifying the assignment mechanism as a key ingredient.

Finally, what I want to emphasize with this post is that potential outcomes offer a unified language for causal inference that is intuitive, general and fairly simple. I am optimistic and excited that many new types of analysis can be created, especially having ML methods as a part of it.

Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66(5), 688–701. https://doi.org/10.1037/h0037350 ↩
Imbens, G., & Rubin, D. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge: Cambridge University Press. doi:10.1017/CBO9781139025751 ↩
The as correctly as possible should be taken seriously. For ML people, data imputation given some features $\mathbf{X}$ would be just a matter of training a model on available data and imputing the missing ones. But even though the model will output something, the predictions may not corresponding to how the world actually works, and additional assumptions are required to make that extra step. So we need to take caution and be clear on what assumptions we are making, whatever method we use to impute the data for causal inference purposes. ↩
Usually part of the Stable Unit Treatment Value Assumption - or SUTVA that is often implied in randomized controlled trials. ↩