Published on

Switchback Experiment Design - A Rare Paper on the Topic

Authors
  • avatar
    Name
    Aleksandar Jovanovic
    LinkedIn

Switchback Design

Think about the following use case: Hypothetically, you work in a ride-hailing company, and you need to solve a problem of drivers rejecting tasks too often. You have some evidence that the pricing model is inadequate for long distance rides, so you want to test the impact of a new pricing model on the driver's acceptance rate. The natural way to think about this experiment is to randomly select a subset of your drivers, apply the new pricing model for them, and let the rest of the drivers get the old pricing model. Then you can compare the acceptance rate in the two groups, and hope to find statistically significant results.

This sounds simple enough, but there are issues with this approach. Imagine if your hypothesis is true - meaning the treatment group acceptance rate is higher. This means that there will be less tasks for the drivers in the control group, because they operate on the same pool of tasks. As a consequence, because the supply of tasks is lower, it is possible for control group acceptance rate to grow as well, making it difficult to detect the real impact of your treatment. In the worst case, you might get a completely opposite conclusion, that the control group acceptance rate is higher, which could seriously impact your business.

This is an example use case where the common A/B testing strategy will no longer work. One way to get around this is the so-called switchback strategy - where instead of randomizing over units that receive the treatment as is usual, we switch back and forth between different variants over some intervals of time.

This post will dive into the theory behind one rare paper I could find on the topic that attempts to give switchback design a strong theoretical foundation1. It will focus on the general mathematical framework on switchback experiments presented there, rather than the many results, and will try to be concise and intuitive. If you want to see all the details, I recommend to read the paper itself, as there is nothing in this post that doesn't exist there. Also, it's very well written and contains a lot of interesting math, so definitely check the references below.

Mathematical Notation

Let's go over the notation that is used in the paper.

Consider a set of time intervals of equal length:

t[T]={1,2,3,...,T}t \in [T] = \{ 1,2,3,...,T \}

Each time interval is assigned one of two interventions:

wt{0,1}w_t \in \{ 0,1 \}

So, if wt=1w_t = 1, we say the time interval is assigned to the treatment group, and if wt=0w_t = 0, we say it's assigned to the control group.

If we generate an assignment for each time interval, we can create an assignment path, which is then a binary vector of length T2:

w1:T=(w1,w2,...,wT){0,1}Tw_{1:T}=(w_1, w_2, ... , w_T) \in \{0,1\}^T

Each time interval corresponds to a potential outcome YtY_t3. A potential outcome is necessarily a function of a concrete assignment path, so it can be represented as:

Yt(w1:T)Y_t(w_{1:T})

We can represent the set of all potential outcomes as:

Y={Yt(w1:T)}tT,w1:T{0,1}T\mathbb{Y} = \{Y_t(w_{1:T})\}_{t \in T, w_{1:T} \in \{0,1\}^T}

Note the key point here - this is a set of all possible outcomes over all possible assignment paths. So, if you imagine that there is a parallel world for every assignment path, then there would be 2T2^T worlds. The set Y\mathbb{Y} covers all of them.

Assumptions?

Every mathematical model of reality necessarily makes assumptions, and it usually has to make a trade-off - having no assumptions may make it too complex, intractable and basically useless and inapplicable. If it has too many or too strong assumptions, it may be too simple, and therefore, again, inapplicable. In the world of modelling realistic phenomena, it is safe to say that the devil is in the details, and those details are usually the model assumptions. So if you want to have a measure of usefulness of a mathematical model, it is probably a good choice to look at the assumptions it makes and evaluate how strong they are, as well as what is the cost in case they are not met.

One of the things I really like about the framework presented in this paper1 is that it starts from a very abstract mathematical formulation and introduces only a small set of assumptions to simplify it. There are only three of them, and you can be the judge of how reasonable they are:

Assumption 1 - Bounded Potential Outcomes

It means simply that Yt(w1:T)B\vert Y_t(w_{1:T}) \vert \leq B, where B is some constant. In practice, it's hard to think of a case when this is not satisfied. In our initial example with ride hailing platforms, this will be satisfied by definition, since the acceptance rate must be 1\leq 1.

Assumption 2 - Non-anticipating Potential Outcomes

It means the outcome at time tt does not depend on future outcomes. This sounds quite weird, I admit, but it makes much more sense when formulated mathematically: for w1:t{0,1}tw_{1:t} \in \{0,1\}^t, and two different wt+1:T{0,1}(Tt)w'_{t+1:T} \in \{0,1\}^(T-t), wt+1:T{0,1}(Tt)w''_{t+1:T} \in \{0,1\}^(T-t), it holds that:

Yt(w1:t,wt+1:T)=Yt(w1:t,wt+1:T)Y_t(w_{1:t},w'_{t+1:T}) = Y_t(w_{1:t},w''_{t+1:T})

I am actually not sure if it's even possible to break this assumption. Nonetheless, it helps in simplifying the theory, so no damage in listing it out.

Assumption 3 - No m-Carryover Effect4

This is a very important assumption, and the strongest one of the three. It means that there is a fixed number of time intervals mm5 that should pass before any carryover effect from the previous time interval is annulated.

We can right away eliminate the first mm points - tm+1,m+2,...,Tt \in {m+1,m+2,...,T}. We can also say wtm:T{0,1}Tt+m+1w_{t-m:T} \in \{ 0,1 \}^{T-t+m+1}. Don't let the notation confuse you - it only says that we include initial mm time periods after we switch the intervention to ensure we kill the carryover effect. If we introduce two other possible assignments w1:tm1,w1:tm1{0,1}tm1w'_{1:t-m-1},w''_{1:t-m-1} \in \{0,1\}^{t-m-1}, this assumption can be formally represented as:

Yt(w1:tm1,wtm:T)=Yt(w1:tm1,wtm:T)Y_t(w'_{1:t-m-1},w_{t-m:T}) = Y_t(w''_{1:t-m-1},w_{t-m:T})

Check out what this means in an example. Say we have a set of assignments ww that is consisted of:

  • w1:tm1={1,0,1,0}w_{1:t-m-1} = \{ 1,0,1,0 \} - assignments for some first 4 time periods
  • wtm:t1={1,1}w_{t-m:t-1} = \{ 1,1 \} - let's say m=2m=2, then these are the assignments that correspond to the middle m ones.
  • wt:T={1,1,0,0}w_{t:T} = \{ 1,1,0,0 \} - the last 4 time periods.

This creates a list of assignments w=w1:T={1,0,1,0,1,1,1,1,0,0}w = w_{1:T} = \{ 1,0,1,0,1,1,1,1,0,0 \}, which is just a concatenation of the above. Now, say we replace w1:tm1w_{1:t-m-1} with w1:tm1={0,1,0,1}w'_{1:t-m-1} = \{ 0,1,0,1 \}, and let's say this then gives us w={0,1,0,1,1,1,1,1,0,0}w' = \{ 0,1,0,1,1,1,1,1,0,0 \}. Now, potential outcome YtY_t refers to the first of the last four assignments. What this assumption says is that, for a t=6t=6, Y6(w)=Y6(w)Y_6(w) = Y_6(w'), meaning that given the time mm has passed, it doesn't matter what happened before, as it won't affect the potential outcome from moment tt and onwards.

Simplifying the Notation

Introducing the three assumptions gives us a degree of independence between the outcomes, allowing the following to be true:

Yt(w1:T)=Yt(wtm:t)Y_t(w_{1:T}) = Y_t(w_{t-m:t})

Practically, the effect on time tt from before tmt-m is invalidated by assumption 3, and the effect from after tt is invalidated by assumption 2.

This allows us to remove a lot of indices in our notation, writing:

Yt(wtm:t)=Yt(w)Y_t(w_{t-m:t}) = Y_t(w)

Also, note that for two different assignments ww and ww', whenever wtm:t=wtm:tw_{t-m:t}=w'_{t-m:t}, we can write:

Yt(w)=Yt(w)Y_t(w) = Y_t(w')

This simplification may not seem like much, but it is quite useful, since instead of treating a whole assignment path as one single big sample of data, we get smaller chunks as samples of size m+1m+1 which are fairly independent of each other under the 3 assumptions made above.

The Causal Effect

The next important concept from the paper is the definition of the causal effect.

Let's dive directly into the formula itself, and then we can decompose it piece by piece:

τp(Y)=1Tpt=p+1T[Yt(1p+1)Yt(0p+1)]\tau_p(\mathbb Y) = \frac {1} {T - p} \sum_{t=p+1}^T [Y_t(1_{p+1}) - Y_t(0_{p+1})]

1p+11_{p+1} and 0p+10_{p+1} - this is a vector of all ones or zeros of size p+1p+1. Imagine if for p+1p+1 time intervals we apply the same intervention (either treatment, or control).

But what the hell is pp now? Basically, pp is equivalent to mm that we had before, except we have to think about pp as the true time needed to eliminate the carryover effect, while mm was our belief of what the carryover effect is. This distinction is necessary as the theory here must assume we have a good idea of pp, i.e. that m=pm = p. We will make this assumption now, but I also want to emphasize that there is evidence in the paper of the framework being correct when m>pm > p. But for the purpose of this post, we assume m=pm = p.

Another important thing that took me a while to understand is why are we using vectors of length p+1p+1? Well, the answer lies in the consequences of the assumptions we made before. I'll quote the last sentence from the previous heading here:

...instead of treating a whole assignment path as one single big sample of data, we get smaller chunks as samples of size m+1m+1 which are fairly independent of each other under the 3 assumptions from above.

Now that p=mp = m, the samples are of size p+1p+1.

Yt(1p+1)Y_t(1_{p+1}) and Yt(0p+1)Y_t(0_{p+1}) - the potential outcomes for time interval tt, starting trom p+1p+1. Another confusing thing for me here was the fact that tt can go from p+1p+1 to TT, yet we are talking about a potential outcome for a vector of size p+1p+1. The key thing to remember is that Yt(wtm:t)=Yt(w)Y_t(w_{t-m:t}) = Y_t(w) as a consequence of the three assumptions. If we want to apply the same treatment for p+1p+1 time intervals, then we either get 1p+11_{p+1} or 0p+10_{p+1}. We would be able to represent this also by Yt(wtp:t)Y_t(w_{t-p:t}), which is again the same p+1p+1-sized vector.

This is where the simplified notation especially comes in handy, since the definition of the causal effect becomes quite intuitive, elegant and easy to grasp. The core of it is Yt(1p+1)Yt(0p+1)Y_t(1_{p+1}) - Y_t(0_{p+1}), i.e. the difference of continuously applying the intervention 1 up to time period tt and applying intervention 0 for the same time period.

The overall equation defines the average causal effect of applying the treatment continuously compared to applying the control intervention - in plain words, using the treatment instead of control always. This is exactly what we want.

The Causal Effect Estimator

Now that the formal definition of the causal effect is there, it's possible to define the causal effect estimator, i.e. how are we actually going to calculate it from real world data. Obviously, the causal effect itself is defined on exactly same time periods, and we cannot apply both interventions on the exactly same time periods. Now it's necessary to find an equivalent mathematical definition we can calculate from the data that we can actually observe. So let's see how the authors of the paper approached this.

Our observed data is one assignment path w1:Tobsw_{1:T}^{obs} out of many possible paths. The paper authors used a common estimator for the causal effect called the Horvits-Thompson estimator. We won't go into the details of it here, as it could have a post of it's own. In addition, they are mostly concerned with the class of regular switchback experiments, where the probability of selecting a treatment is fair, meaning equal to 1/21/2. This is just a minor detail that should be mentioned, though the need to use some other probability is probably rare.

The core of the Horvits-Thompson estimator is inverse-probability weighting - i.e. weighting the outcomes by the inverse of the corresponding assingment probability. This has mathematical justification, but again, out of topic for this post. To apply this, we first need to define the assignment probability.

Assignment Probability

This shouldn't be hard to do, since we are in control of the assignment process, i.e. when we reach a time period on which we want to randomly choose either the treatment or control intervention, we know we will do it with probability p=0.5p=0.5.

Let's assume we have a predefined schedule that determines the time periods we want to randomly choose a new intervention. That schedule can be represented with T\mathbb{T}. Note that this is different from the set of all time periods [T][T], and that T[T]\mathbb{T} \subseteq [T].

The assignment probability at each t[T]t \in [T] is then:

P(Wt=x)={1/2if tT1{Wt1=x}if tTP(W_t=x) = \left\{ \begin{array}{ll} 1/2 & \text{if } t \in \mathbb{T} \\ \mathbb{1}\{W_{t-1} = x\} & \text{if } t \notin \mathbb{T} \end{array} \right.

This simply states that if the time period tt is the one where we have designed to make a random assignment of the treatment, then the probability of assignment is 1/21/2, otherwise it must be the previous period assignment, as we are not doing any switching. Based on this we can compute the probability of an assignment path w1:Tw_{1:T}:

ηT(w1:T)={1/2K+1if k{0,1,...,K},wtk=wtk+1=...=wtk+p0,otherwise\eta_{\mathbb{T}}(w_{1:T}) = \left\{ \begin{array}{ll} 1/2^{K+1} & \text{if } \forall k \in \{ 0,1,...,K \}, w_{t_k}=w_{t_k+1} = ... = w_{t_k+p} \\ 0, & \text{otherwise} \end{array} \right.

This states that, given a schedule T\mathbb{T}, at tTt \in \mathbb{T}, we have a probability of 0.50.5, but only at those assignments points. So, the probability of switching an assignment at some other point that the ones specified in the design is 0. Also, the probability is a power of 1/21/2 that depends on how many randomization points are included in the assignment sequence.

Here's an example, for T=4T = 4 and T={t0=1,t1=3}\mathbb{T} = \{ t_0 = 1, t_1 = 3 \}, meaning p=1p=1. Assignment {0,0,1,1}\{ 0,0,1,1 \} has the probability of 1/41/4, as well as {1,1,0,0}\{ 1,1,0,0 \}, and as well as {0,0,0,0}\{ 0,0,0,0 \}. However, assignment like {0,1,0,1}\{ 0,1,0,1 \} is impossible under our design, since the switch would have to happen at t0=2t_0 = 2 and t1=4t_1 = 4.

Now, the Estimator

The estimator is then defined as follows:

τ^p(ηT,w1:T,Y)=1Tpt=p+1T[Ytobs1{wtp:t=1p+1}P(Wtp:t=1p+1)Ytobs1{wtp:t=0p+1}P(Wtp:t=0p+1)]\hat{\tau}_p(\eta_{\mathbb{T}}, w_{1:T}, \mathbb{Y}) = \frac {1} {T - p} \sum_{t=p+1}^T [ Y_t^{obs} \frac {1\{ w_{t-p:t} = 1_{p+1} \}} {P(W_{t-p:t} = 1_{p+1})} - Y_t^{obs} \frac {1\{ w_{t-p:t} = 0_{p+1} \}} {P(W_{t-p:t} = 0_{p+1})} ]

This may look like a mouthfull, but it's actually quite simple. The expression 1{wtp:t=1p+1}1\{ w_{t-p:t} = 1_{p+1} \} evaluates to 11 if the last p+1p+1 time periods all have the treatment applied to them. It's the same for the control treatment in the other term. Thus for each time period tt, either left or right term in the subtraction, or both of them will be 0. So only when we observe consecutive p+1p+1 time periods with the same intervention will one of the terms "trigger".

An interesting property that I have initially missed is that there are more time periods that will "trigger" than the time periods in the schedule, i.e. tTt \in \mathbb{T}. This is because a random assignment at point tkt_{k} can be the same assignment as at the previous randomization point tk1t_{k-1}, so we can use all the time periods from tk1t_{k-1} to tkt_{k} as they all have p+1p+1 consecutive time periods with the same intervention.

Confirming that the estimate of causal effect τp^\hat{\tau_p} is unbiased amounts to confirming that:

E[YtobsP(Wtp:t=1p+1)]=Yt(1m+1)\mathbf{E}[\frac {Y_t^{obs}} {P(W_{t-p:t} = 1_{p+1})}] = Y_t(1_{m+1})

meaning, the expected value, as the number of samples goes to infinity, converges to the real outcome at time period tt. This is a property of the Horvits-Thompson estimator itself, and it gives exactly the same result as the causal effect definition.

Final Words

This is basically it for this post! Having a causal effect and it's estimator defined completes the theoretical framework for switchback experiment design.

The paper I was refering to in this post1 has much more interesting things to discover, for example:

  • A simulation study on two statistical tests for the causal effect estimate.
  • A method for finding the proper carryover effect.
  • A study on the theoretically optimal schedule for experiments.

I will just give a teaser here on the third point, to encourage you to read the paper, as the result for the optimal schedule is quite unintuitive. As it turns out, the optimal schedule is:

  1. If m=0m=0, then T={1,2,3,...,T}\mathbb{T^\star} = \{ 1,2,3,...,T \}.
  2. If m>0m>0, and the total period over which the experiment runs is long enough, such that for T=nmT=nm it is true that n4n\geq4, then T={1,2m+1,3m+1,...,(n2)m+1}\mathbb{T^\star} = \{ 1,2m+1,3m+1,...,(n-2)m+1 \}

Hope you had fun, and thanks for reading!

Footnotes

  1. Bojinov, Iavor and Simchi-Levi, David and Zhao, Jinglong, Design and Analysis of Switchback Experiments (August 31, 2020). Available at SSRN: https://ssrn.com/abstract=3684168 or http://dx.doi.org/10.2139/ssrn.3684168 2 3

  2. The same approach can be generalized to multiple interventions, even though the focus is on two. Of course, this might be a better choice, since more interventions will require more data, and thus longer and more costly experiments. Plus, two interventions are always easier to interpret and communicate to the stakeholders.

  3. Potential outcomes is a common conceptual framework used in causal inference.

  4. The carryover effect is the effect that the previous treatment can have on the current one. In the ride-hailing platform example, we may start assigning the new pricing model at a time when the previous ride is still ongoing - which will skew the results for the current time period.

  5. The choice of the time period length and mm in terms of real units of time (i.e. seconds, minutes etc.) depends on the use case, and it is assumed that mm is an integer multiple of the length of time period tt.