American Political Science Review (2018)112,4,1067-1082 doi:10.1017/S0003055418000357 American Political Science Association 2018 How to Make Causal Inferences with Time-Series Cross-Sectional Data under Selection on Observables MATTHEW BLACKWELL Harvard University ADAM N.GLYNN Emory University epeated measurements of the same countries,people,or groups over time are vital to many fields of political science.These measurements,sometimes called time-series cross-sectional (TSCS)data. allow researchers to estimate a broad set of causal quantities,including contemporaneous effects op//s and direct effects of lagged treatments.Unfortunately,popular methods for TSCS data can only produce valid inferences for lagged effects under some strong assumptions.In this paper,we use potential outcomes to define causal quantities of interest in these settings and clarify how standard models like the autore- gressive distributed lag model can produce biased estimates of these quantities due to post-treatment conditioning.We then describe two estimation strategies that avoid these post-treatment biases-inverse probability weighting and structural nested mean models-and show via simulations that they can outper- form standard approaches in small sample settings.We illustrate these methods in a study of how welfare spending affects terrorism. INTRODUCTION counterfactual causal effects and discuss the assump- any inquiries in political science involve the tions needed to identify them nonparametrically.We also relate these quantities of interest to common study of repeated measurements of the same quantities in the TSCS literature,like impulse re- countries,people,or groups at several points sponses,and show how to derive them from the param- in time.This type of data,sometimes called time-series eters of a common TSCS model,the autoregressive dis cross-sectional (TSCS)data.allows researchers to draw tributed lag(ADL)model.These treatment effects can on a larger pool of information when estimating causal be nonparametrically identified under a key selection- effects.TSCS data also give researchers the power to on-observables assumption called sequential ignora- ask a richer set of questions than data with a single bility;unfortunately,however,many common TSCS measurement for each unit (for example,see Beck approaches rely on more stringent assumptions and Katz 2011).Using this data,researchers can move including a lack of causal feedback between the 是 past the narrowest contemporaneous questions- treatment and time-varying covariates.This feedback. what are the effects of a single event?-and instead for example,might involve a country's level of welfare ask how the history of a process affects the political spending affecting the vote share of left wing parties, world.Unfortunately,the most common approaches which in turn might affect future levels of spending.We to modeling TSCS data require strict assumptions to estimate the effect of treatment histories without bias argue that this type of feedback is common in TSCS settings.While we focus on a selection-on-observables and make it difficult to understand the nature of the assumption in this paper,we discuss the tradeoffs counterfactual comparisons. with this choice compared to standard fixed-effects This paper makes three contributions to the study methods,noting that the latter may also rule out this of TSCS data.Our first contribution is to define some type of dynamic feedback. Our second contribution is to provide an introduc- tion to two methods from biostatistics that can estimate Matthew Blackwell is an Associate Professor,Department of the effect of treatment histories without bias and under Government and Institute for Quantitative Social Science.Har- vard University,1737 Cambridge St.,MA 02138.Web:http://www. weaker assumptions than common TSCS models.We mattblackwell.org (mblackwell@gov.harvard.edu) focus on two methods:(1)structural nested mean Adam N.Glynn is an Associate Professor,Department of Political models or SNMMs (Robins 1997)and (2)marginal Science,Emory University,327 Tarbutton Hall,1555 Dickey Drive, structural models (MSMs)with inverse probability of Atlanta,GA 30322(aglynn@emory.edu). treatment weighting (IPTWs)(Robins,Hernan,and We are grateful to Neal Beck,Jake Bowers,Patrick Brandt,Simc Goshev,and Cyrus Samii for helpful advice and feedback and El- Brumback 2000).These models allow for consistent es- isha Cohen for research support.Any remaining errors are our own. timation of lagged effects of treatment by paying care- This research project was supported by Riksbankens Jubileumsfond ful attention to the causal ordering of the treatment,the Grant M13-0559:1,PI:Staffan I.Lindberg,V-Dem Institute,Uni- outcome,and the time-varying covariates.The SNMM versity of Gothenburg,Sweden and by European Research Coun- approach generalizes the standard regression modeling cil,Grant 724191,PI:Staffan I.Lindberg,V-Dem Institute,Uni- versity of Gothenburg,Sweden.Replication files are available on of ADLs and often implies very simple and intuitive the American Political Science Review Dataverse:https://doi.org/10. multi-step estimators.The MSM approach focuses on 7910/DVN/SFBX6Z. modeling the treatment process to develop weights Received:September 30,2017;revised:March 16,2018;accepted: that adjust for confounding in simple weighted regres- June 4,2018.First published online:August 3,2018. sion models.Both of these approaches have the ability 1067
American Political Science Review (2018) 112, 4, 1067–1082 doi:10.1017/S0003055418000357 © American Political Science Association 2018 How to Make Causal Inferences with Time-Series Cross-Sectional Data under Selection on Observables MATTHEW BLACKWELL Harvard University ADAM N. GLYNN Emory University Repeated measurements of the same countries, people, or groups over time are vital to many fields of political science. These measurements, sometimes called time-series cross-sectional (TSCS) data, allow researchers to estimate a broad set of causal quantities, including contemporaneous effects and direct effects of lagged treatments. Unfortunately, popular methods for TSCS data can only produce valid inferences for lagged effects under some strong assumptions. In this paper,we use potential outcomes to define causal quantities of interest in these settings and clarify how standard models like the autoregressive distributed lag model can produce biased estimates of these quantities due to post-treatment conditioning. We then describe two estimation strategies that avoid these post-treatment biases—inverse probability weighting and structural nested mean models—and show via simulations that they can outperform standard approaches in small sample settings. We illustrate these methods in a study of how welfare spending affects terrorism. INTRODUCTION Many inquiries in political science involve the study of repeated measurements of the same countries, people, or groups at several points in time. This type of data, sometimes called time-series cross-sectional (TSCS) data, allows researchers to draw on a larger pool of information when estimating causal effects. TSCS data also give researchers the power to ask a richer set of questions than data with a single measurement for each unit (for example, see Beck and Katz 2011). Using this data, researchers can move past the narrowest contemporaneous questions— what are the effects of a single event?—and instead ask how the history of a process affects the political world. Unfortunately, the most common approaches to modeling TSCS data require strict assumptions to estimate the effect of treatment histories without bias and make it difficult to understand the nature of the counterfactual comparisons. This paper makes three contributions to the study of TSCS data. Our first contribution is to define some Matthew Blackwell is an Associate Professor, Department of Government and Institute for Quantitative Social Science, Harvard University, 1737 Cambridge St., MA 02138. Web: http://www. mattblackwell.org (mblackwell@gov.harvard.edu). Adam N. Glynn is an Associate Professor, Department of Political Science, Emory University, 327 Tarbutton Hall, 1555 Dickey Drive, Atlanta, GA 30322 (aglynn@emory.edu). We are grateful to Neal Beck, Jake Bowers, Patrick Brandt, Simo Goshev, and Cyrus Samii for helpful advice and feedback and Elisha Cohen for research support. Any remaining errors are our own. This research project was supported by Riksbankens Jubileumsfond, Grant M13-0559:1, PI: Staffan I. Lindberg, V-Dem Institute, University of Gothenburg, Sweden and by European Research Council, Grant 724191, PI: Staffan I. Lindberg, V-Dem Institute, University of Gothenburg, Sweden. Replication files are available on the American Political Science Review Dataverse: https://doi.org/10. 7910/DVN/SFBX6Z. Received: September 30, 2017; revised: March 16, 2018; accepted: June 4, 2018. First published online: August 3, 2018. counterfactual causal effects and discuss the assumptions needed to identify them nonparametrically. We also relate these quantities of interest to common quantities in the TSCS literature, like impulse responses, and show how to derive them from the parameters of a common TSCS model, the autoregressive distributed lag (ADL) model. These treatment effects can be nonparametrically identified under a key selectionon-observables assumption called sequential ignorability; unfortunately, however, many common TSCS approaches rely on more stringent assumptions, including a lack of causal feedback between the treatment and time-varying covariates. This feedback, for example, might involve a country’s level of welfare spending affecting the vote share of left wing parties, which in turn might affect future levels of spending.We argue that this type of feedback is common in TSCS settings. While we focus on a selection-on-observables assumption in this paper, we discuss the tradeoffs with this choice compared to standard fixed-effects methods, noting that the latter may also rule out this type of dynamic feedback. Our second contribution is to provide an introduction to two methods from biostatistics that can estimate the effect of treatment histories without bias and under weaker assumptions than common TSCS models. We focus on two methods: (1) structural nested mean models or SNMMs (Robins 1997) and (2) marginal structural models (MSMs) with inverse probability of treatment weighting (IPTWs) (Robins, Hernán, and Brumback 2000).These models allow for consistent estimation of lagged effects of treatment by paying careful attention to the causal ordering of the treatment, the outcome, and the time-varying covariates. The SNMM approach generalizes the standard regression modeling of ADLs and often implies very simple and intuitive multi-step estimators. The MSM approach focuses on modeling the treatment process to develop weights that adjust for confounding in simple weighted regression models. Both of these approaches have the ability 1067 Downloaded from https://www.cambridge.org/core. Shanghai JiaoTong University, on 26 Oct 2018 at 03:56:49, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0003055418000357
Matthew Blackwell and Adam N.Glynn to incorporate weaker modeling assumptions than period t and Xit=0 if the unit is untreated in period t traditional TSCS models.We describe the modeling (it is straightforward to generalize to arbitrary treat- choices involved and provide guidance on how to ment types).In our running example,Xit=1 would implement these methods. represent a country that had high welfare spending in Our third contribution is to show how traditional year t and Xit =0 would be a country with low welfare models like the ADL are biased for the direct effects spending.We collect all of the treatments for a given of lagged treatments in common TSCS settings,while unit into a treatment history,Xi=(Xa,...,Xir),where T MSMs and SNMMs are not.This bias arises from the is the number of time periods in the study.For example time-varying covariates-researchers must control for we might have a country that always had high spending, them to accurately estimate contemporaneous effects, (1,1,...,1),or a country that always had low spending. but they induce post-treatment bias for lagged effects. (0,0,...,0).We refer to the partial treatment history Thus.ADL models can only consistently estimate up to t as Xi.:t=(Xi,...Xir),with x1:t as a possible lagged effects when time-varying covariates are particular realization of this random vector.We define unaffected by past treatment.SNMMs and MSMs,on Zit,Zi.1:t,and similarly for a set of time-varying co- the other hand,can estimate these effects even when variates that are causally prior to the treatment at time t such feedback exists.We provide simulation evidence such as the government capability,population size,and that this type of feedback can lead to significant bias whether or not the country is in a conflict. in ADL models compared to the SNMM and MSM The goal is to estimate causal effects of the treat- approaches.Overall,these latter methods could be ment on an outcome,Yit,that also varies over time.In promising for TSCS scholars,especially those who are our running example,Yit is the number of terrorist inci- interested in longer-term effects. dents in a given country in a given year.We take a coun- This paper proceeds as follows.We first clarify terfactual approach and define potential outcomes for the causal quantities of interest available with TSCS each time period,Yir(x1:)(Rubin 1978;Robins 1986). data and show how they relate to parameters from This potential outcome represents the incidence of ter- traditional TSCS models.Causal assumptions are a rorism that would occur in country i in year t if i had key part of any TSCS analysis and we discuss them followed history of welfare spending equal to x1:.Ob- in the following section.We then turn to discussing viously,for any country in any year,we only observe the post-treatment bias stemming from traditional one of these potential outcomes since a country can- TSCS approaches,and then introduce the SNMM and not follow multiple histories of welfare spending over MSM approaches,which avoid this post-treatment the same time window.To connect the potential out- bias,and show how to estimate causal effects using comes to the observed outcomes,we make the stan- these methodologies.We present simulation evidence dard consistency assumption.Namely,we assume that of how these methods outperform traditional TSCS the observed outcome and the potential outcome are models in small samples in the following section.Next, the same for the observed history:Yi=Yi(x1.,)when we present an empirical illustration of each approach, X,1:t=x1:.2 based on Burgoon(2006),investigating the connection To create a common playing field for all the methods between welfare spending and terrorism.Finally,we we evaluate.we limit ourselves to making causal infer- conclude with thoughts on both the limitations of ences about the time window observed in the data- these approaches and avenues for future research. that is,we want to study the effect of welfare spend- ing on terrorism for the years in our data set.Under certain assumptions like stationarity of the covariates CAUSAL QUANTITIES OF INTEREST IN and error terms,many TSCS methods can make infer- TSCS DATA ences about the long-term effects beyond the end of the study.This extrapolation is typically required with At their most basic,TSCS data consists of a treatment a single time series,but with the multiple units we have (or main independent variable of interest),an outcome, in TSCS data,we have the ability to focus our infer- and some covariates all measured for the same units at ences on a particular window and avoid these assump- various points in time.In our empirical setting below. tions about the time-series processes.We view this as we focus on a dataset of countries with the number of a conservative approach because all methods for han- terrorist incidents as an outcome and domestic welfare dling TSCS should be able to generate sensible esti- spending as a binary treatment.With one time period, mates of causal effects in the period under study.There eys only one causal comparison exists:a country has either high or low levels of welfare spending.As we gather The definition of potential outcomes in this manner implicitly as- data on these countries over time,there are more coun- sumes the usual stable unit treatment value assumption(SUTVA) terfactual comparisons to investigate.How does the (Rubin 1978).This assumption is questionable for the many compar- history of welfare spending affect the incidence of ter- ative politics and international relations applications,but we avoid discussing this complication in this paper to focus on the issues re- rorism?Does the spending regime today only affect ter- garding TSCS data.Implicit in our definition of the potential out- rorism today or does the recent history matter as well? comes is that outcomes at time t only depend on past values of treat- The variation over time provides the opportunity and ment,not future values (Abbring and van den Berg 2003). the challenge of answering these complex questions. 2 Implicit in the definition of the potential outcomes is that the To fix ideas,let Xi be the treatment for unit i in time treatment history can affect the outcome through the history of period t.For simplicity,we focus first on the case of a time-varying covariates:Y(x)=Y(x1:Z.1:(x111)).Here. 2i.1:r(1-1)represents the values that the covariate history would binary treatment so thatXir=1 if the unit is treated in take under this treatment history. 1068
Matthew Blackwell and Adam N. Glynn to incorporate weaker modeling assumptions than traditional TSCS models. We describe the modeling choices involved and provide guidance on how to implement these methods. Our third contribution is to show how traditional models like the ADL are biased for the direct effects of lagged treatments in common TSCS settings, while MSMs and SNMMs are not. This bias arises from the time-varying covariates—researchers must control for them to accurately estimate contemporaneous effects, but they induce post-treatment bias for lagged effects. Thus, ADL models can only consistently estimate lagged effects when time-varying covariates are unaffected by past treatment. SNMMs and MSMs, on the other hand, can estimate these effects even when such feedback exists. We provide simulation evidence that this type of feedback can lead to significant bias in ADL models compared to the SNMM and MSM approaches. Overall, these latter methods could be promising for TSCS scholars, especially those who are interested in longer-term effects. This paper proceeds as follows. We first clarify the causal quantities of interest available with TSCS data and show how they relate to parameters from traditional TSCS models. Causal assumptions are a key part of any TSCS analysis and we discuss them in the following section. We then turn to discussing the post-treatment bias stemming from traditional TSCS approaches, and then introduce the SNMM and MSM approaches, which avoid this post-treatment bias, and show how to estimate causal effects using these methodologies. We present simulation evidence of how these methods outperform traditional TSCS models in small samples in the following section. Next, we present an empirical illustration of each approach, based on Burgoon (2006), investigating the connection between welfare spending and terrorism. Finally, we conclude with thoughts on both the limitations of these approaches and avenues for future research. CAUSAL QUANTITIES OF INTEREST IN TSCS DATA At their most basic, TSCS data consists of a treatment (or main independent variable of interest), an outcome, and some covariates all measured for the same units at various points in time. In our empirical setting below, we focus on a dataset of countries with the number of terrorist incidents as an outcome and domestic welfare spending as a binary treatment. With one time period, only one causal comparison exists: a country has either high or low levels of welfare spending. As we gather data on these countries over time, there are more counterfactual comparisons to investigate. How does the history of welfare spending affect the incidence of terrorism? Does the spending regime today only affect terrorism today or does the recent history matter as well? The variation over time provides the opportunity and the challenge of answering these complex questions. To fix ideas, let Xit be the treatment for unit i in time period t. For simplicity, we focus first on the case of a binary treatment so that Xit = 1 if the unit is treated in period t and Xit = 0 if the unit is untreated in period t (it is straightforward to generalize to arbitrary treatment types). In our running example, Xit = 1 would represent a country that had high welfare spending in year t and Xit = 0 would be a country with low welfare spending. We collect all of the treatments for a given unit into a treatment history,Xi = (Xi1,…,XiT), where T is the number of time periods in the study. For example, we might have a country that always had high spending, (1, 1, …, 1), or a country that always had low spending, (0, 0, …, 0). We refer to the partial treatment history up to t as Xi, 1: t = (Xi1, …, Xit), with x1: t as a possible particular realization of this random vector. We define Zit, Zi, 1: t, and z1: t similarly for a set of time-varying covariates that are causally prior to the treatment at time t such as the government capability, population size, and whether or not the country is in a conflict. The goal is to estimate causal effects of the treatment on an outcome, Yit, that also varies over time. In our running example,Yit is the number of terrorist incidents in a given country in a given year.We take a counterfactual approach and define potential outcomes for each time period, Yit(x1: t) (Rubin 1978; Robins 1986).1 This potential outcome represents the incidence of terrorism that would occur in country i in year t if i had followed history of welfare spending equal to x1: t. Obviously, for any country in any year, we only observe one of these potential outcomes since a country cannot follow multiple histories of welfare spending over the same time window. To connect the potential outcomes to the observed outcomes, we make the standard consistency assumption. Namely, we assume that the observed outcome and the potential outcome are the same for the observed history: Yit = Yit(x1: t) when Xi, 1: t = x1: t. 2 To create a common playing field for all the methods we evaluate, we limit ourselves to making causal inferences about the time window observed in the data— that is, we want to study the effect of welfare spending on terrorism for the years in our data set. Under certain assumptions like stationarity of the covariates and error terms, many TSCS methods can make inferences about the long-term effects beyond the end of the study. This extrapolation is typically required with a single time series, but with the multiple units we have in TSCS data, we have the ability to focus our inferences on a particular window and avoid these assumptions about the time-series processes. We view this as a conservative approach because all methods for handling TSCS should be able to generate sensible estimates of causal effects in the period under study. There 1 The definition of potential outcomes in this manner implicitly assumes the usual stable unit treatment value assumption (SUTVA) (Rubin 1978). This assumption is questionable for the many comparative politics and international relations applications, but we avoid discussing this complication in this paper to focus on the issues regarding TSCS data. Implicit in our definition of the potential outcomes is that outcomes at time t only depend on past values of treatment, not future values (Abbring and van den Berg 2003). 2 Implicit in the definition of the potential outcomes is that the treatment history can affect the outcome through the history of time-varying covariates: Yit(xi:t) = Yit(x1: t, Zi, 1: t(x1:t − 1)). Here, Zi, 1: t(x1:t − 1) represents the values that the covariate history would take under this treatment history. 1068 Downloaded from https://www.cambridge.org/core. Shanghai JiaoTong University, on 26 Oct 2018 at 03:56:49, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0003055418000357
How to Make Causal Inferences with Time-Series Cross-Sectional Data under Selection on Observables is a tradeoff with this approach:we cannot study some common TSCS estimands like the long-run multiplier FIGURE 1. Directed acyclic graph(DAG)of that are based on time-series analysis.We discuss this typical TSCS data.Dotted lines are the causal estimand in particular in the Supplemental Material. pathways that constitute the average causal Given our focus on a fixed time window,we will de- effect of a treatment history at time t. fine expectations over cross-sectional units and con- sider asymptotic properties of the estimators as the number of these units grows(rather than the length XI-1 of the time series).Asymptotics are only useful in how they guide our analyses in the real world of finite sam- ples,and we may worry that "large-N,fixed-T asymp- Z totic results do not provide a reliable approximation when N and T are roughly the same size,as is often the case for TSCS data.Fortunately,as we show in the simulation studies below,our analysis of the various of the effects of Xit,Xi.1,Xi.t-2,and so on,that end ISCS estimators holds even when N and T are small up at Yit.Note that many of these effects flow through and close in size.Thus,we do not see the choices of the time-varying covariates,Zi.This point complicates “fixed time-window”versus“time-series analysis'”or the estimation of causal effects in this setting and we large-N versus large-T asymptotics to be consequential return to it below. to the conclusions we draw. Marginal Effects of Recent Treatments The Effect of a Treatment History As mentioned above,there are numerous possible For an individual country,the causal effect of a par- treatment histories to compare when estimating causal 4r元 ticular history of welfare spending,x:t,relative to effects.This can be daunting for applied researchers some other history of spending,x is the difference who may only be interested in the effects of the first Ya(:)-Ya).That is,it is the difference in the po- few lags of welfare spending.Furthermore,any partic- tential or counterfactual level of terrorism when the ular treatment history may not be well-represented in country follows history x1:minus the counterfactual the data if the number of time periods is moderate.To outcome when it follows historyx Given the number avoid these problems,we introduce causal quantities of possible treatment histories,there can be numerous that focus on recent values of treatment and average causal effects to investigate,even with a simple binary over more distant lags.We define the potential out- treatment.As the length of time under study grows,so comes just intervening on treatment the last j periods does the number of possible comparisons.In fact,with asYn(c-a)=Ya(X,u-j-l,x-a.This“marginal” a binary treatment,there are 2'different potential out- potential outcome represents the potential or counter- comes for the outcome in period t.This large number factual level of terrorism in country i if we let welfare of potential outcomes allows for a very large number of spending run its natural course up to t-j-1 and just comparisons and a host of causal questions:Does the set the last jlags of spending tox 5795.801g stability of spending over time matter for the impact With this definition in hand.we can define one on the incidence of terrorism?Is there a cumulative important quantity of interest,the contemporaneous impact of welfare spending or is it only the current level effect of treatment (CET)of Xit on Yit: that matters? These individual-level causal effects are difficult to c()=E[Ya(X.1:t-1,1)-Yt(X,1t-1,0)小, identify without strong assumptions,so we often focus on estimating the average causal effect of a treatment =E[Ya(1)-Ya(o], history (Robins,Greenland,and Hu 1999;Hernan, Brumback,and Robins 2001): Here we have switched from potential outcomes that depend on the entire history to potential outcomes that only depend on treatment in time t.The CET reflects t(xt,x)=E[Yi(x)-Yi(x)] (1) the effect of treatment in period t on the outcome in period t,averaging across all of the treatment histories eys Here,the expectations are over the units so that this up to period t.Thus,it would be the expected effect of quantity is the average difference in outcomes between switching a random country from low levels of welfare the world where all units had history x1:and the world spending to high levels in period t.A graphical depic- where all units had history xi For example,we might tion of a CET is presented in Figure 2,where the dot- be interested in the effect of a country having always ted arrow corresponds to component of the effect.It is high welfare spending versus a country always having common in pooled TSCS analyses to assume that this low spending levels.Thus,this quantity considers the effect is constant over time so that te(t)=c. effect of treatment at time t,but also the effect of all Researchers are also often interested in how more /:sony lagged values of the treatment as well.A graphical de- distant changes to treatment affect the outcome.Thus, piction of the pathways contained in t(xi:,x)is pre- sented in Figure 1,where the dotted arrows correspond See Shephard and Bojinov(2017)for a similar approach to defining to components of the effect.These arrows represent all recent effects in time-series data. 1069
How to Make Causal Inferences with Time-Series Cross-Sectional Data under Selection on Observables is a tradeoff with this approach: we cannot study some common TSCS estimands like the long-run multiplier that are based on time-series analysis. We discuss this estimand in particular in the Supplemental Material. Given our focus on a fixed time window, we will define expectations over cross-sectional units and consider asymptotic properties of the estimators as the number of these units grows (rather than the length of the time series). Asymptotics are only useful in how they guide our analyses in the real world of finite samples, and we may worry that “large-N, fixed-T” asymptotic results do not provide a reliable approximation when N and T are roughly the same size, as is often the case for TSCS data. Fortunately, as we show in the simulation studies below, our analysis of the various TSCS estimators holds even when N and T are small and close in size. Thus, we do not see the choices of “fixed time-window” versus “time-series analysis” or large-N versus large-T asymptotics to be consequential to the conclusions we draw. The Effect of a Treatment History For an individual country, the causal effect of a particular history of welfare spending, x1: t, relative to some other history of spending, x 1:t , is the difference Yit(x1:t) − Yit(x 1:t). That is, it is the difference in the potential or counterfactual level of terrorism when the country follows history x1: t minus the counterfactual outcome when it follows history x 1:t . Given the number of possible treatment histories, there can be numerous causal effects to investigate, even with a simple binary treatment. As the length of time under study grows, so does the number of possible comparisons. In fact, with a binary treatment, there are 2t different potential outcomes for the outcome in period t. This large number of potential outcomes allows for a very large number of comparisons and a host of causal questions: Does the stability of spending over time matter for the impact on the incidence of terrorism? Is there a cumulative impact of welfare spending or is it only the current level that matters? These individual-level causal effects are difficult to identify without strong assumptions, so we often focus on estimating the average causal effect of a treatment history (Robins, Greenland, and Hu 1999; Hernán, Brumback, and Robins 2001): τ (x1:t, x 1:t) = E[Yit(x1:t) − Yit(x 1:t)]. (1) Here, the expectations are over the units so that this quantity is the average difference in outcomes between the world where all units had history x1: t and the world where all units had history x 1:t . For example, we might be interested in the effect of a country having always high welfare spending versus a country always having low spending levels. Thus, this quantity considers the effect of treatment at time t, but also the effect of all lagged values of the treatment as well. A graphical depiction of the pathways contained in τ (x1:t, x 1:t) is presented in Figure 1, where the dotted arrows correspond to components of the effect. These arrows represent all FIGURE 1. Directed acyclic graph (DAG) of typical TSCS data. Dotted lines are the causal pathways that constitute the average causal effect of a treatment history at time t. ··· ··· ··· ··· Xt−1 Zt−1 Yt−1 Xt Zt Yt of the effects of Xit, Xi, t − 1, Xi, t − 2, and so on, that end up at Yit. Note that many of these effects flow through the time-varying covariates, Zit. This point complicates the estimation of causal effects in this setting and we return to it below. Marginal Effects of Recent Treatments As mentioned above, there are numerous possible treatment histories to compare when estimating causal effects. This can be daunting for applied researchers who may only be interested in the effects of the first few lags of welfare spending. Furthermore, any particular treatment history may not be well-represented in the data if the number of time periods is moderate. To avoid these problems, we introduce causal quantities that focus on recent values of treatment and average over more distant lags. We define the potential outcomes just intervening on treatment the last j periods as Yit(xt − j:t) = Yit(Xi, 1:t − j − 1, xt − j:t). This “marginal” potential outcome represents the potential or counterfactual level of terrorism in country i if we let welfare spending run its natural course up to t − j − 1 and just set the last j lags of spending to xt − j:t. 3 With this definition in hand, we can define one important quantity of interest, the contemporaneous effect of treatment (CET) of Xit on Yit: τc(t) = E[Yit(Xi,1: t−1, 1) − Yit(Xi,1: t−1, 0)], = E[Yit(1) − Yit(0)], Here we have switched from potential outcomes that depend on the entire history to potential outcomes that only depend on treatment in time t. The CET reflects the effect of treatment in period t on the outcome in period t, averaging across all of the treatment histories up to period t. Thus, it would be the expected effect of switching a random country from low levels of welfare spending to high levels in period t. A graphical depiction of a CET is presented in Figure 2, where the dotted arrow corresponds to component of the effect. It is common in pooled TSCS analyses to assume that this effect is constant over time so that τ c(t) = τ c. Researchers are also often interested in how more distant changes to treatment affect the outcome. Thus, 3 See Shephard and Bojinov (2017) for a similar approach to defining recent effects in time-series data. 1069 Downloaded from https://www.cambridge.org/core. Shanghai JiaoTong University, on 26 Oct 2018 at 03:56:49, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0003055418000357
Matthew Blackwell and Adam N.Glynn responses for each pair of periods.As we discuss next, FIGURE 2.DAG of a TSCS setting where the traditional modeling of TSCS data imposes restrictions dotted line represents the contemporaneous on the data-generating processes,in part,to summarize effect of treatment at time t. this large number of effects with a few parameters. XI-1 Relationship to Traditional TSCS Models The potential outcomes and causal effects defined above are completely nonparametric in the sense that they impose no restrictions on the distribution of Ya. To situate these quantities in the TSCS literature,it is helpful to see how they are parameterized in a par- ticular TSCS model.One general model that encom- FIGURE 3. DAG of a panel setting where the passes many different possible specifications is an ADL dotted lines represent the paths that model: constitute the lagged effect of treatment at time t-1 on the outcome at time t. Yit Bo aYi.t-1+B1Xi+B2Xi.t-1+8ir, (4) 七 where si are independent and identically distributed -1 errors,independent of Xis for all t and s.The key fea- tures of such a model are the presence of lagged inde- pendent and dependent variables and the exogeneity of the independent variables.This model for the out- 4号元 come would imply the following form for the potential outcomes: we define the lagged effect of treatment,which is the Yi(x1:1)=Bo aYi.t-1(x1:1-1)+B1xr +B2x1-1+8it. marginal effect of treatment in time t-1 on the (5) outcome in time t,holding treatment at time t fixed: In this form,it is clear to see what TSCS scholars have E[Yir(1,0)-Yir(0,0)].More generally,the j-step lagged long pointed out:causal effects are complicated with effect is defined as follows: lagged dependent variables(LDVs)since a change in x-1 can have both a direct effect on Yir and an indi- (t,j)=E[Y(X,1t--1,1,0i)-Ya(X1-i-1,0,0小, rect effect through Yi.1.This is why even seemingly simple TSCS models such as the ADLimply quite com- =E[Ya(1,0j)-Ym(0+11, (2) plicated expressions for long-run effects. The ADL model also has implications for the various where 0,is a vector of s zero values.For example,the causal quantities,both short term and long term.The two-step lagged effect would be E[Yi(1,0,0)-Yi(0, coefficient on the contemporaneous treatment,B1,is 0,0)]and represents the effect of welfare spending two constant over time and does not depend on past values years ago on terrorism today holding the intervening of the treatment,so it is equal to the CET,re(r)=B1. welfare spending fixed at low levels.A graphical One can derive the lagged effects from different com- depiction of the one-step lagged effect is presented in binations of a,B1,and B2: Figure 3,where again the dotted arrows correspond to components of the effect.These effects are similar to (L,0)=B1, (6 a common quantity of interest in both time-series and TSCS applications called the impulse response (Box, t,1)=aB1+P2, (7 Jenkins,and Reinsel 2013). (L,2)=a2B1+a2 (8) Another common quantity of interest in the TSCS literature is the step response,which is the cumulative effect of a permanent shift in treatment status on some Note that these lagged effects are constant across t.The future outcome (Box,Jenkins,and Reinsel 2013:Beck step response,on the other hand,has a stronger im- and Katz 2011).The step response function,or SRF, pact because it accumulates the impulse responses over describes how this effect varies by time period and dis- time: tance between the shift and the outcome: t(t,0)=B1, (9) t(,j)=E[Ya(1)-Ya(0)小, (3) t(t,1)=f1+aB+2, (10) where 1,has a similar definition to 0.Thus,(t,j) x,(L,2)=B1+ax5+2+a2B1+af2. (11) is the effect of j periods of treatment starting at time t-j on the outcome at time t.Without further as- 4 For introductions to modeling choices for TSCS data in political sumptions,there are separate lagged effects and step science,see De Boef and Keele (2008)and Beck and Katz(2011). 1070
Matthew Blackwell and Adam N. Glynn FIGURE 2. DAG of a TSCS setting where the dotted line represents the contemporaneous effect of treatment at time t. ··· ··· ··· ··· Xt−1 Zt−1 Yt−1 Xt Zt Yt FIGURE 3. DAG of a panel setting where the dotted lines represent the paths that constitute the lagged effect of treatment at time t − 1 on the outcome at time t. ··· ··· ··· ··· Xt−1 Zt−1 Yt−1 Xt Zt Yt we define the lagged effect of treatment, which is the marginal effect of treatment in time t − 1 on the outcome in time t, holding treatment at time t fixed: E[Yit(1, 0) − Yit(0, 0)].More generally, the j-step lagged effect is defined as follows: τl(t, j) = E[Yit(Xi,1: t−j−1, 1, 0j) − Yit(Xi,1: t−j−1, 0, 0j)], = E[Yit(1, 0j) − Yit(0j+1 )], (2) where 0s is a vector of s zero values. For example, the two-step lagged effect would be E[Yit(1, 0, 0) − Yit(0, 0, 0)] and represents the effect of welfare spending two years ago on terrorism today holding the intervening welfare spending fixed at low levels. A graphical depiction of the one-step lagged effect is presented in Figure 3, where again the dotted arrows correspond to components of the effect. These effects are similar to a common quantity of interest in both time-series and TSCS applications called the impulse response (Box, Jenkins, and Reinsel 2013). Another common quantity of interest in the TSCS literature is the step response, which is the cumulative effect of a permanent shift in treatment status on some future outcome (Box, Jenkins, and Reinsel 2013; Beck and Katz 2011). The step response function, or SRF, describes how this effect varies by time period and distance between the shift and the outcome: τs(t, j) = E[Yit(1j) − Yit(0j)], (3) where 1s has a similar definition to 0s. Thus, τ s(t, j) is the effect of j periods of treatment starting at time t − j on the outcome at time t. Without further assumptions, there are separate lagged effects and step responses for each pair of periods. As we discuss next, traditional modeling of TSCS data imposes restrictions on the data-generating processes, in part, to summarize this large number of effects with a few parameters. Relationship to Traditional TSCS Models The potential outcomes and causal effects defined above are completely nonparametric in the sense that they impose no restrictions on the distribution of Yit. To situate these quantities in the TSCS literature, it is helpful to see how they are parameterized in a particular TSCS model. One general model that encompasses many different possible specifications is an ADL model:4 Yit = β0 + αYi,t−1 + β1Xit + β2Xi,t−1 + εit, (4) where εit are independent and identically distributed errors, independent of Xis for all t and s. The key features of such a model are the presence of lagged independent and dependent variables and the exogeneity of the independent variables. This model for the outcome would imply the following form for the potential outcomes: Yit(x1: t) = β0 + αYi,t−1(x1: t−1 ) + β1xt + β2xt−1 + εit . (5) In this form, it is clear to see what TSCS scholars have long pointed out: causal effects are complicated with lagged dependent variables (LDVs) since a change in xt − 1 can have both a direct effect on Yit and an indirect effect through Yi, t − 1. This is why even seemingly simple TSCS models such as the ADL imply quite complicated expressions for long-run effects. The ADL model also has implications for the various causal quantities, both short term and long term. The coefficient on the contemporaneous treatment, β1, is constant over time and does not depend on past values of the treatment, so it is equal to the CET, τ c(t) = β1. One can derive the lagged effects from different combinations of α, β1, and β2: τl(t, 0) = β1, (6) τl(t, 1) = αβ1 + β2, (7) τl(t, 2) = α2 β1 + αβ2. (8) Note that these lagged effects are constant acrosst. The step response, on the other hand, has a stronger impact because it accumulates the impulse responses over time: τs(t, 0) = β1, (9) τs(t, 1) = β1 + αβ1 + β2, (10) τs(t, 2) = β1 + αβ1 + β2 + α2 β1 + αβ2. (11) 4 For introductions to modeling choices for TSCS data in political science, see De Boef and Keele (2008) and Beck and Katz (2011). 1070 Downloaded from https://www.cambridge.org/core. Shanghai JiaoTong University, on 26 Oct 2018 at 03:56:49, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0003055418000357
How to Make Causal Inferences with Time-Series Cross-Sectional Data under Selection on Observables Note that the step response here is just the sum of all with no autoregressive component: previous lagged effects.It is clear that one benefit of such a TSCS model is to summarize a broad set of es- Yit Bo+B1Xit+B2Xit-1+ni (13) timands with just a few parameters.This helps to sim- plify the complexity of the TSCS setting while intro- ducing the possibility of bias if this model is incorrect or Here,baseline randomization of the treatment history, combined with the assumptions implicit in linear TSCS misspecified. models,implies the usual identifying assumption in these models,strict exogeneity of the errors: CAUSAL ASSUMPTIONS AND DESIGNS IN TSCS DATA E[nitlXi.1:T]E[ni]=0. (14) Under what assumptions are the above causal quan- tities identified?When we have repeated measure- This is a mean independence assumption about the re- ments on the outcome-treatment relationship,there lationship between the errors,nit,and the treatment his- are a number of assumptions we could invoke to iden- tory,Xi.1:T- tify causal effects.In this section,we discuss several of these assumptions.We focus on cross-sectional as- Sequentially Randomized Treatments sumptions given our fixed time-window approach.That is,we make no assumptions on the time-series pro- Beginning with Robins (1986),scholars in epidemiol- cesses such as stationarity even though imposing these ogy have expanded the potential outcomes framework types of assumptions will not materially affect our con- to handle weaker identifying assumptions than base- clusions about the bias of traditional TSCS methods line randomization.These innovations centered on se This result is confirmed in the simulations below.where quentially randomized experiments,where at each pe- the data generating process is stationary and the biases riod,Xi was randomized conditional on the past values 4号元 we describe below still occur. of the treatment and time-varying covariates (includ- ing past values of the outcome).Under this sequential Baseline Randomized Treatments ignorability assumption,the treatment is randomly as- signed not at the beginning of the process,but at each A powerful,if rare,research design for TSCS data is point in time and can be affected by the past values of one that randomly assigns the entire history of treat- the covariates and the outcome. ment,Xi:r,at time t=0.Under this assumption,treat- At its core,sequential ignorability assumes there is ment at time t cannot be affected by,say,previous val- some function or subset of the observed history up to ues of the outcome or time-varying covariates.In terms time t,Vir =g(Xi.17-1,Yi.1,Zi.1:),that is sufficient of potential outcomes,the baseline randomized treat to satisfy no unmeasured confounders for the effect of ment history assumption is Xit on future outcomes.Formally,the assumption states that,conditional on this set of variables,Vit,the treat- {Y(x1:):t=1,,T}LX.1lZ0 12) ment at time t is independent of the potential outcomes at time t: where A1LBC is defined as"A is independent of B Assumption 1 (Sequential Ignorability).For every conditional on C."This assumes that the entire history treatment history x:T and period t, of welfare spending is independent of all potential lev- els of terrorism,possibly conditional on baseline (that (Yis(xis)s=t,,....T]I Xivit. (15) is,time-invariant)covariates.Hernan.Brumback,and Robins (2001)called Xi.1:causally exogenous under For example,a researcher might assume that sequen- this assumption.The lack of time-varying covariates or tial ignorability for current welfare spending holds con- past values of Yit on the right-hand side of the condi- ditional on lagged levels of terrorism,lagged welfare tioning bar in Equation (12)implies that these vari- spending,and some contemporaneous covariates,so ables do not confound the relationship between the that Va =(Yi.-1 Xi.-1,Za).Unlike baseline ran- treatment and the outcome.For example,this assumes domization and strict exogeneity,it allows for observed there are no time-varying covariates that affect both time-varying covariates like conflict status and lagged welfare spending and the number of terrorist incidents. values of terrorism to confound the relationship be- Thus,baseline randomization relies on strong assump- tween welfare spending and current terrorism levels tions that are rarely satisfied outside of randomized so long as we have measures of these confounders.Fur- experiments and is unsuitable for most observational thermore,these time-varying covariates can be affected TSCS studies. by past values of welfare spending. Baseline randomization is closely related to exo- In the context of traditional linear TSCS models such L geneity assumptions in linear TSCS models.For exam- as Equation (4),with their implicit assumptions,se- ple,suppose we had the following distributed lag model quential ignorability implies the sequential exogeneity assumption: A notable exception are experiments with a panel design that ran- domize rollout of a treatment(e.g.,Gerber et al.2011). E[sitXi.1:t,Zi.1:t,Yi.1:t-1]E[sirlXit,Vit]=0. (16) 1071
How to Make Causal Inferences with Time-Series Cross-Sectional Data under Selection on Observables Note that the step response here is just the sum of all previous lagged effects. It is clear that one benefit of such a TSCS model is to summarize a broad set of estimands with just a few parameters. This helps to simplify the complexity of the TSCS setting while introducing the possibility of bias if this model is incorrect or misspecified. CAUSAL ASSUMPTIONS AND DESIGNS IN TSCS DATA Under what assumptions are the above causal quantities identified? When we have repeated measurements on the outcome-treatment relationship, there are a number of assumptions we could invoke to identify causal effects. In this section, we discuss several of these assumptions. We focus on cross-sectional assumptions given our fixed time-window approach.That is, we make no assumptions on the time-series processes such as stationarity even though imposing these types of assumptions will not materially affect our conclusions about the bias of traditional TSCS methods. This result is confirmed in the simulations below, where the data generating process is stationary and the biases we describe below still occur. Baseline Randomized Treatments A powerful, if rare, research design for TSCS data is one that randomly assigns the entire history of treatment, X1: T, at time t = 0. Under this assumption, treatment at time t cannot be affected by, say, previous values of the outcome or time-varying covariates. In terms of potential outcomes, the baseline randomized treatment history assumption is {Yit(x1: t) : t = 1,...,T} ⊥⊥ Xi,1: t|Zi0, (12) where A⊥⊥B|C is defined as “A is independent of B conditional on C.” This assumes that the entire history of welfare spending is independent of all potential levels of terrorism, possibly conditional on baseline (that is, time-invariant) covariates. Hernán, Brumback, and Robins (2001) called Xi, 1: t causally exogenous under this assumption. The lack of time-varying covariates or past values of Yit on the right-hand side of the conditioning bar in Equation (12) implies that these variables do not confound the relationship between the treatment and the outcome. For example, this assumes there are no time-varying covariates that affect both welfare spending and the number of terrorist incidents. Thus, baseline randomization relies on strong assumptions that are rarely satisfied outside of randomized experiments and is unsuitable for most observational TSCS studies.5 Baseline randomization is closely related to exogeneity assumptions in linear TSCS models. For example, suppose we had the following distributed lag model 5 A notable exception are experiments with a panel design that randomize rollout of a treatment (e.g., Gerber et al. 2011). with no autoregressive component: Yit = β0 + β1Xit + β2Xi,t−1 + ηit . (13) Here, baseline randomization of the treatment history, combined with the assumptions implicit in linear TSCS models, implies the usual identifying assumption in these models, strict exogeneity of the errors: E[ηit|Xi,1: T ] = E[ηit] = 0. (14) This is a mean independence assumption about the relationship between the errors,ηit, and the treatment history, Xi, 1: T. Sequentially Randomized Treatments Beginning with Robins (1986), scholars in epidemiology have expanded the potential outcomes framework to handle weaker identifying assumptions than baseline randomization. These innovations centered on sequentially randomized experiments, where at each period,Xit was randomized conditional on the past values of the treatment and time-varying covariates (including past values of the outcome). Under this sequential ignorability assumption, the treatment is randomly assigned not at the beginning of the process, but at each point in time and can be affected by the past values of the covariates and the outcome. At its core, sequential ignorability assumes there is some function or subset of the observed history up to time t, Vit = g(Xi, 1:t − 1, Yi, 1:t − 1, Zi, 1: t), that is sufficient to satisfy no unmeasured confounders for the effect of Xit on future outcomes. Formally, the assumption states that, conditional on this set of variables, Vit, the treatment at time t is independent of the potential outcomes at time t: Assumption 1 (Sequential Ignorability). For every treatment history x1: T and period t, {Yis(x1:s) : s = t,,...,T} ⊥⊥ Xit|Vit . (15) For example, a researcher might assume that sequential ignorability for current welfare spending holds conditional on lagged levels of terrorism, lagged welfare spending, and some contemporaneous covariates, so that Vit = {Yi, t − 1, Xi, t − 1, Zit}. Unlike baseline randomization and strict exogeneity,it allows for observed time-varying covariates like conflict status and lagged values of terrorism to confound the relationship between welfare spending and current terrorism levels, so long as we have measures of these confounders. Furthermore, these time-varying covariates can be affected by past values of welfare spending. In the context of traditional linear TSCS models such as Equation (4), with their implicit assumptions, sequential ignorability implies the sequential exogeneity assumption: E[εit|Xi,1: t, Zi,1: t,Yi,1: t−1] = E[εit|Xit,Vit] = 0. (16) 1071 Downloaded from https://www.cambridge.org/core. Shanghai JiaoTong University, on 26 Oct 2018 at 03:56:49, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0003055418000357
Matthew Blackwell and Adam N.Glynn According to the model in Equation (4),the time- Unfortunatelv.these fixed effects estimation strate- varying covariates here would include the LDV.This gies require within-unit baseline randomization to assumption states that the errors of the TSCS model identify any quantity other than the CET(Sobel 2012: are mean independent of welfare spending at time t Imai and Kim 2017).Specifically,standard fixed ef- given the conditioning set that depends on the history fects models assume that previous values of covari- of the data up to t.Thus,this allows the errors for levels ates like GDP growth or lagged terrorist attacks(that of terrorism to be related to future values of welfare is,the LDV)have no impact on the current value spending. of welfare spending.Thus,to estimate any effects of Sequential ignorability weakens baseline random- lagged treatment,fixed effects models would allow for ization to allow for feedback between the treatment time-constant unmeasured confounding but would also status and the time-varying covariates,including lagged rule out a large number of TSCS applications where outcomes.For instance,sequential ignorability allows there is feedback between the covariates and the treat- for the welfare spending of a country to impact future ment.Furthermore,the assumptions of fixed-effects- levels of terrorism and for this terrorism to affect fu- style models in nonlinear settings can impose strong ture welfare spending.Thus,in this dynamic case,treat- restrictions on over-time variation in the treatment and ments can affect the covariates and so the covariates outcome(Chernozhukov et al.2013).For these reasons also have potential responses:Zit(x).This dynamic and because there is a large TSCS literature in politi- feedback implies that the lagged treatment may have cal science that relies on selection-on-observables as. both a direct effect on the outcome and an indirect sumptions,we focus on situations where sequential ig- effect through these covariates.For example,welfare norability holds.We return to the avenues for future spending might directly affect terrorism by reducing re- research on fixed effects models in this setting in the sentment among potential terrorists,but it might also conclusion. have an indirect effect if it helps to increase levels of state capacity which could,in turn,help combat future terrorism. THE POST-TREATMENT BIAS OF In TSCS models,the LDV,is often included in the TRADITIONAL TSCS MODELS above time-varying conditioning set,Vit,to assess Under sequential ignorability,standard TSCS mod- the dynamics of the time-series process or to els like the ADL model above can become biased capture the effects of longer lags of treatment in a for common TSCS estimands.The basic problem with simple manner.5 In either case,sequential ignorabil- these models is that sequential ignorability allows for ity would allow the LDV to have an effect on the the possibility of post-treatment bias when estimating treatment history as well,but baseline randomization lagged effects in the ADL model.While this problem is would not.For instance,welfare spending may have a well known in statistics(Rosenbaum 1984:Robins 1997: strong effect on terrorism levels which,in turn,affect Robins.Greenland,and Hu 1999),we review it here in future welfare spending.Under this type of feedback, the context of TSCS models to highlight the potential an LDV must be in the conditioning set Vit and strict for biased and inconsistent estimators. exogeneity will be violated. The root of the bias in the ADL approach is the nature of time-varying covariates,Zit.Under the as- 5795.801g Unmeasured Confounding and Fixed Effects sumption of baseline randomization,there is no need to control or adjust for these covariates beyond the Assumptions baseline covariates,Zio.because treatment is assigned Sequential ignorability is a selection-on-observables at baseline-future covariates cannot confound past assumption-the researcher must be able to choose a treatment assignment.The ADL approach thrives in (time-varying)conditioning set to eliminate any con- this setting.But when baseline randomization is im- founding.An oft-cited benefit of having repeated ob plausible,as we argue is true in most TSCS settings,we servations is that it allows scholars to estimate causal will typically require conditioning on these covariates effects in spite of time-constant unmeasured con- to obtain credible causal estimates.And this condition founders.Linear fixed effects models have the benefit ing on Zit is what can create large biases in the ADL of adjusting for all time-constant covariates,measured approach. or unmeasured.This would be very helpful if,for in- To demonstrate the potential for bias,we focus on stance,each country had its own baseline level of wel- a simple case where we are only interested in the first fare spending that was determined by factors corre- two lags of treatment and sequential ignorability as- lated with terrorist attacks,but the year-to-year vari- sumption holds with Vit =[Yi.t-1,Zit,Xi.t-1).This ation in spending within a country was exogenous.At means that treatment is randomly assigned conditional first glance,this ability to avoid time-constant omitted on the contemporaneous value of the time-varying co- variable bias appears to be a huge benefit. variate and the lagged values of the outcome and the treatment.Given this setting,the ADL approach would model the outcome as follows: 6 In certain parametric models,the LDV can be interpreted as sum- marizing the effects of the entire history of treatment.More gener Yit Bo aYit-1 B1Xir B2Xit-1+Zi 8 8it. ally,the LDV may effectively block confounding for contemporane- ous treatment even if it has no causal effect on the current outcome. (17) 1072
Matthew Blackwell and Adam N. Glynn According to the model in Equation (4), the timevarying covariates here would include the LDV. This assumption states that the errors of the TSCS model are mean independent of welfare spending at time t given the conditioning set that depends on the history of the data up to t. Thus, this allows the errors for levels of terrorism to be related to future values of welfare spending. Sequential ignorability weakens baseline randomization to allow for feedback between the treatment status and the time-varying covariates,including lagged outcomes. For instance, sequential ignorability allows for the welfare spending of a country to impact future levels of terrorism and for this terrorism to affect future welfare spending. Thus, in this dynamic case, treatments can affect the covariates and so the covariates also have potential responses: Zit(x1:t − 1). This dynamic feedback implies that the lagged treatment may have both a direct effect on the outcome and an indirect effect through these covariates. For example, welfare spending might directly affect terrorism by reducing resentment among potential terrorists, but it might also have an indirect effect if it helps to increase levels of state capacity which could, in turn, help combat future terrorism. In TSCS models, the LDV, is often included in the above time-varying conditioning set, Vit, to assess the dynamics of the time-series process or to capture the effects of longer lags of treatment in a simple manner.6 In either case, sequential ignorability would allow the LDV to have an effect on the treatment history as well, but baseline randomization would not. For instance, welfare spending may have a strong effect on terrorism levels which, in turn, affect future welfare spending. Under this type of feedback, an LDV must be in the conditioning set Vit and strict exogeneity will be violated. Unmeasured Confounding and Fixed Effects Assumptions Sequential ignorability is a selection-on-observables assumption—the researcher must be able to choose a (time-varying) conditioning set to eliminate any confounding. An oft-cited benefit of having repeated observations is that it allows scholars to estimate causal effects in spite of time-constant unmeasured confounders. Linear fixed effects models have the benefit of adjusting for all time-constant covariates, measured or unmeasured. This would be very helpful if, for instance, each country had its own baseline level of welfare spending that was determined by factors correlated with terrorist attacks, but the year-to-year variation in spending within a country was exogenous. At first glance, this ability to avoid time-constant omitted variable bias appears to be a huge benefit. 6 In certain parametric models, the LDV can be interpreted as summarizing the effects of the entire history of treatment. More generally, the LDV may effectively block confounding for contemporaneous treatment even if it has no causal effect on the current outcome. Unfortunately, these fixed effects estimation strategies require within-unit baseline randomization to identify any quantity other than the CET (Sobel 2012; Imai and Kim 2017). Specifically, standard fixed effects models assume that previous values of covariates like GDP growth or lagged terrorist attacks (that is, the LDV) have no impact on the current value of welfare spending. Thus, to estimate any effects of lagged treatment, fixed effects models would allow for time-constant unmeasured confounding but would also rule out a large number of TSCS applications where there is feedback between the covariates and the treatment. Furthermore, the assumptions of fixed-effectsstyle models in nonlinear settings can impose strong restrictions on over-time variation in the treatment and outcome (Chernozhukov et al.2013).For these reasons, and because there is a large TSCS literature in political science that relies on selection-on-observables assumptions, we focus on situations where sequential ignorability holds. We return to the avenues for future research on fixed effects models in this setting in the conclusion. THE POST-TREATMENT BIAS OF TRADITIONAL TSCS MODELS Under sequential ignorability, standard TSCS models like the ADL model above can become biased for common TSCS estimands. The basic problem with these models is that sequential ignorability allows for the possibility of post-treatment bias when estimating lagged effects in the ADL model.While this problem is well known in statistics (Rosenbaum 1984;Robins 1997; Robins, Greenland, and Hu 1999), we review it here in the context of TSCS models to highlight the potential for biased and inconsistent estimators. The root of the bias in the ADL approach is the nature of time-varying covariates, Zit. Under the assumption of baseline randomization, there is no need to control or adjust for these covariates beyond the baseline covariates, Zi0, because treatment is assigned at baseline—future covariates cannot confound past treatment assignment. The ADL approach thrives in this setting. But when baseline randomization is implausible, as we argue is true in most TSCS settings, we will typically require conditioning on these covariates to obtain credible causal estimates. And this conditioning on Zit is what can create large biases in the ADL approach. To demonstrate the potential for bias, we focus on a simple case where we are only interested in the first two lags of treatment and sequential ignorability assumption holds with Vit = {Yi, t − 1, Zit, Xi, t − 1}. This means that treatment is randomly assigned conditional on the contemporaneous value of the time-varying covariate and the lagged values of the outcome and the treatment.Given this setting, the ADL approach would model the outcome as follows: Yit = β0 + αYi,t−1 + β1Xit + β2Xi,t−1 + Z itδ + εit . (17) 1072 Downloaded from https://www.cambridge.org/core. Shanghai JiaoTong University, on 26 Oct 2018 at 03:56:49, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0003055418000357
How to Make Causal Inferences with Time-Series Cross-Sectional Data under Selection on Observables Assuming this functional form is correct and assuming classes of models because they are commonly used ap- that sit are independent and identically distributed,this proaches that both (a)avoid post-treatment bias in this model would consistently estimate the CET,B1,given setting and (b)do not require the parametric modeling the sequential ignorability assumption.But what about of the distribution of the time-varying covariates. the effect of lagged treatment?In the ADL approach, One modeling choice that is common to all of these one would combine the coefficients as aB+B2.The approaches,including the ADL,is the choice of causal problem with this approach is that,if Zir is affected lag length.Should we attempt to estimate the effect of by Xi.-1,then Zit will be post-treatment and in many the entire history of welfare spending on terrorist inci- cases induce bias in the estimation of B2(Rosenbaum dents with potential outcome Yir(x1:)Or should we 1984;Acharya,Blackwell,and Sen 2016).Why not sim- only investigate the contemporaneous and first lagged ply omit Zit from our model?Because this would bias effects with potential outcome Yi(x,1,x)?As we the estimates of the contemporary treatment effect,B discussed above,we can always focus on effects that due to omitted variable bias. marginalize over lags of treatment beyond the scope In this setting,there is no way to estimate the di- of our investigation.Thus,this choice of lag length is rect effect of lagged treatment without bias with a sin- less about the "correct"specification and more about gle ADL model.Unfortunately,even weakening the choosing what question the researcher wants to answer. parametric modeling assumptions via matching or gen- A separate question is what variables and their lags eralized additive models will fail to overcome this need to be included in the various models for our an- problem-it is inherent to the data generating pro- swers to be correct.We discuss the details of what needs cess (Robins 1997).These biases exist even in favor to be controlled for and when in our discussion of each able settings for the ADL,such as when the outcome estimator. is stationary and treatment effects are constant over time.Furthermore,as discussed above,standard fixed Structural Nested Mean Models 4号元 effects models cannot eliminate this bias because it involves time-dependent causal feedback.Traditional Our first class of models,SNMMs,can be seen as an approaches can only avoid the bias under special cir- extension of the ADL approach that allows for esti- cumstances such as when treatment is randomly as- mation of lagged effects in a relatively straightforward signed at baseline or when the time-varying covariates manner (Robins 1986.1997).At their most general are completely unaffected by treatment.Both of these these models focus on parameterizing a conditional assumptions lack plausibility in TSCS settings,which is version of the lagged effects (that is,the impulse re- why many TSCS studies control for time-varying co- sponse function): variates.Below,we demonstrate this bias in simula- tions,but we first turn to two methods from biostatistics br(x1:t,j)=E[Yi(x1:1-j,0j) that can avoid these biases. -Ya(x1:t-j-1,0j+1)X:t-j=x1:t- TWO METHODS FOR ESTIMATING THE (18) EFFECT OF TREATMENT HISTORIES 5795.801g If the traditional ADL model is biased in the pres- Robins (1997)refers to these impulse responses as ence of time-varying covariates,how can we proceed "blip-down functions."This function gives the effect of with estimating both contemporaneous and lagged ef- a change from 0 tox,-j in terms of welfare spending on fect of treatment in the TSCS setting?In this section. levels of terrorism at time t,conditional on the treat- we show how to estimate these causal quantities of in- ment history up to time t-j.Inference in SNMMs fo- terest defined above under sequential ignorability us cuses on estimating the causal parameters of this func- ing two approaches developed in biostatistics to specif- tion.The conditional mean of the outcome given the ically address this potential for bias in this type of set- covariates needs to be estimated as part of this ap- ting.The first approach is based on SNMMs,which,in proach,but this is seen as a nuisance function rather their simplest form,represent an extension of the ADL than the object of direct interest approach to avoid the post-treatment bias described Given the chosen lag length to study,a researcher above.The second class of estimators,based on MSMs must only specify the parameters of the impulse re- and IPTW,is semiparametric in the sense that it mod- sponse up to that many lags.If we chose a lag length of eys els the treatment history,but leaves the relationship for example,then we might parameterize the impulse between the outcome and the time-varying covariates response function as unspecified.Because of this,MSMs have the advantage of being robust to our ability or inability to model the bi(x:t,y)=Yixt-j, j∈{0.1}. 19) outcome.We focus our attention on these two broad A second issue is that ADL models often only include condition- 8 Because of focus on being faithful to the ADL setup,we assume ing variables to identify the contemporaneous effect,not any lagged that the lagged effects are constant across levels of the time-varying 士 effects of treatment.Thus,the effect of X might also suffer from confounders as is standard in ADL models.One can include inter omitted variable bias.This issue can be more easily corrected by in- actions with these variables,though SNMMs then require additional cluding the proper condition set,Vit-1,in the model. models for Zir.See Robins (1997,sec.8.3)for more details. 1073
How to Make Causal Inferences with Time-Series Cross-Sectional Data under Selection on Observables Assuming this functional form is correct and assuming that εit are independent and identically distributed, this model would consistently estimate the CET, β1, given the sequential ignorability assumption. But what about the effect of lagged treatment? In the ADL approach, one would combine the coefficients as αβ 1 + β 2. The problem with this approach is that, if Zit is affected by Xi, t − 1, then Zit will be post-treatment and in many cases induce bias in the estimation of β 2 (Rosenbaum 1984; Acharya, Blackwell, and Sen 2016).Why not simply omit Zit from our model? Because this would bias the estimates of the contemporary treatment effect, β 1 due to omitted variable bias.7 In this setting, there is no way to estimate the direct effect of lagged treatment without bias with a single ADL model. Unfortunately, even weakening the parametric modeling assumptions via matching or generalized additive models will fail to overcome this problem—it is inherent to the data generating process (Robins 1997). These biases exist even in favorable settings for the ADL, such as when the outcome is stationary and treatment effects are constant over time. Furthermore, as discussed above, standard fixed effects models cannot eliminate this bias because it involves time-dependent causal feedback. Traditional approaches can only avoid the bias under special circumstances such as when treatment is randomly assigned at baseline or when the time-varying covariates are completely unaffected by treatment. Both of these assumptions lack plausibility in TSCS settings, which is why many TSCS studies control for time-varying covariates. Below, we demonstrate this bias in simulations, but we first turn to two methods from biostatistics that can avoid these biases. TWO METHODS FOR ESTIMATING THE EFFECT OF TREATMENT HISTORIES If the traditional ADL model is biased in the presence of time-varying covariates, how can we proceed with estimating both contemporaneous and lagged effect of treatment in the TSCS setting? In this section, we show how to estimate these causal quantities of interest defined above under sequential ignorability using two approaches developed in biostatistics to specifically address this potential for bias in this type of setting. The first approach is based on SNMMs, which, in their simplest form, represent an extension of the ADL approach to avoid the post-treatment bias described above. The second class of estimators, based on MSMs and IPTW, is semiparametric in the sense that it models the treatment history, but leaves the relationship between the outcome and the time-varying covariates unspecified. Because of this,MSMs have the advantage of being robust to our ability or inability to model the outcome. We focus our attention on these two broad 7 A second issue is that ADL models often only include conditioning variables to identify the contemporaneous effect, not any lagged effects of treatment. Thus, the effect of Xi, t − 1 might also suffer from omitted variable bias. This issue can be more easily corrected by including the proper condition set, Vi, t − 1, in the model. classes of models because they are commonly used approaches that both (a) avoid post-treatment bias in this setting and (b) do not require the parametric modeling of the distribution of the time-varying covariates. One modeling choice that is common to all of these approaches, including the ADL, is the choice of causal lag length. Should we attempt to estimate the effect of the entire history of welfare spending on terrorist incidents with potential outcome Yit(x1: t)? Or should we only investigate the contemporaneous and first lagged effects with potential outcome Yit(xt − 1, xt)? As we discussed above, we can always focus on effects that marginalize over lags of treatment beyond the scope of our investigation. Thus, this choice of lag length is less about the “correct” specification and more about choosing what question the researcher wants to answer. A separate question is what variables and their lags need to be included in the various models for our answers to be correct.We discuss the details of what needs to be controlled for and when in our discussion of each estimator. Structural Nested Mean Models Our first class of models, SNMMs, can be seen as an extension of the ADL approach that allows for estimation of lagged effects in a relatively straightforward manner (Robins 1986, 1997). At their most general, these models focus on parameterizing a conditional version of the lagged effects (that is, the impulse response function):8 bt(x1: t, j) = E[Yit(x1: t−j, 0j) −Yit(x1: t−j−1, 0j+1 )|X1: t−j = x1: t−j]. (18) Robins (1997) refers to these impulse responses as “blip-down functions.” This function gives the effect of a change from 0 to xt − j in terms of welfare spending on levels of terrorism at time t, conditional on the treatment history up to time t − j. Inference in SNMMs focuses on estimating the causal parameters of this function. The conditional mean of the outcome given the covariates needs to be estimated as part of this approach, but this is seen as a nuisance function rather than the object of direct interest. Given the chosen lag length to study, a researcher must only specify the parameters of the impulse response up to that many lags. If we chose a lag length of 1, for example, then we might parameterize the impulse response function as bt(x1: t, j; γ ) = γjxt−j, j ∈ {0, 1}. (19) 8 Because of focus on being faithful to the ADL setup, we assume that the lagged effects are constant across levels of the time-varying confounders as is standard in ADL models. One can include interactions with these variables, though SNMMs then require additional models for Zit. See Robins (1997, sec. 8.3) for more details. 1073 Downloaded from https://www.cambridge.org/core. Shanghai JiaoTong University, on 26 Oct 2018 at 03:56:49, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0003055418000357
Matthew Blackwell and Adam N.Glynn Here,yi is the impulse effect of a one-unit change of the effect of j lags of treatment.creating an estimate of welfare spending at lag j on levels of terrorism,which the counterfactual level of terrorism at time t if welfare does not depend on the past treatment history,x spending had been set to zero for j periods before t. or the time period t.Keeping the desired lag length, Robins (1994)and Robins (1997)show that,under se- we could generalize this specification and have an im- quential ignorability,the transformed outcome,has pulse response that depended on past values of the the same expectation as this counterfactual,Yiti-j, treatment: 0;),conditional on the past.Thus,we can use the rela- tionship betweenand X-as an estimate of the b(x1:,j5Y)=hx-j+2--j-1, j∈0,1h, j-step lagged effect of treatment,which can be used to (20) create and estimate the lagged effect forj+1. This recursive structure of the modeling is what gives where y2;captures the interaction between contempo- SNMM the“nested”moniker. raneous and lagged values of welfare spending.Note We focus on one approach to estimating the param- that,given the definition of the impulse response,if eters called sequential g-estimation in the biostatistics xj=0,then br =0 since this would be comparing the literature (Vansteelandt 2009).This approach is simi- average effect of a change from 0 to 0.Choosing this lar to an extension of the standard ADL model in the function is similar to modeling Xi.i in a regression- sense that it requires modeling the conditional mean it requires the analyst to decide what nonlinearities of the (transformed)outcome to estimate the effect or interactions are important to include for the effect of each lag under study.In particular,for lag j the re- of treatment.If Yit is not continuous,it is possible to searcher must specify a linear regression of Y on the choose an alternative functional form(such as one that variables in the assumed impulse response function, uses a log link)that restricts the effects to the proper b(x1:,j;y)and whatever covariates are needed to sat- scale (Vansteelandt and Joffe 2014). isfy sequential ignorability. 4号 Note that the noninteractive impulse response func- For example,suppose we focused on the contem- tion in Equation(19)can be seen as an alternative pa- poraneous effect and the first lagged effect of welfare rameterization of the ADL(1,1)in Equation(4).When spending and we adopted the simple impulse response & j=0in Equation (19)and an ADL(1.1)model holds. bi(xi:tj;y)=yi-i for both of these effects.As above then the contemporaneous effect of yo corresponds to we assume that sequential ignorability held conditional the B1 parameter from the ADL model.When j=1 on Vit=[Xi.1-1,Yi.-1,Zil.Sequential g-estimation in Equation(19)and an ADL(1,1)model holds,then involves the following steps: the impulse response effect of y1 corresponds to the aBi+B2 combination of parameters from the ADL 1.For j=0,we would regress the untransformed out- model.We derive this connection in more detail below, come on [Xit,Xi.r-1,Yi.-1,Zil,just as we would 是 but one important difference can be seen in this exam- for the ADL model.If the modeling is correctly ple.The SNMM approach directly models the lagged specified (as we would assume with the ADL ap- effects while the ADL model recreates these effects proach),the coefficient on Xit in this regression will from all constituent path effects. provide an estimate of the blip-down parameter,yo 5795.801g The key to the SNMM identification approach is that (the contemporaneous effect). problems of post-treatment bias can be avoided by us- We would use 7o to construct the one-lag blipped- ing a transformation of the outcome that leads to easy down outcome.=Ya-oX estimation of each conditional impulse responses(y;). 3.This blipped-down outcome would be regressed on This transformation is [Xi.t-1,Xi.t-2,Yi.t-2,Zi.t-1]to estimate the next blip-down parameter,y(the first lagged effect). (21) If more than two lags are desired,we could use i to construct the second set of blipped-down outcomes, 径-Yl-方Xit-l,which could then be regressed on which. under the modeling assumptions of [Xi.t-2,Xi.t-3,Yi.t-3,Zi.t-2)to estimate y2.This it- Equation(19),would be eration can continue for as many lags as desired.This approach avoids including a post-treatment covariate when estimating a particular lagged effect.That is 22) when estimating the effect of welfare spending at lag only variables causally prior to welfare spending at that point are included in the regression.Standard er- rors for all of the estimated effects can be estimated These transformed outcomes are called the blipped- down or demediated outcomes.For example,the first using a consistent variance estimator presented in the blipped-down outcome,which we will use to estimate Supplemental Material or via a block bootstrap. first lagged effect,subtracts the contemporaneous ef- fect for each unit off of the outcome, =Ya-YoXit. See Acharya,Blackwell and Sen(2016)for an introduction to this Intuitively,the blip-down transformation subtracts off method in political science. 1074
Matthew Blackwell and Adam N. Glynn Here, γ j is the impulse effect of a one-unit change of welfare spending at lag j on levels of terrorism, which does not depend on the past treatment history, x1:t − 1 or the time period t. Keeping the desired lag length, we could generalize this specification and have an impulse response that depended on past values of the treatment: bt(x1: t, j; γ ) = γ1 jxt−j + γ2 jxt−jxt−j−1, j ∈ {0, 1}, (20) where γ 2j captures the interaction between contemporaneous and lagged values of welfare spending. Note that, given the definition of the impulse response, if xt-j = 0, then bt = 0 since this would be comparing the average effect of a change from 0 to 0. Choosing this function is similar to modeling Xi, t − j in a regression— it requires the analyst to decide what nonlinearities or interactions are important to include for the effect of treatment. If Yit is not continuous, it is possible to choose an alternative functional form (such as one that uses a log link) that restricts the effects to the proper scale (Vansteelandt and Joffe 2014). Note that the noninteractive impulse response function in Equation (19) can be seen as an alternative parameterization of the ADL (1,1) in Equation (4).When j = 0 in Equation (19) and an ADL (1,1) model holds, then the contemporaneous effect of γ 0 corresponds to the β1 parameter from the ADL model. When j = 1 in Equation (19) and an ADL (1,1) model holds, then the impulse response effect of γ 1 corresponds to the αβ1 + β2 combination of parameters from the ADL model.We derive this connection in more detail below, but one important difference can be seen in this example. The SNMM approach directly models the lagged effects while the ADL model recreates these effects from all constituent path effects. The key to the SNMM identification approach is that problems of post-treatment bias can be avoided by using a transformation of the outcome that leads to easy estimation of each conditional impulse responses (γ j). This transformation is Y j it = Yit − j−1 s=0 bt(Xi,1: t,s), (21) which, under the modeling assumptions of Equation (19), would be Y j it = Yit − j−1 s=0 γsXi,t−s. (22) These transformed outcomes are called the blippeddown or demediated outcomes. For example, the first blipped-down outcome, which we will use to estimate first lagged effect, subtracts the contemporaneous effect for each unit off of the outcome, Y 1 it = Yit − γ0Xit . Intuitively, the blip-down transformation subtracts off the effect of j lags of treatment, creating an estimate of the counterfactual level of terrorism at time t if welfare spending had been set to zero for j periods before t. Robins (1994) and Robins (1997) show that, under sequential ignorability, the transformed outcome,Y j it , has the same expectation as this counterfactual, Yit(x1:t − j, 0j), conditional on the past. Thus, we can use the relationship between Y j it and Xi, t − j as an estimate of the j-step lagged effect of treatment, which can be used to create Y j+1 it and estimate the lagged effect for j + 1. This recursive structure of the modeling is what gives SNMM the “nested” moniker. We focus on one approach to estimating the parameters called sequential g-estimation in the biostatistics literature (Vansteelandt 2009).9 This approach is similar to an extension of the standard ADL model in the sense that it requires modeling the conditional mean of the (transformed) outcome to estimate the effect of each lag under study. In particular, for lag j the researcher must specify a linear regression of Y j it on the variables in the assumed impulse response function, bt(x1: t, j; γ ) and whatever covariates are needed to satisfy sequential ignorability. For example, suppose we focused on the contemporaneous effect and the first lagged effect of welfare spending and we adopted the simple impulse response bt(x1: t, j; γ ) = γ jxt − j for both of these effects. As above, we assume that sequential ignorability held conditional on Vit = {Xi, t − 1, Yi, t − 1, Zit}. Sequential g-estimation involves the following steps: 1. For j = 0, we would regress the untransformed outcome on {Xit, Xi, t − 1, Yi, t − 1, Zit}, just as we would for the ADL model. If the modeling is correctly specified (as we would assume with the ADL approach), the coefficient on Xit in this regression will provide an estimate of the blip-down parameter, γ 0 (the contemporaneous effect). 2. We would use γ0 to construct the one-lag blippeddown outcome, Y 1 i,t = Yit − γ0Xit . 3. This blipped-down outcome would be regressed on {Xi, t − 1, Xi, t − 2, Yi, t − 2, Zi, t − 1} to estimate the next blip-down parameter, γ 1 (the first lagged effect). If more than two lags are desired, we could use γ1 to construct the second set of blipped-down outcomes, Y 2 i,t = Y 1 i,t − γ1Xi,t−1, which could then be regressed on {Xi, t − 2, Xi, t − 3, Yi, t − 3, Zi, t − 2} to estimate γ 2. This iteration can continue for as many lags as desired. This approach avoids including a post-treatment covariate when estimating a particular lagged effect. That is, when estimating the effect of welfare spending at lag j, only variables causally prior to welfare spending at that point are included in the regression. Standard errors for all of the estimated effects can be estimated using a consistent variance estimator presented in the Supplemental Material or via a block bootstrap. 9 See Acharya, Blackwell and Sen (2016) for an introduction to this method in political science. 1074 Downloaded from https://www.cambridge.org/core. Shanghai JiaoTong University, on 26 Oct 2018 at 03:56:49, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0003055418000357
How to Make Causal Inferences with Time-Series Cross-Sectional Data under Selection on Observables This sequential g-estimation approach requires the correct specification of the relationship between the TABLE 1.The lagged effects,or impulse (transformed)outcome and the covariate and treat- responses,under the ADL(1,1)in Equation ment histories.It thus requires a similar regression (4)and SNMM in Equation(19). model to the ADL approach described above.More complicated SNMM estimators can incorporate a Lag ADL SNMM model for the treatment process,providing some ro- 0 B1 Yo bustness to the modeling choices for the outcome 1 QB1+B2 Y1 These estimators are consistent for the parameters 2 a2B1+aB2 Y2 of the SNMM when either the model for the (trans- 3 a3B1+42B2 Y3 formed)outcome or the model for the treatment pro- aB1+aB2 Y4 cess is correctly specified.This property is called dou- ble robustness because there are "two shots"to achieve consistency.Vansteelandt and Joffe (2014)provides a review of these methods for SNMMs. laxes this assumption.This provides a useful interpre- Relationship to the ADL model.As we mentioned tation of the ADL model in terms of counterfactual above,the ADL approach and the sequential g- causal effects.It is important to note,however,that estimation version of SNMM presented above are very this equivalence also relies on the form of the ADL similar when the time-varying covariates,Zit,are not af- model,which uses only three parameters regardless of fected by treatment.One intuition for this result is that the number of lags,while the SNMM in this version the ADL model and the SNMM with the linear model uses a new parameter for every lag.Additionally,the are equivalent when there are no covariates aside from near equivalence disappears once there is an additional the LDV.To see this,suppose that the ADL model in time-varying covariate (Zi)in the model. 4号 Equation (4)is correct and perform the first transfor- mation from step 2 above,noting,as above,that the Marginal Structural Models contemporaneous effect is the same for both models & Y0=B1: One potential downside of the SNMM approach is that it requires the analyst to correctly model the rela- Yit-YoXi Yit -B1Xit (23) tionship between the time-varying covariates and the outcome.This can be difficult when the outcome is a complicated process and there is little theoretical guidance for specifying the outcome-covariate rela- =B0+aY.1-1+f2Xi-1+et (24) tionships.An alternative that relies instead on mod- eling the treatment-covariate relationship is called a marginal structural model or MSM (Robins,Hernan. Bo a(Bo a Yi.t-2+B1Xi.1-1+B2Xi.!-2 and Brumback 2000).10 To specify an MSM,we first choose a potential outcome lag length to study and +8it-1)+B2Xi.t-1+Sit write a model for the marginal mean of those potential outcomes in terms of the treatment history.At the most general,then,an MSM would be the following: =(Bo aBo)+a2Yi-2+(aB1+B2)Xit-1 E[Yi(1:)]=g(x1::B). (27) +aB2Xi.-2+(a8it-1+8i). (26) where the function g operates similarly to a link func- tion in a generalized linear model.These models are From this,we can see that the coefficient on Xi.-1 similar to the impulse response functions in the SNMM for this transformed outcome is simply the impulse re- sponse at lag 1,which is exactly the quantity that the approach,b,because they provide structure for the SNMM targets.Given the ADL and SNMM assump- treatment-outcome relationship.For instance,suppose that we were focused on the contemporaneous effect tions above,this quantity will be aB1+B2 for the ADL and the effect of the first two lags and so we had to eys model and y for the SNMM.Of course,this correspon- dence will continue for all lagged effects and Table 1 model E[Ya(x-2a)】=g(xt-2x;β),marginalizing over further lags and other covariates.If Yir were approxi- shows how the two sets of quantities relate for various mately continuous,as in the case of the number of ter- lags. rorist incidents,we might take g to be linear and focus Furthermore,in the Supplemental Material,we show that the sequential g-estimation estimator with no co- variates except an LDV is nearly mechanically equiva- lent to a traditional ADL estimator with one lag.The 10 For a detailed introduction to and application of MSMs in political difference is that the traditional ADL model relies on an assumption that the contemporaneous effect is con- functions the context of pure time-series data (Box,Jenkins,and stant over time,whereas sequential g-estimation re- Reinsel 2013). 1075
How to Make Causal Inferences with Time-Series Cross-Sectional Data under Selection on Observables This sequential g-estimation approach requires the correct specification of the relationship between the (transformed) outcome and the covariate and treatment histories. It thus requires a similar regression model to the ADL approach described above. More complicated SNMM estimators can incorporate a model for the treatment process, providing some robustness to the modeling choices for the outcome. These estimators are consistent for the parameters of the SNMM when either the model for the (transformed) outcome or the model for the treatment process is correctly specified. This property is called double robustness because there are “two shots” to achieve consistency. Vansteelandt and Joffe (2014) provides a review of these methods for SNMMs. Relationship to the ADL model. As we mentioned above, the ADL approach and the sequential gestimation version of SNMM presented above are very similar when the time-varying covariates,Zit, are not affected by treatment. One intuition for this result is that the ADL model and the SNMM with the linear model are equivalent when there are no covariates aside from the LDV. To see this, suppose that the ADL model in Equation (4) is correct and perform the first transformation from step 2 above, noting, as above, that the contemporaneous effect is the same for both models γ 0 = β1: Yit − γ0Xit = Yit − β1Xit (23) = β0 + αYi,t−1 + β2Xi,t−1 + εit (24) = β0 + α(β0 + αYi,t−2 + β1Xi,t−1 + β2Xi,t−2 + εi,t−1 ) + β2Xi,t−1 + εit (25) = (β0 + αβ0 ) + α2 Yi,t−2 + (αβ1 + β2 ) γ1 Xi,t−1 + αβ2Xi,t−2 + (αεi,t−1 + εit). (26) From this, we can see that the coefficient on Xi, t − 1 for this transformed outcome is simply the impulse response at lag 1, which is exactly the quantity that the SNMM targets. Given the ADL and SNMM assumptions above, this quantity will be αβ1 + β2 for the ADL model and γ 1 for the SNMM.Of course, this correspondence will continue for all lagged effects and Table 1 shows how the two sets of quantities relate for various lags. Furthermore,in the Supplemental Material, we show that the sequential g-estimation estimator with no covariates except an LDV is nearly mechanically equivalent to a traditional ADL estimator with one lag. The difference is that the traditional ADL model relies on an assumption that the contemporaneous effect is constant over time, whereas sequential g-estimation reTABLE 1. The lagged effects, or impulse responses, under the ADL (1,1) in Equation (4) and SNMM in Equation (19). Lag ADL SNMM 0 β1 γ 0 1 αβ1 + β2 γ 1 2 α2β1 + αβ2 γ 2 3 α3β1 + α2β2 γ 3 4 α4β1 + α3β2 γ 4 laxes this assumption. This provides a useful interpretation of the ADL model in terms of counterfactual causal effects. It is important to note, however, that this equivalence also relies on the form of the ADL model, which uses only three parameters regardless of the number of lags, while the SNMM in this version uses a new parameter for every lag. Additionally, the near equivalence disappears once there is an additional time-varying covariate (Zit) in the model. Marginal Structural Models One potential downside of the SNMM approach is that it requires the analyst to correctly model the relationship between the time-varying covariates and the outcome. This can be difficult when the outcome is a complicated process and there is little theoretical guidance for specifying the outcome-covariate relationships. An alternative that relies instead on modeling the treatment-covariate relationship is called a marginal structural model or MSM (Robins, Hernán, and Brumback 2000).10 To specify an MSM, we first choose a potential outcome lag length to study and write a model for the marginal mean of those potential outcomes in terms of the treatment history.At the most general, then, an MSM would be the following: E[Yit(x1: t)] = g(x1: t; β), (27) where the function g operates similarly to a link function in a generalized linear model.11 These models are similar to the impulse response functions in the SNMM approach, bt, because they provide structure for the treatment-outcome relationship. For instance, suppose that we were focused on the contemporaneous effect and the effect of the first two lags and so we had to model E[Yit(xt − 2:t)] = g(xt − 2:t; β), marginalizing over further lags and other covariates. If Yit were approximately continuous, as in the case of the number of terrorist incidents, we might take g to be linear and focus 10 For a detailed introduction to and application of MSMs in political science, see Blackwell (2013). 11 These marginal structural models are similar in spirit to transfer functions the context of pure time-series data (Box, Jenkins, and Reinsel 2013). 1075 Downloaded from https://www.cambridge.org/core. Shanghai JiaoTong University, on 26 Oct 2018 at 03:56:49, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0003055418000357
Matthew Blackwell and Adam N.Glynn on the additive effects of each period of treatment:12 (Imai and Ratkovic 2015).The IPTW approach re- quires this model to provide consistent estimates of gx-2;)=o+B1+f2x-1+3x-2 (28) the conditional predicted probability of treatment.13 In spite of this requirement,some methods for propensity If Yir were binary,we might instead assume g to have a score estimation such as CBPS have good finite-sample logistic form: properties in the face of model misspecification(Imai and Ratkovic 2015). exp(Bo B1x:+B2x:-1+B3x1-2) We use the predicted probabilities from this treat- 8-2:B)=1+exp(B+b1x+-1+B-2) ment model to construct weights for each country-year. For example,suppose that Vir included lagged levels of (29) terrorism,Yi.-1,lagged welfare spending,Xi.-1,and a set of time-varying covariates,Zi.Then,for a binary In both of these cases,we have restricted our attention treatment,we would construct the weights as to the last three periods of treatment and so we can- not answer questions about longer-term effects with these models.On the other hand.as we increase the Pr[Xi I Xi-1:7] 31) number of lags under study,the number of parameters PrlXi I Zi,Yi-1.Xit-1:@ needed to summarize the effects grows and the model can become unwieldy.Thus,we may consider focus- The denominator of each term in the product is the pre- ing on the effect of the cumulative number of treated periods,This allows for the entire history of dicted probability of observing unit i's observed treat- ment status in time t(Xit),conditional on the covariates treatment to affect the outcome in a structured,low- that satisfy sequential ignorability.4 When we multiply dimensional way.Under any of these models,the aver- this over time,it is the probability of seeing this unit's age causal effect becomes 4号元 treatment history conditional on the past.The numer- ators here are the marginal probability of the observed t(x1:,x1)=g(x1:tB)-g(x:B) (30) treatment history and stabilize the weights to make sure they are not too variable which can lead to poor fi- Of course,the MSM specification will place restric- nite sample performance(Cole and Hernan 2008).For tions on the average causal effects.An MSM that is a instance,to construct this numerator we might run a function of only the cumulative treatment,for instance, pooled logistic regression of welfare spending in year implies that (x:t,)=0 if xi:t andxt have the t on welfare spending in year t-1,omitting any time- same number of treated periods,even if their sequence varying covariates or LDVs.While this choice of nu- differs. How can a researcher estimate an MSM?If one merator is not required for consistency of the estimator (it can be replaced with 1,for instance),it can help to 是 blindly follows model Equation(28)and regresses Yir stabilize weights that are highly variable and thus in- on [Xi.t-2,Xi.t-1,Xir}using ordinary least squares, crease efficiency. there will be omitted variable bias in the estimated co- Under these assumptions,the expectation of Yir con- efficients.But as we have seen above,simply includ- ditional on Xi 1:in the reweighted data converges to ing time-varying covariates in these models can lead to the true MSM: post-treatment bias.Fortunately,the causal parameters of these models are estimable using an IPTW approach where we adjust for time-varying covariates using the Esw[YalXi.Lt =x]5 E[Ya(x)]. (32) propensity score weights,not the outcome model it- self,avoiding post-treatment bias(Robins,Hernan,and Here Esw[]is the expectation in the reweighted Brumback 2000).The weighting balances the distri- data.For example,if we used the linear MSM in bution of the time-varying covariates across values of Equation (28),then we can estimate the causal the treatments,so that omitting these variables in the parameters of MSM by running a weighted least reweighted data produces no omitted variable bias. squares (WLS)regression of the outcome,Yit on To use IPTW,a researcher must develop a model for [Xi.-2,Xi.t-1,Xal with SWa as the weights.If the probability of treatment in period t given the vari- sequential ignorability holds,the coefficients on the ables that satisfy sequential ignorability.For example components of Xi.1:from this regression will have suppose that sequential ignorability holds conditional a causal interpretation,though they may depend on on some conditioning set Vit.If Xir is binary,then we the particular modeling choices of the MSM(Robins, must obtain a consistent estimate of Pri=1 Vir Hernan.and Brumback 2000).Standard errors can v].This might be a pooled logit,a generalized addi- tive model with a flexible functional form,a boosted 13 This requirement makes it difficult to apply IPTW to fixed-effects regression(McCaffrey,Ridgeway,and Morral 2004),or settings with binary treatments since estimating the unit-specific L a covariate-balancing propensity score(CBPS)model models would face an incidental parameters problem,at least for a fixed time window. 14 To ensure the weights are well-defined,the conditional probability 12 When the treatment is binary and the chosen lag length is short. of treatment given the past must be bounded away from zero and one we can relax the linearity assumption here by saturating the modeling In the biostatistics literature,this assumption is called positivity and with all interactions between the periods under study. is similar to the overlap condition in the matching literature. 1076
Matthew Blackwell and Adam N. Glynn on the additive effects of each period of treatment:12 g(xt−2:t; β) = β0 + β1xt + β2xt−1 + β3xt−2. (28) If Yit were binary, we might instead assume g to have a logistic form: g(xt−2:t; β) = exp(β0 + β1xt + β2xt−1 + β3xt−2 ) 1 + exp(β0 + β1xt + β2xt−1 + β3xt−2 ) . (29) In both of these cases, we have restricted our attention to the last three periods of treatment and so we cannot answer questions about longer-term effects with these models. On the other hand, as we increase the number of lags under study, the number of parameters needed to summarize the effects grows and the model can become unwieldy. Thus, we may consider focusing on the effect of the cumulative number of treated periods, t s=1 xis. This allows for the entire history of treatment to affect the outcome in a structured, lowdimensional way. Under any of these models, the average causal effect becomes τ (x1: t, x 1: t) = g(x1: t; β) − g(x 1: t; β). (30) Of course, the MSM specification will place restrictions on the average causal effects. An MSM that is a function of only the cumulative treatment, for instance, implies that τ (x1: t, x 1: t) = 0 if x1: t and x 1: t have the same number of treated periods, even if their sequence differs. How can a researcher estimate an MSM? If one blindly follows model Equation (28) and regresses Yit on {Xi, t − 2, Xi, t − 1, Xit} using ordinary least squares, there will be omitted variable bias in the estimated coefficients. But as we have seen above, simply including time-varying covariates in these models can lead to post-treatment bias. Fortunately, the causal parameters of these models are estimable using an IPTW approach where we adjust for time-varying covariates using the propensity score weights, not the outcome model itself, avoiding post-treatment bias (Robins,Hernán, and Brumback 2000). The weighting balances the distribution of the time-varying covariates across values of the treatments, so that omitting these variables in the reweighted data produces no omitted variable bias. To use IPTW, a researcher must develop a model for the probability of treatment in period t given the variables that satisfy sequential ignorability. For example, suppose that sequential ignorability holds conditional on some conditioning set Vit. If Xit is binary, then we must obtain a consistent estimate of Pr[Xit = 1 | Vit = v]. This might be a pooled logit, a generalized additive model with a flexible functional form, a boosted regression (McCaffrey, Ridgeway, and Morral 2004), or a covariate-balancing propensity score (CBPS) model 12 When the treatment is binary and the chosen lag length is short, we can relax the linearity assumption here by saturating the modeling with all interactions between the periods under study. (Imai and Ratkovic 2015). The IPTW approach requires this model to provide consistent estimates of the conditional predicted probability of treatment.13 In spite of this requirement, some methods for propensity score estimation such as CBPS have good finite-sample properties in the face of model misspecification (Imai and Ratkovic 2015). We use the predicted probabilities from this treatment model to construct weights for each country-year. For example, suppose that Vit included lagged levels of terrorism, Yi, t − 1, lagged welfare spending, Xi, t − 1, and a set of time-varying covariates, Zit. Then, for a binary treatment, we would construct the weights as SW it = t t=1 Pr[ Xit | Xi,t−1;γ ] Pr[ Xit | Zit,Yi,t−1, Xi,t−1;α] . (31) The denominator of each term in the product is the predicted probability of observing unit i’s observed treatment status in time t(Xit), conditional on the covariates that satisfy sequential ignorability.14 When we multiply this over time, it is the probability of seeing this unit’s treatment history conditional on the past. The numerators here are the marginal probability of the observed treatment history and stabilize the weights to make sure they are not too variable which can lead to poor finite sample performance (Cole and Hernán 2008). For instance, to construct this numerator we might run a pooled logistic regression of welfare spending in year t on welfare spending in year t − 1, omitting any timevarying covariates or LDVs. While this choice of numerator is not required for consistency of the estimator (it can be replaced with 1, for instance), it can help to stabilize weights that are highly variable and thus increase efficiency. Under these assumptions, the expectation of Yit conditional on Xi, 1: t in the reweighted data converges to the true MSM: ESW [Yit|Xi,1: t = x1: t] p → E[Yit(x1: t)]. (32) Here ESW [·] is the expectation in the reweighted data. For example, if we used the linear MSM in Equation (28), then we can estimate the causal parameters of MSM by running a weighted least squares (WLS) regression of the outcome, Yit on {Xi, t − 2, Xi, t − 1, Xit} with SW it as the weights. If sequential ignorability holds, the coefficients on the components of Xi, 1: t from this regression will have a causal interpretation, though they may depend on the particular modeling choices of the MSM (Robins, Hernán, and Brumback 2000). Standard errors can 13 This requirement makes it difficult to apply IPTW to fixed-effects settings with binary treatments since estimating the unit-specific models would face an incidental parameters problem, at least for a fixed time window. 14 To ensure the weights are well-defined, the conditional probability of treatment given the past must be bounded away from zero and one. In the biostatistics literature, this assumption is called positivity and is similar to the overlap condition in the matching literature. 1076 Downloaded from https://www.cambridge.org/core. Shanghai JiaoTong University, on 26 Oct 2018 at 03:56:49, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0003055418000357