# Regression and Causality

Section 3.1.2 shows how regression gives the best (MMSE) linear approximation to the CEF. This under­standing, however, does not help us with the deeper question of when regression has a causal interpretation. When can we think of a regression coefficient as approximating the causal effect that might be revealed in an experiment?

A regression is causal when the CEF it approximates is causal. This doesn’t answer the question, of course. It just passes the buck up one level, since, as we’ve seen, a regression inherits it’s legitimacy from a CEF. Causality means different things to different people, but researchers working in many disciplines have found it useful to think of causal relationships in terms of the potential outcomes notation used in Chapter 2 to describe what would happen to a given individual in a hypothetical comparison of alternative hospitalization scenarios. Differences in these potential outcomes were said to be the causal effect of hospitalization. The CEF is causal when it describes differences in average potential outcomes for a fixed reference population.

In empirical work, the causal relationship between schooling and earnings tells us what people would earn—on average—if we could either change their schooling in a perfectly-controlled environment, or change their schooling randomly so that those with different levels of schooling would be otherwise comparable. As we discussed in Chapter 2, experiments ensure that the causal variable of interest is independent of potential outcomes so that the groups being compared are truly comparable. Here, we would like to generalize this notion to causal variables that take on more than two values, and to more complicated situations where we must hold a variety of "control variables" fixed for causal inferences to be valid. This leads to the conditional independence assumption (CIA), a core assumption that provides the (sometimes implicit) justification for the causal interpretation of regression. This assumption is sometimes called selection-on-observables because the covariates to be held fixed are assumed to be known and observed (e. g., in Goldberger, 1972; Barnow, Cain, and Goldberger, 1981). The big question, therefore, is what these control variables are, or should be. We’ll say more about that shortly. For now, we just do the econometric thing and call the covariates "X;". As far as the schooling problem goes, it seems natural to imagine that X; is a vector that includes measures of ability and family background.

For starters, think of schooling as a binary decision, like whether Angrist goes to college. Denote this by a dummy variable, C;. The causal relationship between college attendance and a future outcome like earnings can be described using the same potential-outcomes notation we used to describe experiments in

Chapter 2. To address this question, we imagine two potential earnings variables:

Yii if ci = 1

.

Yoi if Ci =0

In this case, Yoi is i’s earnings without college, while Y1i is i’s earnings if he goes. We would like to know the difference between Y1i and Yoi, which is the causal effect of college attendance on individual i. This is what we would measure if we could go back in time and nudge i onto the road not taken. The observed outcome, Yi, can be written in terms of potential outcomes as

Yi = Yoi + (Yii – Yoi)Ci.

We get to see one of Yii or Yoi, but never both. We therefore hope to measure the average of Y1i—Yoi, or the average for some group, such as those who went to college. This is E[Yii—YoilC = 1].

In general, comparisons of those who do and don’t go to college are likely to be a poor measure of the causal effect of college attendance. Following the logic in Chapter 2, we have

E [Yi|Ci = 1] – E[yilci =0] = E[yii – YoilCi = 1] (3.2.1)

Observed difference in earnings average treatment effect on the treated

+E [yoi|ci = 1] – E [Yoi|Ci = 0].

4——————– v——————– }

selection bias

It seems likely that those who go to college would have earned more anyway. If so, selection bias is positive, and the naive comparison, E [Yi|Ci = 1] — E[Yi|Ci = 0], exaggerates the benefits of college attendance.

The CIA asserts that conditional on observed characteristics, Xi, selection bias disappears. In this example, the CIA says,

{Yoi, Yii}U Ci|Xi. (3.2.2)

Given the CIA, conditional-on-Xi comparisons of average earnings across schooling levels have a causal interpretation. In other words,

E [Yi|Xi, Ci = 1] – E [Yi|Xi, Ci = 0] = E[Yii – Yoi|Xi].

Now, we’d like to expand the conditional independence assumption to causal relations that involve vari­ables that can take on more than two values, like years of schooling, Si. The causal relationship between schooling and earnings is likely to be different for each person. We therefore use the individual-specific notation,

Ysi — fi(s)

to denote the potential earnings that person i would receive after obtaining s years of education. If s takes on only two values, 12 and 16, then we are back to the college/no college example:

Yoi = fi (12); Yu = fi( 16).

More generally, the function fj(s) tells us what i would earn for any value of schooling, s. In other words, fi(s) answers causal “what if’ questions. In the context of theoretical models of the relationship between human capital and earnings, the form of fj(s) may be determined by aspects of individual behavior and/or market forces.

The CIA in this more general setup becomes

Ysl Ц 8г|Хг (CIA)

In many randomized experiments, the CIA crops up because Sj is randomly assigned conditional on Xj (In the Tennessee STAR experiment, for example, small classes were randomly assigned within schools). In an observational study, the CIA means that Sj can be said to be "as good as randomly assigned," conditional on Xj.

Conditional on Xj, the average causal effect of a one year increase in schooling is E[fj(s) — fj(s — 1)|Xj], while the average causal effect of a 4-year increase in schooling is E[fj(s) — E [fj(s — 4)] |Xj]. The data reveal only Yj = fj(Sj), however, that is fj(s) for s =Sj. But given the CIA, conditional-on-Xj comparisons of average earnings across schooling levels have a causal interpretation. In other words,

E [yj |Xj, Sj = s] – E [yj |Xj, Sj = s – 1]

= E [fj(s) – fj(s – 1)|Xj]    for any value of s. For example, we can compare the earnings of those with 12 and 11 years of schooling to learn about the average causal effect of high school graduation:   This comparison has a causal interpretation because, given the CIA,

Here, the selection bias term is the average difference in the potential dropout-earnings of high school graduates and dropouts. Given the CIA, however, high school graduation is independent of potential earnings conditional on Xj, so the selection-bias vanishes. Note also that in this case, the causal effect of

E [/*(12) – /*(11)|X*, s* = 12] = E [/*(12) – /*(11)|X*].

This is important. . . but less important than the elimination of selection bias in (3.2.1).

So far, we have constructed separate causal effects for each value taken on by the conditioning variable, X*. This leads to as many causal effects as there are values of X*, an embarrassment of riches. Empiricists almost always find it useful to boil a set of estimates down to a single summary measure, like the population average causal effect. By the law of iterated expectations, the population average causal effect of high school graduation is

E {E [Yi|Xi, S* = 12] – E [Yi|Xi, S* = 11]} (3.2.3)

= E {E [/*(12) – /*(11)|X*]}

= E [/*(12) – /*(11)] (3.2.4)

In the same spirit, we might be interested in the average causal effect of high school graduation on high school graduates:

E{E[y*|X*, s* = 12] – E[y*|X*, s* = 11]|s* = 12} (3.2.5)

= E{E[/*(12) – /*(11)|X*]|s* = 12}

= E[/*(12) – /*(11)|s* = 12]. (3.2.6)

This parameter tells us how much high school graduates gained by virtue of having graduated. Likewise, for the effects of college graduation there is a distinction between E[/*(16) — /*(12) |s* = 16], the average causal effect on college graduates and E[/*(16) — /*(12)], the population average effect.

The population average effect, (3.2.3), can be computed by averaging all of the X-specific effects using the marginal distribution of X*, while the average effect on high school or college graduates averages the X-specific effects using the distribution of X* in these groups. In both cases, the empirical counterpart is a matching estimator: we make comparisons across schooling groups graduates for individuals with the same covariate values, compute the difference in their earnings, and then average these differences in some way.

In practice, there are many details to worry about when implementing a matching strategy. We fill in some of the technical details on the mechanics of matching in Section 3.3.1, below. Here we note that a global drawback of the matching approach is that it is not "automatic," rather it requires two steps, matching and averaging. Estimating the standard errors of the resulting estimates may not be straightforward, either.

A third consideration is that the two-way contrast at the heart of this subsection (high school or college completers versus dropouts) does not do full justice to the problem at hand. Since Sj takes on many values, there are separate average causal effects for each possible increment in Sj, which also must be summarized in some way. These considerations lead us back to regression.

Regression provides an easy-to-use empirical strategy that automatically turns the CIA into causal effects. Two routes can be traced from the CIA to regression. One assumes that fj(s) is both linear in s and the same for everyone except for an additive error term, in which case linear regression is a natural tool to estimate the features of fj(s). A more general but somewhat longer route recognizes that fj(s) almost certainly differs for different people, and, moreover, need not be linear in s. Even so, allowing for random variation in fi(s) across people, and for non-linearity for a given person, regression can be thought of as strategy for the estimation of a weighted average of the individual-specific difference, fj(s) — fj(s — 1). In fact, regression can be seen as a particular sort of matching estimator, capturing an average causal effect much like 3.2.3 or 3.2.5.

At this point, we want to focus on the conditions required for regression to have a causal interpretation and not on the details of the regression-matching analog. We therefore start with the first route, a linear constant-effects causal model. Suppose that

fi(s) = a + ps + р{. (3.2.7)

In addition to being linear, this equation says that the functional relationship of interest is the same for everyone. Again, s is written without an i subscript to index individuals, because equation (3.2.7) tells us what person i would earn for any value of s and not just the realized value, Sj. In this case, however, the only individual-specific and random part of fj(s) is a mean-zero error component, ^j, which captures unobserved factors that determine potential earnings.

Substituting the observed value Sj for s in equation (3.2.7), we have

Yj = a + pSj + ^j. (3.2.8)

Equation (3.2.8) looks like a bivariate regression model, except that equation (3.2.7) explicitly associates the coefficients in (3.2.8) with a causal relationship. Importantly, because equation (3.2.7) is a causal model, Sj may be correlated with potential outcomes, fj(s), or, in this case, the residual term in (3.2.8), ^j.

Suppose now that the CIA holds given a vector of observed covariates, X;. In addition to the functional form assumption for potential outcomes embodied in (3.2.8), we decompose the random part of potential earnings, V;, into a linear function of observable characteristics, Xj, and an error term, v;:

V; = Xi7 + ^   where 7 is a vector of population regression coefficients that is assumed to satisfy E[v;|X;] =Xjy. Because 7 is defined by the regression of V; on X;,the residual v; and X; are uncorrelated by construction. Moreover, by virtue of the CIA, we have

Because mean-independence implies orthogonality, the residual in the linear causal model

Y; — a + pSj + X; 7 + v; (3.2.9)

is uncorrelated with the regressors, S; and Xj, and the regression coefficient p is the causal effect of interest. It bears emphasizing once again that the key assumption here is that the observable characteristics, Xj, are the only reason why V; and S; (equivalently, /;(s) and S; ) are correlated. This is the selection-on-observables assumption for regression models discussed over a quarter century ago by Barnow, Cain, and Goldberger (1981). It remains the basis of most empirical work in Economics.