Regression Meets Matching
The past decade or two has seen increasing interest in matching as an empirical tool. Matching as a strategy to control for covariates is typically motivated by the CIA, as for causal regression in the previous section. For example, Angrist (1998) used matching to estimate the effects of volunteering for the military service on the later earnings of soldiers. These matching estimates have a causal interpretation assuming that, conditional on the individual characteristics the military uses to select soldiers (age, schooling, test scores), veteran status is independent of potential earnings.
An attractive feature of matching strategies is that they are typically accompanied by an explicit statement of the conditional independence assumption required to give matching estimates a causal interpretation. At the same time, we have just seen that the causal interpretation of a regression coefficient is based on exactly the same assumption. In other words, matching and regression are both control strategies. Since the core assumption underlying causal inference is the same for the two strategies, it’s worth asking whether or to what extent matching really differs from regression. Our view is that regression can be motivated as a computational device for a particular sort of weighted matching estimator, and therefore the differences between regression and matching are unlikely to be of major empirical importance.
To flesh out this idea, it helps to look more deeply into the mathematical structure of the matching and regressions estimands, i. e., the population quantities that these methods attempt to estimate. For regression, of course, the estimand is a vector of population regression coeff cients. The matching estimand is typically
a particular weighted average of contrasts or comparisons across cells defined by covariates. This is easiest to see in the case of discrete covariates, as in the military service example, and for a discrete regressor such as veteran status, which we denote here by the dummy, Dj. Since treatment takes on only two values, we can use the notation Yij=/j(1) and Yoj=/j(0) to denote potential outcomes. A parameter of primary interest in this context is the average effect of treatment on the treated, E[Yij—YojDj = 1]. This tells us the difference between the average earnings of soldiers, E[Y1jDj = 1], an observable quantity, and the counterfactual average earnings they would have obtained if they had not served, E[YojDj = 1]. Simply comparing the observed earnings differential by veteran status is a biased measure of the effect of treatment on the treated unless Dj is independent of Yoj. Specifically,
E [YjDj = 1] – E [YjDj =0] = E [yij – YoiDj = 1]
+ {E [yoj Dj = 1] – E [YoiDj =0]} .
In other words, the observed earnings difference by veteran status equals the average effect of treatment on the treated plus selection bias. This parallels the discussion of selection bias in Chapter 2.
Given the CIA, selection bias disappears after conditioning on Xj, so the effect of treatment on the treated can be constructed by iterating expectations over Xj:
StoT = E[yij — Yoj d j = 1]
= E {E[YijXj, Dj = 1] – E[Yoj Xj; Dj = 1]  D j = 1}.
Of course, E[yoj Xj, Dj = 1] is counterfactual. By virtue of the CIA, however,
E[YojXj, Dj = 0] = E[YojXj, Dj = 1].
Therefore,
= E[Sx d j = 1],
where
Sx = E[yjXj, Dj = 1] – E[YjXj, Dj =0],
is the random Xspecific difference in mean earnings by veteran status at each value of Xj.
The matching estimator in Angrist (1998) uses the fact that Xj is discrete to construct the sample analog
of the righthandside of (3.3.1). In the discrete case, the matching estimand can be written
E[Yu – Y0iDi = 1] = £) SXP(Xi = xDi = I), (3.3.2)
X
where P(Xi = xDi = 1) is the probability mass function for Xi given Di = 1.[25]. In this case, Xi, takes on values determined by all possible combinations of year of birth, testscore group, year of application to the military, and educational attainment at the time of application. The test score in this case is from the AFQT, used by the military to categorize the mental abilities of applicants (we included this as a control in the schooling regression discussed in Section 3.2.2). The Angrist (1998) matching estimator simply replaces Sx by the sample veterannonveteran earnings difference for each combination of covariates, and then combines these in a weighted average using the empirical distribution of covariates among veterans.[26]
Note also that we can just as easily construct the unconditional average treatment effect,
Sate = E{E[yiiXi, Di = 1] – E[YoiXi, Di = 0]} (3.3.3)
= X) SxP(Xi = x)
X
= E[Yii – Yoi],
which is the expectation of Sx using the marginal distribution of Xi instead of the distribution among the treated. Stot tells us how much the typical soldier gained or lost as a consequence of military service, while Sate tells us how much the typical applicant to the military gained or lost (since the Angrist, 1998, population consists of applicants.)
The US military tends to be fairly picky about it’s soldiers, especially after downsizing at the end of the Cold War. For the most part, the military now takes only high school graduates with test scores in the upper half of the test score distribution. The resulting positive screening generates positive selection bias in naive comparisons of veteran and nonveteran earnings. This can be seen in Table 3.3.1, which reports differencesinmeans, matching, and regression estimates of the effect voluntary military service on the 198891 Social Securitytaxable earnings of men who applied to join the military between 1979 and 1982. The matching estimates were constructed from the sample analog of (3.3.2). Although white veterans earn $1,233 more than nonveterans, this difference becomes negative once differences in covariates are matched away. Similarly, while nonwhite veterans earn $2,449 more than nonveterans, controlling for covariates reduces this to $840.
Table 3.3.1: Uncontrolled, matching, and regression estimates of the effects of voluntary military service on earnings

Notes: Adapted from Angrist (1998, Tables II and V). Standard errors are reported in parentheses. The table shows estimates of the effect of voluntary military service on the 19881991 Social Security – taxable earnings of men who applied to enter the armed forces between 1979 and 1982. The matching and regression estimates control for applicants’ year of birth, education at the time of application, and AFQT score. There are 128,968 whites and 175,262 nonwhites in the sample.
Table (3.3.1) also shows regression estimates of the effect of voluntary military service, controlling for the same set of covariates that were used to construct the matching estimates. These are estimates of Sr in the equation
= ^^ dixfx + SrDі + £j, (3.3.4)
X
where dix is a dummy that indicates Xj = x, fx is a regressioneffect for Xj = x, and Sr is the regression estimand. Note that this regression model allows a separate parameter for every value taken on by the covariates. This model can therefore be said to be saturatedinXi, since it includes a parameter for every value of Xj (it is not "fully saturated," however, because there is a single additive effect for Dj with no Dj – Xi interactions).
Despite the fact that the matching and regression estimates control for the same variables, the regression estimates in Table 3.3.1 are somewhat larger than the matching estimates for both whites and nonwhites. In fact, the differences between the matching and regression results are statistically significant. At the same time, the two estimation strategies present a broadly similar picture of the effects of military service. The reason the regression and matching estimates are similar is that regression, too, can be seen as a sort of matching estimator: the regression estimand differs from the matching estimands only in the weights used to sum the covariatespecific effects, Sx into a single effect. In particular, matching uses the distribution of covariates among the treated to weight covariatespecific estimates into an estimate of the effect of treatment on the treated, while regression produces a varianceweighted average of these effects.
To see this, start by using the regression anatomy formula to write the coefficient on Dj in the regression of Yj on Xj and Dj as
r Cov(Yj, Dj)
OR = — лг(~ —
— E[(Dj – E[DjXj])Yj]
— E{(Dj – E[DjXj])E[YjDj, Xj]}
The second equality in this set of expressions uses the fact that saturating the model in Xj means E[DjXj] is linear. Hence, Dj, which is defined as the residual from a regression of Dj on Xj, is the difference between Dj and E[DjXj]. The third equality uses the fact that the regression of Yj on Dj and Xj is the same as the regression of Yj on E[YjDj, Xj].
To simplify further, we expand the CEF, E[YjDj, Xj], to get
E[y j d j, Xj] = E [Yj  d j = 0, Xj] + Sx Dj.
If covariates are unnecessary – in other words, the CIA holds unconditionally, as if in a randomized trial – this CEF becomes
E[y j d j; Xj] — E[yj d j — 0] + E[y 1j Y0j]Dj;
from which we conclude that the regression of Yj on Dj estimates the population average treatment effect in this case (e. g., as in the experiment discussed in Section 2.3). But here we are interested in the more general scenario where conditioning Xj is necessary to eliminate selection bias.
To evaluate the more general regression estimand, (3.3.5), we begin by substituting for E[YjDj, Xj] in the numerator. This gives
E{(Dj – E[DjXj])E[YjDj, Xj]} — E{(Dj – E[DjXj])E[YjDj — 0, Xj]} + E{(Dj – E[DjXj])DjSx}.
The first term on the righthand side is zero because E[YjDj — 0,Xj] is a function of Xj and is therefore uncorrelated with (Dj — E[DjXj]). For the same reason, the second term simplifies to
E{(Dj – E[DjXj])DjSx} — E{(Dj – E[DjXj])2Sx}.
E[(Dj – E[DjXj])2Sx] 
E{E[(Dj – E[Dj Xj])2 Xj]Sx} — EfrD(Xj)Sx] E{E[(Dj – E[DjXj])2Xj]} E[AD(Xj)] 
a2D(Xi) = E[(d, – E[d,X,])2X,]
is the conditional variance of d, given X,. This establishes that the regression model, (3.3.4), produces a treatmentvariance weighted average of Sx ■
Because the regressor of interest, d, is a dummy variable, one last step can be taken. In this case, a2D(Х,) = P(d, = 1X,)(1 – P(d, = 1X,)), so
J2 Sx [P(d, = 1X, = x)(1 – P(d, = 1X, = x))] P (X, = x)
x
J2 [P(d, = 1X, = x)(1 – P(d, = 1X, = x))] P (X, = x)
This shows that the regression estimand weights each covariatespecific treatment effect by [P(X, = xd, = 1)(1 — P(X, = xd, = 1))]P (X, = x). In contrast, the matching estimand for the effect of treatment on the treated can be written
Y^SxP (d, = 1X, = x)P (Xi = x)
x
Y, P (d, = 1X, = x)P (X, = x)
x because
P(d, = 1X, = x) • P(Xi = x)
P (d, = 1)
So the weights used to construct E[yi,—Yo,d, = 1] are proportional to the probability of treatment at each value of the covariates.
The point of this derivation is that the treatmentonthetreated estimand puts the most weight on covariate cells containing those who are most likely to be treated. In contrast, regression puts the most weight on covariate cells where the conditional variance of treatment status is largest. As a rule, this variance is maximized when P(d, = 1X, = x) = 2, in other words, for cells where there are equal numbers of treated and control observations. Of course, the difference in weighting schemes is of little importance if Sx does not vary across cells (though weighting still affects the statistical efficiency of estimators). In this example, however, men who were most likely to serve in the military appear to benefit least from their service. This is probably because those most likely to serve were most qualified, but therefore also had the highest civilian earnings potential and so benefited least from military service. This fact leads matching estimates of the effect of military service to be smaller than regression estimates based on the same vector of control variables.[27]
Importantly, neither the regression nor the covariatematching estimands give any weight to covariate cells that do not contain both treated and control observations. Consider a value of X;, say x*, where either no one is treated or everyone is treated. Then, Sx* is undefined, while the regression weights, [P(D; = 1X; = x*)(1 — P(D; = 1X; = x*))], are zero. In the language of the econometric literature on matching, both the regression and matching estimands impose common support, that is, they are limited to covariate values where both treated and control observations are found.[28]
The step from estimand to estimator is a little more complicated. In practice, both regression and matching estimators are implemented using modelling assumptions that implicitly involve a certain amount of extrapolation across cells. For example, matching estimators often combine covariates cells with few observations. This violates common support if the cells being combined do not each have both treated and nontreated observations. Regression models that are not saturated in X; may also violate common support, since covariate cells without both treated and control observations can end up contributing to the estimates by extrapolation. Here too, however, we see a symmetry between the matching and regression strategies: they are in the same class, in principle, and require the same sort of compromises in practice.[29]
Even More on Regression and Matching: Ordered and Continuous Treatments*
Does the pseudomatching interpretation of regression outlined above for a binary treatment apply to models with ordered and continuous treatments? The long answer is fairly technical and may be more than you want to know. The short answer is, to one degree or another, "yes."
As we’ve already discussed, one interpretation of regression is that the population OLS slope vector provides the MMSE linear approximation to the CEF. This, of course, works for ordered and continuous regressors as well as for binary. A related property is the fact that regression coefficients have an “average derivative” interpretation. In multivariate regression models, this interpretation is unfortunately complicated by the fact that the OLS slope vector is a matrixweighted average of the gradient of the CEF. Matrix weighted averages are difficult to interpret except in special cases (see Chamberlain and Leamer, 1976). An important special case when the average derivative property is relatively straightforward is in regression models for an ordered or continuous treatment with a saturated model for covariates. To avoid lengthy derivations, we simply explain the formulas. A derivation is sketched in the appendix to this chapter. For additional details, see the appendix to Angrist and Krueger (1999).
For the purposes of this discussion, the treatment intensity, Sj, is assumed to be a continuously distributed random variable, not necessarily nonnegative. Suppose that the CEF of interest can be written h(t) = E[YjSj = t] with derivative h’ (t). Then
E[Yj(Sj ~ E[Sj])] = f h’ (t) /itdt E[Sj(Sj – E[Sj])] = /Mtdt
where
Mt = {E[SjSj > t] – E[SjSj < t]}{P(Sj > t)[1 – P(Sj > t)},
and the integrals in (3.3.8) run over the possible values of Sj. This formula weights each possible value of Sj in proportion to the difference in the conditional mean of Sj above and below that value. More weight is also given to points close to the median of Sj since P(Sj > t) • [1 — P(Sj > t)] is maximized at P(Sj > t) = 1/2.
With covariates, Xj, the weights in (3.3.8) become Xspecific. A covariateaveraged version of the same formula applies to the multivariate regression coefficient of Yj on Sj, after partialling out Xj. In particular,
E[Yj(Sj – E[SjXj])] = E [/ hX(t)Mtxdt] E[Sj(Sj – E[SjX])] = E [/MtXdt] ;
where h’X (t) = @E[Yi@Xti, Si t] and ^tX = {E[SjXj, Sj > t] – E[SjXj, Sj < t]}{P(Sj > tXj)[1 – P(Sj > tXj)}.
It bears emphasizing that equation (3.3.10) reflects two types of averaging: an integral that averages along the length of a nonlinear CEF at fixed covariate values, and an expectation that averages across covariate cells. An important point in this context is that population regression coeff cients contain no information about the effect of Sj on the CEF for values of Xj where P(Sj > tXj) equals 0 or 1. This includes values of Xj where Sj is fixed. In the same spirit, it’s worth noting that if Sj is a dummy variable, we can extract equation (3.3.7) from the more general formula, (3.3.10).
Angrist and Krueger (1999) construct the average weighting function for a schooling regression with state of birth and year of birth covariates. Although equations (3.3.8) and (3.3.10) may seem arcane or at least nonobvious, in this example the average weights, E[^tx ], turn out to be a reasonably smooth symmetric function of t, centered at the mode of Sj.
The implications of (3.3.8) or (3.3.10) can be explored further given a model for the distribution of regressors. Suppose, for example, that Sj is Normally distributed. Let Zj = ——E(Si), where 7s is the
G s
standard deviation of Sj, so that Zj is standard Normal. Then
From truncated Normal formulas (see, e. g., Johnson and Kotz, 1970), we know that
E[zijzi > t*] = , 1 and E[zijzi < t*] = Ф( )
where ф() and Ф() are the standard Normal density and distribution function. Substituting in the formula for Mt, (3.3.9), we have
фф^}[1 – Ф(Г)]Ф(Г)= а3ф(Г).
We have therefore shown that
Cov(Yi, si)
V (Si)
In other words, the regression of Yi on Si is the (unweighted!) population average derivative, E[h'(Si)], when Si is Normally distributed. Of course, this result is a special case of a special case.[30] Still, it seems reasonable to imagine that Normality might not matter very much. And in our empirical experience, the average derivatives (also called “marginal effects”) constructed from parametric nonlinear models for limited dependent variables (e. g., Probit or Tobit) are usually indistinguishable from the corresponding regression coefficients, regardless of the distribution of regressors. We expand on this point in Section 3.4.2, below.
Leave a reply