Pairing Off

One sample average is the loneliest number that you’ll ever do. Luckily, we’re usually concerned with two. We’re especially keen to compare averages for subjects in experimental treatment and control groups. We reference these averages with a compact notation, writing Y1 for Avgn[YilDi = 1] and Y0 for Avgn[YilDi = 0]. The treatment group mean, Y1, is the average for the n1 observations belonging to the treatment group, with Y° defined similarly. The total sample size is n = n0 + n1.

For our purposes, the difference between Y1 and Y0 is either an estimate of the causal effect of treatment (if Y is an outcome), or a check on balance (if Y is a covariate). To keep the discussion focused, we’ll assume the former. The most important null hypothesis in this context is that treatment has no effect, in which case the two samples used to construct treatment and control averages come from the same population. On the other hand, if treatment changes outcomes, the populations from which treatment and control observations are drawn are necessarily different. In particular, they have different means, which we denote л1 and /л0.

We decide whether the evidence favors the hypothesis that л1 = Л0 by looking for statistically significant differences in the corresponding sample averages. Statistically significant results provide strong evidence of a treatment effect, while results that fall short of statistical significance are consistent with the notion that the observed difference in treatment and control means is a chance finding. The expression “chance finding” in this context means that in a hypothetical experiment involving very large samples—so large that any sampling variance is effectively eliminated—we’d find treatment and control means to be the same.

Statistical significance is determined by the appropriate f-statistic. A key ingredient in any f recipe is the standard error that lives downstairs in the f ratio. The standard error for a comparison of means is the square root of the sampling variance of Y1 – Y0. Using the fact that the variance of a difference between two statistically independent variables is the sum of their variances, we have


The second equality here uses equation (1.5). which gives the sampling variance of a single average. The standard error we need is therefore


In deriving this expression, we’ve assumed that the variances of individual observations are the same in treatment and control groups. This assumption allows us to use one

symbol, for the common variance. A slightly more complicated formula allows variances to differ across groups even if the means are the same (an idea taken up again in the discussion of robust regression standard errors in the appendix to Chapter 2).

Recognizing that °Y must be estimated, in practice we work with the estimated standard error


where S(Y;) is the pooled sample standard deviation. This is the sample standard deviation calculated using data from both treatment and control groups combined.

Under the null hypothesis that p1 – p° is equal to the value p, the t-statistic for a difference in means is

Y] –

SE(y 1 – i70) ’

We use this t-statistic to test working hypotheses about p1 – p° and to construct confidence intervals for this difference. When the null hypothesis is one of equal means (p = 0), the statistic t(p) equals the difference in sample means divided by the estimated standard error of this difference. When the t-statistic is large enough to reject a difference of zero, we say the estimated difference is statistically significant. The confidence interval for a difference in means is the difference in sample means plus or minus two standard errors.

Bear in mind that t-statistics and confidence intervals have little to say about whether findings are substantively large or small. A large t-statistic arises when the estimated effect of interest is large but also when the associated standard error is small (as happens when you’re blessed with a large sample). Likewise, the width of a confidence interval is determined by statistical precision as reflected in standard errors and not by the magnitude of the relationships you’re trying to uncover. Conversely, t-statistics may be small either because the difference in the estimated averages is small or because the standard error of this difference is large. The fact that an estimated difference is not significantly different from zero need not imply that the relationship under investigation is small or unimportant. Lack of statistical significance often reflects lack of statistical precision, that is, high sampling variance. Masters are mindful of this fact when discussing econometric results.

-1 For more on this surprising fact, see Jonathan Gruber, “Covering the Uninsured in the United States,” Journal of Economic Literature, vol. 46, no. 3, September 2008, pages 57–606.

2 Our sample is aged 26-59 and therefore does not yet qualify for Medicare.

3 An Empirical Notes section after the last chapter gives detailed notes for this table and most of the other tables and figures in the book.

4 Robert Frost’s insights notwithstanding, econometrics isn’t poetry. A modicum of mathematical notation allows us to describe and discuss subtle relationships precisely. We also use italics to introduce repeatedly used terms, such as potential outcomes, that have special meaning for masters of ’metrics.

5 Order the n observations on Y. so that the n° observations from the group indicated by D{ = 0 precede the n1 observations from the D{ = – group. The conditional average


Avgn[YiDl = 0]=^^Yi "o 1-І

is the sample average for the n° observations in the Dt = 0 group. The term Avgn[YDi = -] is calculated analogously from the remaining n1 observations.

6 Six-sided cubes with one to six dots engraved on each side. There’s an app for ’em on your smartphone.

7 Our description of the HIE follows Robert H. Brook et al., “Does Free Care Improve Adults’ Health? Results from a Randomized Controlled Trial,” New England Journal of Medicine, vol. 309, no. 23, December 8, 1983, pages 1426­1434. See also Aviva Aron-Dine, Liran Einav, and Amy Finkelstein, “The RAND Health Insurance Experiment, Three Decades Later,” Journal of Economic Perspectives, vol. 27, Winter 2013, pages 197-222, for a recent assessment.

8 Other HIE complications include the fact that instead of simply tossing a coin (or the computer equivalent), RAND investigators implemented a complex assignment scheme that potentially affects the statistical properties of the resulting analyses (for details, see Carl Morris, “A Finite Selection Model for Experimental Design of the Health Insurance Study,” Journal of Econometrics, vol. 11, no. 1, September 1979, pages 43-61). Intentions here were good, in that the experimenters hoped to insure themselves against chance deviation from perfect balance across treatment groups. Most HIE analysts ignore the resulting statistical complications, though many probably join us in regretting this attempt to gild the random assignment lily. A more serious problem arises from the large number of HIE subjects who dropped out of the experiment and the large differences in attrition rates across treatment groups (fewer left the free plan, for example). As noted by Aron-Dine, Einav, and Finkelstein, “The RAND Experiment,” Journal of Economic Perspectives, 2013, differential attrition may have compromised the experiment’s validity. Today’s “randomistas” do better on such nuts – and-bolts design issues (see, for example, the experiments described in Abhijit Banerjee and Esther Duflo, Poor Economics: A Radical Rethinking of the Way to Fight Global Poverty, Public Affairs, 2011).

9 The RAND results reported here are based on our own tabulations from the HIE public use file, as described in the Empirical Notes section at the end of the book. The original RAND results are summarized in Joseph P. Newhouse et al., Free for All? Lessons from the RAND Health Insurance Experiment, Harvard University Press, 1994.

10 Participants in the free plan had slightly better corrected vision than those in the other plans; see Brook et al., “Does Free Care Improve Health?” New England Journal of Medicine, 1983, for details.

11 See Amy Finkelstein et al., “The Oregon Health Insurance Experiment: Evidence from the First Year,” Quarterly Journal of Economics, vol. 127, no. 3, August 2012, pages 1057-1106; Katherine Baicker et al., “The Oregon Experiment—Effects of Medicaid on Clinical Outcomes,” New England Journal of Medicine, vol. 368, no. 18, May 2, 2013, pages 1713-1722; and Sarah Taubman et al., “Medicaid Increases Emergency Department Use: Evidence from Oregon’s Health Insurance Experiment,” Science, vol. 343, no. 6168, January 17, 2014, pages 263-268.

12 Why weren’t all OHP lottery winners insured? Some failed to submit the required paperwork on time, while about half of those who did complete the necessary forms in a timely fashion turned out to be ineligible on further review.

13 Lind’s experiment is described in Duncan P. Thomas, “Sailors, Scurvy, and Science,” Journal of the Royal Society of Medicine, vol. 90, no. 1, January 1997, pages 50-54.

14 Charles S. Peirce and Joseph Jastrow, “On Small Differences in Sensation,” Memoirs of the National Academy of Sciences, vol. 3, 1885, pages 75-83.

15 Ronald A. Fisher, Statistical Methods for Research Workers, Oliver and Boyd, 1925, and Ronald A. Fisher, The Design of Experiments, Oliver and Boyd, 1935.

16 Sample variances tend to underestimate population variances. Sample variance is therefore sometimes defined as


that is, dividing by n – 1 instead of by n. This modified formula provides an unbiased estimate of the corresponding population variance.

Using separate variances for treatment and control observations, we have


where VJ(Y.) is the variance of treated observations, and V°(Y.) is the variance of control observations.

Chapter 2



kwai chang caine: A worker is known by his tools. A shovel for a man who digs. An ax for a woodsman. The econometrician runs regressions.

Kung Fu, Season 1, Episode 8

Our Path

When the path to random assignment is blocked, we look for alternate routes to causal

knowledge. Wielded skillfully, ’metrics tools other than random assignment can have much of the causality-revealing power of a real experiment. The most basic of these tools is regression, which compares treatment and control subjects who have the same observed characteristics. Regression concepts are foundational, paving the way for the more elaborate tools used in the chapters that follow. Regression-based causal inference is predicated on the assumption that when key observed variables have been made equal across treatment and control groups, selection bias from the things we can’t see is also mostly eliminated. We illustrate this idea with an empirical investigation of the economic returns to attendance at elite private colleges.

Leave a reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>