# Appendix: Derivation of the simple Moulton factor Write

and  where Lg is a column vector of ng ones and G is the number of groups. Note that   Let tg = 1 + (ng — 1 )p, so we get ^2X ng t g xg Xg •

With this in hand, we can write

V(b) = (X’X) 1 X’ФХ (X’X) 1 X ngxgx’g) X ngtgxgx’g X ngxgx’g

ggg  We want to compare this with the standard OLS covariance estimator

If the group sizes are equal, ng = n and tg = t = 1 + (n — 1)p, so that V (b) = ^T ( X nxg x’gI X nxg xg ( X nxg x’g I

g / g V g / nxgxg

g

which implies (8.2.4).

Table 8.1.1: Monte Carlo results for robust standard errors

 Empirical 5% Rejection Rates Mean Standard Normal t Deviation (1) (2) (3) (4) A. Lots of Heteroskedasticity І і -0.001 0.586 Standard Errors: Conventional 0.331 0.052 0.278 0.257 HC0 0.417 0.203 0.247 0.231 HC1 0.447 0.218 0.223 0.208 HC2 0.523 0.26 0.177 0.164 HC3 0.636 0.321 0.13 0.12 max(Conventional, HC0) 0.448 0.172 0.188 0.171 max(Conventional, HC1) 0.473 0.19 0.173 0.157 max(Conventional, HC2) 0.542 0.238 0.141 0.128 max(Conventional, HC3) 0.649 0.305 0.107 0.097 B. Little Heteroskedasticity 11 0.004 0.6 Standard Errors: Conventional 0.52 0.07 0.098 0.084 HC0 0.441 0.193 0.217 0.202 HC1 0.473 0.207 0.194 0.179 HC2 0.546 0.25 0.156 0.143 HC3 0.657 0.312 0.114 0.104 max(Conventional, HC0) 0.562 0.121 0.083 0.07 max(Conventional, HC1) 0.578 0.138 0.078 0.067 max(Conventional, HC2) 0.627 0.186 0.067 0.057 max(Conventional, HC3) 0.713 0.259 0.053 0.045 C. No Heteroskedasticity 11 -0.003 0.611 Standard Errors: Conventional 0.604 0.081 0.061 0.05 HC0 0.453 0.19 0.209 0.193 HC1 0.486 0.203 0.185 0.171 HC2 0.557 0.247 0.15 0.136 HC3 0.667 0.309 0.11 0.1 max(Conventional, HC0) 0.629 0.109 0.055 0.045 max(Conventional, HC1) 0.64 0.122 0.053 0.044 max(Conventional, HC2) 0.679 0.166 0.047 0.039 max(Conventional, HC3) 0.754 0.237 0.039 0.031

Note: The table reports results from a sampling experiment with 25,000 replications.

Table 8.2.1: Standard errors for class size effects in the STAR data

 Standard Error Robust (HC1) 0.09 Parametric Moulton Correction (using Moulton intraclass coefficient) 0.222 Parametric Moulton Correction (using ANOVA intraclass coefficient) 0.23 Clustered 0.232 Block Bootstrap 0.231 Estimation using group means (weighted by class size) 0.226

Note: The table reports estimates from a regression of average percentile scores on

class size for kindergartners using the public use data set from Project STAR. The coefficient on class size is -.62. The group level for clustering is the classroom. The number of observations is 5,743. The bootstrap estimate uses 1,000 replications.

 Milgram was later played by the actor William Shatner in a TV special, an honor that no economist has yet received, though Angrist is still hopeful.

 A recent example is Bertrand and Mullainathan (2004) who compared employers’ reponses to resumes with blacker-sounding and whiter-sounding first names, like Lakisha and Emily (though Fryer and Levitt, 2004, note that names may carry information ab out so cio econom ic status as well as race.)

The Perry data continue to get attention, particular as policy-interest has returned to early education. A recent re-analysis by Michael Anderson (2006) confirms many of the findings from the original Perry study, though Anderson also shows that the overall positive effects of Perry are driven entirely by the impact on girls. The Perry intervention seems to have done nothing for boys.

The potential outcomes idea is a fundamental building block in modern research on causal effects. Important references developing this idea are Rubin (1974, 1977), and Holland (1986), who refers to a causal framework involving potential outcomes as the Rubin Causal Model.

Krueger (1999) devotes considerable attention to the attrition problem. Differences in attrition rates across groups may result in a sample of students in higher grades that is not randomly distributed across class types. The kindergarten results, which were unaffected by attrition, are therefore the most reliable.

 Randomized trials are never perfect and STAR is no exception. Pupils who repeated or skipped a grade left the experiment. Students who entered an experimental school one grade later were added to the experiment and randomly assigned to one of the classes. One unfortunate aspect of the experiment is that students in the regular and regular/aide classes were reassigned after the kindergarten year, possibly due to protests of the parents with children in the regular classrooms. There was also some switching of children after the kindergarten year. Despite these problems, the STAR experiment seems to have been an extremely well implemented randomized trial. Krueger’s (1999) analysis suggests that none of these implementation problems affected the main conclusions of the study.

The Angrist-Lavy (1999) results turn up again in Chapter 6, as an illustration of the quasi-experimental regression-

discontinuity research design.

 Examples of pedagogical writing using the “population-first” approach to econometrics include Chamberlain (1984), Gold-

berger (1991), and Manski (1991).

The discussion of asymptotic OLS inference in this section is largely a condensation of material in Chamberlain (1984).

Important pitfalls and problems with this asymptotic theory are covered in the last chapter.

Econometricians like to use matrices because the notation is so compact. Sometimes (not very often) we do too. Suppose

For a derivation of the the delta method formula using the Slutsky and continuous mapping theorems, see, e. g., Knight,

2000, pp. 120-121.

Residuals defined in this way are not necessarily mean-independent of Xi; for mean-independence, we need a linear CEF.

 is the matrix whose rows are given by X0 and y is the vector with elements y, for i = 1 . The sample moment

іу£ XiXi is X’X/N and the sample moment XiVi is X’y/N. Then we can write ( = (X0X) 1 X’y, a familiar matrix formula.

The cross-product term resulting from an expansion of the quadratic in the middle of 3.1.9 is zero because Yi — E[Yi|Xi] is mean-independent of Xi.

With a third dummy variable in the model, say X3i, a saturated model includes 3 main effects, 3 second-order interaction terms fxiiX2i, Х2іХзі, ХііХ2і} and one third-order term, ХііХ2іХзі.

For example, we might construct the average effect over s using the distribution of Si. In other words, estimate E[fi(s) — fi(s — 1)] for each s by matching, and then compute the average difference

Z>[fi(s) – fi(s – 1 )]P(s).

where P(s) is the probability mass function for Si. This is a discrete approximation to the average derivative, Efi(Si)].

Here is the multivariate generalization of OVB: Let denote the coefficient vector on a Ki X 1 vector of variables, Хц in a (short) regression that has no other variables and let denote the coefficient vector on these variables in a (long) regression

that includes a K2 X 1 vector of control variables, Х2І, with coefficient vector ^(. Then Pi = + E[XiiXiJ~1E[XiiX2J^(.

 As highly educated people, we like to assume that ability and schooling are positively correlated. This is not a foregone

conclusion, however: Mick Jagger dropped out of the London School of Economics and Bill Gates dropped out of Harvard, perhaps because the opportunity cost of schooling for these high-ability guys was high (of course, they may also be a couple of very lucky college dropouts).

 A large empirical literature investigates the consequences of omitting ability variables from schooling equations. Key early references include Griliches and Mason (1972), Taubman (1976), Griliches (1977), and Chamberlain (1978).

This program appears to raise earnings, primarily because workers in the training group went back to work more quickly.

Lotteries have been used to distribute private school tuition subsidies; see, e. g., Angrist, et al. (2002).

The same problem arises in "conditional-on-positive" comparisons, discussed in detail in section (3.4.2), below.

16In this example, selection bias is probably negative, that is E [Yoi|Wii = 1] < E [Y0i|W0i = 1] . It seems reasonable to think that any college graduate can get a white collar job, so E [YoijWii = 1] is not too far from E[Yoi]. But someone who gets a white collar without benefit of a college degree (i. e., Woi = 1) is probably special, i. e., has a better than average Yoi.

Griliches and Mason (1972) is a seminal exploration of the use of early and late ability controls in schooling equations. See also Chamberlain (1977, 1978) for closely related studies. Rosenbaum (1984) offers an alternative discussion of the proxy control idea using very different notation, outside of a regression framework.

This matching estimator is discussed by Rubin (1977) and used by Card and Sullivan (1988) to estimate the effect of subsidized training on employment.

 With continuous covariates, exact matching is impossible and some sort of approximation is required, a fact that leads to bias. See Abadie and Imbens (2006), who derive the implications of approximate matching for the limiting distirbution of matching estimators.

It’s no surprise that regression gives the most weight to cells where P(Di = 1|Xi = x) = 1/2 since regression is efficient for a homoskedastic constant-effects linear model. We should expect an efficient estimator to give the most weight to cells where the common treatment effect is estimated most precisely. With homoskedastic residuals, the most precise treatment effects

 The support of a random variable is the set of realizations that occur with positive probability. See Heckman, Ichimura,

Smith, and Todd (1998) and Smith and Todd (2001) for a discussion of common support in matching.

Matching problems involving finely distributed X-variables are often solved by aggregating values to make coarser groupings

or by pairing observations that have similar, though not necessarily identical values. See Cochran (1965), Rubin (1973), or Rosenbaum (1995, Chapter 3) for discussions of this approach. With continuously-distributed covariates, matching estimators are biased because matches are imperfect. Abadie and Imbens (2008) have recently shown that a regression-based bias correction can eliminate the (asymptotic) bias from imperfect matches.

More specialized results in this spirit appear in Ruud (1986), who considers distribution-free estimation of limited-dependent – variable models with Normally distributed regressors.

Propensity-score methods can be adapted to multi-valued treatments, though this has yet to catch on. See Imbens (2000) for an effort in this direction.

 An similar but more extended propensity-score face-off appears in the exchange beween Smith and Todd (2005) and Dehejia

The HIE was considerably more complicated than described here. There were 14 different treatments, including assignment to a prepaid HMO-like service. The experimental design did not use simple random assignment, but rather a more complicated assignment scheme meant to ensure covariate balance acrosss groups.

A generalization of Tobit is the sample selection model, where the latent variable determining participation is not the same as the latent expenditure variable. See, e. g., Maddala (1983). The same conceptual problems related to the interpretation of effects on latent variables arise in the sample selection model as with Tobit.

We should note that our favorite regression example – a regression of log wages on schooling – may have a COP problem since the sample of log wages naturally omits those with zero earnings. This leads to COP-style selection bias if education affects the probability of working. In practice, therefore, we focus on samples of prime-age males where participation rates are high and reasonably stable across schooling groups (e. g., white men aged 40-49 in Figure 3.1.1).

Yule’s first applied paper on the poor laws was published in 1895 in the Economic Journal, where Pischke is proud to serve

as co-editor. The theory of multiple regression that goes along with this appears in Yule (1897).

 Recent years have seen an increased willingness by statisticians to discuss statistical models for observational data in an explicitly causal framework; see, for example, Freedman’s (2005) review.

Key historical references here are Wald (1940) and Durbin (1954), both discussed below.

See Angrist and Krueger (2001) for a brief exposition of the history and uses of IV; Stock and Trebbi (2003) for a detailed account of the birth of IV; and Morgan (1990) for an extended history of econometric ideas, including the simultaneous equations

mo del.

 Other explanations are possible, the most likely being some sort of family background effect associated with season of birth (see, e. g., Bound, Jaeger, and Baker, 1995). Weighing against the possibility of omitted family background effects is the fact that the quarter of birth pattern in average schooling is much more pronounced at the schooling levels most affected by compulsory attendance laws. Another possible concern is a pure age-at-entry effect which operates through channels other than highest grade completed (e. g., achievement). The causal effect of age-at-entry on learning is difficult, if not impossible, to separate from pure age effects, as noted in Chapter 1). A recent study by Elder and Lubotsky (2008) argues that the evolution of putative age-at-entry effects over time is more consistent with effects due to age differences per se than to a within-school learning advantage for older students.

Note that s* = 2і7Гіі, where Zi is the residual from a regression of Zi on Xi, so that the 2SLS estimator is therefore the

This gain may not be without cost, as the use of many additional instruments opens up the possibility of increased bias, an issue discussed in Chapter 8, below.

 As noted in the introduction to this chapter, measurement error in regressors tends to shrink regression coefficients towards zero. To eliminate this bias, Wald (1940) suggested that the data be divided in a manner independent of the measurement error, and the coefficient of interest estimated as a ratio of differences in means as in (4.1.12). Durbin (1954) showed that Wald’s method of fitting straight lines is an IV estimator where the instrument is a dummy marking Wald’s division of the data. Hausman (2001) provides an overview of econometric strategies for dealing with measurement error.

 An exception is the classical measurement error model, where both the variable to be instrument and the instrument are assumed to be continuous. Here, we have in mind IV scenarios involving omitted variables bias.

Continuous instruments recoded as dummies can be seen as providing a parsimonious non-parametric model for the under­lying first-stage relation, E[DiZi]. In homoskedastic models with constant coefficients, the asymptotically efficient instrument is E[DiZi] (Newey, 1990).

 See, e. g., the preface to Borjas (2005).

With a single endogenous variable and more than one instrument, Г is [к + 1] X 1, while Zi is [к + Q] X 1 for Q> 1. Hence the

resulting linear system cannot be solved unless there is a linear dependency that makes some of the instruments redundant.

”Quadratic form” is matrix language for a weighted sum of squares. Suppose v is an N X 1 vector and M is an N X N

matrix. A quadratic form in v is v’Mv. If M is a N X N diagonal matrix with diagonal elements mi, then v’Mv = mivi-

І

Much more detailed explanations can be found in Newey (1985), Newey and West (1987), and the original Hansen (1982)

GMM paper.

If, for example, the instrument takes on three values, one of which is assigned to the constant, and the model includes

constant and a single the endogenous variable only, the test statistic has 1 degree of freedom.

The Wald estimator and Wald test are named after the same statistician, Abraham Wald, but the latter reference is Wald (1943).

The fact that Wald and LM testing procedures for the same null are equivalent in linear models was established by Newey and West (1987). Angrist (1991) gives a formal statement of the argument in this paragraph.

A quadratic form is the matrix-weighted product, x’Ax, where x is a random vector of, say, dimension К and A is a KXK matrix of constants.

Applications of TSIV include Bjorklund and Jantti (1997), Jappelli, Pischke, and Souleles (1998), Currie and Yelowitz (2000), and Dee and Evans (2003). In a recent paper, Inoue and Solon (2005) compare the asymptotic distributions of alternative TSIV estimators, and introduce a maximum likelihood (LIML-type) version of TSIV. They also correct a mistake in the distribution theory in Angrist and Krueger (1995), discussed further, below.

Angrist and Krueger called this estimator SSIV because they were concerned with a scenario where a single data set is deliberately split in two. As discussed in Section (4.6.4), the resulting estimator may have less bias than conventional 2SLS.

Inoue and Solon (2005) refer to the estimator Angrist and Krueger (1995) called SSIV as Two-sample 2SLS or TS2SLS.

This shortcut formula uses the standard errors from the manual SSIV second stage. The correct asymptotic covariance

matrix formula, from Inoue and Solon (2005), is

{B[(an + кГ’Е22Г)А]-1В}-1

where B=plim ^f= plim ( f aW* ) ’ A = plim (—Vf1) = plim (^vf2), plim = n, a11 is the variance of the

reduced-form residual in data set 1, and X22 is the variance of the first-stage residual in data set 2. In principle, these pieces are easy enough to calculate. Other approaches to SSIV inference include those of Dee and Evans (2003), who calculate standard errors for just-identified models using the delta-method, and Bjorklund and Jantti (1997), who use a bootstrap.

The distinction between internal and external validity is relatively new to applied econometrics but has a long history in social science. See, for example, the chapter-length discussion in Shadish, Cook, and Campbell (2002), the successor to a classic text on research methods by Campbell and Stanley (1963).

Hirano, Imbens, Rubin and Zhou (2000) note that the exclusion restriction that Yi(d, z) equals Yi(d, z!) can be weakened

to require only that the distributions of Yi(d, z) and Yi(d, z0) be the same.

As it turns out, there is not much of a relationship between schooling and lottery numbers in the Angrist and Krueger

(1992) data, probably because educational deferments were phased out during the lottery period.

Angrist (1990) interprets draft lottery estimates as the penalty for lost labor market experience. This suggests draft lottery estimates should have external validity for the effects of conscription in other periods, a conjecture born out by the results for WWII draftees in Angrist and Krueger (1994).

With a constant effect, p,

> D0i]P[Dii > Doi]

< Doi]P[Dii < Doi].

= PfP[Dii > Doi] — P[Dii < Doi]g

= p{E[Dii – D0i]g.

So a zero reduced form effect means either the first stage is zero or p = 0.

Another application of IV to data from a randomized trial is Krueger (1999). This study uses randomly assigned class size as an instrument for actual class size with data from the Tennessee STAR experiment. For students in first grade and higher, actual class size differs from randomly assigned class size in the STAR experiment because parents and teachers move students around in years after the experiment began. Krueger 1999 also illustrates 2SLS applied to a model with variable treatment intensity, as discussed in section 4.5.3.

In fact, maintaining the hypothesis that all instruments in an over-identified model are valid, the traditional over-

identification test statistic becomes a formal test for treatment-effect heterogeneity.

Using twins instruments alone, the IV estimate of the effect of a third child on female labor force participation is -.084 (s. e. = .017). The corresponding samesex estimate is -.138 (s. e. = .029). Using both instruments produces a 2SLS estimate of -.098 (.015). The 2SLS weight in this case is.74 for twins, .26 for samesex, due to the much stronger twins first stage.

 Note that the variability in E[Di|Xi, Zi] conditional on Xi comes from Zi. So the weighting formula gives more weight to covariate values where the instrument creates more variation in fitted values. The first line of the weight formula, (4.5.4), holds for any endogenous variable in a 2SLS setup. The second is a consequence of the fact that here the endogenous variable is a d um my.

For compliers,

P [Di = 1|{Yli, Yqi}, X,, Dii > Do,]

= P [Zi = 11{y 11, Yq,}, Xi, Dii > Doi] .

And by conditional independence,

P [Zi = 11{y 1 i, Yq,}, Xi, Dii > Doi] P [Zi = 1|Xi, Di, > Doi] .

The class of approximating functions needn’t be linear. Instead of aDi+X(b, it might make sense to use a nonlinear function like an exponential (if the dependent variable is non-negative) or probit (if the dependent variable is zero-one). We return to this point at the end of this chapter. As noted in Section (4.4.4), the kappa-weighting sceme can be used to characterize covariate distributions for compliers as well as to estimate outcome distributions.

 Abadie (2003) gives formulas for standard errors and Alberto Abadie has posted software to compute them. The bootstrap provides a simple alternative, which we used to construct standard errors for the Abadie estimates mentioned in this paragraph.

The insight that consistency of 2SLS estimates in a traditional SEM does not depend on correct specification of the first – stage CEF goes back to Kelejian (1971). Use of a nonlinear plug-in first-stage may not do too much damage in practice – a probit first-stage can be pretty close to linear – but why take a chance when you don’t have to?

The coefficient on average schooling in an equation with individual schooling can be interpreted as the Hausman (1978)

test statistic for the equality of OLS estimates and 2SLS estimates of private returns to schooling using state dummies as

instruments. Borjas (1992) discusses a similar problem affecting the estimation of ethnic-background effects.

Here is a direct proof that the regression of Sij on Sj is always unity:

52-52 i Sij (Sj _ S) Ej (Sj _ S)52 i Sij

Ej nj(Sj – S)2 = Xj nj(Sj – S)2

X – (Sj – S)(nj Sj)

= —j—=—=— = i.

52 – п – (Sj – S)2

j

The analogy between nonlinear LDV models and GLS is more than rhetorical. Consider a Probit model with nonlinear CEF E[Yi|Xi] = Ф — Pi• The first-order conditions for maximum likelihood estimation of this model are

Abadie, and nonlinear structural estimates of models for hours worked. Angrist (1991) compares 2SLS and bivariate Probit estimates in sampling experiments.

A more precise statement is that OLS is unbiased when, either (a) the CEF is linear or, (b) the regressors are non-stochastic, i. e., fixed in repeated samples. In practice, these qualifications do not seem to matter much. As a rule, the sampling distribution

of 3 = [5^^ XiXi] 1 ^{XiYi, tends to be centered on the population analog, 3 = E[XiXi] 1E[XiYi] in samples of any size,

whether or not the CEF is linear or the regressors are stochastic.

Key references are Nelson and Startz, (1990a, b); Buse (1992), Bekker (1994); and especially Bound, Jaeger, and Baker

See Bekker (1994) and Angrist and Krueger (1995). This is also called a group-asymptotic approximation because it can be derived from an an asymptotic sequence that lets the number instruments go to infinity at the same time as the number of observations goes to infinity, thereby keeping the number of observations per instrument constant.

44Sort of; the actual F-statistic is (l/(r|)7r/Z/Z7r/Q, where hats denote estimates. (1/ct|)E (n! Z!Zn) =Q is therefore sometimes called the population F-statistic since it’s the F-statistic we’d get in an infinitely large sample. In practice, the distinction between population and sample F matters little in this context.

LIML is available in SAS and in STATA 10. With weak instruments, LIML standard errors are not quite right, but Bekker (1994) gives a simple fix for this. Why is LIML unbiased? Expression (4.6.21) shows that the approximate bias of 2SLS is proportional to the bias of OLS. From this we conclude that there is a linear combination of OLS and 2SLS that is approximately unbiased. LIML turns out to be just such a "combination estimator”. Like the bias of 2SLS, the approximate unbiasedness of LIML can be shown using a Bekker-style group-asymptotic sequence that fixes the ratio of instruments to sample size. Its worth mentioning, however, that LIML is biased in models with a certain type of heteroskedasticity; See Hausman, Newey, and Wouterson (2006) for details.

 A recent paper by Chernozhukov and Hansen (2007) formalizes this maxim.

Cruz and Moreira (2005) similarly conclude that, low F-statistics notwithstanding, there is little bias in the Angrist and Krueger (1991) 180-instrument specifications.

In some cases, we can allow heterogeneous treatment effects so that

E(y nt Y о it I At, Xit;t) = Pi.

See, e. g., Wooldridge (2005), who discusses estimators for the average of Pi.

An alternative to the fixed-effects specification is "random effects” (See, e. g., Wooldridge, 2006). The random-effects model

assumes that ai is uncorrelated with the regressors. Because the omitted variable in a random-effects model is uncorrelated with included regressors there is no bias from ignoring it – in effect, it becomes part of the residual. The most important consequence of random effects is that the residuals for a given person are correlated across periods. Chapter 8 discusses the implications of this for standard errors. An alternative approach is GLS, which promises to be more efficient if the assumptions of the random-effects model are satisfied (linear CEF, homoskedasticity). We prefer OLS/fix-the-standard-errors to GLS under random-effects assumptions. As discussed in Section 3.4.1, GLS requires stronger assumptions than those we are comfortable with and the resulting efficiency gain is likely to be modest.

Why is deviations from means the same as estimating each fixed effect in (5.1.3)? Because, by the regression anatomy formula, (3.1.3), any set of multivariate regression coefficients can be estimated in two steps. To get the multivariate coefficient on one set of variables, first regress them on all the other included variables, then regress the original dependent variable on the residuals from this first step. The residuals from a regression on a full set of person-dummies in a person-year panel are deviations from person means.

is called the "incidental parameters problem," a name which reflects the fact that the number of parameters grows with the sample size. Nevertheless, other parameters in the fixed effects model – the ones we care about – are consistently estimated.

The fixed effects are not estimated consistently in a panel where the number of periods T is fixed while N! oo. This

 See Griliches and Hausman (1986) for a more complete analysis of measurement error in panel data.

The DD idea is at least as old as IV. Kennan (1995) references a 1915 BLS report using DD to study the employment effects of the minimum wage (Obenauer and von der Nienburg, 1915).

The common trends assumption can be applied to transformed data, for example,

E(log Y oist |s, t) = 7S + At.

Note, however, that if there is a common trend in logs, there will not be one in levels and vice versa. Athey and Imbens (2006) introduce a semi-parametric DD estimator that allows for common trends after an unknown transformation, which they propose to use the data to estimate. Poterba, Venti and Wise (1995) and Meyer, Viscusi, and Durbin (1995) discuss DD-type models for quantiles.

Card weights estimates of (5.2.4) by the sample size used to construct averages for each state. Other specifications in the spirit of (5.2.4) put a normalized function of state and federal minimum wages on the right hand side instead of FAs ■ dt. See, for example, Neumark and Wascher (1992), who work with the difference between state and federal minima, adjusted for minimum-wage coverage provisions, and normalized by state average hourly wages.

Abadie, Diamond, and Hainmueller (2007) develop a semiparametric version of the lagged-dependent variables model, more flexible than the traditional regression setup. As with our regression setup, the key assumption in this model is conditional

See Holtz-Eakin, Newey and Rosen (1988), Arellano and Bond (1991), Blundell and Bond (1998) for details and examples.

In particular, setting в = 1 in (5.3.3) does not produce the fixed-effects model as a special case of the lagged dependent variables model. Instead we get

Ay it = a + At + ^Dit + Хцб + єц

i. e., a differenced dependent variable with regressors in levels. This is not the model with first differences on both the right and left side needed to kill the fixed effect.

The basic structure of RD designs appears to have emerged simultaneously in a number of disciplines but has only recently become important in applied econometrics. Cook (2008) gives an intellectual history. In a recent paper using Lalonde (1986) style within-study comparisons, Cook and Wong (2008) find that RD generally does a good job of reproducing the results from

randomized trials.

Hoxby (2000) also uses this idea to check RD estimates of class size effects. A fully nonparametric approach requires data-driven rules for selection of the width of the discontinuity-sample window, also known as "bandwidth". The bandwidth must shrink with the sample size at a rate sufficiently slow so as to ensure consistent estimation of the underlying conditional mean functions. See Imbens and Lemieux (2007) for details. We prefer to think of estimation using (6.1.4) or (6.1.6) as essentially parametric: in any given sample, the estimates are only as good as the model for E[Yoixi] that you happen to be

using. Promises about how you might change the model if you had more data should be irrelevant.

The fitted values in this figure are from a Logit model for the probability of winning as a function of the cutoff indicator

Di = 1 (xi > 0), a 4th-order polynomial in Xi, and interactions between the polynomial terms and Di.

The idea of using jumps in the probability of assignment as a source of identifying information appears to originate with Trochim (1984), although the IV interpretation came later. Not everyone agrees that fuzzy RD is IV, but this view is catching on. In a recent history of the RD idea, Cook (2008) writes about the fuzzy design: "In many contexts, the cutoff value can function as an IV and engender unbiased causal conclusions. . . fuzzy assignment does not seem as serious a problem today

as earlier.

van der Klaauw’s original working paper circulated in 1997. Note that the fact that (6.2.2) is only an approximation of

E [DiIxi] is not very important; second-stage estimates are still consistent.

 Alternately, center neither the first or second stage. In this case, however, p no longer captures the treatment effect at

the cutoff.

The Angrist and Lavy (1999) study differs modestly from the description here in that the data used to estimate equation (6.2.6) are class averages. But since the covariates are all defined at the class or school level, the only difference between student-level and class-level estimation is the implicit weighting by number of students in the student-level estimates.

More generally, we can define the CQF for discrete random variables and random variables with less-than-well-behaved densities as

Qt(yіX) = inf fy : Fy(yXi) > t}.

See Card and Lemieux (1996) for an empirical example of a regression model with this sort of heteroskedasticity. Koenker and Portnoy (1996) call this a linear location-scale model.

The results in table 7.1.1 include two sets of standard errors. The first are conventional standard errors, of the sort reported by Stata’s qreg command (also specifying "robust"). These presume the CQF is truly linear. The formula for these is

T(1 – T){E[fUT (0Xt)XiX’l]-1E[XiX’l]E[fUT (ON^X’]-1,

where fuT (0Xi) is the conditional density of the quantile-regression residual at zero. If the residuals are homoskedastic this sim­plifies to f1 (q) EfXiX^]-1. The second set are robust to misspecification, computed using formulas in Angrist, Chernozhukov, and Fernandez-Val (2006). In this example, the impact of nonlinearity on standard errors is minor.

For example, if y is the conditional median, then Fy (y|Xi) = .5 and half of all conditional quantiles are below y. The relation (7.1.9) can be proved formally using the change of variables formula.

For an alternative approach, see Chernozhukov and Hansen (2005), which allows for regressors of any type (i. e., not just

dummies), but invokes a rank-invariance assumption that is unnecessary in the QTE framework.

See, for example, Heckman, Smith, and Clements (1997).

Intuitively, this is because "finds compliers”. A formal statement of this result appears in Abadie, Angrist, and Imbens

(2002; Lemma 3.2).

Step-by-step, it goes like this:

1. Probit Zi on Yi and Xi separately in the Di =0 and Di = 1 subsamples. Save these fitted values. 2. Probit Zi on Xi in the whole sample. Save these fitted values. 3. Construct E[^i|Yi, Di, Xi] by plugging the two sets of fitted values into (7.2.5). Set anything less than zero to zero and anything greater than one to one. 4. Use these kappas to weight quantile regressions.

5. Bootstrap this whole procedure to construct standard errors.

See Bloom et al (1997).

N

The property ‘У ^ hij j=1

N

regression, ^ ^ hij = 2. j=l

Think of Нц as a random variable with a uniform distribution in the sample. Then

and

 A jackknife variance estimator estimates sampling variance from the empirical distribution generated by omitting one observation at a time. Stata computes HC, HC2, and HC3. You can also use a trick suggested by Messer and White (1984): divide Yi and Xi by у фі and instrument the transformed model by Xi/y фі for your preferred choice of bp

This is known as the Behrens-Fisher problem (see e. g. DeGroot and Schervish, 2001, ch. 8).

Notice that HC2 is an unbiased estimator of the sampling variance, while the mean of the HC2 standard errors across sampling experiments (0.52) is still below the standard deviation of b (0.59). This comes from the fact that the standard error is the square root of the sampling variance, the sampling variance is itself estimated and hence has sampling variability, and the square root is a concave function.

The large sampling variance of robust standard error estimators is noted by Chesher and Austin (1991). Kauermann and Carroll (2001) propose an adjustment to confidence intervals to correct for this.

Yang, Hsu, and Zhao (2005) formalize the notion of test procedures based on the maximum of a a set of test statistics with

differing efficiency and robustness properties.

This sort of residual correlation structure is also a consequence of stratified sampling (see, e. g., Wooldridge, 2003). Most of the samples that we work with are close enough to random that we typically worry more about the dependence due to a group structure than clustering due to stratification.

With non-stochastic regressors and homoscedastic residuals, the Moulton factor is a finite-sample result. Survey statisticians call the Moulton factor the design effect because it tells us how much to adjust standard errors in stratified samples for deviations from simple random sampling (Kish, 1965).

 Clustering can also be a problem in regression-discontinuity designs if the variable that determines treatment assignment

varies only at a group level (see Card and Lee, 2008, for details).

Use Stata’s loneway command, for example.

See, e. g., Angrist and Lavy (2007) for an example of the latter two weighting schemes.

The Somebody Else’s Problem (SEP) Field, first identified as a natural phenomenon in Adams’ Life, the Universe, and Everything, is, according to Wikipedia, ”a generated energy field that affects perception. . . Entities within the field will be perceived by an outside observer as ’Somebody Else’s Problem’, and will therefore be effectively invisible unless the observer is specifically looking for the entity.”

The matrix Ag is not unique; there are many such decompositions. Bell and McCaffrey (2002) use the symemtric square root of (I — Hg) 1 or

Ag = P Л1/2

where P is the matrix of eigenvectors of (I — Hg)-1, Л is the diagonal matrix of the correponding eigenvalues, and Л1/2 is the diagonal matrix of the square roots of the eigenvalues. One problem with the Bell and McCaffrey adjustment is that (I — Hg) may not be of full rank, and hence the inverse may not exist for all designs. This happens, for example, when one of the regressors is a dummy variable which is one for exactly one of the clusters, and zero otherwise. This includes the panel DD model discussed by Bertrand et al. (2004), where you include a full set of state dummies and cluster by state. Moreover, the eigenvalue decomposition is implemented for matrices which are the size of the groups. In many applications, group sizes are large enough that this becomes computationally intractible.

Donald and Lang (2007) discuss serial correlation examples where the regressor is fixed within the clustering dimension, but this is not the typical differences-in-differences setup.