The Bias of 2SLSF
It is a fortunate fact that the OLS estimator is not only consistent, it is also unbiased. This means that in a sample of any size, the estimated OLS coefficient vector has a distribution that is centered on the population coefficient vector.[74] The 2SLS estimator, in contrast, is consistent, but biased. This means that the 2SLS estimator only promises to be close the causal effect of interest in large samples. In small samples, the 2SLS estimator can differ systematically from the population estimand.
For many years, applied researchers have lived with the knowledge that 2SLS is biased without losing too much sleep. Neither of us heard much about the bias of 2SLS in our graduate econometrics classes. A series of papers in the early 1990s changed this, however. These papers show that 2SLS estimates can be highly misleading in cases relevant for empirical practice.[75]
The 2SLS estimator is most biased when the instruments are “weak,” meaning the correlation with endogenous regressors is low, and when there are many overidentifying restrictions. When the instruments are both many and weak, the 2SLS estimator is biased towards the probability limit of the corresponding OLS estimate. In the worstcase scenario for many weak instruments, when the instruments are so weak that there really is no firststage in the population, the 2SLS sampling distribution is centered on the probability limit of OLS. The theory behind this result is a little technical but the basic idea is easy to see. The source of the bias in 2SLS estimates is the randomness in estimates of the firststage fitted values. In practice, the firststage estimates reflect some of the randomness in the endogenous variable since the firststage coefficients come from a regression of the endogenous variable on the instruments. If the population firststage is zero, then all of the randomness in the first stage is due to the endogenous variable. This randomness turns into finitesample correlation between firststage fitted values and the secondstage errors, since the endogenous variable is correlated with the secondstage errors (or else you wouldn’t be instrumenting in the first place).
A more formal derivation of 2SLS bias goes like this. To streamline the discussion we use matrices and vectors and a simple constanteffects model (it’s difficult to discuss bias in a heterogeneous effects world, since the target parameter may be variable across estimators). Suppose you are interested in estimating the effect of a single endogenous regressor, stored in a vector x, on a dependent variable, stored in the vector y, with no other covariates. The causal model of interest can then be written
y = fix + г]. (4.6.17)
The N XQ matrix of instrumental variables is Z, with the associated firststage equation
ж = Zn + f. (4.6.18)
OLS estimates of (4.6.17) are biased because yj is correlated with fj. The instruments, Zj are uncorrelated with fi by construction and uncorrelated with yj by assumption.
The 2SLS estimator is
b2SLS = (x’Pz x) 1 x’Pz y = ft + (x’Pz x) 1 x’Pz y.
where Pz = Z(Z’Z)_1Z’ is the projection matrix that produces fitted values from a regression of x on Z. Substituting for x in x’Pzy, we get
/?2sls – ft = (x’Pzx) 1 (n’Z’ + f’) Pzy = (x’Pzx) 1 n’Z’y + (x’Pzx) 1 f’Pzy (4.6.19)
The bias in 2SLS comes from the nonzero expectation of terms on the right hand side.
The expectation of (4.6.19) is hard to evaluate because the expectation operator does not pass through the inverse (x’Pz x) 1, a nonlinear function. It’s possible to show, however, that the expectation of the ratios on the right hand side of (4.6.19) can be closely approximated by the ratio of expectations. In other words,
E[ft2sls – ft] ~ (E[x’Pzx])_1 E[n’Z’y] + (E[x’Pzx])_1 E[f’Pzy].
This approximation is much better than the usual firstorder asymptotic approximation invoked in large – sample theory, so we think of it as giving us a good measure of the finitesample behavior of the 2SLS estimator.[76] Furthermore, because E[n’Z’f] = 0 and E[n’Z’y] = 0, we have
E[ft2SLS – ft] « [E (n’Z’Zn) + E(f’Pzf)]1 E (f’Pzy) . (4.6.20)
The approximate bias of 2SLS therefore comes from the fact that E (f’Pzy) is not zero unless yj and fj are uncorrelated. But correlation between yj and f j is what led us to use IV in the first place.
Further manipulation of (4.6.20) generates an expression that is especially useful:
(see the appendix for a derivation). The term (1/it)E (k’Z’ Zk) /q is the Fstatistic for the joint significance of all regressors in the first stage regression.44 Call this statistic F, so that we can write
Etb2SLS — P и F+1′ (4.6.21)
From this we see that as the first stage Fstatistic gets small, the bias of 2SLS approaches AsJL. The bias of the OLS estimator is, which also equals if к = 0. Thus, we have shown that 2SLS is centered on the same point as OLS when the first stage is zero. More generally, we can say 2SLS estimates are "biased towards" OLS estimates when there isn’t much of a first stage. On the other hand, the bias of 2SLS vanishes when F gets large, as it should happen in large samples when к = 0.
When the instruments are weak, the Fstatistic itself varies inversely with the number of instruments. To see why, consider adding useless instruments to your 2SLS model, that is, instruments with no effect on the firststage Rsquared. The model sum of squares, E (k’Z’Z’k), and the residual variance, ct, will both stay the same while Q goes up. The Fstatistic becomes smaller as a result. From this we learn that the addition of many weak instruments increases bias.
Intuitively, the bias in 2SLS is a consequence of the fact that the first stage is estimated. If the first stage coefficients were known, we could use xpop = Zk for the firststage fitted values. These fitted values are uncorrelated with the second stage error. In practice, however, we use b = Pzx = Zk + Pz£, which differs from xpop by the term Pz £. The bias in 2SLS arises from the fact that Pz £ is correlated with r, so some of the correlation between errors in the first and second stages seeps in to our 2SLS estimates through the sampling variability in b. Asymptotically, this correlation is negligible, but real life does not play out in "asymptopia".
The bias formula, (4.6.21), shows that the bias in 2SLS is an increasing function of the number of instruments, so clearly bias is least in the justidentified case when the number of instruments is as low as it can get. It turns out, however, that justidentified 2SLS (say, the simple Wald estimator) is approximately unbiased. This is hard to show formally because justidentified 2SLS has no moments (i. e., the sampling distribution has fat tails). Nevertheless, even with weak instruments, justidentified 2SLS is approximately centered where it should be (we therefore say that justidentified 2SLS is medianunbiased). This is not to say that you can happily use weak instruments in justidentified models. With a weak instrument, justidentified IV estimates tend to be highly unstable and imprecise.
The LIML estimator is approximately medianunbiased for overidentified constanteffects models, and therefore provides an attractive alternative to justidentified estimation using one instrument at a time (see, e. g., Davidson and MacKinnon, 1993, and Mariano, 2001). LIML has the advantage of having the same [77]
largesample distribution as 2SLS (under constant effects) while providing finitesample bias reduction. A number of estimators reduce the bias in overidentified 2SLS models. But an extensive Monte Carlo study by FloresLagunes (2007) suggests that LIML does at least as well as the alternatives in a wide range of circumstances (in terms of bias, mean absolute error, and the empirical rejection rates for ttests). Another
advantage of LIML is that many statistical packages compute it while other estimators typically require
some programming.
We use a small Monte Carlo experiment to illustrate some of the theoretical results from the discussion above. The simulated data are drawn from the following model,
Уі = ftxi + Уі
Q
xi = Kj + & j = 1
with ft = 1, ki = 0.1, ‘Kj =0 8j > 1,
where the zij are independent, normally distributed random variables with mean zero and unit variance. The sample size is 1000.
Figure 4.6.1 shows the Monte Carlo distributions of four estimators: OLS, just identified IV (i. e. 2SLS with Q= 1, labeled IV), 2SLS with two instruments (for Q= 2, labeled 2SLS), and LIML with Q= 2. The OLS estimator is biased and centered around a value of about 1.79. IV is centered around 1, the value of ft. 2SLS with one weak and one uninformative instrument is moderately biased towards OLS (the median is 1.07). The distribution function for LIML with Q = 2 is basically indistinguishable from that for justidentified IV, even though the LIML estimator uses a completely uninformative instrument.
Figure 4.6.2 reports simulation results where we set Q= 20. Thus, in addition to the one informative but weak instrument, we added 19 worthless instruments. The figure again shows OLS, 2SLS, and LIML distributions. The bias in 2SLS is now much worse (the median is 1.53, close to the OLS median). The sampling distribution of the 2SLS estimator is also much tighter than in the Q= 2 case. LIML continues to [78]
perform well and is centered around ft = 1, with a bit more dispersion than in the Q= 2 case.
Finally, Figure 4.6.3 reports simulation results from a model that is truly unidentified. In this case, we set Kj =0; j = 1,20. Not surprisingly, all the sampling distributions are centered around the same value as OLS. On the other hand, the 2SLS sampling distribution is much tighter than the LIML distribution. We would say advantageLIML in this case because the widely dispersed LIML sampling distribution correctly reflects the fact that the sample is uninformative about the parameter of interest.
What does this mean in practice? Besides retaining a vague sense of worry about your first stage, we recommend the following:
1. Report the first stage and think about whether it makes sense. Are the magnitude and sign as you would expect, or are the estimates too big or large but wrongsigned? If so, perhaps your hypothesized firststage mechanism isn’t really there, rather, you simply got lucky.
2. Report the Fstatistic on the excluded instruments. The bigger this is, the better. Stock, Wright, and Yogo (2002) suggest that Fstatistics above about 10 put you in the safe zone though obviously this cannot be a theorem.
3. Pick your best single instrument and report justidentified estimates using this one only. Justidentified IV is medianunbiased and therefore unlikely to be subject to a weakinstruments critique.
4. Check overidentified 2SLS estimates with LIML. LIML is less precise than 2SLS but also less biased. If the results come out similar, be happy. If not, worry, and try to find stronger instruments.
5. Look at the coefficients, tstatistics, and Fstatistics for excluded instruments in the reducedform regression of dependent variables on instruments. Remember that the reduced form is proportional to the causal effect of interest. Most importantly, the reducedform estimates, since they are OLS, are unbiased. As Angrist and Krueger (2001) note, if you can’t see the causal relation of interest in the reduced form, it’s probably not there.[79]
We illustrate some of this reasoning in a reanalysis of the Angrist and Krueger (1991) quarterofbirth study. Bound, Jaeger, and Baker (1995) argued that bias is a major concern when using quarter birth as an instrument for schooling, in spite of the fact that sample size exceeds 300,000. “Small sample” is clearly relative. Earlier in the chapter, we saw that the QOB pattern in schooling is clearly reflected in the reduced form, so there would seem to be little cause for concern. On the other hand, Bound, Jaeger, and Baker (1995) argue that the most relevant models have additional controls not included in these reduced forms. Table 4.6.2 reproduces some of the specifications from Angrist and Krueger (1991) as well as other specifications in the spirit of Bound, Jaeger, and Baker (1995).
Table 4.6.2: Alternative IV estimates of the economic returns to schooling 

(1) 
(2) 
(3) 
(4) 
(5) 
(6) 

2SLS 
0.105 
0.435 
0.089 
0.076 
0.093 
0.091 
(0.020) 
(0.450) 
(0.016) 
(0.029) 
(0.009) 
(0.011) 

LIML 
0.106 
0.539 
0.093 
0.081 
0.106 
0.110 
(0.020) 
(0.627) 
(0.018) 
(0.041) 
(0.012) 
(0.015) 

Fstatistic (excluded instruments) 
32.27 
0.42 
4.91 
1.61 
2.58 
1.97 
Controls 

Year of birth 
/ 
/ 
/ 
/ 
/ 
/ 
State of birth 
/ 
/ 

Age, Age squared 
/ 
/ 
/ 

Excluded Instruments 

Quarter of birth 
/ 
/ 

Quarter of birth*year of birth 
/ 
/ 
/ 
/ 

Quarter of birth*state of birth 
/ 
/ 

Number of excluded instruments 
3 
2 
30 
28 
180 
178 
Notes: The table compares 2SLS and LIML estimates using alternative sets of 
instruments and controls. The OLS estimate corresponding to the models reported in columns 14 is.071; the OLS estimate corresponding to the models reported in columns 56 is.067. Data are from the Angrist and Krueger (1991) 1980 Census sample. The sample size is 329,509. Standard errors are reported in parentheses.
The first column in the table reports 2SLS and LIML estimates of a model using three quarter of birth dummies as instruments with year of birth dummies as covariates. The OLS estimate for this specification is 0.071, while the 2SLS estimate is a bit higher at 0.105. The firststage Fstatistic is over 32, well above the danger zone. Not surprisingly, the LIML estimate is almost identical to 2SLS in this case.
Angrist and Krueger (1991) experimented with models that include age and age squared measured in quarters as additional controls. These controls are meant to pick up omitted age effects that might confound the quarterofbirth instruments. The addition of age and age squared reduces the number of instruments to two, since age in quarters, year of birth, and quarter of birth are linearly dependent. As shown in column 2, the first stage Fstatistic drops to 0.4 when age and age squared are included as controls, a sure sign of trouble. But the 2SLS standard error is high enough that we would not draw any substantive conclusions from this estimate. The LIML estimate is even less precise. This model is effectively unidentified.
Columns 3 and 4 report the results of adding interactions between quarter of birth dummies and year of birth dummies to the instrument list, so that there are 30 instruments, or 28 when the age and age squared variables are included. The first stage Fstatistics are 4.9 and 1.6 in these two specifications. The 2SLS estimates are a bit lower than in column 1 and hence closer to OLS. But LIML is not too far away from 2SLS. Although the LIML standard error is pretty big in column 4, it is not so large that the estimate is uninformative. On balance, there seems to be little cause for worry about weak instruments, even with the age quadratic included.
The most worrisome specifications are those reported in columns 5 and 6. These estimates were produced by adding 150 interactions between quarter of birth and state of birth to the 30 interactions between quarter of birth and year of birth. The rationale for the inclusion of stateofbirth interactions in the instrument list is to exploit differences in compulsory schooling laws across states. But this leads to highly overidentified models with 180 (or 178) instruments, many of which are weak. The first stage Fstatistics for these models are 2.6 and 2.0, well into the discomfort zone. On the plus side, the LIML estimates again look fairly similar to 2SLS. Moreover, the LIML standard errors differ little from the 2SLS standard errors in this case. This suggests that you can’t always determine instrument relevance using a mechanical rule such as "F>10". In some cases, a low F may not be fatal.[80]
Finally, it’s worth noting that in applications with multiple endogenous variables, the conventional first – stage F is no longer appropriate. To see why, suppose there are two instruments for two endogenous variables and that the first instrument is strong and predicts both endogenous variables well while the second instrument is weak. The firststage Fstatistics in each of the two first stage equations are likely to be high but the model is weakly identified because one instrument is not enough to capture two causal effects. A simple modification of the firststage F for this case is given in the appendix.
0 .5 1 1.5 2 2.5
x
OLS —————— V
2SLS …………….. LIML
Figure 4.6.1: Distribution of the OLS, IV, 2SLS, and LIML estimators. IV uses one instrument, while 2SLS and LIML use two instruments.
Leave a reply