TwoStage Least Squares
The reducedform equation, (4.1.4b), can be derived by substituting the first stage equation, (4.1.4a), into the causal relation of interest, (4.1.6), which is also called a “structural equation” in simultaneous equations language. We then have:
Yj _ a’Xj + p[Xj^10 + ^11zj + C1j] + Pj
_ Xj[a + p^10] + p^11zj + [p?1j + pj]
_ X W20 + ^21zj + ^2i,
where ^20 = a + p’Kio, ^21 = її, and £2i = pCii + Vi in equation (4.1.4b). Equation (4.1.7) again shows why p = 021. Note also that a slight rearrangement of (4.1.7) gives
Y i = a’Ni + p[Xi^io + ^11 Zi] + C2i; (4.1.8)
where [Xi^1o + ^11Zi] is the population fitted value from the firststage regression of Si on Xi and Zi. Because Zi and Xi are uncorrelated with the reducedform error, ^2i, the coefficient on [Xi^o + ^11Zi] in the population regression of Yi on Xi and [Xi’Кю + ^11Zi] equals p.
In practice, of course, we almost always work with data from samples. Given a random sample, the firststage fitted values in the population are consistently estimated by
Si = Xi7T1o + ТГ 11zi,
where 7T 1o and 7r11 are OLS estimates from equation (4.1.4a). The coefficient on Si in the regression of Yi on Xi and Si is called the TwoStage Least Squares (2SLS) estimator of p. In other words, 2SLS estimates can be constructed by OLS estimation of the “secondstage equation,”
Y i = Oi’Xi + pSi + [Vi + p(si — Si)], (4.1.9)
This is called 2SLS because it can be done in two steps, the first estimating Si using equation (4.1.4a), and the second estimating equation (4.1.9). The resulting estimator is consistent for p because (a) firststage estimates are consistent; and, (b) the covariates, Xi, and instruments, Zi, are uncorrelated with both Vi and
(si Si).
The 2SLS name notwithstanding, we don’t usually construct 2SLS estimates in twosteps. For one thing, the resulting standard errors are wrong, as we discuss later. Typically, we let specialized software routines (such as are available in SAS or Stata) do the calculation for us. This gets the standard errors right and helps to avoid other mistakes (see Section 4.6.1, below). Still, the fact that the 2SLS estimator can be computed by a sequence of OLS regressions is one way to remember why it works. Intuitively, conditional on covariates, 2SLS retains only the variation in Si that is generated by quasiexperimental variation, i. e., generated by the instrument, Zi.
2SLS is a manysplendored thing. For one, it is an instrumental variables estimator: the 2SLS estimate of p in (4.1.9) is the sample analog of Cov(YSi’si), where S* is the residual from a regression of Si on Xi. This follows from the multivariate regression anatomy formula and the fact that Cov(Si, S*) = V(S*). It is also easy to show that, in a model with a single endogenous variable and a single instrument, the 2SLS estimator is the same as the corresponding ILS estimator.[41]
The link between 2SLS and IV warrants a bit more elaboration in the multiinstrument case. Assuming each instrument captures the same causal effect (a strong assumption that is relaxed below), we might want to combine these alternative IV estimates into a single more precise estimate. In models with multiple instruments, 2SLS provides just such a linear combination by combining multiple instruments into a single instrument. Suppose, for example, we have three instrumental variables, Zii, Z2i, and Z3j. In the Angrist and Krueger (1991) application, these are dummies for first, second, and thirdquarter births. The firststage equation then becomes
si = Xj^io + ^iizii + ^I2z2i + ^13z3i + Cli; (4.1.10a)
while the 2SLS second stage is the same as (4.1.9), except that the fitted values are from (4.1.10a) instead of (4.1.4a). The IV interpretation of this 2SLS estimator is the same as before: the instrument is the residual from a regression of firststage fitted values on covariates. The exclusion restriction in this case is the claim that all of the quarter of birth dummies in (4.1.10a) are uncorrelated with ^ in equation equation (4.1.6).
The results of 2SLS estimation of a schooling equation using three quarterofbirth dummies, as well as other interactions, are shown in Table 4.1.1, which reports OLS and 2SLS estimates of models similar to those estimated by Angrist and Krueger (1991). Each column in the table contains OLS and 2SLS estimates of p from an equation like (4.1.6), estimated with different combinations of instruments and control variables. The OLS estimate in column 1 is from a regression of log wages with no control variables, while the OLS estimates in column 2 are from a model adding dummies for year of birth and state of birth as control variables. In both cases, the estimated return to schooling is around.075.
Г Cov(Yi j _
V,(Zz^ . But the sample analog of the numerator, Cov(Yp), is the OLS estimate of кол in the reduced
7Г11 ^ & ’ V (~i) 21
form, (4.1.4b), while Кц is the OLS estimate of the firststage effect, кц, in (4.1.4a). Hence, 2SLS with a single instrument is ILS, i. e., the ratio of the reduced formeffect of the instrument to the corresponding firststage effect where both the firststage and red ucedform in clude covariates.
Table 4.1.1: 2SLS estimates of the economic returns to schooling
OLS 
2SLS 

(1) (2) 
(3) 
(4) 
(5) (6) 
(7) 
(8) 
Years of education 
0.075 
0.072 
0.103 
0.112 
0.106 
0.108 
0.089 
0.061 
(0.0004) 
(0.0004) 
(0.024) 
(0.021) 
(0.026) 
(0.019) 
(0.016) 
(0.031) 

Covariates: 

Age (in quarters) 
/ 

Age (in quarters) squared 
/ 

9 year of birth dummies 
/ 
/ 
/ 
/ 
/ 

50 state of birth dummies 
/ 
/ 
/ 
/ 
/ 

Instruments: 
dummy 
dummy 
dummy 
full set 
full set 
full set 

for 
for 
for 
of QOB 
of QOB 
of QOB 

QOB=l 
QOB=l 
QOB=l 
dummies 
dummies 
dummies 

or 
int. with 
int. with 

QOB=2 
year of 
year of 

birth 
birth 

dummies 
dummies 
Notes: The table reports OLS and 2SLS estimates of the returns to schooling using the the Angrist and Krueger (1991) 1980 Census sample. This sample includes nativeborn men, born 19301939, with positive earnings and nonallocated values for key variables. The sample size is 329,509. Robust standard errors are reported in parentheses.
The first pair of IV estimates, reported in columns 3 and 4, are from models without controls. The instrument used to construct the estimates in column 1 is a single dummy for first quarter births, while the instruments used to construct the estimates in column 2 are a pair of dummies indicating first and second quarter births. The standard error estimates range from.10 – .11. The results from models including year of birth and state of birth dummies as control variables are similar, not surprisingly, since quarter of birth is not closely related to either of these controls. Overall, the 2SLS estimates are mostly a bit larger than the corresponding OLS estimates. This suggests that the observed associated between schooling and earnings is not driven by omitted variables like ability and family background.
Column 7 in Table 4.1.1 shows the results of adding interaction terms to the instrument list. In particular, each specification adds interaction with 9 dummies for year of birth (the sample includes cohorts born 193039), for a total of 30 excluded instruments. The first stage equation becomes
Si = X’^10 + niiZii + ‘K2’i2i + ^13 z3i (4.1.10b)
+ z1ibj + X^(Bij Z2i)K2j + X^(Bij z3i)^3j + £li
j j j
where Bij is a dummy equal to one if individual i was born in year j for j equal to 1931 – 39. The coefficients K1j, K2j, K3j are the corresponding yearofbirth interactions. These interaction terms capture differences in the relation between quarterofbirth and schooling across cohorts. The rationale for adding these interaction terms is an increase in precision that comes from increasing the firststage R2, which goes up because the quarter of birth pattern in schooling differs across cohorts. In this example, the addition of interaction terms to the instrument list leads to a modest gain in precision; the standard error declines from.0194 to.0161.[42]
The last 2SLS model reported in Table 4.1.1 includes controls for linear and quadratic terms in ageinquarters in the list of covariates, Xi. In other words, someone who was born in the first quarter of 1930 is recorded as being 50 years old on census day (April 1), 1980, while someone born in the fourth quarter is recorded as being 49.25 years old. This finely coded age variable, entered into the model with a linear and quadratic term, provides a partial control for the fact that small differences age may be an omitted variable that confounds the quarterofbirth identification strategy. As long as the effects of age are similarly smooth, the quadratic ageinquarters model will pick them up.
This variation in the 2SLS setup illustrates the interplay between identification and estimation. For the 2SLS procedure to work, there must be some variation in the firststage fitted values conditional on whatever control variables (covariates) are included in the model. If the firststage fitted values are a linear combination of the included covariates, then the 2SLS estimate simply does not exist. In equation (4.1.9) this
is manifest by perfect multicollinearity. 2SLS estimates with quadratic age exist. But the variability “left over” in the firststage fitted values is reduced when the covariates include variables like age in quarters, that are closely related to the instruments (quarter of birth dummies). Because this variability is the primary determinant of 2SLS standard errors, the estimate in column 8 is markedly less precise than that in column 7, though it is still close to the corresponding OLS estimate.
Recap of IV and 2SLS Lingo
As we’ve seen, the endogenous variables are the dependent variable and the independent variable(s) to be instrumented; in a simultaneous equations model, endogenous variables are determined by solving a system of stochastic linear equations. To treat an independent variable as endogenous is to instrument it, i. e., to replace it with fitted values in the second stage of a 2SLS procedure. The independent endogenous variable in the Angrist and Krueger (1991) study is schooling. The exogenous variables include the exogenous covariates that are not instrumented and the instruments themselves. In a simultaneous equations model, exogenous variables are determined outside the system. The exogenous covariates in the Angrist and Krueger (1991) study are dummies for year of birth and state of birth. We think of exogenous covariates as controls. 2SLS aficionados live in a world of mutually exclusive labels: in any empirical study involving instrumental variables, the random variables to be studied are either dependent variables, independent endogenous variables, instrumental variables, or exogenous covariates. Sometimes we shorten this to: dependent and endogenous variables, instruments and covariates (fudging the fact that the dependent variable is also endogenous in a traditional SEM).
Leave a reply