# Two-Sample IV and Split-Sample IVF

 GLS estimates of Г in (4.3.1) are consistent because E The 2SLS minimand can be thought of as GLS applied to equation (4.3.1), after multiplying by /N to keep the residual from disappearing as the sample size gets large. In other words, 2SLS minimizes a quadratic form in the residuals from (4.3.1) with a (possibly non-diagonal) weighting matrix. An important insight that comes from writing the 2SLS problem in this way is that we do not need the individual observations in our sample to estimate (4.3.1). Just as with the OLS coefficient vector, which can be constructed from the sample conditional mean function, IV estimators can also be constructed from sample moments. The moments needed for IV are Zy and ZNW. The dependent variable, Zy, is a vector of dimension [k+q] x 1. The regressor matrix, ZNW, is of dimension [k+q] x [k+1]. The second-moment equation cannot be solved exactly unless Q = 1 so it makes sense to make the fit as good as possible by minimizing a quadratic form in the residuals. The most efficient weighting matrix for this purpose is the asymptotic covariance matrix of p=. This again produces the 2SLS minimand, Jn (g). A related insight is the fact that the moment matrices on the left and right hand side of the equals sign in equation (4.3.1) need not come from the same data sets provided these data sets are drawn from the same population. This observation leads to the two-sample instrumental variables (TSIV) estimator used by Angrist (1990) and developed formally in Angrist and Krueger (1992). Briefly, let Zi and Yi denote     The GMM interpretation of 2SLS highlights the fact that the IV estimator can be constructed from sample moments alone, with no micro data. Returning to the sample moment condition, (4.2.3), and re-arranging slightly produces a regression-like equation involving second moments:   the instrument/covariate matrix and dependent variable vector in data set 1 of size N± and let Z2 and W2 denote the instrument /covariate matrix and endogenous variable/covariate matrix in data set 2 of size N2. Assuming plim ^= plim ^ ^Д^1), GLS estimates of the two-sample moment equation

are also consistent for Г. The limiting distribution of this estimator is obtained by normalizing by у/N1 and assuming plim ^ Ду) is a constant.

The utility of TSIV comes from the fact that it widens the scope for IV estimation to situations where observations on dependent variables, instruments, and the endogenous variable of interest are hard to find in a single sample. It may be easier to find one data set that has information on outcomes and instruments, with which the reduced form can be estimated, and another data set which has information on endogenous variables and instruments, with which the first stage can be estimated. For example, in Angrist (1990), administrative records from the Social Security Administration (SSA) provide information on the dependent variable (annual earnings) and the instruments (draft lottery numbers coded from dates of birth, as well as covariates for race and year of birth). The SSA, however, does not track participants’ veteran status. This information was taken from military records, which also contain dates of birth that can used to code lottery numbers. Angrist (1990) used these military records to construct 222, the first-stage correlation between lottery numbers and veteran status conditional on race and year of birth, while the SSA data were used to construct Z1Y1.

Two further simplifications make TSIV especially easy to use. First, as noted previously, when the instruments consist of a full set of mutually exclusive dummy variables, as in Angrist (1990) and Angrist and Krueger (1992), the second moment equation, (4.3.1), simplifies to a model for conditional means. In particular, the 2SLS minimand for the two-sample problem becomes

Jn(g) = £ (yij – g’W2j)2, (4.3.2)

з

where yij is the mean of the dependent variable at instrument/covariate value j in one sample, W2j is the mean of endogenous variables and covariates at instrument/covariate value j in a second sample, and! j is an appropriate weight. This amounts to weighted least squares estimation of the VIV equation, except that the dependent and independent variables do not come from the same sample. Again, Angrist (1990) and Angrist and Krueger (1992) provide illustrations. The optimal weights for asymptotically efficient TSIV are given by variance of y1j — g’W2j. This variance is affected by the fact that moments come from different samples, as are the TSIV standard errors, which are easy to compute in the dummy-instrument case since the estimator is equivalent to weighted least squares.

Second, Angrist and Krueger (1995) introduced a computationally attractive TSIV-type estimator that requires no matrix manipulation and can be implemented with ordinary regression software. This estimator, called Split-Sample IV (SSIV), works as follows. The first-stage estimates in data set two are given by (Z2Z2)-1Z2W2. These fitted values can be carried over to data set 1 by constructing the cross-sample fitted value, Wi2 = Z1(Z!2Z2)-1Z2W2. The SSIV second stage is a regression of Y1 on W12. The correct limiting distribution for this estimator is derived in Inoue and Solon (2005), who show that the limiting distribution presented in Angrist and Krueger (1992) requires the assumption that Z1Z1 = Z2 Z2 (as would be true if the marginal distribution of the instruments and covariates is fixed in repeated samples). It’s worth noting, however, that the limiting distributions of SSIV and 2SLS are the same when the coefficient on the endogenous variable is zero. The standard errors for this special case are simple to construct and probably provide a reasonably good approximation to the general case.