# Parametric Estimation

1.2 Two-stage estimation

Consider the estimation of the model (18.1). Assume that u and e are jointly normally distributed with zero-means and the variance of e being normalized to be a unity, i. e. var(e) = 1. The least squares procedure applied to the observed y and x will give inconsistent estimates, if E(u | x, zj > e) is not zero and is corre­lated with x. This omitted selection-bias term needs to be corrected for consistent estimation (Heckman, 1979). With normally distributed disturbances, E(e|x, z,

1=1) = , where ф and Ф denote, respectively, the standard normal density   and distribution functions and o1e is the covariance of u and e. The bias-corrected regression equation is y = xв – o^^zY) + П, where E(n | x, z, I = 1) = 0. A two-stage method can be applied to estimate the corrected equation for в (Heckman, 1979). In the first stage, у is estimated by the probit maximum likelihood method. The least squares method can then be applied to estimate в and o1e in

with the observed subsample corresponding to I = 1, where у is the probit maxi­mum likelihood estimate of y. The estimator is consistent but the asymptotic
distribution of the two-stage estimator is not the conventional one of a linear regression model. The disturbances n and, hence, – are heteroskedastic. The estimated bias corrected term is a generated regressor. The replacement of у by the estimate у introduces additional errors, which are correlated across different sample units. By taking into account the heteroskedasticity of disturbances but ignoring the randomness of у in equation (18.7), the constructed asymptotic vari­ance matrix of the two-stage estimates S and d1e will underestimate the correct one unless o1e = 0.

For the two-sector model (18.2)-(18.3), the expected observable outcome equation for y* is E(y11 x, I = 1) = x1p1 + o1e (—jzY)-), and the expected observable outcome for y* will be E( y21 x, I = 0) = x2p2 + o2e (a Ф(ф^у)). For this model, each bias-corrected

equation can be either separately or jointly estimated by the two-stage method (Lee, 1978). While o1e is the covariance of u1 and e and o2e is the covariance of u2 and e, their signs may be of special interest for some empirical studies as they determine the direction of selection bias. When o1e is negative, the observed

outcome y1 is subject to positive selection as o1e (—Ф(^) is strictly positive. For

example, for the study on the return to college education, if high school gradu­ates with high unobserved ability are likely to go to college and that ability could increase earning, one might expect that the observed earning of college gradu­ates would be subject to positive selection. In some situations, negative selection might also be meaningful. For example, from the view of comparative advantage, what matters is the sign of o2e – o1e. As the measure of expected unobservable comparative advantage is E(u1 – u2|x, I = 1) + E(u2 – u1|x, I = 0) = (o2e –

°1e) (фТу + 1 _(ф(гу)), an individual has comparative advantage in his or her chosen

task when o2e – o1e is positive. The relevance of comparative advantage in self­selection has been explored in Lee (1978) and Willis and Rosen (1979). Heckman and Honore (1990) provide an in-depth analysis of the implications of Roy’s model.

The normal distribution is a common distributional assumption for sample selection models. For the simplicity of the two-stage estimation method, Olsen (1980) pointed out that the crucial property underlying the derivation of the bias correction is the linearity of the conditional expectation of u given e. Based on that property, Olsen specified the potential outcome equation y* = xP + ^(e – цє) + n, where цє is the mean of e, as the basic structure and suggested a linear probability modification to correct for the selection bias in observed outcomes. This modification is useful as it provides an alternative parametric specification without being restricted to normal disturbances. The correction of selection bias is now dependent on the marginal distribution of e. However, the selectivity bias terms may be sensitive to a specific choice probability model. Lee (1982) sug­gested the use of nonlinear transformations to overcome possible restrictions in Olsen’s approach and suggested some flexible functional form and series expan­sion for the selection-bias correction. The dichotomous indicator I is determined by the selection decision such that I = 1 if and only if z у > e. Thus, for any strictly increasing transformation J, I = 1 if and only if J(z y) > /(e). The Olsen specification

was generalized into y* = vP + X(J(e) – |Mj) + n, where |Mj = E(/(e)). The selection

bias term for the observed y is E(e*| J(zу) > e*) = F(J(ZY)) where F is the distribution of e, e* = J(e) and p(/(e)) = //(^ e*f/ (e*)de* with fj being the implied density function of e*. Conditional on y being observed, the outcome equation becomes y = vP + – Mj) + n, which can be estimated by a simple two-stage method.

This approach generates a large class of models with selectivity while the prob­ability choice model can be chosen to be a specific popular choice model and remains unchanged. The choice of different possible Js is, in general, a regressor selection problem in a linear regression model.

The specification of the conditional expectation of u on J(e) being linear can be further relaxed by introducing a flexible expansion of distributions (Lee, 1982). It was noted that if the bivariate distribution of u* and e, where u* = u/au/ could be represented by a bivariate Edgeworth expanded distribution, the conditional expectation of u* conditional on e would be E(u*| e) = {pe + ХГ>з [pA0rHr+1(e)/r! + A1r-1Hr-1(e)/(r – 1)!]}/D(e) where D(e) = 1 + ХГ>3A0rHr(e)/r!, Ars are functions of cumulants of u* and e, and Hr(e) is the rth order Hermite polynomial. When the marginal distribution of e is normal, D(e) = 1. Bias correction can be based on the expanded conditional expectation. With a normal e (or a normally trans­formed e*) and expanded terms up to r + s = 4 (instead of an infinite series), the bias corrected outcome equation is

y = v P + pOuH(z у)/Ф(г y)] + Mi20u[-(z у)ф^у)/(2Ф^ у))]

+ (Міз – 3p)Ou[(1 – (zy)2H(zу)/(6Ф(2y))] + n. (18.8)

The additional terms generalize the selection-bias correction of a normally dis­tributed u to a flexible distributional one. The two-stage estimation can be applied to the estimation of equation (18.8). The correct variance matrix of the two-stage estimator shall take into account both the heteroskedasticity of the disturbance n and the distribution of the first-stage estimator of у. For the model in (18.8), the exact expression of the heteroskedastic variance of n would be too complicated to be useful for estimation. To overcome that complication, Lee (1982) suggests the adoption of White’s robust variance formulation. White’s correction will estimate the first component of the correct variance matrix. The second component will be the variance and covariances due to the first stage estimate of у. Equation (18.8) can also be used for the testing of normality disturbance by checking whether the coefficients of the last two terms are zero. A test of the presence of selection bias is to see whether the coefficients of all the (three) bias-correction terms are zero. In principle, it is possible to formulate the bias correction with more terms. However, with additional terms, one might quickly run into possible increasing multicollinearity in a regression framework. The proper selection of expanded terms is an issue on the selection of regressors or a model selection problem. One may think that it is sensible to incorporate more expanded terms as sample size increases. Such a strategy can be better justified in a semiparametric estimation framework (Newey, Powell, and Walker, 1990).

For the estimation of sample selection model, multicollinearity due to the addition of the bias-correction term in the observed outcome equation remains a serious point of contention. Olsen’s specification with a linear probability choice equation highlights eloquently the multicollinearity issue at an early devel­opment stage of the literature. The linear choice probability corresponds to a uniform distribution for e. With e being a uniform random variable on [0, 1], E(e | zy > e) = zy/2. If z = x or z is a subvector of x, the bias-correction term will be perfectly multicollinear with the included x of the outcome equation. Conse­quently, the two-stage estimation method will break down completely. This multi­collinearity issue is related to model identification. With normal disturbances, the bias-corrected term is a nonlinear function of z and, because of the nonlinearity, the two-stage method would not completely break down. However, severe multicollinearity might still be an issue for some samples. Nawata (1993) and Leung and Yu (1996) showed that ф^ y)^(z Y) is almost linear in z y on a range of approximately [-3, 3]. The two-stage estimator would not be reliable under multi­collinearity. Wales and Woodland (1980), Nelson (1984), Manning, Duan, and Rogers (1987), Nawata and Nagase (1996), and Leung and Yu (1996), among others, investigate this issue by several Monte Carlo studies. The overall conclusions from these Monte Carlo studies are that the effectiveness of the two-stage esti­mator depends on either exclusion restrictions whereby some relevant variables in z do not appear in x or on the fact that at least one of the relevant variables in z displays sufficient variation to make the nonlinearity effective. In a distribution – free sample selection model, the exclusion restriction is a necessary condition for identification as it needs to rule out the linear probability setting of Olsen.

The possible multicollinearity of the two-stage estimation procedure created a debate in the health economics literature on whether a sample selection model is a better model than multi-part models for modeling discrete choices with outcome equations. A multi-part model essentially assumes uncorrelated dis­turbances among outcomes and choice equations (Manning et al., 1987; Maddala, 1985; Hay and Olsen, 1984; Leung and Yu, 1966). Hay and Olsen (1984) and Maddala (1985) point out an important feature of the sample selection model is its usage to investigate potential outcomes in addition to observed outcomes. For the normal distribution model, in the presence of severe multicollinearity, one has to resort to the method of maximum likelihood for a better inference.