Selection Bias
Selection bias occurs when your sample is truncated and the cause of that truncation is correlated with your dependent variable. Ignoring the correlation, the model could be estimated using least squares or truncated least squares. In either case, obtaining consistent estimates of the regression parameters is not possible. In this section the basic features of the this model will be presented.
Consider a model consisting of two equations. The first is the selection equation, defined
z* = Yi + Y2Wi + Ui, i = 1,… ,N
where z* is a latent variable, Yi and y2 are parameters, wi is an explanatory variable, and ui is a random disturbance. A latent variable is unobservable, but we do observe the dichotomous variable
The second equation, called the regression equation, is the linear model of interest. It is
Vi = ві + в2Xi + ei, i = 1,…,n, N > n
A selectivity problem arises when Vi is observed only when zi = 1 and P = 0. In this case the ordinary least squares estimator of в in (16.26) is biased and inconsistent. A consistent estimator has been suggested by Heckman (1979) and is commonly referred to as Heckman’s twostep estimator, or more simply, Heckit. Because the errors are normally distributed, there is also a maximum likelihood estimator of the parameters. Gretl includes routines for both.
The twostep (Heckit) estimator is based on conditional mean of y given that it is observed:
E[Vizi > 0] = ві + в2Xi + влАi (16.28)
is the inverse Mill’s ratio;(y1 + Y2wi) is the index function; ф() is the standard normal probability density function evaluated at the index; and Ф() is the standard normal cumulative density function evaluated at the index. Adding a random disturbance yields:
Vi = ві + в2 Xi + вл Ai + Vi
It can be shown that (16.30) is heteroskedastic and if Ai were known (and nonstochastic), then the selectivity corrected model (16.30) could be estimated by generalized least squares. Alternately, the heteroskedastic model (16.30) could be estimated by ordinary least squares, using White’s heteroskedasticity consistent covariance estimator (HCCME) for hypothesis testing and the construction of confidence intervals. Unfortunately, Ai is not known and must be estimated using the sample. The stochastic nature of Ai in (16.30) makes the automatic use of HCCME in this context inappropriate.
The twosteps of the Heckit estimator consist of
1. Estimate the selection equation to obtain 71 and 72. Use these in equation (16.29) to estimate the inverse Mill’s ratio, Aj.
2. Add Aj to the regression model as in equation (16.30) and estimate it using least squares.
This ignores the problem of properly estimating the standard errors, which requires an additional step. Gretl takes care of this automatically when you use the heckit command.
The example from POE4 uses the mroz. gdt data. The first thing we’ll do is to estimate the model ignoring selection bias using least squares on the nonzero observations. Load the data and generate the natural logarithm of wages. Since wages are zero for a portion of the sample, gretl will generate an error when you take the natural logs. You can safely ignore it as gretl will simply create missing values for the variables that cannot be transformed. Then use the ols command to estimate a linear regression on the truncated subset.
1 open "@gretldirdatapoemroz. gdt"
2 logs wage
3 ols l_wage const educ exper
The results are:
Model 1: OLS estimates using the 428 observations 1428
Dependent variable: Lwage
Coefficient 
Std. Error 
tratio 
pvalue 

const 
0.400174 
0.190368 
2.1021 
0.0361 
educ 
0.109489 
0.0141672 
7.7283 
0.0000 
exper 
0.0156736 
0.00401907 
3.8998 
0.0001 
Notice that the sample has been truncated to include only 428 observations for which hours worked are actually observed. The estimated return to education is about 11%, and the estimated coefficients of both education and experience are statistically significant.
The Heckit procedure takes into account that the decision to work for pay may be correlated with the wage a person earns. It starts by modeling the decision to work and estimating the resulting
selection equation using a probit model. The model can contain more than one explanatory variable, Wi, and in this example we have four: a womans age, her years of education, a dummy variable for whether she has children and the marginal tax rate that she would pay upon earnings if employed. Generate a new variable kids which is a dummy variable indicating the presence of any kids in the household. Estimate the probit model, generate the index function, and use it to compute the inverse Mill’s ratio variable. Finally, estimate the regression including the IMR as an explanatory variable.
1 open "@gretldirdatapoemroz. gdt"
2 series kids = (kidsl6+kids618>0)
3 logs wage
4 series kids = (kidsl6+kids618>0)
5 list X = const educ exper
6 list W = const mtr age kids educ probit lfp W
7 series ind = $coeff(const) + $coeff(age)*age +
8 $coeff(educ)*educ + $coeff(kids)*kids + $coeff(mtr)*mtr
9 series lambda = dnorm(ind)/cnorm(ind) io ols l_wage X lambda
The variables for the regression are put into the list X and those for the selection equation are put into W. The dnorm and cnorm functions return the normal density and normal cumulative density evaluated at the argument, respectively. The results are:
OLS estimates using the 428 observations 1428
Dependent variable: Lwage
Coefficient 
Std. Error 
tratio 
pvalue 

const 
0.810542 
0.494472 
1.6392 
0.1019 
educ 
0.0584579 
0.0238495 
2.4511 
0.0146 
exper 
0.0163202 
0.00399836 
4.0817 
0.0001 
lambda 
0.866439 
0.326986 
2.6498 
0.0084 
1.190173 S. D. dependent var 0.723198
187.0967 S. E. of regression 0.664278
0.162231 Adjusted R2 0.156304
27.36878 Pvalue(F) 3.38e16
430.2212 Akaike criterion 868.4424
884.6789 HannanQuinn 874.8550
Notice that the estimated coefficient of the inverse Mill’s ratio is statistically significant, implying that there is a selection bias in the least squares estimator. Also, the estimated return to education has fallen from approximately 11% (which is inconsistently estimated) to approximately 6%. Unfortunately, the usual standard errors do not account for the fact that the inverse Mills
ratio is itself an estimated value and so they are not technically correct. To obtain the correct standard errors, you will use gretl’s builtin heckit command.
The heckit command syntax is
depvar indepvars; selection equation —quiet (suppress printing of results)
—robust (QML standard errors)
—twostep (perform twostep estimation) —vcv (print covariance matrix)
—verbose (print extra output) heckit у 0 xl xZ ; ys 0 x3 x4 See also heckit. inp
In terms of our example the generic syntax will be
heckit y const x2 x3 … xk; z const w2 w3 … ws —twostep
where const x2 … xk are the k independent variables for the regression and const w2 …. ws are the s independent variables for the selection equation. In this example, we’ve used the two – step option which mimics the manual procedure employed above, but returns the correct standard errors.
heckit l_wage X ; lfp W —twostep
Again, we’ve used the results from the list function, which put the independent variables for the regression into X and the variables for the selection equation into W.
The results appear below in Table 16.3. Notice that in this model, the return to another year of schooling is about 5.8%. The parameter on the inverse Mills ratio is significant, which is evidence of selection bias.
To use the pulldown menus, select Model>Nonlinear models>Heckit from gretl’s main window. This will reveal the dialog shown in figure 16.5. Enter lwage as the dependent variable and the indicator variable lfp as the selection variable. Then enter the desired independent variables for the regression and selections equations. Finally, select the 2step estimation button at the bottom of the dialog box and click OK.
You will notice that the coefficient estimates are identical to the ones produced manually above. However, the standard errors, which are now consistently estimated, have changed. The tratio

Mean dependent var 1.190173 S. D. dependent var 0.723198 A 0.932559 p 0.929098
Total observations: 753
Censored observations: 325 (43.2%)
Table 16.3: Twostep Heckit results.
of the coefficient on the inverse Mills ratio, A, has fallen to 2.17, but it is still significant at the 5% level. Gretl also produces the estimates of the selection equation, which appear directly below those for the regression.
Leave a reply