# Selection Bias

Selection bias occurs when your sample is truncated and the cause of that truncation is corre­lated with your dependent variable. Ignoring the correlation, the model could be estimated using least squares or truncated least squares. In either case, obtaining consistent estimates of the re­gression parameters is not possible. In this section the basic features of the this model will be presented.

Consider a model consisting of two equations. The first is the selection equation, defined

z* = Yi + Y2Wi + Ui, i = 1,… ,N

where z* is a latent variable, Yi and y2 are parameters, wi is an explanatory variable, and ui is a random disturbance. A latent variable is unobservable, but we do observe the dichotomous variable

The second equation, called the regression equation, is the linear model of interest. It is

Vi = ві + в2Xi + ei, i = 1,…,n, N > n

A selectivity problem arises when Vi is observed only when zi = 1 and P = 0. In this case the ordinary least squares estimator of в in (16.26) is biased and inconsistent. A consistent estimator has been suggested by Heckman (1979) and is commonly referred to as Heckman’s two-step estimator, or more simply, Heckit. Because the errors are normally distributed, there is also a maximum likelihood estimator of the parameters. Gretl includes routines for both.

The two-step (Heckit) estimator is based on conditional mean of y given that it is observed:

E[Vizi > 0] = ві + в2Xi + влАi (16.28)

is the inverse Mill’s ratio;(y1 + Y2wi) is the index function; ф(-) is the standard normal probability density function evaluated at the index; and Ф(-) is the standard normal cumulative density function evaluated at the index. Adding a random disturbance yields:

Vi = ві + в2 Xi + вл Ai + Vi

It can be shown that (16.30) is heteroskedastic and if Ai were known (and nonstochastic), then the selectivity corrected model (16.30) could be estimated by generalized least squares. Al­ternately, the heteroskedastic model (16.30) could be estimated by ordinary least squares, using White’s heteroskedasticity consistent covariance estimator (HCCME) for hypothesis testing and the construction of confidence intervals. Unfortunately, Ai is not known and must be estimated using the sample. The stochastic nature of Ai in (16.30) makes the automatic use of HCCME in this context inappropriate.

The two-steps of the Heckit estimator consist of

1. Estimate the selection equation to obtain 71 and 72. Use these in equation (16.29) to estimate the inverse Mill’s ratio, Aj.

2. Add Aj to the regression model as in equation (16.30) and estimate it using least squares.

This ignores the problem of properly estimating the standard errors, which requires an additional step. Gretl takes care of this automatically when you use the heckit command.

The example from POE4 uses the mroz. gdt data. The first thing we’ll do is to estimate the model ignoring selection bias using least squares on the nonzero observations. Load the data and generate the natural logarithm of wages. Since wages are zero for a portion of the sample, gretl will generate an error when you take the natural logs. You can safely ignore it as gretl will simply create missing values for the variables that cannot be transformed. Then use the ols command to estimate a linear regression on the truncated subset.

1 open "@gretldirdatapoemroz. gdt"

2 logs wage

3 ols l_wage const educ exper

The results are:

Model 1: OLS estimates using the 428 observations 1-428
Dependent variable: Lwage

 Coefficient Std. Error t-ratio p-value const -0.400174 0.190368 -2.1021 0.0361 educ 0.109489 0.0141672 7.7283 0.0000 exper 0.0156736 0.00401907 3.8998 0.0001

Notice that the sample has been truncated to include only 428 observations for which hours worked are actually observed. The estimated return to education is about 11%, and the estimated coefficients of both education and experience are statistically significant.

The Heckit procedure takes into account that the decision to work for pay may be correlated with the wage a person earns. It starts by modeling the decision to work and estimating the resulting
selection equation using a probit model. The model can contain more than one explanatory variable, Wi, and in this example we have four: a womans age, her years of education, a dummy variable for whether she has children and the marginal tax rate that she would pay upon earnings if employed. Generate a new variable kids which is a dummy variable indicating the presence of any kids in the household. Estimate the probit model, generate the index function, and use it to compute the inverse Mill’s ratio variable. Finally, estimate the regression including the IMR as an explanatory variable.

1 open "@gretldirdatapoemroz. gdt"

2 series kids = (kidsl6+kids618>0)

3 logs wage

4 series kids = (kidsl6+kids618>0)

5 list X = const educ exper

6 list W = const mtr age kids educ probit lfp W

7 series ind = \$coeff(const) + \$coeff(age)*age +

8 \$coeff(educ)*educ + \$coeff(kids)*kids + \$coeff(mtr)*mtr

9 series lambda = dnorm(ind)/cnorm(ind) io ols l_wage X lambda

The variables for the regression are put into the list X and those for the selection equation are put into W. The dnorm and cnorm functions return the normal density and normal cumulative density evaluated at the argument, respectively. The results are:

OLS estimates using the 428 observations 1-428
Dependent variable: Lwage

 Coefficient Std. Error t-ratio p-value const 0.810542 0.494472 1.6392 0.1019 educ 0.0584579 0.0238495 2.4511 0.0146 exper 0.0163202 0.00399836 4.0817 0.0001 lambda 0.866439 0.326986 2.6498 0.0084

1.190173 S. D. dependent var 0.723198

187.0967 S. E. of regression 0.664278

27.36878 P-value(F) 3.38e-16

-430.2212 Akaike criterion 868.4424

884.6789 Hannan-Quinn 874.8550

Notice that the estimated coefficient of the inverse Mill’s ratio is statistically significant, im­plying that there is a selection bias in the least squares estimator. Also, the estimated return to education has fallen from approximately 11% (which is inconsistently estimated) to approximately 6%. Unfortunately, the usual standard errors do not account for the fact that the inverse Mills
ratio is itself an estimated value and so they are not technically correct. To obtain the correct standard errors, you will use gretl’s built-in heckit command.

The heckit command syntax is

heckit

depvar indepvars; selection equation —quiet (suppress printing of results)

—robust (QML standard errors)

—two-step (perform two-step estimation) —vcv (print covariance matrix)

—verbose (print extra output) heckit у 0 xl xZ ; ys 0 x3 x4 See also heckit. inp

In terms of our example the generic syntax will be

heckit y const x2 x3 … xk; z const w2 w3 … ws —two-step

where const x2 … xk are the k independent variables for the regression and const w2 …. ws are the s independent variables for the selection equation. In this example, we’ve used the two – step option which mimics the manual procedure employed above, but returns the correct standard errors.

heckit l_wage X ; lfp W —two-step

Again, we’ve used the results from the list function, which put the independent variables for the regression into X and the variables for the selection equation into W.

The results appear below in Table 16.3. Notice that in this model, the return to another year of schooling is about 5.8%. The parameter on the inverse Mills ratio is significant, which is evidence of selection bias.

To use the pull-down menus, select Model>Nonlinear models>Heckit from gretl’s main win­dow. This will reveal the dialog shown in figure 16.5. Enter lwage as the dependent variable and the indicator variable lfp as the selection variable. Then enter the desired independent variables for the regression and selections equations. Finally, select the 2-step estimation button at the bottom of the dialog box and click OK.

You will notice that the coefficient estimates are identical to the ones produced manually above. However, the standard errors, which are now consistently estimated, have changed. The t-ratio

Two-step Heckit estimates using the 428 observations 1-428 Dependent variable: Lwage Selection variable: lfp

 Coefficient Std. Error z-stat p-value const 0.810542 0.610798 1.3270 0.1845 educ 0.0584579 0.0296354 1.9726 0.0485 exper 0.0163202 0.00420215 3.8838 0.0001 lambda -0.866439 0.399284 -2.1700 0.0300 Selection equation const 1.19230 0.720544 1.6547 0.0980 mtr -1.39385 0.616575 -2.2606 0.0238 age -0.0206155 0.00704470 -2.9264 0.0034 kids -0.313885 0.123711 -2.5372 0.0112 educ 0.0837753 0.0232050 3.6102 0.0003

Mean dependent var 1.190173 S. D. dependent var 0.723198 A 0.932559 p -0.929098

Total observations: 753
Censored observations: 325 (43.2%)

Table 16.3: Two-step Heckit results.

of the coefficient on the inverse Mills ratio, A, has fallen to -2.17, but it is still significant at the 5% level. Gretl also produces the estimates of the selection equation, which appear directly below those for the regression.