# Tobit

The tobit model is essentially just a linear regression where some of the observations on your dependent variable have been censored. A censored variable is one that, once it reaches a limit, it is recorded at that limit no matter what it’s actual value might be. For instance, anyone earning \$1 million or more per year might be recorded in your dataset at the upper limit of \$1 million. That means that Bill Gates and the authors of your textbook earn the same amount in the eyes of your dataset (just kidding, gang). Least squares can be seriously biased in this case and it is wise to use a censored regression model (tobit) to estimate the parameters of the regression when a portion of your sample is censored.

Hill et al. (2011) consider the following model of hours worked for a sample of women. equation (16.20).

hoursi = ві + e2educi + вз exper + agei + вб kidsl6i + ei (16.20)

They estimate the model as a censored regression since a number of people in the sample are found to work zero hours. The command for censored regression in gretl is tobit, the syntax for which is shown below

tobit

Arg u m ents: depvar indepvars

Options: —1 limit =fva!(specfy left bound)

—rlimit=™/(specify right bound)

—vcv (print covariance matrix)

—robust (robust standard errors)

—verbose (print details of iterations)

The routine allows you to specify the left and right points at which censoring occurs. You also can choose a robust covariance that is robust with respect to the normality assumption used to obtain the MLE (not heteroskedasticity).

Estimation of this model in gretl is shown in the following script which replicates the example from POE4. The script estimates a tobit model of hours worked and generates the marginal effect of another year of schooling on the average hours worked. Hours are assumed to be censored at zero and no lower limit need be specified.

1 open "@gretldirdatapoemroz. gdt"

2 list xvars = const educ exper age kidsl6

3 tobit hours xvars

The results from the basic tobit estimation of the hours worked equation are:

Tobit, using observations 1-753
Dependent variable: hours
Standard errors based on Hessian

 Coefficient Std. Error z p-value const 1349.88 386.298 3.4944 0.0005 educ 73.2910 20.4698 3.5804 0.0003 age -60.7678 6.88310 -8.8286 0.0000 exper 80.5353 6.28051 12.8231 0.0000 kidsl6 918.918 111.588 8.2349 0.0000

244.2972 p-value 1.10e-51

-3827.143 Akaike criterion 7666.287 7694.031 Hannan-Quinn 7676.975

<r = 1133.7 (40.8769) Left-censored observations: 325 Right-censored observations: 0

Test for normality of residual – Null hypothesis: error is normally distributed Test statistic: x2(2) = 6.31679 with p-value = 0.0424938

The marginal effect of another year of schooling on hours worked is

dE (hoursi)

deduc – = ^(hoursi)fj2, (16.21)

where hours is the estimated regression function evaluated at the mean levels of education, expe­rience, and age for a person with one child under the age of six. Then, the cnorm function is used to compute the normal cdf, Ф(hoursi), evaluated at the prediction.

1 matrix beta = \$coeff

2 scalar H_hat = beta+beta*mean(educ)+beta*mean(exper)

3 +beta*mean(age)+beta*1

4 scalar z = cnorm(h_hat/\$sigma)

5 scalar me_educ = z*\$coeff(educ)

6

6 printf "nThe computed scale factor = %6.5gnand marginal effect of

7 another year of schooling = %5.5g.n", z, me_educ

This produces

The computed scale factor = 0.363

and marginal effect of another year of schooling = 26.605.

Note, the backward slashes () used at the end of the first two lines in the generation of H_hat are continuation commands. The backslash at the end of a line tells gretl that the next line is a continuation of the current line. This helps keep your programs looking good (and in this case, fitting within the margins of the page!).

A slightly easier way to evaluate the index, hours0, is to use matrices. In the alternative version we convert the data to a matrix and create a vector of the variable means. The average number of children (0.24), is replaced with a 1 and the index is computed using vector algebra.

1 tobit hours xvars

2 matrix beta = \$coeff

3 matrix X = { xvars }

4 matrix meanx = meanc(X)

5 matrix meanx[1,5]=1

6 scalar h_hat=meanx*beta

7 printf "nTwo ways to compute a prediction get %8.4f and %8.4fn", h_hat, H_hat

This produces

Two ways to compute a prediction get -397.3022 and -397.3022

Finally, estimates of the restricted sample using least squares and the full sample that includes the zeros for hours worked follow.

1 list xvars = const educ exper age kidsl6

2 smpl hours > 0 –restrict

3 ols hours xvars

4 smpl –full

5 ols hours xvars

Notice that the sample is restricted to the positive observations using the smpl hours > 0 —restrict statement. To estimate the model using the entire sample the full range is restored using smpl full.

These were added to a model table and the result appears below:

Dependent variable: hours

 (1) (2) (3) Tobit OLS OLS const 1350** 1830** 1335** (386.3) (292.5) (235.6) educ 73.29** -16.46 27.09** (20.47) (15.58) (12.24) exper 80.54** 33.94** 48.04** (6.281) (5.009) (3.642) age -60.77** -17.11** -31.31* (6.883) (5.458) (3.961) kidsl6 -918.9** -305.3** -447.9* (111.6) (96.45) (58.41)

Standard errors in parentheses * indicates significance at the 10 percent level ** indicates significance at the 5 percent level

You can see that the tobit regression in column (1) and the OLS regression in column (3) use the entire sample. Estimating the model by OLS with the zero observations in the model reduces all of the slope coefficients by a substantial amount. Tossing out the zero observations as in the OLS regression in column (2) reverses the sign on years of schooling (though it is not significant). For only women in the labor force, more schooling has no effect on hours worked. If you consider the entire population of women, more schooling does increase hours, presumably by enticing more women into the labor force.

16.5 Simulation

when the sample is censored using a

(16.22)

samples.

indeed biased when all observations, including the zero ones, are included in the sample. The line series yc = (y > 0) ? y : 0 is a logical statement that generates ‘y’ or ‘0’ depending on whether the statement in parentheses is true. Thus, a new variable, yc, is created that takes the value y if y>0 and is zero if not. Another logical is used in line 10 to generate an indicator variable, w. The variable w=1 when the statement in the parentheses (y>0) is true. Otherwise it is equal to zero. The variable w is used in wls to exclude the observations that have zero hours of work.

 1 nulldata 200 2 series xs = 20*uniform() 3 list x = const xs 4 series ys = -9 + 1*xs

5 loop 1000 —progressive —quiet

6 series y = ys + normal(0,4)

7 series yc = (y > 0) ? y : 0

8 ols y x

9 ols yc x

10 series w = (yc>0)

11 wls w yc x

12 tobit yc x

13 endloop

Because the tobit estimator is iterative, there is a lot of output generated to the screen. However, if you scroll down you will find the results from this simulation. Recall, the value of the constant was set at —9 and the slope to 1. The column labeled ‘mean of the estimated coefficients’ is the average value of the estimator in 1000 iterations of the Monte Carlo. When the estimator is unbiased, this number should be ‘close’ to the true value in the statistical sense. You can use the next column (std. dev. of estimated coefficients) to compute a Monte Carlo standard error to perform a test.

Since the coefficients are being averaged over the number, NMC, of simulated samples, the central limit theorem should apply; the mean should be approximately normally distributed and the variance of the mean equal to a/VNMC. The result in the column labeled ‘std. dev. of estimated coefficients’ is an estimate of a. To test for unbiasedness of the tobit estimator of the slope (Ho : b2 = 1 against the two-sided alternative) compute:

VNMC(b2 — 1)/a = ^1000(1.00384 — 1)/0.0737160 = 1.647 ~ N(0,1) (16.23)

if the estimator is unbiased. The 5% critical value is 1.96 and the unbiasedness of the tobit estimator cannot be rejected at this level of significance. See Adkins (2011b) for more examples and details.

OLS estimates using the 200 observations 1-200 Statistics for 1000 repetitions Dependent variable: y

OLS estimates using the 200 observations 1-200 Statistics for 1000 repetitions Dependent variable: yc

WLS estimates using the 108 observations 1-200 Statistics for 1000 repetitions Dependent variable: yc

Tobit estimates using the 200 observations 1-200 Standard errors based on Hessian Statistics for 1000 repetitions Dependent variable: yc

The estimators in the first set and last are unbiased. OLS in the first instance uses the full sample that has not been censored. In reality, the censored observations won’t be available so this estimator is not really feasible outside of the Monte Carlo. The tobit estimator in the last set is feasible, however. Clearly it is working pretty well with this data generation process. The second set of results estimates the model using the entire sample with 0 recorded for the censored observations. It is not performing well at all and is no better than the third set of results that discards the zero hours observations. It does reveal what happens, conditional on being in the labor force though. So, it is not without its uses.