# Sample Attrition and Sample Selection

Missing observations occur frequently in panel data. If individuals are missing randomly, most estimation methods for the balanced panel can be extended in a straightforward manner to the unbalanced panel (e. g. Hsiao, 1986). For instance, suppose that

ditVit = dit [a, + 7 ‘zit + uit], (16.34)

where dit is an observable scalar indicator variable which denotes whether information about (yit, z ‘it) for the ith individual at fth time period is available or not. The indicator variable dit is assumed to depend on a ^-dimensional variables, wit, individual specific effects Xi and an unobservable error term nit,

dit = I(Xi + §’wu + Чи > 0), (16.35)

where I(-) is the indicator function that takes the value of 1 if Xi + S’wit + nit > 0 and 0 otherwise. In other words, the indicator variable dit determines whether (yit, zit) in (16.34) is observed or not (e. g. Hausman and Wise, 1979).

Without sample selectivity, that is dit = 1 for all i and t, (16.31) is the standard variable intercept (or fixed effects) model for panel data discussed in Section 2. With sample selection and if nit and uit are correlated, E(uit | zit, dit = 1) Ф 0. Let 0( ) denote the conditional expectation of uit conditional on dit = 1 and wit, then (16.31) can be written as

Vit = ai + 7′ zit + 0(Xi + §’wit) + £n

where E(eit | zit, dit = 1) = 0. The form of the selection function is derived from the joint distribution of u and n. For instance, if u and n are bivariate normal, then we have the Heckman (1979) sample selection model with 0(X, + §’wit) =

Ф(^, + § Wu), where aun denotes the covariance between u and n, ф( ) and Ф( )

Ф(^і + §’Wit)

are standard normal density and distribution, respectively, and the variance of n is normalized to be 1. Therefore, in the presence of sample attrition or selection, regressing Vu on zit using only the observed information is invalidated by two problems. First, the presence of the unobserved effects a i, and second, the "selection bias" arising from the fact that E(uit| zit, dit = 1) = 0(X,- + §wit).

When individual effects are random and the joint distribution function of (u, n, Y, Xi) is known, both the maximum likelihood and two – or multi-step estimators can be derived (e. g. Heckman, 1979; and Ryu, 1998). The resulting estimators are consistent and asymptotically normally distributed. The speed of convergence is proportional to the square root of the sample size. However, if the joint distribution of u and n is misspecified, then even without the presence of

ai, both the maximum likelihood and Heckman (1979) two-step estimators will be inconsistent. This sensitivity of parameter estimate to the exact specification of the error distribution has motivated the interest in semiparametric methods.

The presence of individual effects is easily solved by pairwise differencing those individuals that are observed for two time periods t and s, i. e. who has dit = dis = 1. However, the sample selectivity factors are not eliminated by pairwise differencing. The expected value of yit – yis given dit = 1 and dis = 1 takes the form

E(Уи – yis I dit = 1, diS = 1) = (§it – ZisYj + E[uu – Uis | du = 1, dis = 1]. (16.37)

In general,

QUs = E(Uit – Uis I du = 1, dis = 1) * 0 (16.38)

and are different from each other. If (uit, nit) are independent, identically distributed (iid) and are independent of ai, Ki, § and w, then

Qu = E(Uit | du = 1, dis = 1) = E(Uit | du = 1)

= E(UU Int > – w’it8 – K) = 9 (8’wit + A,-), (16.39)

where the second equality is due to the independence over time assumption of the error vector and the third equality is due to the independence of the errors to the individual effects and the explanatory variables. The function 9() of the single index, S’wit + K, is the same over i and t because of the iid assumption of (Uit, nit), but in general, 9(8’wit + K) * 9(S’wis + Ki) because of the time variation of the scalar index S’wit. However, for an individual i that has S’wit = S’wis and dit = dis = 1, the sample selection effect 9it will be the same in the two periods. Therefore, for this particular individual, time differencing eliminates both the unobserved individual effect and the sample selection effect,

У it – yis = У’ (§it – §is) + (£it – Sis). (16.40)

This suggests estimating у by the least squares from a subsample that consists of those observations that satisfy S’wit = 8’wis and dit = dis = 1,

1

X X (§it – §is)(§it – §is)’1{(Wit – Wis)§ = 0}dudis

i=1 1< s<t < Ti N

X X (§it – §is)(yit – yis)1{(Wit – Wis)’8}dudis

i =1 1<s <t <Ti

where Ti denotes the number of ith individual’s time series observations.

The estimator (16.41) cannot be directly implemented because 8 is unknown. Moreover, the scalar index 8’wit will typically be continuous if any of the variables

in wit is continuous. Ahn and Powell (1993) note that if 9 is a sufficiently "smooth" function, and 8 is a consistent estimator of 8, observations for which the difference (wit – wis)’8 is close to zero should have 9it – 9is — 0. They propose a two-step procedure. In the first step, consistent semiparameter estimates of the coefficients of the "selection" equation are obtained. The result is used to obtain estimates of the "single index, wJ8," variables characterizing the selectivity bias in the equation of index. The second step of the approach estimates the parameters of the equation of interest by a weighted instrumental variables regression of pairwise differences in dependent variables in the sample on the corresponding differences in explanatory variables; the weights put more emphasis on pairs with w’t8 — wit-18.

Kyriazidou (1997) and Honore and Kyriazidou (1998) generalize this concept and propose to estimate the fixed effects sample selection models in two steps: In the first step, estimate 8 by either the Anderson (1970), Chamberlain (1980) conditional maximum likelihood approach or the Manski (1975) maximum score method. In the second step, the estimated 8 is used to estimate y, based on pairs of observations for which dit = dis = 1 and for which (wit – wis)’8 is "close" to zero. This last requirement is operationalized by weighting each pair of observations with a weight that depends inversely on the magnitude of (wit – wis)’8, so that pairs with larger differences in the selection effects receive less weight in the estimation. The Kyriazidou (1997) estimator takes the form:

where K is a kernel density function which tends to zero as the magnitude of its argument increases and hN is a positive constant that decreases to zero as N ^ ^.

Under appropriate regularity conditions, Kyriazidou (1997) shows that ‘ (16.42) is consistent and asymptotically normally distributed. However, the rate of convergence is slower than the standard square root of the sample size.

There is an explosion of techniques and procedures for the analysis of panel data (e. g. Matyas and Sevestre, 1996). In this chapter we have discussed some popular panel data models. We did not discuss issues of duration and count data models (e. g. Cameron and Trivedi, 1998; Heckman and Singer, 1984; Lancaster, 1990; Lancaster and Intrator, 1998), simulation-based inference (e. g. Gourieroux and Monfort, 1993), specification analysis (e. g. Baltagi and Li, 1995; Lee, 1987; Li and Hsiao, 1998; Maddala, 1995; Wooldridge, 1995), measurement errors (e. g. Biorn, 1992; Griliches and Hausman, 1984; Hsiao, 1991; Hsiao and Taylor, 1991) pseudo panels or matched samples (e. g. Deaton, 1985; Moffit, 1993; Peracchi and Welsch,

1995; Verbeek, 1992), etc. In general, there does not exist a panacea for panel data analysis. It appears more fruitful to explicitly recognize the limitations of the data and focus attention on providing solutions for a specific type of model. A specific model often contains specific structural information that can be exploited. However, the power of panel data depends on the validity of the assumptions upon which the statistical methods have been built (e. g. Griliches, 1979).

Notes

* This work was supported in part by National Science Foundation grant SBR96-19330. I would like to thank two referees for helpful comments.

1 Normality is made for ease of relating sampling approach and Bayesian approach estimators. It is not required.

2 See Chamberlain (1984), Hausman and Taylor (1981) for the approaches of estimating models when u and є are correlated.

3 Under smooth conditions, Horowitz (1992) proposed a smoothed maximum score estimator that has a n~2/5 rate of convergence. With even stronger conditions Lee (1999) is able to propose a root-n consistent semiparametric estimator.

References

Ahn, H., and J. L. Powell (1993). Semiparametric estimation of censored selection models with a nonparametric selection mechanism. Journal of Econometrics 58, 3-30.

Ahn, S. C., and P. Schmidt (1995). Efficient estimation of models for dynamic panel data. Journal of Econometrics 68, 29-52.

Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In Prac. 2nd Int. Symp. Information Theory, pp. 267-81.

Amemiya, T., and T. MaCurdy (1986). Instrumental variable estimation of an error components model. Econometrica 54, 869-81.

Anderson, E. B. (1970). Asymptotic properties of conditional maximum likelihood estimators. Journal of the Royal Statistical Society series B 32, 283-301.

Anderson, E. B. (1973). Conditional Inference and Models for Measuring. Kobenbaun: Mentalhygiejnisk Forlag.

Anderson, T. W., and C. Hsiao (1981). Estimation of dynamic models with error components. Journal of the American Statistical Association 76, 598-606.

Anderson, T. W., and C. Hsiao (1982). Formulation and estimation of dynamic models using panel data. Journal of Econometrics 18, 47-82.

Arellano, M., and O. Bover (1995). Another look at the instrumental variable estimation of error-components models. Journal of Econometrics 68, 29-52.

Balestra, P., and M. Nerlove (1966). Pooling cross-section and time series data in the estimation of a dynamic model: The demand for natural gas. Econometrica 34, 585-612.

Baltagi, B. H. (1995). Econometric Analysis of Panel Data. New York: Wiley.

Baltagi, B. H., and Q. Li (1995). Testing AR(1) against MA(1) distrubances in an error components model. Journal of Econometrics 68, 133-52.

Biorn, E. (1992). Econometrics of panel data with measurement errors. In L. Matyas and P. Sevestre (eds.) Econometrics of Panel Data: Theory and Applications, pp. 152-95. Kluwer.

Bhargava A., and J. D. Sargan (1983). Estimating dynamic random effects models from panel data covering short time periods. Econometrica 51, 1635-59.

Blundell, R., and S. Bond (1998). Initial conditions and moment restrictions in dynamic panel data models. Journal of Econometrics 87, 115-43.

Breusch, T. S., and A. R. Pagan (1979). A simple test for heteroscedasticity and random coefficient variation. Econometrica 47, 1287-94.

Cameron, A. C., and P. K. Trivedi (1998). Regression Analysis of Count Data. Cambridge: Cambridge University Press.

Chamberlain, G. (1980). Analysis of covariance with qualitative data. Review of Economic Studies 47, 225-38.

Chamberlain, G. (1984). Panel data. In Z. Griliches and M. Intriligator (eds.), Handbook of Econometrics, Volume 2. pp. 1247-1318. Amsterdam: North-Holland.

Chintagunta, P., E. Kyriazidou, and J. Perktold (1998). Panel data analysis of household brand choices. Journal of Econometrics.

Deaton, A. (1985). Panel data from time series of cross-sections. Journal of Econometrics 30, 109-26.

Gelfand, A. E., and A. F.M. Smith (1990). Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association 85, 398-409.

Gourieroux, C., and A. Monfort (1993). Simulation based inference: A survey with special reference to panel data models. Journal of Econometrics 59, 5-34.

Griliches, Z. (1979). Sibling models and data in economics, beginning of a survey. Journal of Political Economy 87, supplement 2, S37-S64.

Griliches, Z., and J. A. Hausman (1984). Errors-in-variables in panel data. Journal of Econometrics 31, 93-118.

Hausman, J. A. (1978). Specification tests in econometrica. Econometrica 46, 1251-71.

Hausman, J. A., and W. E. Taylor (1981). Panel data and unobservable individual effects. Econometrica 49, 1377-98.

Hausman, J. A., and D. A. Wise (1979). Attrition bias in experimental and panel data: The Gary income maintenance experiment. Econometrica 47, 455-73.

Heckman, J. (1979). Sample selection bias as a specification error. Econometrica 47, 153-61.

Heckman, J. J., and B. Singer (1984). Econometric duration analysis. Journal of Econometrics 24, 63-132.

Honore, B. E., and E. Kyriazidou (1997). Panel data discrete choice models with lagged dependent variables. Mimeo.

Honore, B. E., and E. Kyriazidou (1998). Estimation of tobit-type models with individual specific effects. Mimeo.

Hsiao, C. (1974). Statistical inference for a model with both random cross-sectional and time effects. International Economic Review 15, 12-30.

Hsiao, C. (1975). Some estimation methods for a random coefficients model. Econometrica 43, 305-25.

Hsiao, C. (1985). Benefits and limitations of panel data. Econometric Reviews 4, 121-74.

Hsiao, C. (1986). Analysis of Panel Data. Econometric Society monographs No. 11, New York: Cambridge University Press.

Hsiao, C. (1989). Consistent estimation for some nonlinear errors-in-variables models. Journal of Eocnometrics 41, 159-85.

Hsiao, C. (1990). A mixed fixed and random coefficients framework for pooling crosssection and time series data. Paper presented at the Third Conference on Telecommunication Demand Analysis with Dynamic Regulation, Hilton Head, S. Carolina.

Hsiao, C. (1991). Indentification and estimation of latent binary choice models using panel data. Review of Economic Studies 58, 717-31.

Hsiao, C. (1992). Nonlinear latent variables models. In L. Matyas and P. Sevestre (eds.) Econometrics of Panel Data. pp. 242-61. Kluwer.

Hsiao, C. (1995). Panel analysis for metric data. G. Arminger, C. C. Clogg, and M. E. Sobel, Handbook of Statistical Modelling in the Social and Behavioral Sciences, (3rd edn). pp. 361400. Plenum.

Hsiao, C., and D. Mountain (1995). A framework for regional modelling and impact analysis – an analysis of demand for electricity by large municipalities in Ontario, Canada. Journal of Regional Science 34, 361-85.

Hsiao, C., and A. K. Tahmiscioglu (1997). A panel analysis of liquidity constraints and firm investment. Journal of the American Statistical Association 92, 455-65.

Hsiao, C., and G. Taylor (1991). Some remarks on measurement errors and the identification of panel data models. Statistical Neerlandica 45, 187-94.

Hsiao, C., T. W. Applebe, and C. R. Dineen (1993). A general framework for panel data models – with an application to Canadian customer-dialed long distance telephone service. Journal of Econometrics 59, 63-86.

Hsiao, C., D. C. Mountain, K. Y. Tsui, and M. W. Luke Chan (1989). Modelling Ontario regional electricity system demand using a mixed fixed and random coefficients approach. Regional Science and Urban Economics 19, 567-87.

Hsiao, C., M. H. Pesaran, and A. K. Tahmiscioglu (1998). Maximum likelihood estimation of fixed effects dynamic panel data models covering short time periods. Mimeo, Cambridge University.

Hsiao, C., M. H. Pesaran, and A. K. Tahmiscioglu (1999). Bayes estimation of short-run coefficients in dynamic panel data models. In C. Hsiao, L. F. Lee, K. Lahiri, and M. H. Pesaran (eds.) Analysis of Panels and Limited Dependent Variables Models. Cambridge: Cambridge University Press, pp. 268-96.

Hsiao, C., B. H. Sun, and J. Lightwood (1995). Fixed vs. random effects specification for panel data analysis. Paper presented in International Panel Data Conference, Paris.

Horowitz, J. (1992). A smoothed maximum score estimator for the binary response model. Econometrica 60, 505-31.

Kuh, E. (1963). Capital Stock Growth: A Micro-Econometric Approach. Amsterdam: North – Holland.

Kyriazidou, E. (1997). Estimation of a panel data sample selection model. Econometrica 65, 1335-64.

Lancaster, T. (1990). The Econometric Analysis of Transition Data. Cambridge: Cambridge University Press.

Lancaster, T. (1998). Some econometrics of scarring. In C. Hsiao, K. Morimune, and J. Powell (eds.) Nonlinear Statistical Inference. Cambridge: Cambridge University Press.

Lancaster, T., and O. Intrator (1998). Panel data with survival: Hospitalization of HIVpositive patients. Journal of the American Statistical Association 93, 46-53.

Lee, L. F. (1987). Nonparametric testing of discrete panel data models. Journal of Econometrics 34, 147-78.

Lee, M. J. (1999). A root-n consistent semiparametric estimator for related effect binary response panel data. Econometrica 67, 427-33.

Li, Q., and C. Hsiao (1998). Testing serial correlation in semiparametric panel data models. Journal of Econometrics 87, 207-37.

Lindley, D. V., and A. F.M. Smith (1972). Bayes estimates for the linear model. Journal of the Royal Statistical Society B 34, 1-41.

Maddala, G. S. (1995). Specification tests in limited dependent variable models. In G. S. Maddala, P. C.B. Phillips, and T. N. Srinivasan (eds.) Advances in Econometrics and Quantitative Economics: Essays in Honor of C. R. Rao. Oxford: Blackwell. pp. 1-49.

Manski, C. F. (1975). Maximum score estimation of the stochastic utility model of choice. Journal of Econometrics 3, 205-28.

Matyas, L., and P. Sevestre (1996). The Econometrics of Panel Data – Handbook of Theory and Applications, 2nd edn. Dordrecht: Kluwer.

Min, C. K., and A. Zellner (1993). Bayesian and non-Bayesian methods for combining models and forecasts with applications to forecasting international growth rate. Journal of Econometrics 56, 89-118.

Moffitt, R. (1993). Identification and estimation of dynamic models with a time series of repeated cross-sections. Journal of Econometrics 59, 99-123.

Mundlak, Y. (1978). On the pooling of time series and cross section data. Econometrica 46, 69-85.

Neyman, J., and E. L. Scott (1948). Consistent estimates based on partially consistent observations. Econometrica 16, 1-32.

Peracchi, F., and F. Welsch (1995). How representative are matched cross-sections? Evidence from the current population survey. Journal of Econometrics 68, 153-80.

Pesaran, M. H., and R. Smith (1995). Estimation of long-run relationships from dynamic heterogeneous panels. Journal of Econometrics 68, 79-114.

Ryu, K. K. (1998). New approach to attrition problem in longitudinal studies. In C. Hsiao, K. Morimune, and J. Powell (eds.) Nonlinear Statistical Inference. Cambridge University Press.

Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics 6, 461-4.

Swamy, P. A.V. B. (1970). Efficient inference in a random coefficient regression model. Econometrica 38, 311-23.

Verbeek, M. (1992). The design of panel surveys and the treatment of missing observations. Ph. D. dissertation, Tilburg University.

Wallace, T. D., and A. Hussain (1969). The use of error components models in combining cross-section with time series data. Econometrica 37, 55-72.

Wansbeek, T., and T. Knaap (1999). Estimating a dynamic panel data model with heterogeneous trend. Annales d’Economie et de Statistique 55-6, 331-50.

Wooldridge, J. M. (1995). Selection corrections for panel data models under conditional mean independence assumptions. Journal of Econometrics 68, 115-32.

Zellner, A. (1962). An efficient method of estimating seemingly unrelated regressions and tests for aggregation bias. Journal of the American Statistical Association 57, 348-68.

Ziliak, J. P. (1997). Efficient estimation with panel data when instruments are predetermined: An empirical comparison of moment-condition estimators. Journal of Business and Economic Statistics 15, 419-31.

## Leave a reply