Grouped Data

In the biometrics literature, grouped data is very likely from laboratory experiments, see Cox (1970). In the insecticide example, every dosage level xi is administered to a group of insects of size ni, and the proportion of insects that die are recorded (pi). This is done for i = 1,2,…,M dosage levels.

P[Vi = 1] = пі = P[I* < Ii] = $(a + вхі)

where I* is the tolerance and Ii = a + вхі is the stimulus. In economics, observations may be grouped by income levels or age and we observe the labor participation rate for each income or age group. For this type of grouped data, we estimate the probability of participating in the labor force ni with pi, the proportion from the sample. This requires a large number of observations in each group, i. e., a large ni for i = 1,2,…, M. In this case, the approximation is

Zi = Ф-1(рі) = a + вхі (13.9)

for each pi, we compute the standardized normal variates, the zi’s, and we have an estimate of a + вхі. Note that the standard normal distribution assumption is not restrictive in the sense that if I* is N(p, a2) rather than N(0,1), then one standardizes the P[I* < Ii] by subtracting p and dividing by a, in which case the new I* is N(0,1) and the new a is (a — p)/a, whereas the new в is в/a. This also implies that p and a are not separately estimable. A plot of the zi’s versus the xi’s would give estimates of a and в. For the biometrics example, one can compute LD50, which is the dosage level that will kill 50% of the insect population. This corresponds to zi = 0, which solves for xi = —33//3. Similarly, LD95 corresponds to zi = 1.645, which solves for Xi = (1.645 — a)/в. Alternatively, for the economic example, LD50 is the minimum reservation wage that is necessary for a 50% labor participation rate.

One could improve on this method by including more x’s on the right hand side of (13.9). In this case, one can no longer plot the zi values versus the x variables. However, one can run

OLS of z on these x’s. One problem remains, OLS ignores the heteroskedasticity in the error term. To see this:

Pi = Пі + €i = F (Хів)+ €i (13.10)

where F is a general c. d.f. and ni = F(х’ф). Using the properties of the binomial distribution, E(pi) = ni and var(pi) = ni(1 — ni)/ni. Defining zi = F-1(pi), we obtain from (13.10)

Zi = F-1(pi) = F -1(пі + €i) = F-1(пі) + [dF-1(ni)/dni]ei (13.11)

where the approximation = is a Taylor series expansion around ni with ei ^ 0. Since F is monotonic ni = F(F-1(ni)). Let wi = F-1(ni) = х’ів, differentiating with respect to n gives

1 = [dF (wi)/dwi ]dwi/dni (13.12)

Alternatively, this can be rewritten as

dF-1(ni)/dni = dwi/dni = 1/{dF (wi)/dwi} = 1/f (wi) = 1/f (х’ф) (13.13)

where f is the probability density function corresponding to F. Using (13.13), equation (13.11) can be rewritten as

Zi = F-1p) = F-1(пі) + ei/f (х’ф) (13.14)

= F -1(F (хів)) + ei/f (х’ів) = х’ів + ei/f (х’ів)

From (13.14), it is clear that the regression disturbances of zi on хі are given by ui = ei/f (хів), with E(ui) = 0 and a.2 = var(ui) = var(ei)/f2(хів) = Пі(1 — ni)/(nif2) = Fi(1 — Fi)/(nif2) since ni = Fi where the subscript i on f or F denotes that the argument of that function is хів. This heteroskedasticity in the disturbances renders OLS on (13.14) consistent but inefficient. For the probit, a2 = Фі(1 — Фі)/(піф‘2), and for the logit, a2 = 1/[піЛі(1 — Лі)], since fi = Лі(1 — Лі). Using 1/ai as weights, a WLS procedure can be performed on (13.14). Note that F-1(p) for the logit is simply log[p/(1 — p)]. This is one more reason why the logistic functional form is so popular. In this case one regresses log[p/(1 — p)] on х correcting for heteroskedasticity using WLS. This procedure is also known as the minimum logit chi-square method and is due to Berkson (1953).

In order to obtain feasible estimates of the ai’s, one could use the OLS estimates of в from (13.14), to estimate the weights. Greene (1993) argues that one should not use the proportions pi’s as estimates for the ni’s because this is equivalent to using the y2’s instead of a2 in the het – eroskedastic regression. These will lead to inefficient estimates. If OLS on (13.14) is reported one should use the robust White heteroskedastic variance-covariance option, otherwise the standard errors are biased and inference is misleading.

Example 1: Beer Taxes and Motor Vehicle Fatality. Ruhm (1996) used grouped logit analysis with fixed time and state effects to study the impact of beer taxes and a variety of alcohol-control policies on motor vehicle fatality rates. Ruhm collected panel data for 48 states (excluding Alaska, Hawaii and the District of Columbia) over the period 1982-1988. The dependent variable is a proportion p, denoting the total vehicle fatality rate per capita for state i at time t. One can perform the inverse logit transformation, log[p/(1 — p)], provided p is not zero or one, and run the usual fixed effects regression described in Chapter 12. Denote this dependent variable by (LFVR). The explanatory variables included the real beer tax rate on 24 (12 oz.) containers of beer (BEERTAX), the minimum legal drinking age (MLDA) in years, the percentage of the population living in dry counties (DRY), the average number of vehicle miles per person aged 16 and over (VMILES), and the percentage of young drivers (15-24 years old) (YNGDRV). Also some dummy variables indicating the presence of alcohol regulations. These include BREATH test laws which is a dummy variable that takes the value 1 if the state authorized the police to administer pre-arrest breath test to establish probable cause for driving under the influence (DUI). JAILD which takes the value of 1 if the state passed legislation mandating jail or community service (COMSERD) for the first DUI conviction. Other variables included are the unemployment rate, real per capita income, and state and time dummy variables. Details on these variables are given in Table 1 of Ruhm (1996). Some of the variables in this data set can be downloaded from the Stock and Watson (2003) web site at www. aw. com/stock watson. Table 13.1 replicate to the extent possible the grouped logit regression results in column (d) of Table 2, p. 444 of Ruhm (1996). This regression does not include some of the other alcohol regulations that were not provided in the data set. These were replicated using the robust White cross-section option in EViews.

Table 13.1 Grouped Logit, Beer Tax and Motor Vehicle Fatality

Dependent Variable: LVFR

Method: Panel Least Squares

Sample: 1982 1988

Cross-sections included: 48

Total panel (unbalanced) observations: 335

White cross-section standard errors & covariance (d. f. corrected)

Variable Coefficient

Std. Error





















































Effects Specification

Cross-section fixed (dummy variables)

Period fixed (dummy variables)



Mean dependent var


Adjusted R-squared


S. D. dependent var


S. E. of regression


Akaike info criterion


Sum squared resid


Schwarz criterion


Log likelihood




Durbin-Watson stat




Table 13.1 shows that the beer tax is negative and significant, while the minimum legal drinking age is not significant. Neither is the breath test law, JAILD or COMSERD variables, all of which represent state alcohol safety related legislation. Income per capita and the percentage of the population living in dry counties have a positive and significant effect on motor vehicle fatality rates. The state dummy variables are jointly significant with an observed F-value of 34.9 which is distributed as F(47,272). The year dummies are jointly significant with an observed F-value of 2.97 which is distributed as F(6,272). Problem 12 asks the reader to replicate Table 13.1. These results imply that increasing the minimum legal drinking age, or imposing stiffer punishments like mandating jail or community service are not effective policy tools for decreasing traffic related deaths. However, increasing the real tax on beer is an effective policy for reducing traffic related deaths.

For grouped data, the sample sizes ni for each group have to be sufficiently large. Also, the Pi’s cannot be zero or one. One modification suggested in the literature is to add (1/2n^ to pi when computing the log of odds ratio, see Cox (1970).

Example 2: Fractional Response. Papke and Wooldridge (1996) argue that in many economic settings pi may be 0 or 1 for a large number of observations. For example, when studying par­ticipation rates in pension plans or when studying high school graduation rates. They propose a fractional logit regression which handles fractional response variables based on quasi-likelihood methods. Fractional response variables are bounded variables. Without loss of generality, they could be restricted to lie between 0 and 1. Examples include the proportion of income spent on charitable contributions, the fraction of total weekly hours spent working. Papke and Wooldridge (1996) propose modeling the E(yi/xi) as a logistic function к(х’ф). This insures that the pre­dicted value of yi lies in the interval (0,1). It is also well defined even if yi takes the values 0 or 1 with positive probability. It is important to note that in case yi is a proportion from a group of known size ni, the quasi maximum likelihood method ignores the information on ni. Using the Bernoulli log-likelihood function, one gets

Li(/3) = yi log[A(xie)j + (1 – Уі) log[1 – A(xie)]

for i = 1, 2,… ,n, with 0 < Л(хів) < 1.

Maximizing П=і Li(P) with respect to в yields the quasi-MLE which is consistent and y/n asymptotically normal regardless of the distribution of yi conditional on xi, see Gourieroux, Monfort and Trognon (1984) and McCullagh and Nelder (1989). The latter proposed the gen­eralized linear models (GLM) approach to this problem in statistics. Logit QMLE can be done in Stata using the GLM command with the Binary family function indicating Bernoulli and the Link function indicating the logistic distribution.

Papke and Wooldridge (1996) derive robust asymptotic variance of the QMLE of в and sug­gest some specification tests based on Wooldridge (1991). They apply their methods to the participation in 401(K) pension plans. The data are from the 1987 IRS Form 5500 reports of pension plans with more than 100 participants. This data set containing 4734 observations can be downloaded from the Journal of Applied Econometrics Data Archive. We focus on a subset of their data which includes 3874 observations of plans with match rates less than or equal to one. Match rates above one may be indicating end-of-plan year employer contribu­tions made to avoid IRS disqualification. Participation rates (PRATE) in this sample are high

Table 13.2 Logit Quasi-MLE of Participation Rates in 401(K) Plan

glm prate mrate log emp log emp2 age age2 sole if one==1, f(bin) l(logit) robust

note: prate has non-integer values

Iteration 0:

log pseudo-likelihood =


Iteration 1:

log pseudo-likelihood =


Iteration 2:

log pseudo-likelihood =


Iteration 3:

log pseudo-likelihood =


Generalized linear models

Number of obs

= 3784


: ML: Newton-Raphson

Residual df

= 3777

Scale parameter



= 1273.60684

(1/df) Deviance

= .3372006


= 724.4199889

(1/df) Pearson

= .1917977

Variance function

: V(u) = u*(1-u)


Link function

: g(u) = ln(u/(1-u))


Standard errors

: Sandwich

Log pseudo-likelihood

= -1179.278516


= -29843.34715


= .6269971



Robust Std. Err.


P> z

[95% Conf. Interval]








log emp







log emp2



































averaging 84.8%. Over 40% of the plans have a participation proportion of one. This makes the log-odds ratio approach awkward since adjustments have to be made to more than 40% of the observations. The plan match rate (MRATE) averages about 41 cents on the dollar. Other explanatory variables include total firm employment (EMP), age of the plan (AGE), a dummy variable (SOLE) which takes the value of 1 if the 401(K) plan is the only pension plan offered by the employer. The 401(K) plans average 12 years in age, they are the SOLE plan in 37% of the sample. The average employment is 4622. Problem 14 asks the reader to replicate the descriptive statistic given in Table I of Papke and Wooldridge (1996, p. 627). Table 13.2 gives the Stata output for logit QMLE using the same specification given in Table II of Papke and Wooldridge (1996, p. 628). Note that it uses the GLM command, the Bernoulli variance function and the logit link function. The results show that there is a positive and significant relationship between match rate and participation rate. All the other variables in­cluded are significant except for SOLE. Problem 14 asks the reader to replicate this result and compare with OLS. The latter turns out to have a lower R2 and fails a RESET test, see Chapter 8.

Leave a reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>