Grouped Data
In the biometrics literature, grouped data is very likely from laboratory experiments, see Cox (1970). In the insecticide example, every dosage level xi is administered to a group of insects of size ni, and the proportion of insects that die are recorded (pi). This is done for i = 1,2,…,M dosage levels.
P[Vi = 1] = пі = P[I* < Ii] = $(a + вхі)
where I* is the tolerance and Ii = a + вхі is the stimulus. In economics, observations may be grouped by income levels or age and we observe the labor participation rate for each income or age group. For this type of grouped data, we estimate the probability of participating in the labor force ni with pi, the proportion from the sample. This requires a large number of observations in each group, i. e., a large ni for i = 1,2,…, M. In this case, the approximation is
Zi = Ф1(рі) = a + вхі (13.9)
for each pi, we compute the standardized normal variates, the zi’s, and we have an estimate of a + вхі. Note that the standard normal distribution assumption is not restrictive in the sense that if I* is N(p, a2) rather than N(0,1), then one standardizes the P[I* < Ii] by subtracting p and dividing by a, in which case the new I* is N(0,1) and the new a is (a — p)/a, whereas the new в is в/a. This also implies that p and a are not separately estimable. A plot of the zi’s versus the xi’s would give estimates of a and в. For the biometrics example, one can compute LD50, which is the dosage level that will kill 50% of the insect population. This corresponds to zi = 0, which solves for xi = —33//3. Similarly, LD95 corresponds to zi = 1.645, which solves for Xi = (1.645 — a)/в. Alternatively, for the economic example, LD50 is the minimum reservation wage that is necessary for a 50% labor participation rate.
One could improve on this method by including more x’s on the right hand side of (13.9). In this case, one can no longer plot the zi values versus the x variables. However, one can run
OLS of z on these x’s. One problem remains, OLS ignores the heteroskedasticity in the error term. To see this:
Pi = Пі + €i = F (Хів)+ €i (13.10)
where F is a general c. d.f. and ni = F(х’ф). Using the properties of the binomial distribution, E(pi) = ni and var(pi) = ni(1 — ni)/ni. Defining zi = F1(pi), we obtain from (13.10)
Zi = F1(pi) = F 1(пі + €i) = F1(пі) + [dF1(ni)/dni]ei (13.11)
where the approximation = is a Taylor series expansion around ni with ei ^ 0. Since F is monotonic ni = F(F1(ni)). Let wi = F1(ni) = х’ів, differentiating with respect to n gives
1 = [dF (wi)/dwi ]dwi/dni (13.12)
Alternatively, this can be rewritten as
dF1(ni)/dni = dwi/dni = 1/{dF (wi)/dwi} = 1/f (wi) = 1/f (х’ф) (13.13)
where f is the probability density function corresponding to F. Using (13.13), equation (13.11) can be rewritten as
Zi = F1p) = F1(пі) + ei/f (х’ф) (13.14)
= F 1(F (хів)) + ei/f (х’ів) = х’ів + ei/f (х’ів)
From (13.14), it is clear that the regression disturbances of zi on хі are given by ui = ei/f (хів), with E(ui) = 0 and a.2 = var(ui) = var(ei)/f2(хів) = Пі(1 — ni)/(nif2) = Fi(1 — Fi)/(nif2) since ni = Fi where the subscript i on f or F denotes that the argument of that function is хів. This heteroskedasticity in the disturbances renders OLS on (13.14) consistent but inefficient. For the probit, a2 = Фі(1 — Фі)/(піф‘2), and for the logit, a2 = 1/[піЛі(1 — Лі)], since fi = Лі(1 — Лі). Using 1/ai as weights, a WLS procedure can be performed on (13.14). Note that F1(p) for the logit is simply log[p/(1 — p)]. This is one more reason why the logistic functional form is so popular. In this case one regresses log[p/(1 — p)] on х correcting for heteroskedasticity using WLS. This procedure is also known as the minimum logit chisquare method and is due to Berkson (1953).
In order to obtain feasible estimates of the ai’s, one could use the OLS estimates of в from (13.14), to estimate the weights. Greene (1993) argues that one should not use the proportions pi’s as estimates for the ni’s because this is equivalent to using the y2’s instead of a2 in the het – eroskedastic regression. These will lead to inefficient estimates. If OLS on (13.14) is reported one should use the robust White heteroskedastic variancecovariance option, otherwise the standard errors are biased and inference is misleading.
Example 1: Beer Taxes and Motor Vehicle Fatality. Ruhm (1996) used grouped logit analysis with fixed time and state effects to study the impact of beer taxes and a variety of alcoholcontrol policies on motor vehicle fatality rates. Ruhm collected panel data for 48 states (excluding Alaska, Hawaii and the District of Columbia) over the period 19821988. The dependent variable is a proportion p, denoting the total vehicle fatality rate per capita for state i at time t. One can perform the inverse logit transformation, log[p/(1 — p)], provided p is not zero or one, and run the usual fixed effects regression described in Chapter 12. Denote this dependent variable by (LFVR). The explanatory variables included the real beer tax rate on 24 (12 oz.) containers of beer (BEERTAX), the minimum legal drinking age (MLDA) in years, the percentage of the population living in dry counties (DRY), the average number of vehicle miles per person aged 16 and over (VMILES), and the percentage of young drivers (1524 years old) (YNGDRV). Also some dummy variables indicating the presence of alcohol regulations. These include BREATH test laws which is a dummy variable that takes the value 1 if the state authorized the police to administer prearrest breath test to establish probable cause for driving under the influence (DUI). JAILD which takes the value of 1 if the state passed legislation mandating jail or community service (COMSERD) for the first DUI conviction. Other variables included are the unemployment rate, real per capita income, and state and time dummy variables. Details on these variables are given in Table 1 of Ruhm (1996). Some of the variables in this data set can be downloaded from the Stock and Watson (2003) web site at www. aw. com/stock watson. Table 13.1 replicate to the extent possible the grouped logit regression results in column (d) of Table 2, p. 444 of Ruhm (1996). This regression does not include some of the other alcohol regulations that were not provided in the data set. These were replicated using the robust White crosssection option in EViews.
Table 13.1 Grouped Logit, Beer Tax and Motor Vehicle Fatality

Table 13.1 shows that the beer tax is negative and significant, while the minimum legal drinking age is not significant. Neither is the breath test law, JAILD or COMSERD variables, all of which represent state alcohol safety related legislation. Income per capita and the percentage of the population living in dry counties have a positive and significant effect on motor vehicle fatality rates. The state dummy variables are jointly significant with an observed Fvalue of 34.9 which is distributed as F(47,272). The year dummies are jointly significant with an observed Fvalue of 2.97 which is distributed as F(6,272). Problem 12 asks the reader to replicate Table 13.1. These results imply that increasing the minimum legal drinking age, or imposing stiffer punishments like mandating jail or community service are not effective policy tools for decreasing traffic related deaths. However, increasing the real tax on beer is an effective policy for reducing traffic related deaths.
For grouped data, the sample sizes ni for each group have to be sufficiently large. Also, the Pi’s cannot be zero or one. One modification suggested in the literature is to add (1/2n^ to pi when computing the log of odds ratio, see Cox (1970).
Example 2: Fractional Response. Papke and Wooldridge (1996) argue that in many economic settings pi may be 0 or 1 for a large number of observations. For example, when studying participation rates in pension plans or when studying high school graduation rates. They propose a fractional logit regression which handles fractional response variables based on quasilikelihood methods. Fractional response variables are bounded variables. Without loss of generality, they could be restricted to lie between 0 and 1. Examples include the proportion of income spent on charitable contributions, the fraction of total weekly hours spent working. Papke and Wooldridge (1996) propose modeling the E(yi/xi) as a logistic function к(х’ф). This insures that the predicted value of yi lies in the interval (0,1). It is also well defined even if yi takes the values 0 or 1 with positive probability. It is important to note that in case yi is a proportion from a group of known size ni, the quasi maximum likelihood method ignores the information on ni. Using the Bernoulli loglikelihood function, one gets
Li(/3) = yi log[A(xie)j + (1 – Уі) log[1 – A(xie)]
for i = 1, 2,… ,n, with 0 < Л(хів) < 1.
Maximizing П=і Li(P) with respect to в yields the quasiMLE which is consistent and y/n asymptotically normal regardless of the distribution of yi conditional on xi, see Gourieroux, Monfort and Trognon (1984) and McCullagh and Nelder (1989). The latter proposed the generalized linear models (GLM) approach to this problem in statistics. Logit QMLE can be done in Stata using the GLM command with the Binary family function indicating Bernoulli and the Link function indicating the logistic distribution.
Papke and Wooldridge (1996) derive robust asymptotic variance of the QMLE of в and suggest some specification tests based on Wooldridge (1991). They apply their methods to the participation in 401(K) pension plans. The data are from the 1987 IRS Form 5500 reports of pension plans with more than 100 participants. This data set containing 4734 observations can be downloaded from the Journal of Applied Econometrics Data Archive. We focus on a subset of their data which includes 3874 observations of plans with match rates less than or equal to one. Match rates above one may be indicating endofplan year employer contributions made to avoid IRS disqualification. Participation rates (PRATE) in this sample are high
Table 13.2 Logit QuasiMLE of Participation Rates in 401(K) Plan

prate 
Coef. 
Robust Std. Err. 
z 
P> z 
[95% Conf. Interval] 

mrate 
1.39008 
.1077064 
12.91 
0.000 
1.17898 
1.601181 
log emp 
1.001874 
.1104365 
9.07 
0.000 
1.218326 
.7854229 
.0521864 
.0071278 
7.32 
0.000 
.0382161 
.0661568 

age 
.0501126 
.0088451 
5.67 
0.000 
.0327766 
.0674486 
age2 
.0005154 
.0002117 
2.43 
0.015 
.0009303 
.0001004 
sole 
.0079469 
.0502025 
0.16 
0.874 
.0904482 
.1063421 
cons 
5.057997 
.4208646 
12.02 
0.000 
4.233117 
5.882876 
averaging 84.8%. Over 40% of the plans have a participation proportion of one. This makes the logodds ratio approach awkward since adjustments have to be made to more than 40% of the observations. The plan match rate (MRATE) averages about 41 cents on the dollar. Other explanatory variables include total firm employment (EMP), age of the plan (AGE), a dummy variable (SOLE) which takes the value of 1 if the 401(K) plan is the only pension plan offered by the employer. The 401(K) plans average 12 years in age, they are the SOLE plan in 37% of the sample. The average employment is 4622. Problem 14 asks the reader to replicate the descriptive statistic given in Table I of Papke and Wooldridge (1996, p. 627). Table 13.2 gives the Stata output for logit QMLE using the same specification given in Table II of Papke and Wooldridge (1996, p. 628). Note that it uses the GLM command, the Bernoulli variance function and the logit link function. The results show that there is a positive and significant relationship between match rate and participation rate. All the other variables included are significant except for SOLE. Problem 14 asks the reader to replicate this result and compare with OLS. The latter turns out to have a lower R2 and fails a RESET test, see Chapter 8.
Leave a reply