# Results of Cosslett: Part I

Cosslett (1981a) proved the consistency and the asymptotic normality of CBMLE in the model where both /and Q are unknown and also proved that CBMLE asymptotically attains the Cramer-Rao lower bound. These results require much ingenuity and deep analysis because maximizing a likelihood function with respect to a density function /as well as parameters fi creates a new and difficult problem that cannot be handled by the standard asymptotic theory of MLE. As Cosslett noted, his model does not even satisfy the condi­tions of Kiefer and Wolfowitz (1956) for consistency of MLE in the presence of infinitely many nuisance parameters.

Cosslett’s sampling scheme is a generalization of the choice-based sampling we have hitherto considered.16 His scheme, called generalized choice-based sampling, is defined as follows: Assume that the total sample of size n is divided into S subsamples, with ns people (л, is a fixed known number) in the 5th subsample. A person in the 5th subsample faces alternatives/,, a subset of the total alternatives (0, 1, 2,. . . , m). He or she chooses alternative j with probability Q(j)/&(s), where Q(s) = 2,єЛ Q(j). Given j, he or she chooses a vector of exogenous variables x with a conditional densityf(xj).17 Therefore the contribution of this typical person to the likelihood function is &(s)~lQU)A*j)> which can be equivalently expressed as 6(5)-,/>a|x,)J)/(x).  Taking the product of (9.5.45) over all the persons in the 5th subsample (denoted by /,) and then over all s, we obtain the likelihood function

In the model under consideration, the Q(j)’s are unknown. Therefore we may wonder how, in subsample s, alternatives in Js are sampled in such a way that they’th alternative is chosen with probability &(s)~lQ(j). To this question Cosslett (1981b) gave an example of interviewing train riders at a train station, some of whom have traveled to the station by their own cars and some of whom have come by taxi. Thus this subsample consists of two alternatives, each of which is sampled according to the correct probability by random sampling conducted at the train station.

Cosslett’s generalization of choice-based sampling is an attractive one, not only because it contains the simple choice-based sampling of the preceding sections as a special case (take Js = {\$}), but also because it contains an inter­esting special case of “enriched” samples.18 That is, a researcher conducts random sampling, and if a particular alternative is chosen with a very low frequency, the sample is enriched by interviewing more of those who chose this particular alternative. Here we have, for example, Jl = (0, 1,2, . . . , m) and J2(0), if the 0th alternative is the one infrequently chosen. t

Our presentation of Cosslett’s results will not be rigorous, and we shall not spell out all the conditions needed for consistency, asymptotic normality, and asymptotic efficiency. However, we want to mention the following identifica­tion conditions as they are especially important.

Assumption 9.5.6. uf_,y, = (0, 1,2, .. . ,m).

Assumption 9.5.7. A subset M_of integers (1,2, . . . , S) such that (иієл/ Л) n ОЛєм Js) = Ф> where M is the complement of M, cannot be found.

Note that if S= 2 and m = 1, for example, /, = (0) and J2 = (1) violate Assumption 9.5.7, whereas Jx = (0) and J2 = (0, 1) satisfy it. Thus simple choice-based sampling violates this assumption. Cosslett noted that these assumptions are needed for the multinomial logit model but may not be needed for other QR models. Cosslett also gave an intuitive reason for these assumptions: If alternatives j and к are both contained in some subsample, then Q(j)/Q(k) can be estimated consistently, and under Assumption 9.5.7 we have enough of these ratios to estimate all the Q(j) separately. Thus Assumption 9.5.7 would not be needed in the case where Q(j) are known.

Before embarking on a discussion of Cosslett’s results, we must list a few more symbols as an addition to those listed in Section 9.5.1:

?(s) = P(sx, fl)=-J2}PUx, fi) h(s) = £(s|x, fio) = 2 P(jb Po)

j£J,

Q(S)= 2 GO)

jej.

GoO)=5:GoO)

H(s) = njn

From (9.5.46) we obtain

log T = 2 log F(j,|x„ P) + 2 log/(x,) (9.5.47)

i-l (-1 – 2 и*1о8 1-І

Our aim is to maximize (9.5.47) with respect to a parameter vector P and a function/(•). By observing (9.5.47) we can immediately see that we should put/(x) = Oforx Ф xt, i = 1, 2,. . . , «.Therefore our problem is reduced to the more manageable one of maximizing log L, = 2 lQg P(ji*n P) + log w,

i-l 1-І

– 2І0® 2 Д)

5-1 /-1   with respect to P and w,, / =1,2,. . . , n. Differentiating (9.5.48) with re­spect to w, and setting the derivative equal to 0 yield  If we insert (9.5.49) into (9.5.48), we recognize that log L, depends on w, only through the 5 terms w^Poix,, P), s= 1,2,. . . , S’. Thus, if we define

log A = £ log PUi*i, P) ~ 2 lQg 2 UP)P(s*i, P) (9.5.51)

/-1 /-1 J-l

+ 2 ns log АДД) – X и, log

j—і і

But maximizing (9.5.51) with respect to fi and Xs{fi), 5=12,. . . ,S, is equivalent to maximizing19

Q = 2 log ДлІх/> P) ~ 2 log 2 ^(slxi> P)+y. ns log Xs

1-1 /“I 5-1 i-l

(9.5.52)

with respect to fi and 2,, s= 1,2,. . . , £ This can be shown as follows. Inserting (9.5.50) into (9.5.49) yields

1

»i = ——————————— • (9.5.53)

n£Ufi)P(sx„P)

5—1   Multiplying both sides of (9.5.53) by P(s|x,, fi), summing over i, and using (9.5.50) yield

But setting the derivative of A with respect to Xs equal to 0 yields dSl_ns " P(sxhfi)

Д1 I 2j S u-

’ Я,

5— 1

Clearly, a solution of (9.5.55) is a solution of (9.5.54).

In the maximization of (9.5.52) with respect to fi and {A*}, some normalization is necessary because Q.{ali, aX2,. . . , als, P) = А(АЬ Аг,. . . , Xs, fi). The particular normalization adopted by Cosslett is

As = H(S). (9.5.56)

Thus we define CBMLE of this model as the value of fi that is obtained by maximizing A with respect to fi and {A,} subject to the normalization (9.5.56).

Although our aim is to estimate;?, it is of some interest to ask what the MLE  of {A5} is supposed to estimate. The probability limits of MLE yjand {A,} can be obtained as the values that maximize plim„„„ л-1Д. We have

plim n~lQ = E log P(jixh 0) – E log V A^falx,, 0) (9.5.57)

* 1-і

+ log a,

5— 1

= X ад f 2 [log pux, p)]€us)~’pux> a>)/w ^

j-l J jej,

– 2) H(S) J [log 2j A/(5|x, Д)]

X<3o(5)_1T(5|x, 0o)/(x) Л + 2 H(S) log A,.

J-l

Differentiating (9.5.57) with respect to Ат, т= 1, 2,. . . , S— 1, and)? yields

a r і&м-вдЯм*.*) _

^-plim «-‘D = – ——?—————— — Р(тх, 0)f(x) dx oAt J

J-l  + “W, t_,,2…….. S- 1,

and

^pUmn-‘O 

^o(5)-iH(s)Qa(S) and P = Po – Next, insert these same values into the right – hand side of (9.5.59). Then each term becomes

4 [ V H(s)Qo(s) V(x) dx f-1 J уел °p

[ 2 /,ow)to-i/(x) ^41=°-

Зр я j m, °P

We conclude that A

plim y? = /?o  and

It is interesting to note that if we replace As that appears in the right-hand side of (9.5.52) by plim is given in (9.5.61), Q becomes identical to the loga­rithm of V given in (9.5.39). This provides an alternative interpretation of the Manski-McFadden estimator.

The asymptotic normality ofP and {A,} can be proved by a straightforward, although cumbersome, application of Theorem 4.1.3. Although the same value of P maximizes log L given in (9.5.47) and П given in (9.5.52), £1 is, strictly speaking, not a log likelihood function. Therefore we must use Theorem 4.1.3, which gives the asymptotic normality of a general extremum estimator, rather than Theorem 4.2.4, which gives the asymptotic normality of MLE. Indeed, Cosslett showed that the asymptotic covariance matrix of

d = (jj A1( A2………. As_i)’ is given by a formula like that given in Theorem

4.1.3, which takes the form A-lBA_1, rather than by [— Ед2£1/дада’]“1 • However, Cosslett showed that the asymptotic covariance matrix of j? is equal to the first К X К block of [-£3a£tytaae’]“l.

Cosslett showed that the asymptotic covariance matrix of P attains the lower bound of the covariance matrix of any unbiased estimator of p. This remark­able result is obtained by generalizing the Cramer-Rao lower bound (Theorem 1.3.1) to a likelihood function that depends on an unknown density function as in (9.5.47).

To present the gist of Cosslett’s argument, we must first define the concept of differentiation of a functional with respect to a function. Let F(f) be a mapping from a function to a real number (such a mapping is called a func­
tional). Then we define df 1іш F(f+ efl – F(f)

df { e-o € where C is a function for which F(f+ e£) can be defined. Let t, (a AVvector) and t2 (a scalar) be unbiased estimators of)? and //(x)C(x) dx for some function C such that J{(x) dx = 0 and fftx)2 dx=l. Then Cosslett showed that the (2K + 2)-vector has covariance matrix of the form

where C is the covariance matrix of (tj, t2)’. Because this covariance matrix is positive definite, we have C> R(£)-,»asinTheorem 1.3.1. Finally, it is shown that the asymptotic covariance matrix of)? is equal to the first К X AT block of max{R(0-1.

Thus Cosslett seems justified in saying that is asymptotically efficient in the sense defined in Section 4.2.4. As we remarked in that section, this does not mean that fi has the smallest asymptotic covariance matrix among all consistent estimators. Whether the results of LeCam and Rao mentioned in Section 4.2.4 also apply to Cosslett’s model remains to be shown.