# Choice-Based Sampling

9.5.1 Introduction

Consider the multinominal QR model (9.3.1) or its special case (9.3.4). Up until now we have specified only the conditional probabilities of alternatives j = 0, 1,. . . , m given a vector of exogenous or independent variables x and have based our statistical inference on the conditional probabilities. Thus we have been justified in treating x as a vector of known constants just as in the classical linear regression model of Chapter 1. We shall now treat both j and x as random variables and consider different sampling schemes that specify how j and x are sampled.

First, we shall list a few basic symbols frequently used in the subsequent discussion:

F(y|x, P) or P(j) Conditional probability the y’th

alternative is chosen, given the exogenous variables x   Leti’=l,2,. . . , n be the individuals sampled according to some scheme. Then we can denote the alternative and the vector of the exogenous variables observed for the ith individual by jt and x/s respectively.

We consider two types of sampling schemes called exogenous sampling and endogenous sampling (or choice-based sampling in the QR model). The first refers to sampling on the basis of exogenous variables, and the second refers to sampling on the basis of endogenous variables. The different sampling schemes are characterized by their likelihood functions. The likelihood func­tion associated with exogenous sampling is given by11

Le = II PUixh fi)g(*i). (9.5.1)

f-1

The likelihood function associated with choice-based sampling is given by

Lc=n pu, ix„ mxdQum^HUi) (9.5.2)

/-1

if Q(jfio) is unknown and by

£co = f[ PUi*i, )QUiPoTlH(ji) (9.5.3)

(-1

if Q(jo) is known. Note that ifg(x) =/(x) and H(j) = Q(jfioX (9.5.1) and

(9.5.3)  both become

which is the standard likelihood function associated with random sampling. This is precisely the likelihood function considered in the previous sections.

Although (9.5.2) may seem unfamiliar, it can be explained easily as follows. Consider drawing random variables j and x in the following order. We can first

draw j with probability H(j) and then, given j, we can draw x according to the conditional density f(xj). Thus the joint probability isf(xj)H(j), which by Bayes’s rule is equal to P(jx)f(x)Q(j)~lH(j).

This sampling scheme is different from a scheme in which the proportion of people choosing alternative j is a priori determined and fixed. This latter scheme may be a more realistic one. (Hsieh, Manski, and McFadden, 1983, have discussed this sampling scheme.) However, we shall adopt the definition of the preceding paragraphs (following Manski and Lerman, 1977) because in this way choice-based sampling contains random sampling as a special case [QU)= H(j)] and because the two definitions lead to the same estimators with the same asymptotic properties.

Choice-based sampling is useful in a situation where exogenous sampling or random sampling would find only a very small number of people choosing a particular alternative. For example, suppose a small proportion of residents in a particular city use the bus for going to work. Then, to ensure that a certain number of bus riders are included in the sample, it would be much less expensive to interview people at a bus depot than to conduct a random survey of homes. Thus it is expected that random sampling augmented with choice – based sampling of rare alternatives would maximize the efficiency of estima­tion within the budgetary constraints of a researcher. Such augmentation can be analyzed in the framework of generalized choice-based sampling proposed by Cosslett (1981a) (to be discussed in Section 9.5.4).

In the subsequent subsections we shall discuss four articles: Manski and Lerman (1977), Cosslett (1981a), Cosslett (1981b), and Manski and McFad­den (1981). These articles together cover the four different types of models, varying according to whether /is known and whether Q is known, and cover five estimators offi—the exogenous sampling maximum likelihood estimator (ESMLE), the random sampling maximum likelihood estimator (RSMLE), the choice-based sampling maximum likelihood estimator (CBMLE), the Manski-Lerman weighted maximum likelihood estimator (WMLE), and the Manski-McFadden estimator (MME).

A comparison of RSMLE and CBMLE is important because within the framework of choice-based sampling a researcher controls H{j), and the par­ticular choice H(j) = Q0(j) yields random sampling. The choice of H(j) is an important problem of sample design and, as we shall see later, H(j) = Q0(j) is not necessarily an optimal choice.

Table 9.6 indicates how the definitions of RSMLE and CBMLE vary with the four types of model; it also indicates in which article each case is discussed. Note that RSMLE = CBMLE if Q is known. ESMLE, which is not listed in

Table 9.6 Models, estimators, and cross references

 / Q RSMLE CBMLE WMLE MME Known Known Max. L* wrt /? subject to Q0 = fPfdx MM Max. L„ wrt P subject to Q0 = SPfdx MAL Known Unknown Max. L* wrt fi. MM Max. L„ wrt. p. — — Unknown Known Max. L* wrt P and /subject to Go = SPfdx. C2 (see also Cosslett, 1978) Max. Leo wrt P and /subject to Qo = SPfdx. MAL MM Unknown Unknown Max. L* wrt p. Cl (also proves asymptotic efficiency) Max. Lc wrt p and/

Note: RSMLE = random sampling maximum likelihood estimator; CBMLE = choice-based sampling maximum likelihood estimator; WMLE ~ Manski-Lerman weighted maximum likeli­hood estimator, MME = Manski-McFadden estimator.

MM = Manski and McFadden (1981); MAL = Manski and Lerman (1977); C2 = Cosslett (1981b); Cl = Cosslett (1981a).

Table 9.6, is the same as RSMLE except when/is unknown and Q is known. In that case ESMLE maximizes Le with respect to fi without constraints. RSMLE and CBMLE for the case of known Q will be referred to as the constrained RSMLE and the constrained CBMLE, respectively. For the case of unknown Q, we shall attach unconstrained to each estimator.