# Multinomial Generalizations

In all the models we have considered so far in Section 10.10, the sign of у ft determined two basic categories of observations, such as union members versus nonunion members, states with an antidiscrimination law versus those without, or college-goers versus non-college-goers. By a multinomial general­ization of Type 5, we mean a model in which observations are classified into more than two categories. We shall devote most of this subsection to a discus­sion of the article by Duncan (1980).

Duncan presented a model of joint determination of the location of a firm and its input – output vectors. A firm chooses the location for which profits are maximized, and only the input-output vector for the chosen location is observed. Let Si(k) be the profit of the rth firm when it chooses location k, r=l,2,. . . , n and к = 1, 2,. . . , AT, and let y,(/c) be the input-output vector for the rth firm at the kth location. To simplify the analysis, we shall subsequently assume yt{k) is a scalar, for a generalization to the vector case is
straightforward. It is assumed that

si0c) = *"’P+uik (10.10.26)

and

yi(k) = x\$’0+vik, (10.10.27)

where x^’ and are vector functions of the input-output prices and eco­nomic theory dictates that the same P appears in both equations.19 It is as­sumed that (un, ua,. . . , uiK, vn, va,… , vx) are i. i.d. drawings from a 2AT-variate normal distribution. Suppose s,(k,) > Sj(j) for any j Ф kt. Then a researcher observes y,(fc,) but does not observe yt(j) for j Ф kt.

For the following discussion, it is useful to define ^binary variables for each і by

Wj(k) = 1 if rth firm chooses kth location (10.10.28)

= 0 otherwise

and define the vector w,= 0,(1), w,(2),. . . ,w,(K)]’. Also define Ptk = P[wi(k) = 1] and the vector P, = (Pn, Pa,…, Pxy.

There are many ways to write the likelihood function of the model, but perhaps the most illuminating way is to write it as

L – ЦШкдШК) = 1]P*, (10.10.29)

і

where k, is the actual location the ith firm was observed to choose.

The estimation method proposed by Duncan can be outlined as follows:

Step 1. Estimate the /? that characterize /in (10.10.29) by nonlinear WLS.

Step 2. Estimate thePthat characterize Pin (10.10.29) by the multinomial probit MLE using the nonlinear WLS iteration.

Step 3. Choose the optimum linear combination of the two estimates of P obtained in steps 1 and 2.

To describe step 1 explicitly, we must evaluate pt = w,-(/c,)= 1]

and a] = Vly^k^Wtiki) = 1] as functions ofPand the variances and covar­iances of the error terms of Eqs. (10.10.26) and (10.10.27). These conditional moments can be obtained as follows. Define z,(y) = s^k,) — St(j) and the (К— 1/vector z, = [z,(l),. . . , z,(k,- — 1), z,(k( + 1),. . . , z,(AT)]’. To simplify the notation, write z, as z, omitting the subscript. Similarly, write Уі(кі) as y. Also, define R = E(y — Ey)(z — Ez)'[E(z — Ez)(z — 2?z)’]-1 and Q = Vy — RE(z — Ez)(y — Ey). Then we obtain20 Pi = E(yz > 0) = Ey + RE(z|z > 0) — R£z

and

a = V(yz > 0) = RF(z|z > 0)R’ + Q. (10.10.31)

The conditional moments of z appearing in (10.10.30) and (10.10.31) can be found in the articles by Amemiya (1974b, p. 1002) and Duncan (1980, p. 850). Finally, we can describe the nonlinear WLS iteration of step 1 as follows: Estimate a? by inserting the initial estimates (for example, those obtained by minimizing [y,(fc,) — ju, ]2) of the parameters into the right-hand side of (10.10.31)—call it a}. Minimize

(10.10.32)

і

with respect to the parameters that appear in the right-hand side of (10.10.30). Use these estimates to evaluate the right-hand side of (10.10.31) again to get another estimate of a]. Repeat the process to yield new estimates of fi.

Now consider step 2. Define

X, ш £(W, – P. Xw, – P,)’ = D, – P, P’, (10.10.33)

where D, is the KX К diagonal matrix the kth diagonal element of which is Pik. To perform the nonlinear WLS iteration, first, estimate Xt by inserting the initial estimates of the parameters into the right-hand side of (10.10.33) (de­note the estimate thus obtained as 2,); second, minimize

^(w,-P<)’Xr(w(-P1), (10.10.34)

where the minus sign in the superscript denotes a generalized inverse, with respect to the parameters that characterize P/; and repeat the process until the estimates converge. A generalized inverse A- of A is any matrix that satisfies A A-A = A (Rao, 1973, p. 24). A generalized inverse Xj is obtained from the matrix D"1 — Pjk 1Г, where 1 is a vector of ones, by replacing its kth column and row by a zero vector. It is not unique because we may choose any k.

Finally, regarding step 3, if we denote the two estimates of fi obtained by steps 1 and 2 by fix and fi2, respectively, and their respective asymp­totic variance-covariance matrices by V, and V2, the optimal linear com­bination of the two estimates is given by (V71 + Vj1)-1 V71/?; + (V71 + V71)~iV71)J2. This final estimator is asymptotically not fully effi­cient, however. To see this, suppose the regression coefficients of (10.10.26) and (10.10.27) differ: Call them fi2 and fix, say. Then, by a result of Amemiya (1976b), we know that fi2 is an asymptotically efficient estimator of fi2. How­ever, as we have indicated in Section 10.4.4, fix is not asymptotically efficient. So a weighted average of the two could not be asymptotically efficient.

Dubin and McFadden (1984) used a model similar to that of Duncan in their study of the joint determination of the choice of electric appliances and the consumption of electricity. In their model, st(k) may be interpreted as the utility of the rth family when they use the fcth portfolio of appliances, and y,(k) as the consumption of electricity for the rth person holding the kth portfolio. The estimation method is essentially similar to Duncan’s method. The main difference is that Dubin and McFadden assumed that the error terms of (10.10.26) and (10.10.27) are distributed with Type I extreme-value distribu­tion and hence that the P part of (10.10.29) is multinomial logit.

Exercises

1. (Section 10.4.3)

Verify (10.4.19).

2. (Section 10.4.3)

Verify (10.4.28).

3. (Section 10.4.3)

Consider Vyw and Vyw given in (10.4.32) and (10.4.33). As stated in the text, the difference of the two matrices is neither positive definite nor negative definite. Show that the first part of Vyw, namely, a2(Z’2~lZ)~l, is smaller than Vyw in the matrix sense.

4. (Section 10.4.5)

In the standard Tobit model (10.2.3), assume that <t2 = 1,0 is a scalar and the only unknown parameter, and {л:,} are i. i.d. binary random variables taking 1 with probability p and 0 with probability 1 — p. Derive the formulae ofp • AV[Jn(0 — 0)] for 0 = Probit MLE, Tobit MLE, Heck­man’s LS, and NLLS. Evaluate them for 0 = 0, 1, and 2.

5. (Section 10.4.6)

Consider the following model:

У,= 1 if

= 0 if УЇ <0, /=1,2,. . . , и,

where (yf} are independent N(x’,0, 1). It is assumed that {y,} are observed but {yf } are not. Write a step-by-step instruction of the EM algorithm to obtain the MLE of/? and show that the MLE is an equilibrium solution of the iteration.

6. (Section 10.6)

Consider the following model:

У и = x»A + и»

3>2, = Xa& + W2.-

 Уи = Уи if y£> o = 0 if y£SO У2і= 1 if уЪ> о = 0 if y£S 0, /=1,2,.. • ,n,

where (и1(, u2i) are i. i.d. with the continuous density/(•, •). Denote the marginal density of uu by f (•) and that of u2i byf2{ •).

a. Assuming that yu, y2i, xu and x2i are observed for /=1, 2, . . . , n, express the likelihood function in terms off, f, and f2.

b. Assuming that yu, y2h x,„ and x2< are observed for all /, express the likelihood function in terms offf and^.

7. (Section 10.6)

Consider the following model:

У? = atz, + u, z? = ftyi + vi yt =1 if yf ё 0 = 0 if yf <0 z, = 1 if zf S 0 = 0 if zf <0,

where ut and v, are jointly normal with zero means and nonzero covar­iance. Assume that y*, z*, u, and v are unobservable and у and z are observable. Show that the model makes sense (that is, у and zare uniquely determined as functions of и and v) if and only if aft = 0.

8. (Section 10.6)

In the model of Exercise 7, assume that ft— 0 and that we have n i. i.d. observations on (y(, z,), /=1,2,. . . , n. Write the likelihood function of a. You may write the joint density of (u, v) as simply /(«, v) without explicitly writing the bivariate normal density.

9. (Section 10.6)

Suppose yf and zf, /=1,2…………. и, are i. i.d. and jointly normally dis­

tributed with nonzero correlation. For each i, we observe (1) only у f, (2) only zf, or (3) neither, according to the following scheme:

(1) Observe у f = у і and do not observe zf if yf ё zf ё 0.

(2) Observe zf = z, and do not observe y( if zf > yf S 0.

(3) Do not observe either if yf < 0 or zf < 0.

Write down the likelihood function of the model. You may write the joint normal density simply as/( •, •).

10. Section (10.7.1)

Write the likelihood function of the following two models (cf. Cragg, 1971).

a. (yf, yf) ~ Bivariate N(xfi,, x2fi2, 1, a, an)

y2 = yf ^ yf>0 and yf>0

= 0 otherwise.

We observe only y2.

b. (yf, yf) ~ Bivariate N(xfii, x2fi2, 1, a, <i12) with yf truncated so thatyf>0

y2 = yf if yf > 0

= 0 if yf S 0

We observe only y2.

11. (Section 10.9.4)

In Tomes’ model defined by (10.9.7) through (10.9.9), consider the fol­lowing estimation method: Step 1. Regress y2, on x,, and x2/ and obtain the least squares predictor j>2l. Step 2. Substitute y2f for y2, in (10.9.7) and apply the Tobit MLE to Eqs. (10.9.7) and (10.9.9). Will this method yield consistent estimates of and?

12. (Section 10.10)

Suppose the joint distribution of a binary variable w and a continuous variable у is determined by P(w = 1 |y) = Л^у) and /(y|w) = N(y2w, a2). Show that we must assume a2 y, = y2 for logical consistency.

13. (Section 10.10.1)

In model (10.10.1), Type 5 Tobit, define an observed variable у,- by Уі = У*і if У*>0 = У*і if У и = 0

and assume that a researcher does not observe whether yf > 0 or S 0; that is, the sample separation is unknown. Write the likelihood function of this model.

14. (Section 10.10.4)

Let (yf» У*і> У зі) be a three-dimensional vector of continuous random variables that are independent across i= 1,2,. . . ,n but may be corre­lated among themselves for each /. These random variables are unob­served; instead, we observe zt and y, defined as follows

 2/ = y* if y* > 0 = y* if yf Ш 0. y’=(° with probability A with probability 1 — A if yf > 0 = 0 if У и — 0.

Write down the likelihood function. Use the following symbols: ЛіІУшУи) joint density of yf and yf

/зіІУшУзі) joint density of yf, and yf.

15. (Section 10.10.4)

Consider a regression model: yf = xfA 4- uu and yf = xfjSz + u2i, where the observable random variable yt is defined by y( = yf, with proba­bility A and У/ = у */ with probability 1 — A. This is called a switching regression model. Write down the likelihood function of the model, as­suming that (uu, u2i) are i. i.d. with joint density /(•,*)•

16. (Section 10.10.6)

Show Х, ХгХ, = X/ where X, is given in (10.10.33) and X~ is given after (10.10.34). Let wf and Pf be the vectors obtained by eliminating the fcth element from w, and P,, where к can be arbitrary, and let X? be the variance-covariance matrix of wf. Then show (w, — P,)’Xr(w, — P() = (w? — Pf )'(X?)-,(wf — p*).