# Maximum Likelihood Estimator

4.1.2 Definition

Let LT(6) = L(y, в) be the joint density of a Г-vector of random variables У — (У і. Уі. • • • . Утї characterized by a Я-vector of parameters в. When we regard it as a function of в, we call it the likelihood function. The term maximum likelihood estimator (MLE) is often used to mean two different concepts: (1) the value of в that globally maximizes the likelihood function L( у, в) over the parameter space 0; or (2) any root of the likelihood equation дЬт(в) _
дв

that corresponds to a local maximum. We use it only in the second sense and use the term global maximum likelihood estimator to refer to the first concept. We sometimes use the term local maximum likelihood estimator to refer to the second concept.

4.1.3 Consistency

The conditions for the consistency of the global MLE or the local MLE can be immediately obtained from Theorem 4.1.1 or Theorem 4.1.2 by putting QT(6) = log LT(6). We consider the logarithm of the likelihood function because T~l log LT(6) usually converges to a finite constant. Clearly, taking the logarithm does not change the location of either global or local maxima.

So far we have not made any assumption about the distribution of y. If we assume that {y,} are i. i.d. with common density function/(•, в), we can write

log L(y, в) =’£ logf(y„ в). (4.2.2)

r-1

In this case we can replace assumption C of either Theorem 4.1.1 or Theorem

4.1.2 by the following two assumptions:

E sup |log/(y,, 0)| < M, for some positive constant M, (4.2.3)

«ее

and

log/(y„ 0) is a continuous function of в for each yt. (4.2.4) In Theorem 4.2.1 we shall show that assumptions (4.2.3) and (4.2.4) imply

(4.2.5)

Furthermore, we have by Jensen’s inequality (see Rao, 1973, p. 58, for the proof)  (4.2.6)

where the expectation is taken using the true value 80, and, therefore E log/(y„ в)<Е log f(y„ Є0) for 8Фв0.

As in (4.2.7), we have T~lE log LT{8) < T~lElog Ет(в0)іотв Ф в0 and for all T. However, when we take the limit of both sides of the inequality (4.2.7) as T goes to infinity, we have

lim T~lE log LT{8) ^ lim T~lE log Ьт(в0).

T—fcee 7*—*oo

Hence, one generally needs to assume that 1ітг_и T~lE log LT(8) is uniquely maximized at 8 = 60.

That (4.2.3) and (4.2.4) imply (4.2.5) follows from the following theorem when we put gffi) = log f(y„ в)-Е log f(y„ 8).

Theorem 4.2.1. Let g(у, 8) be a measurable function of у in Euclidean space for each в Є 0, a compact subset of RK (Euclidean ЛГ-space), and a continuous function of в Є 0 for each y. Assume E g( y, 8) = 0. Let {y,} be a sequence of i. i.d. random vectors such that E supeee|g(y,, 0) < °°. Then Т-‘ЪТ-Жу,, в) converges to 0 in probability uniformly in 8 є 0.

Proof} Partition 9 into n nonoverlapping regions 6f, 0J,. . . , 9J in such a way that the distance between any two points within each 9? goes to 0 as n goes to *>. Let 8x,82,. . . ,8„ be an arbitrary sequence of ЛГ-vectors such that0/G9?, i= 1, 2,. . . , «.Then writing g\$) for g(y„ 8), we have for any € > 0  (4.2.8)

where the first inequality follows from the fact that if A implies В then P(А) ё P( B) and the last inequality follows from the triangle inequality. Be­cause gjfi) is uniformly continuous in в є 9, we have for every і  lim sup l&(0) ~ g№= 0-

И—»00 АРА?

But, because

sup gt(0) ~ gtfi)I ^ 2 sup gt(6)

Oeej вєе

and the right-hand side of the inequality (4.2.10) is integrable by our assump­tions, (4.2.9) implies by the Lebesgue convergence theorem (Royden, 1968,

p. 88)  lim E sup gi(6) ~ g,(0i) = 0

«—►oo 4^0?

uniformly for i. Take n so large that the expected value in (4.2.11) is smaller than e/2. Finally, the conclusion of the theorem follows from Theorem 3.3.2 (Kolmogorov LLN 2) by taking T to infinity.

This theorem can be generalized to the extent that 7’_12£.1я(0і) and, supeee;. I gJiO) — g£6i)can be subjected to a law of large numbers. The following theorem, which is a special case of a theorem attributed to Hoadley (1971) (see also White, 1980b), can be proved by making a slight modification of the proof of Theorem 4.2.1 and using Markov’s law of large numbers (Chapter 3, note 10).

Theorem 4.2.2. Let gj(y, в) be a measurable function of у in Euclidean space for each t and for each в Є 0, a compact subset of RK (Euclidean К-space), and a continuous function of в for each у uniformly in t. Assume E g/y, в) = 0. Let {y,} be a sequence of independent and not necessarily identically distributed random vectors such that E supeee |g/y(, 0)|1+<5 = Af < oo for some 8 > 0. Then T^S^g^y,, в) converges to 0 in probability uniformly in в Є 9.

We will need a similar theorem (Jennrich, 1969, p. 636) for the case where y, is a vector of constants rather than of random variables.

Theorem 4.2.3. Letyb y2,. . . , yTbe vectors of constants. We define the empirical distribution function of (y,, y2, • • • , Уг) by FT(a)=

X(y, < a), where x takes the value 1 or 0 depending on whether the event in its argument occurs or not. Note that y, < a means every element of the vector y, is smaller than the corresponding element of a. Assume that g(y, в) is a bounded and continuous function of у in Euclidean space and 0ina compact set 0. Also assume that FT converges to a distribution function F. Then 1ітг_«, r-‘2£.ig(y„ в) = fg(y, 0) dF(y) uniformly in в.

There are many results in the literature concerning the consistency of the maximum likelihood estimator in the i. i.d. case. Rao (1973, p. 364) has presented the consistency of the local MLE, which was originally proved by Cramer (1946). Wald (1949) proved the consistency of the global MLE with­out assuming compactness of the parameter space, but his conditions are difficult to verify in practice. Many other references concerning the asymp­totic properties of the MLE can be found in survey articles by Norden (1972, 1973).

As an example of the application of Theorem 4.1.1, we shall prove the consistency of the maximum likelihood estimators of P and o1 in Model 1. Because in this case the maximum likelihood estimators can be written as explicit functions of the sample, it is easier to prove consistency by a direct method, as we have already done in Section 3.5. We are considering this more complicated proof strictly for the purpose of illustration.

Example 4.2.1. Prove the consistency of the maximum likelihood estima­tors of the parameters of Model 1 with normality using Theorem 4.1.1, assum­ing that lim T~lX’X is a finite positive definite matrix.

In Section 1.1 we used the symbols p and or2 to denote the true values because we did not need to distinguish them from the domain of the likelihood function. But now we shall put the subscript 0 to denote the true value; therefore we can write Eq. (1.1.4) as

y = Xj80 + u, (4.2.12)

where Vu, — <t§. From (1.3.1) we have

log Lr = -| log 2% – ~ log ff2 – 2^2 (У “ W(y – Wb (4.2.13) = -■j log 2л —~ log o2

1Ш ~P)+тжРо -P)+«0,

where the second equality is obtained by using (4.2.12). Therefore   Define a compact parameter space 0 by

c, s;<r2Sc2, 0’0£c3, (4.2.15)

where c, is a small positive constant and c2 and c3 are large positive constants, and assume that (0’o, al) is an interior point of 0. Then, clearly, the conver­gence in (4.2.14) is uniform in 0 and the right-hand side of (4.2.14) is uniquely maximized at (0’o, <r§). Put в = (0′, a1)’ and define 0rby

log LT(0T) = max log LT(jS). (4.2.16)

eee

Then QT is clearly consistent by Theorem 4.1.1. Now define 6T by

log LT(6T) = max log LT{Q), (4.2.17)

wherethemaximizationin(4.2.17)isoverthewhoIe Euclidean (K + l)-space.

Then the consistency of QT, which we set out to prove, would follow from

lim Р(вт=вт)= 1. (4.2.18)

7’—* ос

The proof of (4.2.18) would be simple if we used our knowledge of the explicit formulae for 6T in this example. But that would be cheating. The proof of (4.2.18) using condition D given after the proof of Theorem 4.1.1 is left as an exercise.

There are cases where the global maximum likelihood estimator is incon­sistent, whereas a root of the likelihood equation (4.2.1) can be consistent, as in the following example.

Example 4.2.2. Lety,,* = 1,2,. . . , T, be independent with the common distribution defined by

ДУі) ~ (T?) with probability к (4.2.19)

= N(ju2, al) with probability 1 — k.

This distribution is called a mixture of normal distributions. The likelihood function is given by
(4.2.20) If we put цх = yx and let cr, approach 0, the term of the product that corre­sponds to t = 1 goes to infinity, and, consequently, L goes to infinity. Hence, the global MLE canot be consistent. Note that this example violates assump­tion C of Theorem 4.1.1 because Q(ff) does not attain a global maximum at 0O. However, the conditions of Theorem 4.1.2 are generally satisfied by this model. An extension of this model to the regression case is called the switching regression model (see Quandt and Ramsey, 1978).

It is hard to construct examples in which the maximum likelihood estima­tor (assuming the likelihood function is correctly specified) is not consistent and another estimator is. Neyman and Scott (1948) have presented an inter­esting example of this type. In their example MLE is not consistent because the number of incidental (or nuisance) parameters goes to infinity as the sample size goes to infinity.

4.1.4 Asymptotic Normality  The asymptotic normality of the maximum likelihood estimator or, more precisely, a consistent root of the likelihood equation (4.2.1), can be analyzed by putting QT — log LT in Theorem 4.1.3. If {yt) are independent, we can write

where ft is the marginal density of yt. Thus, under general conditions on/,, we can apply a law of large numbers to d2 log L-г/двдв’ and a central limit theorem to d log LT/dd. Even if (y,) are not independent, a law of large numbers and a central limit theorem may still be applicable as long as the degree of dependence is limited in a certain manner, as we shall show in later chapters. Thus we see that assumptions В and C of Theorem 4.1.3 are ex­pected to hold generally in the case of the maximum likelihood estimator.

Moreover, when we use the characteristics otLTas a joint density function, we can get more specific results than Theorem 4.1.3, namely as we have shown.

in Section 1.3.2, the regularity conditions on the likelihood function given in assumptions A’ and B’ of Section 1.3.2 imply

A(0o) = – B(0o). (4.2.22)

Therefore, we shall make (4.2.22) an additional assumption and state it for­mally as a theorem.

Theorem 4.2.4. Under the assumptions of Theorem 4.1.3 and assumption (4.2.22), the maximum likelihood estimator 0T satisfies

m – 4) – * (o. -[«■ в і • <«•*»

If {y,} are i. i.d. with the common density function /(•, ff), we can replace assumptions В and C of Theorem 4.1.3 as well as the additional assumption (4.2.22) with the following conditions on/( •, 6) itself: /1*"0′  uniformly in в in an open neighborhood of 0O. (4.2.26)

A sufficient set of conditions for (4.2.26) can be found by putting g^ff) = a2 log fJdQjdOj in Theorem 4.2.1. Because log LT= log/(y,, в) in this case, (4.2.26) implies assumption В of Theorem 4.1.3 because of Theorem 4.1.5. Assumption C of Theorem 4.1.3 follows from (4.2.24) and (4.2.26) on account of Theorem 3.3.4 (Lindeberg-Levy CLT) since (4.2.24) implies E(d logf/дв)^ = 0. Finally, it is easy to show that assumptions (4.2.24)- (4.2.26) imply (4.2.22).

We shall use the same model as that used in Example 4.2.1 and shall illustrate how the assumptions of Theorem 4.1.3 and the additional assump­tion (4.2.22) are satisfied. As for Example 4.2.1, the sole purpose of Example

4.2.3 is as an illustration, as the same results have already been obtained by a direct method in Chapter 1.

Example 4.2.3. Under the same assumptions made in Example 4.2.1, prove^the asymptotic normality of the maximum likelihood estimator 6 = (Р’,Э>у. We first obtain the first and second derivatives of log L:

 (4.2.27) вд,(у m (4.2.28) d2 log L _ 1 дрр rr2 ’ (4.2.29) «Зр 2a* > W ** (4.2.30) X’>* (4.2.31)  From (4.2.29), (4.2.30), and (4.2.31) we can clearly see that assumptions A and В of Theorem 4.1.3 are satisfied. Also from these equations we can evaluate the elements of A(0O):     From (4.2.27) and (4.2.28) we obtain    Thus, by applying either the Lindeberg-Feller or Liapounov CLT to a se­quence of an arbitrary linear combination of a (K + l)-vector {xltu„ x2tu„ . . . , xau„ u2 – al), we can show

Figure 4.1 The log likelihood function in a nonregular case with zero asymptotic covariance between (4.2.35) and (4.2.36). Thus assump­tion C of Theorem 4.1.3 has been shown to hold. Finally, results (4.2.32) through (4.2.36) show that assumption (4.2.22) is satisfied. We write the conclusion (4.2.23) specifically for the present example as (4.2.37)

There are cases where the global maximum likelihood estimator exists but does not satisfy the likelihood equation (4.2.1). Then Theorem 4.1.3 cannot be used to prove the asymptotic normality of MLE. The model of Aigner, Amemiya, and Poirier (1976) is such an example. In their model, plim T~l log L T exists and is maximized at the true parameter value 60 so that MLE is consistent. However, problems arise because plim T~1 log LT is not smooth at 0O; it looks like Figure 4.1. In such a case, it is generally difficult to prove asymptotic normality.

4.2.4 Asymptotic Efficiency

The asymptotic normality (4.2.23) means that if Tis laijge the variance-covar­iance matrix of a maximum likelihood estimator may be approximated by  (4.2.38)

But (4.2.38) is precisely the Cramer-Rao lower bound of an unbiased estima­tor derived in Section 1.3.2. At one time statisticians believed that a consistent and asymptotically normal estimator with the asymptotic covariance matrix

(4.2.38) was asymptotically minimum variance among all consistent and asymptotically normal estimators. But this was proved wrong by the following counterexample, attributed to Hodges and reported in LeCam (1953).

Example 4.2.4. Let 6T be an estimator of a scalar parameter such that plim 6T — 0 and 4Т(дт — 0)—* АГ[0, i>(0)]. Define the estimator 6* = Wr§T, where

wT—0 if |0r|< T~l/A = 1 if |0r| gr1/4.

It can be shown (the proof is left as an exercise) that 4Т(в%— в) —► N[0, v*(0)], where v*(0) = 0 and if (в) = v{d) іївФО.

The estimator 6* of Example 4.2.4 is said to be superefficient. Despite the existence of superefficient estimators, we can still say something good about an estimator with the asymptotic variance-covariance matrix (4.2.38). We shall state two such results without proof. One is the result of LeCam (1953), which states that the set of 0 points on which a superefficient estimator has an asymptotic variance smaller than the Cramer-Rao lower bound is of Lebesgue measure zero. The other is the result of Rao (1973, p. 350) that the matrix

(4.2.38) is the lower bound for the asymptotic variance-covariance matrix of all the consistent and asymptotically normal estimators for which the conver­gence to a normal distribution is uniform over compact intervals of 0. These results seem to justify our use of the term asymptotically efficient in the following sense:

Definition 4.2.1. A consistent estimator is said to be asymptotically effi­cient if it satisfies statement (4.2.23).

Thus the maximum likelihood estimator under the appropriate assump­tions is asymptotically efficient by definition. An asymptotically efficient estimator is also referred to as best asymptotically normal (BAN for short). There are BAN estimators other than MLE. Many examples of these will be discussed in subsequent chapters, for example, the weighted least squares estimator will be discussed Section 6.5.3, the two-stage and three-stage least squares estimators in Sections 7.3.3 and 7.4, and the minimum chi-square estimators in Section 9.2.5. Barankin and Gurland (1951) have presented a general method of generating BAN estimators. Because their results are math­ematically too abstract to present here, we shall state only a simple corollary of their results: Let (y() be an i. i.d. sequence of random vectors with E y, = p{0), E(y, — p)(y, — p)’ = 2(0), and with exponential family density

/(У. 0) = exp [0*0(0) + fi0(y) + с*<(0)А(У)]

and define zr= r_1X, Ljy,. Then the minimization of zT—ц(в) 2(0)-J[zr — ^(0)] yields a BAN estimator of 0 (see also Taylor, 1953; Fergu­son, 1958).4

Different BAN estimators may have different finite sample distributions. Recently many interesting articles in the statistical literature have compared the approximate distribution (with a higher degree of accuracy than the asymptotic distribution) of MLE with those of other BAN estimators. For example, Ghosh and Subramanyam (1974) have shown that in the exponen­tial family the mean squared error up to 0(T~2) of MLE after correcting its bias up to 0(T~l) is smaller than that of any other BAN estimator with similarly corrected bias. This result is referred to as the second-order efficiency of MLE,5 and examples of it will be given in Sections 7.3.5 and 9.2.6.