Asymptotic Normality
theorem 7.4.3 Let the likelihood function be L(XbX2,. . . ,Xn 0). Then, under general conditions, the maximum likelihood estimator 0 is asymptotically distributed as
(Here we interpret the maximum likelihood estimator as a solution to the likelihood equation obtained by equating the derivative to zero, rather than the global maximum likelihood estimator. Since the asymptotic normality can be proved only for this local maximum likelihood estimator, henceforth this is always what we mean by the maximum likelihood estimator.)
Sketch of Proof. By definition, 31ogL/30 evaluated at 0 is zero. We expand it in a Taylor series around 0O to obtain
3 log L 
_ 3 log L 
! 32 log L 
30 
9 30 
e0 302 
(7.4.17) 0 
(0 – 0o), 
where 0* lies between 0 and 0O. Solving for (0 — 0O), we obtain
(7.4.18) VM0 – 0O) =
But we can show (see the paragraph following this proof) that
and
where we have simply written / for f (Xt). But we have (the derivatives being evaluated at 0O throughout)
(7.4.21) E a = – E 302
as in (7.4.3). Therefore, by (iii) of Theorem 6.1.4 (Slutsky), we conclude
(7.4.22) (0 – 0O) 4 N
We may paraphrase (7.4.22) as
f 
l ^ 

(7.4.23) 0 ~ N 
0o. — 
nE92^/ 

Э02 
V 
Finally, the conclusion of the theorem follows from the identity
(7.4.24) nE 9 lQg^ = E 9 lQg – • □
002 Э02
The convergence result (7.4.19) follows from noting that
and that the righthand side satisfies the conditions for the LindebergLevy CLT (Theorem 6.2.2). Somewhat more loosely than the above, (7.4.20) follows from noting that
^ 1 j, 3^og/№)
n i= і 302 e °0
and that the righthand side satisfies the conditions for Khinchine’s LLN (Theorem 6.2.1).
A significant consequence of Theorem 7.4.3 is that the asymptotic variance of the maximum likelihood estimator is identical with the CramerRao lower bound given in (7.4.1). This is almost like (but not quite the same as) saying that the maximum likelihood estimator has the smallest asymptotic variance among all the consistent estimators. Therefore we define
DEFINITION 7.4.1 A consistent estimator is said to be asymptotically efficient if its asymptotic distribution is given by (7.4.16).
Thus the maximum likelihood estimator is asymptotically efficient essentially by definition.
We shall give three examples to illustrate the properties of the maximum likelihood estimator and to compare it with the other estimators.
EXAMPLE 7.4.3 Let X have density f(x) = (1 + 0)xe, 0 > —1, 0 < x < 1. Obtain the maximum likelihood estimator of EX(= p.) based on n
observations X, X%, . . . , Xn and compare its asymptotic variance with the variance of the sample mean X.
We have
(7.4.27) p. = (l + 0)£*e+1d* = ^.
Since (7.4.27) defines a onetoone function and 0 > —1, we must have 0 < x < 1. Solving (7.4.27) for 0, we have
The log likelihood function in terms of 0 is given by
П
(7.4.29) log L = n log(l +0) + 0 X log x,.
i= 1
Inserting (7.4.28) into (7.4.29), we can express the log likelihood function in terms of jjl as
Differentiating (7.4.30) with respect to jx yields
(7.4.31) – £ log хг.
0^ p(l – p) (1 – p)2 i=1
Equating (7.4.31) to zero, we obtain the maximum likelihood estimator
n
П
n – X l°g. i=l
Differentiating (7.4.31) again, we obtain
Since we have, using integration by parts,
(7.4.34) E log X = (1 + 0) (x9 log x)dx = —— –
Jo Iі
we obtain from (7.4.33)
Therefore, by Theorem 7.4.3, the asymptotic variance of jjl, denoted AV(jx), is given by
(7.4.36) AV(jx) =
Next we obtain the variance of the sample mean. We have
Hence
There are several points worth noting with regard to this example, which we state as remarks.
Remark 1. In this example, solving dlogL/dx = 0 for x led to the closedform solution (7.4.32), which expressed jx as an explicit function of the sample, as in Examples 7.3.1, 7.3.2, and 7.3.3. This is not possible in many applications; in such cases the maximum likelihood estimator can be defined only implicitly by the likelihood equation, as pointed out in Section 7.3.3. Even then, however, the asymptotic variance can be obtained by the method presented here.
Remark 2. Since jx in (7.4.32) is a nonlinear function of the sample, the exact mean and variance, let alone the exact distribution, of the estimator are difficult to find. That is why our asymptotic results are useful.
Remark 3. In a situation such as this example, where the maximum likelihood estimator is explicitly written as a function of the sample, the
consistency can be directly shown by using the convergence theorems of Chapter 6, without appealing to the general result of Section 7.4.2. For this purpose rewrite (7.4.32) as
(7.4.41) p =—————————— ,
1= 1
where Xt has been substituted for xt because we must treat p as a random variable. But since {log Xt are i. i.d. with mean (p — l)/p as given in (7.4.34), we have by Khinchine’s LLN (Theorem 6.2.1)
(7.4.42) plim – X log ‘
n—>co U ^ И’
Therefore the consistency of p follows from Theorem 6.1.3.
Remark 4. Similarly, we can derive the asymptotic normality directly without appealing to Theorem 7.4.3:
= N[0, p2(l – p)2].
Therefore, we can state (7.4.44) p ~ TV p
The second equality with LD in (7.4.43), as defined in Definition 6.1.4, means that both sides have the same limit distribution and is a conse
quence of (iii) of Theorem 6.1.4 (Slutsky). The convergence in distribution appearing next in (7.4.43) is a consequence of the LindebergLevy CLT (Theorem 6.2.2). Here we need the variance of log X, which can be obtained as follows: By integration by parts,
(7.4.45) £(log Xf = •
Therefore
2
(7.4.46) У log X = (1 ~ •
У
Remark 5. We first expressed the log likelihood function in terms of x in (7.4.30) and found the value of (x that maximizes (7.4.30). We would get the same estimator if we maximized (7.4.29) with respect to 0 and inserted the maximum likelihood estimator of 0 into the last term of (7.4.27). More generally, if two parameters 0! and 02 are related by a onetoone continuous function 0i = g(02), the respective maximum likelihood estimators are related by 0i = g(02).
2 2 .
EXAMPLE 7.4.4 Assuming a = p, in Example 7.3.3 (normal density), so that x is the sole parameter of the distribution, obtain the maximum likelihood estimator of x and directly prove its consistency. Assume that (JL Ф 0.
From (7.3.14) we have
l "
(7.4.47) log L = – ^ log (2tt) – ^ log p,2 – — £ (xt – p,)2
^ ‘‘•2р. і=і
= A + В + C,
where
П
2*?
71 9 2=1
(7.4.48) A = — — log pr ———— ,
П
2*.
(7.4.49) В = ,
У
and C is a constant term that does not depend on the parameter p,.
We shall study the shape of log I as a function of x. The function A is an even function depicted in Figure 7.7. The shape of the function В depends on the sign of £f=1x, and looks like Figure 7.8. From these two figures it is clear that log L is maximized at a positive value of p when £"=1^ > 0 and at a negative value of p when Z”=ix, < 0.
Setting the derivative of (7.4.47) with respect to jjl equal to zero yields
which can be written as
П П
(7.4.51) тар,2 + pX Xj — xf = 0.
i= 1 г=1
There are two roots for the above, one positive, one negative, given by
1 J 71 
( n Л 
2 » 
1/2 

(7.4.52) Ио – я1Х*і± 
n~2 
+ 4 n 1 ^ xf 

^ i= 1 
l*=1 J 
i= 1 
We know from the argument of the preceding paragraph that the positive root is the maximum likelihood estimator if £”=1х^ > 0 and the negative root if L"= Xi < 0.
Next we shall directly prove the consistency of the maximum likelihood estimator in this example. We have, using Khinchine’s LLN (Theorem 6.2.1),
П
(7.4.53) plim———————————————— = jjl
n—>c0 n
and
71
(7.4.54) plim——– = 2(x2.
77,—» СО V,
if JJL > 0
if JL < 0, which shows that the positive root is consistent if p, > 0 and the negative root is consistent if x < 0. But because of (7.4.53), the signs of Еги=1хг and
jul are the same with probability approaching one as n goes to infinity. Therefore the maximum likelihood estimator is consistent.
EXAMPLE 7.4.5 Let the model be the same as in Example 7.2.5. The likelihood function of the model is given by
(7.4.56) L = — for 0 > z,
0"
= 0 otherwise,
where z = max(xi, x2, . . . , xn), the observed value of Z defined in Example 7.2.5. Clearly, therefore, the maximum likelihood estimator of 0 is Z. Since x = 0/2, the maximum likelihood estimator of p is Z/2 because of remark 5 of Example 7.4.3. Thus we see that jx2 defined in that example is the biascorrected maximum likelihood estimator.
In this example, the support of the likelihood function depends on the unknown parameter 0 and, therefore, the regularity conditions do not hold. Therefore the asymptotic distribution cannot be obtained by the standard procedure given in Section 7.4.3.
4. (Section 7.2.2)
Show that criteria (1) through (5) are transitive, whereas (6) is not.
5. (Section 7.2.2)
Let Xj and X2 be independent, each taking the value of 1 with probability p and 0 with probability 1 — p. Let two estimators of p be defined by T = (Xj + X2)/2 and S = Xj. Show that Eg(T — p) < Eg(S — p) for any convex function g and for any p. Note that a function g() is convex if for any a < b and any 0 < X < 1, g(a) + (1 — X)g(b) > g[a + (1 — X)b]. (A more general theorem can be proved: in this model the sample mean is the best linear unbiased estimator of p in terms of an arbitrary convex loss function.)
6. (Section 7.2.3)
Let X, X2, and X3 be independent binary random variables taking 1 with probability p and 0 with probability 1 — p. Define two estimators p = X and p2 = (X/2) + (1/4), where X = (Xj + X2 + X3)/3. For what values of p is the mean squared error of p% smaller than that of j&l?
7. (Section 7.2.3)
Let Xj and X2 be independent, and let each take 1 and 0 with probability p and 1 — p. Define the following two estimators of 0 = p( 1 — p) based on Xj and X2.
0! = (Xj + X2 – 2Х}Х2)/2 §2 = Xj(l – X2).
Which estimator do you prefer? Why?
8. (Section 7.2.3)
Let Xj, X2, and X3 be independently distributed as B(l, p) and let two estimators of p be defined as follows:
pi = ВД + X2 – X3)
p2 = Xi + x2 – x3.
Obtain the mean squared errors of the two estimators. Can you say one estimator is better than the other?
9. (Section 7.2.5)
Suppose we define better in the following way: “Estimator X is better than Y in the estimation of 0 if P(X — 0 < e) > Р(У — 0 < e) for every є > 0 and > for at least one value of є.” Consider the binary model: P(X; = 1) = p and P(X, = 0) = 1 — p. Show that the sample mean X is not the best linear unbiased estimator. You may consider the special case where n = 2 and the true value of p is equal to 3/4.
10. (Section 7.3.1)
Suppose we want to estimate the probability that Stanford will win a football game, denoted by p. Suppose the only information we have about p consists of the forecasts of n people published in the Stanford Daily. Assume that these forecasts are independent and that each forecast is accurate with a known probability tt. If r of them say Stanford will win, how would you estimate p? Justify your choice of estimator.
11. (Section 7.3.1)
Suppose the probability distribution of X and Y is given as follows:
P(X = 1) = p, P(X = 0) = 1 — p, P(Y = 1) = %,
P(Y = 0) = %, and X and Y are independent.
Define Z = X + Y. Supposing that twenty i. i.d. observations on Z yield “Z = 2” four times, “Z =1” eight times, and “Z = 0” eight times, compute the maximum likelihood estimator of p. Note that we observe neither X nor У.
12. (Section 7.3.1)
A proportion ul of n jurors always acquit everyone, regardless of whether a defendant has committed a crime or not. The remaining 1 — x proportion of jurors acquit a defendant who has not committed a crime with probability 0.9 and acquit a criminal with probability 0.2. If it is known that the probability a defendant has committed a crime is 0.5, find the maximum likelihood estimator of (jl when we observe that r jurors have acquitted the defendant. If n = 5 and r = 3, what is your maximum likelihood estimator of jjl?
13. (Section 7.3.1)
Let X ~ B(n, p). Find the maximum likelihood estimator of p based on a single observation on X, assuming you know a priori that 0 ^ p < 0.5. Derive its variance for the case of n = 3.
14. (Section 7.3.1)
Suppose the probability distribution of X and Y is given as follows: P(Xt =1) = a, P(Xt = 0) = 1 – a,
P(Yi = 1 І X, = 1) = %, P(Yt = 0  X* = 1) = y4>
P(Yi = 1 І X, = 0) = У2> P(Yi = 0 I Х{ = 0) = У2.
(a) Given i. i.d. sample F1; F2, . . . , Yn, find the maximum likelihood
estimator of a.
(b) Find the exact mean and variance of the maximum likelihood estimator of a assuming that n = 4 and the true value of a is 1.
15. (Section 7.3.2)
Let X]…………… be a sample drawn from a uniform distribution
t/[0 — 0.5, 0 + 0.5]. Find the maximum likelihood estimator of 0.
16. (Section 7.3.2)
Suppose that Х4 — 0, і = 1, . . . , n, are i. i.d. with the common density f(x) = (l/2)exp( —x) (the Laplace or doubleexponential density).
(a) Show that the maximum likelihood estimator of 0 is the same as the least absolute deviations estimator that minimizes ZX; — 0.
(b) Show that it is also equal to the median of {F,}.
17. (Section 7.3.2)
Let Xj, . . . , Xn be a sample from the Cauchy distribution with the density f(x, 0) = {tt[1 + (x — 0)2]}
(a) Show that if n = 1, the maximum likelihood estimator of 0 is X.
(b) Show that if n = 2, the likelihood function has multiple maxima, and the maximum likelihood estimator is not unique.
18. (Section 7.3.2)
The density of X is given by
fix) = 3/(40) forO<x<0,
= 1/(40) for 0 < x < 20,
= 0 otherwise.
Assuming that a sample of size 4 from this distribution yielded observations 1, 2.5, 3.5, and 4, calculate the maximum likelihood estimator of 0.
19. (Section 7.3.2)
Let the density function of X be given by
/(x) = 2x/0 for 0 < x < 0,
= 2(x – l)/(0 – 1) for0<x<l,
where 0 < 0 < 1. Supposing that two independent observations on X yield x and x2, derive the maximum likelihood estimator of 0. Assume x < X2
20. (Section 7.3.2)
Show that p, and ct2 obtained by solving (7.3.15) and (7.3.16) indeed maximize log L given by (7.3.14).
21. (Section 7.3.3)
Suppose that Xj, . . . , Xn are independent and that it is known that (X,)x — 10 has a standard normal distribution, і = 1, . . . , n. This is called the BoxCox transformation. See Box and Cox (1964).
(a) Derive the secondround estimator X2 °f the NewtonRaphson iteration (7.3.19), starting from an initial guess that = 1.
(b) For the following data, compute K2:
96, 125, 146, 76, 114, 69, 130, 119, 85, 106.
22. (Section 7.4.1)
Given/(x) = 0 exp( —0x), x > 0, 0 > 0,
(a) Find the maximum likelihood estimator of 0.
(b) Find the maximum likelihood estimator of EX.
(c) Show that the maximum likelihood estimator of EX is best unbiased.
23. (Section 7.4.1)
Suppose X ~ A(x, 1) and Y ~ iV(2x, 1), independent of each other. Obtain the maximum likelihood estimator of x based on Nx i. i.d. observations on X and Ny i. i.d. observations on Y and show that it is best unbiased.
24. (Section 7.4.2)
Let (Xji) and {Х2г}, і = 1, 2, . . . , n, be independent of each other and across i, each distributed as 5(1, p). We are to observe Хц — X2i, і = 1,2, ,n. Find the maximum likelihood estimator of p assuming
we know 0 < p < 0.5. Prove its consistency.
25. (Section 7.4.3)
Using a coin whose probability of a head, p, is unknown, we perform ten experiments. In each experiment we toss the coin until a head appears and record the number of tosses required. Suppose the experiments yielded the following sequence of numbers:
1, 3, 4, 1, 2, 2, 5, 1, 3, 3.
Compute the maximum likelihood estimator of p and an estimate of its asymptotic variance.
26. (Section 7.4.3)
Let {X,}, і = 1, 2, . . . , n, be a random sample on iV(x, p), where we assume (x > 0. Obtain the maximum likelihood estimator of jjl and prove its consistency. Also obtain its asymptotic variance and compare it with the variance of the sample mean.
27. (Section 7.4.3)
Let {X;}, і = 1, 2, …. 5, be i. i.d. N(p, 1) and let {У,}, * = 1, 2, … , 5, be i. i.d. lV(x2, 1). Assume that all the X’s are independent of all the Y’s. Suppose that the observed values of {X;) and {KJ are (—2, 0, 1, — 3, — 1) and (1, 1, 0, 2, —1.5), respectively. Calculate the maximum likelihood estimator of x and an estimate of its asymptotic variance.
28. (Section 7.4.3)
It is known that in a certain criminal court those who have not committed a crime are always acquitted. It is also known that those who have committed a crime are acquitted with 0.2 probability and are convicted with 0.8 probability. If 30 people are acquitted among 100 people who are brought to the court, what is your estimate of the true proportion of people who have not committed a crime? Also obtain the estimate of the mean squared error of your estimator.
29. (Section 7.4.3)
Let X and Y be independent and distributed as N(x, 1) and N(0, x), respectively, where x > 0. Derive the asymptotic variance of the maximum likelihood estimator of x based on a combined sample of (Хъ X2, . . . , Xn) and (Fb Y2,. . . , F„).
30. (Section 7.4.3)
Suppose that X has the HardyWeinberg distribution:
X = 1 with probability p2,
= 2 with probability 2p(l — x),
о
= 3 with probability (1 — p) ,
where 0 < x < 1. Suppose we observe X = 1 three times, X = 2 four times, and X = 3 three times.
(a) Find the maximum likelihood estimate of jl.
(b) Obtain an estimate of the variance of the maximum likelihood estimator.
(c) Show that the maximum likelihood estimator attains the Cramer – Rao lower bound in this model.
31. (Section 7.4.3)
In the same model as in Exercise 30, let Nt be the number of times X = і in N trials. Prove the consistency of рц = VN /N and of jx2 = 1 — sjNs/N and obtain their asymptotic distributions as N goes to infinity.
32. (Section 7.4.3)
Let (XJ, і = 1, 2, . . . , n, be i. i.d. with P(X > t) = exp(— t). Define 0 = exp( —X). Find the maximum likelihood estimator of 0 and its asymptotic variance.
33. (Section 7.4.3)
Suppose f(x) =0/(1 + %)1+e, 0 < x < oo, 0 > 0. Find the maximum likelihood estimator of 0 based on a sample of size n from / and obtain its asymptotic variance in two ways:
(a) Using an explicit formula for the maximum likelihood estimator.
(b) Using the CramerRao lower bound.
Hint: E log (1 + X) = 0’1, V log (1 + X) = 0’2.
34. (Section 7.4.3)
Suppose f(x) = 0_1 exp(—x/0), x > 0, 0 > 0. Observe a sample of size n from/. Compare the asymptotic variances of the following two estimators of 0:
(a) 0 = maximum likelihood estimator (derive it).
(b) 0 = VXxf/2rc.
35. (Section 7.4.3)
Suppose /(x) = 1 / (b — a) for a < x < b. Observe a sample of size n from /. Compare the asymptotic variances of the following two estimators of 0 = b — a:
(a) 0 = maximum likelihood estimator (derive it).
(b) 0 = 2V3Z(xi – xf/n.
36. (Section 7.4.3)
Let the joint distribution of X and Y be given as follows:
P(X = 1) = 0, P(X = 0) = 1 – 0,
P(Y = 1 I X = 1) = 0, P(Y = 0 I X = 1) = 1 – 0,
P(Y = 1 I X = 0) = 0.5, P(Y = 0 I X = 0) = 0.5,
where we assume 0.25 < 0 < 1. Suppose we observe only Y and not X, and we see that Y = 1 happens N dmes in N trials. Find an explicit formula for the maximum likelihood estimator of 0 and derive its asymptotic distribution.
37. (Section 7.4.3)
Suppose that P(X = 1) = (1 – 0)/3, P(X = 2) = (1 + 0)/3, and P(X = 3) = ys. Suppose X is observed N times and let Nt be the number of times X = i. Define 0j = 1 — (5N/N) and 02 = (ЗХ2 / N) — 1. Compute their variances. Derive the maximum likelihood estimator and compute its asymptotic variance.
38. (Section 7.4.3)
A box contains cards on which are written consecutive whole numbers 1 through N, where N is unknown. We are to draw cards at random from the box with replacement. Let Xt denote the number obtained on the ith drawing.
(a) Find EXi and VX).
(b) Define estimator N = 2X — 1, where X is the average value of the К numbers drawn. Find EN and VN.
(c) If five drawings produced numbers 411, 950, 273, 156, and 585, what is the numerical value of N? Do you think N is a good estimator of N? Why or why not?
39. (Section 7.4.3)
Verify (7.4.15) in Examples 7.4.1 and 7.4.2.
Obtaining an estimate of a parameter is not the final purpose of statistical inference. Because we can never be certain that the true value of the parameter is exactly equal to an estimate, we would like to know how close the true value is likely to be to an estimated value in addition to just obtaining the estimate. We would like to be able to make a statement such as “the true value is believed to lie within a certain interval with such and such confidence.” This degree of confidence obviously depends on how good an estimator is. For example, suppose we want to know the true probability, p, of getting a head on a given coin, which may be biased in either direction. We toss it ten times and get five heads. Our point estimate using the sample mean is У2, but we must still allow for the possibility that p may be, say, 0.6 or 0.4, although we are fairly certain that p will not be 0.9 or 0.1. If we toss the coin 100 times and get 50 heads, we will have more confidence that p is very close to У2, because we will have, in effect, a better estimator.
More generally, suppose that 0(Xj, X2,. . . , Xn) is a given estimator of a parameter 0 based on the sample Xj, X2, . . . , Xn. The estimator 0 summarizes the information concerning 0 contained in the sample. The better the estimator, the more fully it captures the relevant information contained in the sample. How should we express the information contained in 0 about 0 in the most meaningful way? Writing down the observed value of 0 is not enough—this is the act of point estimation. More information is contained in 0: namely, the smaller the mean squared error
of 0, the greater confidence we have that 0 is close to the observed value of 0. Thus, given 0, we would like to know how much confidence we can have that 0 lies in a given interval. This is an act of interval estimation and utilizes more information contained in 0.
Note that we have used the word confidence here and have deliberately avoided using the word probability. As discussed in Section 1.1, in classical statistics we use the word probability only when a probabilistic statement can be tested by repeated observations; therefore, we do not use it concerning parameters. The word confidence, however, has the same practical connotation as the word probability. In Section 8.3 we shall examine how the Bayesian statistician, who uses the word probability for any situation, carries out statistical inference. Although there are certain important differences, the classical and Bayesian methods of inference often lead to a conclusion that is essentially the same except for a difference in the choice of words. The classical statistician’s use of the word confidence may be somewhat like letting probability in through the back door.
Leave a reply