Definition of Best

Before we prove that the least squares estimator is best linear unbiased, we must define the term best. First we shall define it for scalar estimators, then for vector estimators.

Definition 1.2.1. Let в and 0* be scalar estimators of a scalar parameter 0. The estimator 0 is said to be at least as good as (or at least as efficient as) the estimator 0* if£(0 — в? si Е(в* — в)2 for all parameter values. The estimator 0 is said to be better (or more efficient) than the estimator 0* if 0 is at least as good as 0* and E(§ — 0f< Е(в* — в)2 for at least one parameter value. An estimator is said to be best (or efficient) in a class if it is better than any other estimator in the class.

The mean squared error is a reasonable criterion in many situations and is mathematically convenient. So, following the convention of the statistical literature, we have defined “better” to mean “having a smaller mean squared error.” However, there may be situations in which a researcher wishes to use other criteria, such as the mean absolute error.

Definition 1.2.2. Let 0 and 0* be estimators of a vector parameter 0. Let A and be their respective mean squared error matrices; that is, A = E(§— 0X0 — 0)’ and В = E{0* — 0X0* — 0)Then we say 0 is better (or more efficient) than 0* if

c'(B — A)c Ш 0 for every vector c and every parameter value



c'(B —A)c>0 for at least one value of c and (1.2.20)

at least one value of the parameter.

This definition of better clearly coincides with Definition 1.2.1 if 0 is a scalar. In view of Definition 1.2.1, equivalent forms of statements (1.2.19) and

(1.2.19) are statements (1.2.21) and (1.2.22);

c’0 is at least as good as c’ 0* for every vector с (1.2.21)


c’0 is better than c’0* for at least one value of c. (1.2.22)

Using Theorem 4 of Appendix 1, they also can be written as

В S A for every parameter value (1.2.23)


В Ф A for at least one parameter value. (1.2.24)

(Note that В ^ A means В — A is nonnegative definite and В > A means В — A is positive definite.)

We shall now prove the equivalence of (1.2.20) and (1.2.24). Because the phrase “for at least one parameter value” is common to both statements, we shall ignore it in the following proof. First, suppose (1.2.24) is not true. Then В = A. Therefore c'(B — A)c = 0 for every c, a condition that implies that

(1.2.20) is not true. Second, suppose (1.2.20) is not true. Then с’ (B — A)c = 0 for every c and every diagonal element of В — A must be 0 (choose c to be the zero vector, except for 1 in the ith position). Also, the i, jth element of В — A is 0 (choose c to be the zero vector, except for 1 in the fth and yth positions, and note that В — A is symmetric). Thus (1.2.24) is not true. This completes the proof.

Note that replacing В Ф A in (1.2.24) with В > A—or making the corre­sponding change in (1.2.20) or (1.2.22) — is unwise because we could not then rank the estimator with the mean squared error matrix higher than the estimator with the mean squared error matrix


A problem with Definition 1.2.2 (more precisely, a problem inherent in the comparison of vector estimates rather than in this definition) is that often it does not allow us to say one estimator is either better or worse than the other. For example, consider

A“[o l] a"d B-[o i! <l’225)

Clearly, neither A S В nor В ї A. In such a case one might compare the trace and conclude that 0 is better than 0* because tr A < tr B. Another example is

A"[l 2] and B“[o 2} <1126)

Again, neither A ^ В nor B^ S A. If one were using the determinant as the criterion, one would prefer 0 over 0* because det A < det B.

Note that B^A implies both tr В tr A and det В ^ det A. The first follows from Theorem 7 and the second from Theorem 11 of Appendix 1. As these two examples show, neither tr В ^ tr A nor det В S det A implies B> A.

Use of the trace as a criterion has an obvious intuitive appeal, inasmuch as it is the sum of the individual variances. Justification for the use of the determi­nant involves more complicated reasoning. Suppose 0 ~ N{Q, V), where V is the variance-covariance matrix of 0. Then, by Theorem 1 of Appendix 2, (0 — 0)’V _1(0 — 0) ~ Я*, the chi-square distribution with К degrees of free­dom, ATbeing the number of elements of 0. Therefore the (1 — a)% confidence ellipsoid for 0 is defined by

(0|(0 – в)’~'(в – 0) <Х2Л<*)}, (1.2.27)

where ^(a) is the number such that P [%% ^ X%(a)] ~ a – Then the volume of the ellipsoid (1.2.27) is proportional to the determinant of У, as shown by Anderson (1958, p. 170).

A more intuitive justification for the determinant criterion is possible for the case in which 0 is a two-dimensional vector. Let the mean squared error matrix of an estimator 0 = (0j, 02)’ be

[an a12”|

La2i a22J

Suppose that 02 is known; define another estimator of Q by 0, = в і + a(§2 — 02). Its mean squared error is an + a2a22 + 2aa12 and at­tains the minimum value of an — (a2/a22) when a = —al2/a22. The larger al2, the more efficient can be estimation of 0,. Because the larger a12 implies the smaller determinant, the preceding reasoning provides a justification for the determinant criterion.

Despite these justifications, both criteria inevitably suffer from a certain degree of arbitrariness and therefore should be used with caution. Another useful scalar criterion based on the predictive efficiency of an estimator will be proposed in Section 1.6.

Leave a reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>