# Properties of Estimators

(i) Unbiasedness

Д is said to be unbiased for ц if and only if EQX) = ц

For Д = X, we have E(X) = £™=1 E(Xi)/n = ц and X is unbiased for ц. No distributional assumption is needed as long as the Xj’s are distributed with the same mean ц. Unbiasedness means that “on the average” our estimator is on target. Let us explain this last statement. If we repeat our drawing of a random sample of 100 households, say 200 times, then we get 200 XT’s. Some of these X’s will be above ц some below ц, but their average should be very close to ц. Since in real life situations, we observe only one random sample, there is little consolation if our observed X is far from ц. But the larger n is the smaller is the dispersion of this X, since var(X) = a2/n and the lesser is the likelihood of this X to be very far from ц. This leads us to the concept of efficiency.

(ii) Efficiency

For two unbiased estimators, we compare their efficiencies by the ratio of their variances. We say that the one with lower variance is more efficient. For example, taking Д1 = X1 versus Д2 = X, both estimators are unbiased but var(^1) = a2 whereas, var(^2) = a2/n and {the relative efficiency of Д1 with respect to Д2} = var(^2)/var(^1) = 1/n, see Figure 2.1. To compare all unbiased estimators, we find the one with minimum variance. Such an estimator if it exists is called the MVU (minimum variance unbiased estimator). A lower bound for the variance of any unbiased estimator Д of ц, is known in the statistical literature as the Cramer-Rao lower bound, and is given by

var(Д) > 1/n{E(dlog/(X;ц)/дц)}2 = -1/{nE(d2log/(X;ц)/дц2)} (2.2)

where we use either representation of the bound on the right hand side of (2.2) depending on which one is the simplest to derive.

Example 1: Consider the normal density

log/(Xi; ц) = (—1/2)loga2 – (1/2)log2^ – (1/2)(Xi – ц)2/а2 дlog/(Xi; ц)/дц = (Xi – ц)/а2 д 2log/(Xi; ц)/дц2 = -(1/a2) with E{<92log/(Xi; і)/ді2} = -(1/a2). Therefore, the variance of any unbiased estimator of i, say Д satisfies the property that var(I) > a2/n.

Turning to a2; let в = a2, then

log/(Xi; в) = -(1/2)loge – (1/2)log2^ – (1/2)(Xi – ц)2/в dlog/(Xi; в)/дв = -1/2в + (Xi – і)2/2в2 = {(Xi – i)2 – в}/2в2

д2log/(Xi; в)/дв2 = 1/2в2 – (Xi – і)2/в3 = {в – 2(Xi – і)2}/2в3

E[d2log/(Xi; в)/дв2] = -(1/2в2), since E(Xi – i)2 = в. Hence, for any unbiased estimator of в, say в, its variance satisfies the following property var(e) > 2в2/п, or var(<r2) > 2a4/n.

Note that, if one finds an unbiased estimator whose variance attains the Cramer-Rao lower bound, then this is the MVU estimator. It is important to remember that this is only a lower bound and sometimes it is not necessarily attained. If the Xi’s are normal, X ~ N(i, a2/n). Hence, X is unbiased for i with variance a2/n equal to the Cramer-Rao lower bound. Therefore, X is MVU for i. On the other hand,

aMLE = ?=1(Xi – X)2/n->

and it can be shown that (na2MLE)/(n -1) = s2 is unbiased for a2. In fact, (n – 1)s2/a2 ~ хП-1 and the expected value of a Chi-squared variable with (n – 1) degrees of freedom is exactly its degrees of freedom. Using this fact,

E{(n – 1)s2/a2} = E(хП-і) = n – 1

Therefore, E(s2) = a2.1 Also, the variance of a Chi-squared variable with (n – 1) degrees of freedom is twice these degrees of freedom. Using this fact,

var{(n – 1)s2/a2} = var(xLi) = 2(n – 1)

or

{(n — 1)2/a4}var(s2) = 2(n — 1).

Hence, the var(s2) = 2a4/(n — 1) and this does not attain the Cramer-Rao lower bound. In fact, it is larger than (2a4/n). Note also that var(<rMLE) = {(n — 1)2/n2}var(s2) = {2(n — 1)}a4/n2. This is smaller than (2a4/n)! How can that be? Remember that a2MLE is a biased estimator of a2 and hence, var(<rMLE) should not be compared with the Cramer-Rao lower bound. This lower bound pertains only to unbiased estimators.

Warning: Attaining the Cramer-Rao lower bound is only a sufficient condition for efficiency. Failing to satisfy this condition does not necessarily imply that the estimator is not efficient.

Example 2: For the Bernoulli case

log/(Xi; в) = Xilog6> + (1 — Xi)log(1 — в)

дlog/(Xi, в)/дв = (Xi/в) — (1 — Xi)/(1 — в)

д2log/(Xi; в)/дв2 = (—Xi/в2) — (1 — Xi)/(1 — в)2

and E[d2log/(Xi; в)/дв2] = (—1/в) — 1/(1 — в) = —1/[в(1 — в)]. Therefore, for any unbiased estimator of в, say в, its variance satisfies the following property:

var^) > в(1 — в)/n.

For the Bernoulli random sample, we proved that g = E(Xi) = в. Similarly, it can be easily verified that a2 = var(Xi) = в(1—в). Hence, X has mean g = в and var(X) = a2/n = в(1—в)/n. This means that X is unbiased for в and it attains the Cramer-Rao lower bound. Therefore, X is MVU for в.

Unbiasedness and efficiency are finite sample properties (in other words, true for any finite sample size n). Once we let n tend to ж then we are in the realm of asymptotic properties.

Example 3: For a random sample from any distribution with mean g it is clear that g = (X + 1/n) is not an unbiased estimator of g since E(g) = E(X + 1/n) = g + 1/n. However, as n -^ж the lim E(g) is equal to g. We say, that g is asymptotically unbiased for g.

Example 4: For the Normal case

vMle = (n — 1)s2/n and e(?Mle) = (n — 1)a2/n.

But as n ^ж, lim E(a2MLE) = a2. Hence, a2MLE is asymptotically unbiased for a2.

Similarly, an estimator which attains the Cramer-Rao lower bound in the limit is asymp­totically efficient. Note that var(X) = a2/n, and this tends to zero as n ^ ж. Hence, we consider^/nX which has finite variance since var(^/nX) = n var(X) = a2. We say that the asymptotic variance of X denoted by asymp. var(X) = a2/n and that it attains the Cramer – Rao lower bound in the limit. X is therefore asymptotically efficient. Similarly,

var(^na2MLE) = n var(?Mle) = 2(n — 1)a4/n

which tends to 2a4 as n ^ж. This means that asymp. var(?MLE) = 2a4/n and that it attains the Cramer-Rao lower bound in the limit. Therefore, oMle is asymptotically efficient.

(iii) Consistency

Another asymptotic property is consistency. This says that as n — ж lim Pr[|X — л > c] = 0 for any arbitrary positive constant c. In other words, X will not differ from л as n — ж. Proving this property uses the Chebyshev’s inequality which states in this context that

Pr[|X — л > kax] < 1/k2.

If we let c = kax then 1/k2 = aX /c2 = a2/nc2 and this tends to 0 as n – ж, since a2 and c are finite positive constants. A sufficient condition for an estimator to be consistent is that it is asymptotically unbiased and that its variance tends to zero as n — ж.2

Example 1: For a random sample from any distribution with mean л and variance a2, E(X) = л and var(X) = a2/n — 0 as n – ж, hence X is consistent for л.

Example 2: For the Normal case, we have shown that E(s2) = a2 and var(s2) = 2a4/(n — 1) — 0 as n – ж, hence s2 is consistent for a2.

Example 3: For the Bernoulli case, we know that E(X) = 9 and var(X) = 9(1 — 9)/n — 0 as n – ж, hence X is consistent for 9.

Warning: This is only a sufficient condition for consistency. Failing to satisfy this condition does not necessarily imply that the estimator is inconsistent.

(iv) Sufficiency

X is sufficient for л, if X contains all the information in the sample pertaining to л. In other words, f(Xi,.. .,Xn/X) is independent of л. To prove this fact one uses the factorization theorem due to Fisher and Neyman. In this context, X is sufficient for л, if and only if one can factorize the joint p. d.f.

f(Xi,…,Xn;л) = h(X;л) • g(Xi,…,Xn)

where h and g are any two functions with the latter being only a function of the X’s and independent of л in form and in the domain of the X’s.   Example 1: For the Normal case, it is clear from equation (2.1) that by subtracting and adding X in the summation we can write after some algebra

Hence, h(X; л) = e-(n/2a2)(X-^) and g(X1,…, Xn) is the remainder term which is independent of л in form. Also —ж < Xi < ж and hence independent of л in the domain. Therefore, X is sufficient for л.

Example 2: For the Bernoulli case,

f(Xi,…,Xn;9) = 9nX(1 — 9)n(1-^) Xi = 0,1 for i = 1,…,n.

Therefore, h(X, 9) = 9nX(1 — 9)n(i x) and g(Xi,…, Xn) = 1 which is independent of 9 in form and domain. Hence, X is sufficient for 9.

Under certain regularity conditions on the distributions we are sampling from, one can show that the MVU of any parameter в is an unbiased function of a sufficient statistic for в.3 Advan­tages of the maximum likelihood estimators is that (i) they are sufficient estimators when they exist. (ii) They are asymptotically efficient. (iii) If the distribution of the MLE satisfies certain regularity conditions, then making the MLE unbiased results in a unique MVU estimator. A prime example of this is s2 which was shown to be an unbiased estimator of a2 for a random sample drawn from the Normal distribution. It can be shown that s2 is sufficient for a2 and that (n — l)s2/a2 ~ хП-1. Hence, s2 is an unbiased sufficient statistic for a2 and therefore it is MVU for a2, even though it does not attain the Cramer-Rao lower bound. (iv) Maximum likelihood estimates are invariant with respect to continuous transformations. To explain the last property, consider the estimator of eU Given gMLE = X, an obvious estimator is e^MLE = eX. This is in fact the MLE of e^. In general, if g(g) is a continuous function of g, then g(gMLE) is the MLE of g(g). Note that E(e^MLE) = eE(^MLE) = eM, in other words, expectations are not invariant to all continuous transformations, especially nonlinear ones and hence the resulting MLE estimator may not be unbiased. eX is not unbiased for e^ even though X is unbiased for g.

In summary, there are two routes for finding the MVU estimator. One is systematically following the derivation of a sufficient statistic, proving that its distribution satisfies certain regularity conditions, and then making it unbiased for the parameter in question. Of course, MLE provides us with sufficient statistics, for example,

X1,—,Xra ~ IIN(^,a2) ^ gMLE = X and aMLE = Yli=1(Xi — X)2/n

are both sufficient for g and a2, respectively. X is unbiased for g and X Normal distribution satisfies the regularity conditions needed for X to be MVU for g. a2MLE is biased for a2, but s2 = na2MLE/(n — l) is unbiased for a2 and (n — 1)s2/a2 ~ хП-1 which also satisfies the regularity conditions for s2 to be a MVU estimator for a2.

Alternatively, one finds the Cramer-Rao lower bound and checks whether the usual estimator (obtained from say the method of moments or the maximum likelihood method) achieves this lower bound. If it does, this estimator is efficient, and there is no need to search further. If it does not, the former strategy leads us to the MVU estimator. In fact, in the previous example X attains the Cramer-Rao lower bound, whereas s2 does not. However, both are MVU for g and a2 respectively. 

Minimizing the risk when the loss function is quadratic is equivalent to minimizing the Mean Square Error (MSE). From its definition the MSE shows the trade-off between bias and variance. MVU theory, sets the bias equal to zero and minimizes var(0). In other words, it minimizes the above risk function but only over 0’s that are unbiased. If we do not restrict ourselves to unbiased estimators of 0, minimizing MSE may result in a biased estimator such as 02 which beats 0i because the gain from its smaller variance outweighs the loss from its small bias, see Figure 2.2.

f (0) Figure 2.2 Bias Versus Variance