# Least-Squares Estimation Observe that

E[(Y – X0)T(Y – X0)] = E[(U + X(00 – 0))T(U + X0 – 0))]

= E[UTU] + 2(00 – 0)TE(XTE[U|X])

+ (00 – 0 )T(E [XTX])(00 – 0)

= n • a2 + (00 – 0)T(E[XTX])(00 – 0).

(5.33)

Hence, it follows from (5.33) that

в0 = argmin E[(Y – XQ)T(Y – XQ)] = (E[XTX])-1 E[XTY]

в eRk

(5.34)

provided that the matrix E [XTX] is nonsingular. However, the nonsingularity of the distribution of Zj = (Yj, Xj)T guarantees that E [XTX] is nonsingular because it follows from Theorem 5.5 that the solution (5.34) is unique if YXX = Var(Xj) is nonsingular.

The expression (5.34) suggests estimating в0 by the ordinary least-squares! estimator

в = argmin(Y – XQ)T(Y – XQ) = (XTX)-1XTY. (5.35)

в eRk

It follows easily from (5.32) and (5.35) that

в – во = (XTX)-1XTU; (5.36)

hence, в is conditionally unbiased: E [в|X] = в0 and therefore also uncondi­tionally unbiased: E [в] = в0. More generally,

0|X – N*[во, a2(XTX)-1]. (5.37)

Of course, the unconditional distribution of в is not normal.

Note that the OLS estimator is not efficient because a2(E[XTX])-1 is the Cramer-Rao lower bound of an unbiased estimator of (5.37) and Var(0) = a2E[(XTX)-1] = a2(E[XTX])-1. However, the OLS estimator is the most efficient of all conditionally unbiased estimators в of (5.37) that are linear functions of Y. In other words, the OLS estimator is the best linear unbiased estimator (BLUE). This result is known as the Gauss-Markov theorem:

Theorem 5.16: (Gauss-Markov theorem) Let C(X) be ak x n matrix whose elements are Borel-measurable functions of the random elements ofX, and let в = C(X) Y. IfE [в | X] = в0, then for some positive semidefinite k x k matrix D, Var[§lX] = a2C(X)C(X)T = a2(XTX)-1 + D.

Proof: The conditional unbiasedness condition implies that C(X)X = Ik; hence, в = в0 + C(X)U, and thus Var(0|X) = a2C(X)C(X)T. Now

D = a2[C(X)C(X)T – (XTX)-1]

= a2[C(X)C(X)T – C(X)X(XTX)-1XTC(X)T]

= a2C(X)[In – X(XTX)-1 XT]C(X)T = a2C(X)MC(X)T,

for instance, where the second equality follows from the unbiasedness condition CX = Ik. The matrix (5.38)

is idempotent; hence, its eigenvalues are either 1 or 0. Because all the eigen­values are nonnegative, M is positive semidefinite and so is C(X)MC(X)T. Q. E.D.

Next, we need an estimator of the error variance a 2. Ifwe observed the errors Uj, then we could use the sample variance S2 = (1 /(n – 1))J2"=1(Uj – U)2 of the Uj’s as an unbiased estimator. This suggests using OLS residuals,  (5.39)

instead of the actual errors Uj in this sample variance. Taking into account that (5.40) (1/(n – 1))YTj=1 Щ. However, this estimator is not unbiased, but a minor correction will yield an unbiased estimator of a2, namely, (5.41)

which is called the OLS estimator of a2. The unbiasedness of this estimator is a by-product of the following more general result, which is related to the result of Theorem 5.13.

Theorem 5.17: ConditionalonXandwellas unconditionally, (n – k)S2/a2 ~ xl-k; hence, E[S2] = a2.

Proof: Observe that  UTU – 2UTX(6 – 60) + (6 – 6 0)XTX( 6 – 60) UTU – UTX(XTX)-1XTU = UTMU,

where the last two equalities follow from (5.36) and (5.38), respectively. Because the matrix M is idempotent with rank

rank(M) = trace(M) = trace(In) – trace(X(XTX) 1XT)

= trace(In) – trace ((XTX) 1XTX) = n — k,

it follows from Theorem 5.10 that, conditional onX, (5.42) divided by a2 has a x2—k distribution

n

£U2/a2|X ~ xlk. (5.43)

j=1

It is left as an exercise to prove that (5.43) also implies that the unconditional distribution of (5.42) divided by a2 is x2—k:

n/

£U2/a2 ~ x2—k. (5.44)

j=1

Because the expectation of the x2—k distribution is n — k, it follows from (5.44) that the OLS estimator (5.41) of a2 is unbiased. Q. E.D.

Next, observe from (5.38) that XTM = O, and thus by Theorem 5.7 (XTX)—1XTU and UTMU are independent conditionally on X, that is,

P[XTU < x and UTMU < z|X]

= P[XTU < x |X] ■ P[UTMU < z|X], V x є Rk, z > 0.

Consequently,

Theorem 5.18: Conditional onX, в and S2 are independent,

but unconditionally they can be dependent.

Theorems 5.17 and 5.18 yield two important corollaries, which I will state in the next theorem. These results play a key role in statistical testing.  S^cT( XT X)—1 c

(b) Let R be a given nonrandom m x k matrix with rank m < k. Then (в — e0)T RT(R(XTX)—1 RT)—1R (в — в0)

m S2   Proof of (5.45): It follows from (5.37) that cJ0 – в0)|X ~ N[0, a 2cT(XTX)-1c]; hence,

It follows now from Theorem 5.18 that, conditional on X, the random variable in (5.47) and S2 are independent; hence, it follows from Theorem 5.17 and the definition of the г-distribution that (5.44) is true, conditional on X and therefore also unconditionally.   Proof of (5.46): It follows from (5.37) that R(e – в0)|X ~ Nm [0, a2R(XTX)-1RT]; hence, it follows from Theorem 5.9 that

Again it follows from Theorem 5.18 that, conditional onX, the random variable in (5.48) and S2 are independent; hence, it follows from Theorem 5.17 and the definition of the ^-distribution that (5.46) is true, conditional on X and therefore also unconditionally. Q. E.D.

Note that the results in Theorem 5.19 do not hinge on the assumption that the vector Xj in model (5.31) has a multivariate normal distribution. The only conditions that matter for the validity of Theorem 5.19 are that in (5.32), U | X ~ Nn(0, a2In) and P[0 < det(XTX) < to] = 1.