# Theory of Least Squares

In this section we shall define the least squares estimator of the parameter P in Model 1 and shall show that it is the best linear unbiased estimator. We shall also discuss estimation of the error variance a2.

1.1.2 Definition of Least Squares Estimators of p and a2

The least squares (LS) estimator ft of the regression parameter fi in Model 1 is defined to be the value of P that minimizes the sum of squared residuals2

S(fi) = ( у-ХД)'(у-ХД) (1.2.1)

= y’y-2y ‘Xfi + fi’X’Xfi.

Putting the derivatives of S(fi) with respect to fi equal to 0, we have

^ = -2X’y + 2X’X0 = O, (1.2.2)

dp

where dS/dfi denotes the AT-vector the Ah element of which is dS/dfih fit being the zth element of fi. Solving (1.2.2) for fi gives

)§ = (X’X)-1X’y. (1.2.3)

Clearly, S(fl) attains the global minimum at fi.

Let us consider the special case К = 2 and x’t = (1, x^) and represent each of the Г-observations (y„ x^) by a point on the plane. Then, geometrically, the least squares estimates are the intercept and the slope of a line drawn in such a way that the sum of squares of the deviations between the points and the line is minimized in the direction of the у-axis. Different estimates result if the sum of squares of deviations is minimized in any other direction.

Given the least squares estimator fi, we define

fi – у — Xfi (1.2.4)

and call it the vector of the least squares residuals. Using fi, we can estimate a2 by

a2=T~ ifi’fi, (1.2.5)

called the least squares estimator of a2, although the use of the term least squares here is not as compelling as in the estimation of the regression parameters.

Using (1.2.4), we can write

y = X0 + fi = Py + My, (1.2.6)

where P = X(X’ X)-*X’ and M = I — P. Because fi is orthogonal to X (that is, u’X = 0), least squares estimation can be regarded as decomposing у into two orthogonal components: a component that can be written as a linear combi­nation of the column vectors of X and a component that is orthogonal to X. Alternatively, we can call Py the projection of у onto the space spanned by the column vectors of X and My the projection of у onto the space orthogonal to X. Theorem 14 of Appendix 1 gives the properties of a projection matrix such as P or M. In the special case where both у and X are two-dimensional vectors
(that is, К = 1 and T = 2), the decomposition (1.2.6) can be illustrated as in Figure 1.1, where the vertical and horizontal axes represent the first and second observations, respectively, and the arrows represent vectors.

From (1.2.6) we obtain y’y = y’Py + y’My.

The goodness of fit of the regression of у on X can be measured by the ratio y’Py/y’y, sometimes called R2. However, it is more common to define R2 as the square of the sample correlation between у and Py: (y’LPy)2
y’Ly • y’PLPy’

where L = Ir — Г_11Г and 1 denotes the Г-vector of ones. If we assume one of the columns of X is 1 (which is usually the case), we have LP = PL. Then we can rewrite (1.2.8) as  y’LPLy У’My

y’Ly y’Ly

Thus R2 can be interpreted as a measure of the goodness of fit of the regression of the deviations of у from its mean on the deviations of the columns of X from their means. (Section 2.1.4 gives a modification of R2 suggested by Theil, 1961.)